What if there was one place where you could go to get all the data you need? No matter what kind of data you are looking for you will find it in this place. It could be your company’s master data, transactional data, analytics data, IoT data or even a document or a video. Sure, you will need to authenticate yourself and have the rights to access the data, but if that’s the case, then you just need to look for it in one place. And best of all, you could rely on the data being presented in this place to be accurate. Like a bedspread wrapped over all your data assets giving shelter and at the same time providing a common interface to the surrounding world. Wouldn’t that be wonderful? Well, that is the main idea behind the concept of “Data Fabric”.
Data Fabric as a concept has been developing during the past years as an answer to the increasing challenge of getting full benefits out of enterprise-wide data where:
The ideas behind Data Fabric have been in development for many years. It started with collecting data in data warehouses, data lakes and BI platforms and was further developed by adding integration, security, data lineage and master data aspects. In the beginning, this was branded as Enterprise Information hubs or sometimes even as API platforms, but that was unfortunately driven by software vendors who had a hard time delivering upon their promises. So, the more general architectural pattern of Data Fabric was born, though the definition is somewhat blurry and depends on who you ask.
The figure below shows a conceptual architecture of the data fabric concept in relationship to other established concepts.
For example, the exhibit below gives an interpretation of how different vendors in this area define Data Fabric:
As evidenced by the diverse interpretations provided in the table above, gaining a comprehensive understanding of the concept of Data Fabric is not straightforward, as it involves multiple perspectives from various sources and vendors. However, analyzing existing discussions surrounding the definition of a Data Fabric reveals a consensus that it serves as a mechanism for generating data pipelines and integrations from diverse sources within a unified platform. However, there are divergent views on whether a Data Fabric is considered an architecture or a broader concept encompassing various technologies and architectures.
Furthermore, when it comes to looking at what creates a Data Fabric, some of the suggestions are pushing the idea that the Data Fabric is based around metadata and metadata analysis. In a metadata-based driven Data Fabric, metadata is used through activation and is then pushed towards the users when creating pipelines but is also suggesting new metadata when data is created from external sources. The metadata would also be enriched with semantics, putting meaning and context to the metadata through knowledge graphs. On these knowledge graphs, it would be possible to apply artificial intelligence and machine learning, and then we have achieved the concept of active metadata. The active metadata feature is considered one of the key features in achieving a Data Fabric architecture, which is analyzed using semantics, knowledge graphs, artificial intelligence, and machine learning.
There are a few contrasting views of how Data Fabric is defined by market participants. While some call it a design concept which could be interpreted as architecture. Then there are others who view the data fabric as a ‘Solution’, thereby interpreting it as an instantiated architecture. Interestingly, most market participants agree that Data Management should be an integral part of the Data Fabric definition, however one goes as far as viewing data fabric as a data management approach which is a broad view which leaves a lot of room for own interpretations.
One of the challenges in organizations is the diversity of sources and systems dealing with data. Data is generated at an increasing pace with the development of new technologies, regulations, and business needs. An increase in data volumes and number of data sources will make the landscape more complex and finding the right data, and understanding the data in context will therefore become more difficult. Additionally, with different systems and user groups, it is not unusual that data becomes siloed. Each system or user group will have access to and understand the data within their respective silos. However, their knowledge about data outside of their organizational unit will be limited.
In addition, this may also lead to difficulties in harmonizing data and establishing consistent data categorization across the organization. Instances may arise where identical data objects exist in multiple locations but with varying identification formats, or worse, different taxonomies. A prime example of this is product or customer information, which may be scattered across numerous systems and often lacks consistency or sometimes, may even present contradictory information.
Another common problem is that the depth of the data architecture needed is underestimated. Large companies tend to strive to simplify the complexity of their own operations to be able to achieve a higher degree of freedom in their business processes. Consequently, processes are not adequately documented, and data is not captured and stored with the requisite granularity and quality. When these companies are faced with higher requirements on data quality by external parties like end-customers, regulating authorities or business partners, it could be a painful wake-up call. Attempting to implement a better order in the base data of a fully operating business is comparable to performing engine repairs on an airplane while it is airborne. Frequently, the solution involves implementing a new IT platform, such as a more robust ERP or master data platform. This enables the enhancement of data quality to coincide with the implementation of the new platform.
Yet another problem is simplifying data architecture work by putting it in the hands of large commercially available off-the-shelf platforms. To avoid tedious data management work, organizations rely on the data architecture presented to them by large solution platforms. The argumentation is that these platforms have many kinds of customers and hence they probably have already thought all data architecture aspects through anyhow. Later it’s not uncommon that these organizations discover the hard way that the complexity of their own business does not fit into this architecture. Then, costly adjustments are made to the solution platforms, data lakes are installed to bridge the gaps and analytical tools are installed to try to understand the data. If these mistakes are repeated frequently across the organization, the final situation will be characterized by disorder and disintegration. Data Fabric has been presented as the cure for these problems, but the question managers should ask themselves is:
Well, a Data Fabric focused on connecting all data in a business in an easy achievable platform in one place will mostly focus on damage control. To increase the data quality and reliability it is not enough to just connect the data with the surroundings. Here is where a structured approach to cleansing, normalizing, and analyzing the data on the fly can make a real difference. However, even if we could attain a higher level of data quality by incorporating capabilities for “on-the-fly-data-quality-improvement” such as AI, active metadata, and machine learning, it’s essential to maintain a skeptical approach regarding the reliability of the output. Data-driven decision-making assumes that the data is as accurate as possible. However, would you trust entirely machine-generated data to make critical decisions?
Given this rationale, it’s probable that Data Fabric architectures will initially see implementations in domains characterized by high-volume public data flows. In these contexts, data can be interpreted and acted upon with a lower risk of serious consequences in the event of faulty actions. However, as algorithms improve, Data Fabric architectures are likely to ascend in the value chain, eventually becoming a substantial data source in executive decision-making.
Another common problem is to establish and operate a working data governance organization. Many consider working with data governance to be a tedious and time-consuming task that often does not receive adequate attention. Furthermore, even when prioritized, it is frequently assigned to individuals with limited understanding of the potential business consequences of poor or inaccurate data. By implementing a sophisticated Data Fabric Architecture some of the problems generated by poor data governance could potentially also be addressed by establishing a self-healing active metadata architecture. While it may seem utopian at this moment, accomplishing this feat would yield significant potential.
When it comes to technologies related to Data Fabric, these can be classified into different categories:
Not surprisingly the different vendors tend to accommodate the Data Fabric concepts as much as possible into their own domains. In the table below we highlight some of the vendors in this domain and their current offerings in the Data Fabric area.
Depending on how we view Data Fabrics, and what we need to get out of them, we could look at the following three components:
However, merely focusing on data discovery and automation for data pipelines would only address part of the issue as outlined earlier. The common perspective is that a Data Fabric platform should encompass data discovery, data catalog, and hybrid integration tools to effectively tackle these challenges.
Some experts offer AI capabilities to achieve the active metadata setup and argue that without this, it cannot really be considered a Data Fabric platform. The main argument is that if we are aware of which data assets exist and what level of quality those data assets hold, then we can address data quality issues in real-time using AI and machine learning to provide a shield of automated data improvements that will result in improved quality compared to the underlying sources. Nevertheless, to make decisions based on this aggregated data, you must have a high level of trust in your algorithms.
Working with metadata has traditionally been a time-consuming and relatively static process, involving the analysis of required attributes and characteristics of data objects. However, in the future, advancements in technology could enable machines to interpret metadata dynamically, analyzing input data and utilizing AI algorithms to meta-tag data based on similarities with previous instances. This would undoubtedly represent a paradigm shift in data interpretation, capture, and analytics, unlocking the potential for advanced data handling in a fraction of the time compared to current methods.
Some providers are also taking this one step further and are integrating AI tools to create a spoken natural language interface to the data discovery module making the data even more accessible.
Implementing a seamless data layer on top of a large and scattered data landscape brings its fair share of challenges. It is likely that organizations with complex IT/data landscapes will be those with the most to gain from investing in a Data Fabric architecture. On the other hand, the implementation of such a fabric project could be an arduous and time-consuming process, increasing the possibility of failure.
While legacy issues such as data accessibility and presenting the data in a consumer-friendly format remain, modern hybrid platforms have the tools built-in to address them.
If a company has successfully implemented technologies for data access and publication, the primary challenge moving forward will be ensuring the trustworthiness of the outcomes. It is tempting to think that an enterprise’s common Data Fabric is seen as a cure for the bad underlying structure. What you achieve by implementing a Data Fabric on top of this mess is that you get a centralized mess where you easily can access the underlying bad data. This makes the poor data quality more visible rather than being restricted to a backbone legacy environment, translating into business decisions likely being taken based on incorrect data!
So, the real concern will be around data quality. The most effective approach to addressing this concern is by improving data quality at the source through processes such as data cleansing and enrichment. Further, an understanding of metadata and data structures will be required. AI can assist in automating manual and time-consuming tasks, not only in tidying up data sources but also during real-time data consumption.
To summarise, we feel the following observations are note-worthy:
Data Fabric architecture will find its first successful and cost-saving implementations in organizations with a high volume of data exchange with the external world and where the quality level of the data exchange is not critical. For example, the distribution of reviews on hotels, products and restaurants to many different sites and platforms.
Hans Bergström, Mattias Gustrin & Eric Wallhoff