Every five years, a small group of leaders in the data management research community get together to do a self assessment --- what are we doing right as[...]
Every five years, a small group of leaders in the data management research community get together to do a self assessment --- what are we doing right as a community and how can we improve? What are the important research problems we should be working on in the next decade? How can we communicate the importance of our work to key contributors --- funding agencies, industry analysts, venture capitalists, and corporate boards? What are key challenges we must overcome? The report generated as the outcome of the most recent three meetings can be found here: 2018 , 2013 , and 2008 . Five years is a long time in computer science. The field is developing so rapidly, this meeting ought to look completely different each time it is held. However, the running joke is that there’s one topic that comes up every time --- literally every time the meeting has been held for the past 35 years: data integration (or “information integration” as the term appeared in previous reports). Usually it comes up in the context of a proposal for a grand challenge for the community that has the potential to unify our voices as we make our case to funding agencies that more investment is necessary in our field. Companies invest huge amounts of money on data integration every year --- imagine the impact we could have if we would “solve” that problem and make integration easy for enterprises and scientists everywhere! In truth, the reason why data integration keeps coming up as a “grand challenge” is that we have been unable to make much headway in the past 35 years. Data integration is a painstaking process, requiring much human intervention, and artificial intelligence and machine learning can only serve as an aid for what ultimately requires human intelligence --- the ability to discern the semantics of disparate datasets and use external knowledge to understand that two different names or values are referring to the same read-world entity. Data integration is hard, slow, expensive, and few people want to do it. Yet it must be done ---- otherwise insights that can be generated by the data are based on a narrow vision that misses the global picture. The question is: who should do it? Data integration: how it’s done today and what is different in a Data Mesh Today data integration is typically done in a centralized fashion. Data from different domains are brought together into a centralized data warehouse or data lake. The independent nature of the source data sets that are being combined results in a need for data integration, which is typically done during a data cleaning and transformation process that occurs prior to loading it into the centralized data store. Primary identifiers (also known as “master data” but the industry is moving towards using more inclusive language ) are chosen for key real world entities (e.g. a primary list of customers, products, suppliers, etc.) and an attempt is made to transform the input data sets to use these primary identifiers and thereby become integratable with each other. Recently the Data Mesh has come along, which advocates for eliminating centralized data stores, and managing data across an organization in a distributed fashion. An organization is partitioned into a set of domains that control and maintain expertise over the data within that domain, and make data from that domain available to other domains within the enterprise via the generation of self-contained data products. When there is no centralized data store in the Data Mesh, the impetus to perform data integration at load time into the centralized store disappears. If the Data Mesh is implemented improperly by the enterprise, when the impetus for integration goes away, data integration will simply not happen --- especially since it is a painful process that nobody wants to do. Such an outcome would have disastrous effects on the ability to generate global and unified insights from data. There is thus a great debate brewing: does the Data Mesh result in a set of unintegrated data silos? And if someone does attempt to perform integration, is the result typically better or worse in a Data Mesh vs. traditional centralized data management? Let’s make the case for both sides of this debate, starting with the negative viewpoint: Data Mesh reduces the scope and quality of data integration across an enterprise Domains in a Data Mesh are typically mapped to business units (e.g. online business vs. bricks and mortar business). Business units are charged with optimizing their part of the business, and have little concern for the "global good". Thus, they have little incentive to spend time and money integrating the data products that they generate with the data products generated by other domains located in other business units. And since there is no centralized team performing data integration in the Data Mesh, the result might be that data integration does not happen at all, and each data product in the mesh turns into a separate silo with limited explicit relationships to other datasets across the enterprise. Furthermore, even if someone were to make the effort of integrating two datasets, new data products can show up (at any point in time) in the Data Mesh that use a totally different set of keys and terms to refer to the same real world entities. If those new data products become popular, then the effort performed to integrate the older data sets is wasted, since they refer to those same entities using outdated keys and terms. The end result of a Data Mesh would thus be a data mess of silos everywhere and a data miss at query time, when a relevant data mass is omitted from a query result due to lack of integration. No, it’s the opposite! Data Mesh improves the quality of data integration across an enterprise! Data integration so often comes down to context. Michael Jordan the famous basketball player is almost never confused with Michael Jordan the famous computer scientist when context is taken into account. If the data is associated with a publication or a patent, the correct identifier is probably the one associated with the computer scientist. If it involves an athletic or entertainment context, the correct identifier is probably the one associated with the basketball player. Sometimes the process requires a little more subtlety: Who are individuals within an organization who understand the context of the dataset the best? They are the ones who are charged with generating it or extracting it from a data source. In other words, they are the domain experts in the Data Mesh that create the data product. Who are the individuals within an organization who understand the context the least? Quite possibly, it is the IT team that is charged with running the extraction jobs that load data into a centralized data store. Centralized IT people are absolutely the wrong people to be doing data integration. Hired consultants and external integration teams that have even less of an understanding of a business are not the solution. The domain experts are the ones who should be doing data integration! Furthermore, as data proceeds through a pipeline, source information and context is often lost as the data is transformed in each step of the pipeline. Waiting until later stages in its development to attempt to integrate it with other data sets only makes the data integration task more challenging. The best time to attach global identifiers to data is often as close to the source as possible, before context is lost. Once again, the people working at the data source --- the ones working to create the source-side data products, are the ones most capable of doing data integration. Therefore, the Data Mesh has the potential to substantially increase the quality of data integration across an enterprise. Though data integration will still require human effort to support and supplement the automated techniques, it will be done by the humans who understand the data and context the best and are most capable of doing a good job. Furthermore, they are performing this effort at the point in the data pipeline when effort by any human will be most successful --- before context is lost. Another reason why the Data Mesh may increase the quality of data integration is that it is a fundamentally more scalable approach to data management. Distributing the effort across an array of different domains scales up as you add more domains to an enterprise. In contrast, it is much harder for a centralized data integration team to scale up as an organization or the amount of data it manages grows. Who wins the debate? Both sides to the debate as presented thus far have valid points. Indeed there is risk of reduced impetus and incentives to integrate data in a Data Mesh. So indeed, it is very possible for a Data Mesh to result in much worse data integration across the enterprise. On the other hand, it is also true that moving to a bottom-up distributed data integration process has the potential to improve the overall quality of the data integration result, when the context and experts are available to their fullest. Ultimately, the leaders within an organization implementing a Data Mesh are going to settle the answer to this debate. Can they replace the impetus to integrate data caused by a centralized data management system with new incentives that will create an impetus to perform data integration in a bottom-up fashion? At first glance, it may seem that creating these incentives will be a challenge. The effort involved in performing data integration increases the value of a dataset. But how is it possible to measure this increase in value in order to create an incentive system to provide reward commensurate with the effort and increased value involved? Fortunately, creating these incentives will likely be easier than it first appears. At the end of the day, not only is integrated data more valuable to an organization, but it will also usually be more popular. If dataset A is fully integrated with dataset B, a client that would usually only access dataset A, may now drag along dataset B into a data analysis task, since the prior integration effort makes it convenient to include the extra data that can be found in dataset B. And vice versa: clients that would previously only access dataset B, may now drag along dataset A. Therefore, although creating appropriate incentives around increasing the value of data is hard to accomplish due to the difficulty in providing a quantitative assessment on the increase in value, creating incentives using data popularity as a proxy is a viable way forward. Teams that create data products that are frequently accessed should be rewarded for their efforts in creating these products. Some organizations may prefer to create these rewards in the form of monetary bonuses. Others may prefer accolades. Others may implement budget or power increases. The particular format of the reward depends on the culture of the organization and the wisdom of its leaders, but as long as some sort of currency is attached to successful data products, there will be a natural push on behalf of data product creators to ensure that their product integrates with other existing products. After 35 years of failing to introduce a new technology that can solve the data integration problem, it is time for a shift in approach. The Data Mesh is certainly not a silver bullet. For an organization without strong leadership and without the right architectural infrastructure to implement the vision, the Data Mesh may do more harm than good. Nonetheless, if done correctly, with appropriate investment in enabling software and incentives for the data product creators, the Data Mesh might actually lead to a vast improvement in quality and reduction in cost of data integration across an enterprise.
© Starburst Data, Inc. Starburst and Starburst Data are registered trademarks of Starburst Data, Inc. All rights reserved. Presto®, the Presto logo, Delta Lake, and the Delta Lake logo are trademarks of LF Projects, LLC