On March 1st, I partnered with IBM’s Technical Group to present “ten lessons learned from building a Data Mesh.” The one-hour event lasted a little longer and, as expected, offered a Q&A session which we did not finish, so I offered to take the questions offline and answer them here.
As a reminder, those are my personal views, they do not engage my employer, PayPal.
How about data virtualization? If you have different Data Hubs with different data models, how do you integrate them?
As illustrated in the next figure, you can use data virtualization pointing to various physical data stores. Your onboarding pipeline can be “virtual” or at least leveraging virtualized data stores. You will gain in data freshness by reducing latency but you may be limited in the number of data transformations you want to perform towards your interoperable model.
What tools do you use for data cataloging?
We decided that, for Data Mesh, we have our cataloging solution as we wanted to leverage the richness of the data contract in our discovery process. Later we will integrate with other cataloging solutions within the ecosystem.
Gartner rated data mesh “obsolete before plateau” and favors data fabric.
I am not going to comment on Gartner’s rating. I do not oppose data mesh and data fabric; they should work together [note: more on that is the second part of this article].
Which tool do you use for Data Observability?
We are using our solutions.
What is the engineering profile of the Data Mesh builders at PayPal? The ones who are building & operating the data platform. Are they SWEs or Data Engineers? How big is the team?
I have covered this topic in the Data Engineering Podcast with Tobias Macey and a bit in the panel discussion organized by Data Mesh Learning, called: Data Mesh Operating Model – The key to a successful data mesh journey.
In short, the configuration evolves depending on the phase of your project.
What would (if any) commonalities exist between the various data quanta?
There are an awful lot of commonalities between the data quanta. Everything that is not data related is common. Dictionary, observability, and control (basically active & passive metadata) come to mind. If you are familiar with JDBC drivers, imagine that the quantum behaves a bit like a JDBC-on-steroid driver.
What would be the ideal (recommended) tech stacks for shaping data mesh?
I mentioned Java a thousand times in the talk. I’ll try to be less subtle in the future. Everything derives from there. We would need to drill down together, but it is really dependent on your existing infrastructure.
[Data Mesh] is an interesting combination of data engineering and software engineering. We’ve been working on stuff like this in the agile data community for a long time. Challenge was always to get the developers to take data seriously and the data folks to take software engineering seriously.
It’s working for us; I may have a larger whip. We have distilled a culture of customer-supplier. Our data engineering teams are a customer of the software engineering teams. As such, software engineers are motivated by satisfying their data engineer counterparts. The data engineers are invited to build and discuss features, so the relationship goes both ways.
Are cloud-based databases good for handling huge data? I heard that the I/O process is worse.
It really depends on your use cases. The context of our Data Mesh is analytics and with the expected growth in data, cloud-equivalent architecting is the only viable choice. You are further away from the silicon, but you compensate with other benefits like fewer data movements.
Data contract + quantum dictionary is akin to metadata management?
In a way, yes, I never said you had to reinvent everything… The following illustration shows how you can leverage the different implementation planes, based on Zhamak Dehghani’s three experience planes.
Additionally, I’d like to hear reflections on attribute-based (ABAC) or Next-Generation (NGAC) access control applied in practice to data meshes. Reference: https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-204B.pdf.
As you can imagine, access control is a sensitive topic that I will not elaborate much on. Data Mesh is leveraging the security system in place. We did not feel that the Data Mesh should own the security but better indicate the security aspects involved with the data.
How can a non-technical person learn about these concepts on a high level if they want to lead a team?
I would highly recommend “Data Mesh for all ages.” It may look silly but it is really giving a serious introduction to the principles of a Data Mesh. My Medium article has also been extremely well regarded by readers and practitioners, so it may help as well. I am always happy to exchange on LinkedIn as well.
What do you think about the Microsoft Cloud Scale Analytics, which is an implementation of Data Mesh?
I will not comment on any product, whether Open Source or commercial. I would challenge the vendor. How do they apply the four principles? Can it leverage my existing infrastructure (and future as well)? In the case of Microsoft, is it available only on Azure? It may be fine for you… for now. I have seen a few vendors trying to jump on the Data Mesh bandwagon with their own definitions. Unfortunately, they primarily muddy the water, return to the basics, and bring back the four principles.
If you want to learn more about Q&A, the second part of this article is available here.