On March 1st and March 2nd, I partnered with IBM’s Technical Group to present “ten lessons learned from building a Data Mesh.” This article is a follow-up on the Q&A from March 1st. You can read the first part of the Q&A.
As a reminder, those are my personal views; they do not engage my employer, PayPal.
What is your view on building 360s kind of analytics, such as 360-view of customers, 360-view of business, and so on, from a data mesh perspective? Does building such views/analytics conflict with domain ownership?
I do not think they conflict. I think domain ownership will help you create the proper data products that will become the large 360-degree solution. Data Mesh will ensure better data quality and SLAs.
How did PayPal address Federated Computational governance today?
We partnered with our enterprise data governance since the beginning of the project. We leverage their tools in our tooling. This is an ongoing process, and they are a fantastic partner on our journey.
Which parts are technology driven and which are org/process driven?
This is the topic of a book, not a Q&A… In short… tools and implementation are technology-driven, and so is building the data mesh tooling. Building the data products is heavily process-driven, with more and more tools coming to simplify the steps of the process. It is not a complex process, and Data Mesh helped describe it more than we originally expected.
So “document-based JSON” really only describes one type of NoSQL DBs.
Correct, this presentation is not about document stores; I just wanted to illustrate the variety.
Can you elaborate on the data architect role? What are their key responsibilities?
For us, the data architect helped define the data contract and continuously worked with the data engineers for new interoperable models. More recently, they helped us with SLA, data quality, best practices, etc.
As we’re talking about roles, is the CDO both in and out?
I kept the CDO on this slide as it is a bit of a mystery to me, and I want to have discussions with the audience. Is he in? Is he out? What is your opinion?
How about the CPO (Chief Privacy Officer), or is privacy (and associated regulation) part of the CDO remit?
Great comment. I associate privacy with security, and as you could see during the presentation, I barely spoke about security — on purpose.
Is there a reference source you like for the definition and originator of the Data Mesh concept?
The founder of the Data Mesh principles is Zhamak Dehghani. She wrote several papers and a great book that laid the foundation for Data Mesh. I keep track of her articles and essential videos, as well as other interesting resources, at /datamesh.
Comment, not a question – the hexagon graphic is used for process, not data storage/schema/flow. Unless you are Schema-less (schema-on-read), you still need to define data stores and schemas.
Zhamak Dehghani used the hexagon to describe the data product’s architecture (or data quantum). I was assuming that she leverages the hexagonal architecture per Alistair Cockburn’s original idea (https://en.wikipedia.org/wiki/Hexagonal_architecture_(software)). I will have to ask her the next time we chat!
How does the Data Mesh concept differ from similar efforts in the past, like EDM (Enterprise Data Model) or MDM (Master Data Model)?
Data Mesh will help us achieve those goals more quickly as those EDM and MDM projects are usually slow, and the ROI starts showing only after deployment. The product approach of Data Mesh for its data products enables a product lifecycle mentality that will help get from a current state to an (end?) state like EDM through versioning. It also allows EDM to be versioned more efficiently and reduces time to market.
Most items under “data contract” are system-related, not data related, and would apply to any enterprise system.
There is a big part that is data related, specifically the schema, but you are right; hence my nickname for a data contract is a “computational brochure.”
How long will it take other large organizations to start adopting data mesh?
Many organizations are adopting it. PayPal is one of many. You can join the Data Mesh Learning community and discuss with peers.
You said “data contracts” are an industry standard. Where would I find such standards?
I am sorry if I said that because they are not. I wish they were, and I hope they will become. You can find some examples online, but I don’t like them, so I will not advertise them. I am happy to entertain the idea of building a standard.
Efforts like this occur about 5-10 years across the industry. They always run into organizational barriers like budget, ownership of coordinated efforts, cultural barriers, legacy technology integration, etc. How does the Data Mesh concept manage these issues?
Data Mesh is not at the 5-yr mark yet, so I am hoping for the best. However, I think that, as Data Mesh is not a “vendor solution” and tackles some of those issues as foundations, we have a better chance at making things correct this time. Ownership is one of the principles. Legacy technology is not affected and can be phased out incrementally thanks to product thinking. Cultural barriers will always be there, and it’s a leadership dilemma, but giving autonomy to a decentralized team and more purpose to a centralized team will help motivate people.
At PayPal, how do business users (non-tech) find the data they are looking for: search catalog and more?
We have invested in cataloging features directly at the Data Mesh level with a Google-like interface. People love it.
Does Avro address the data catalog requirement?
Our implementation is technology agnostic. The Avro schema is very interesting, but I do not think a storage metadata definition should become the enterprise standard for metadata.
Who should propose data mesh to the enterprise? CDO, data architect, etc.
I have been building data platforms for more than 15 years, either for my own companies, as a consultant, or, now, for major corporations. Everyone had something special, and Data Mesh is finally unifying the approach and covering most (if not all) cases. It varies from company to company. It can come from a business unit that wants to accelerate or an enterprise-level level that sees that they are losing control. In any situation, it starts with discussion and adopting agile principles.
What are the main challenges and technology necessary when meshing data products from different sources into valuable interoperable data?
The number of sources does not matter. What matters is bringing value, as you clearly state. The main challenge is defining your domain: something not too big or small. As for technology, we designed our Data Mesh to interconnect with any of our systems, so that’s the pattern I would highly recommend.
Data Scientists almost entirely use Jupyter and Python. That would simplify your slide a great deal. Many also are using cloud-native notebooks like SageMaker, and adding would be helpful.
Unfortunately, I do not know which slides you are referring to. However, I recommend keeping listening to your users. If they use Jupyter (or SageMaker, or IBM Watson Studio) notebooks, yes, bring the power of Data Mesh to notebooks.
BI/Analytics is missing as a persona, and it’s a far bigger community than Data Scientists.
You are correct; they are missing. Our priority was not this persona. Things may evolve, and this is an important lesson while building a Data Mesh. It is a product, not a project.
Which is your point of view on Data Mesh vs. Data Fabric that is often opposing?
Some people oppose carrots and oranges; however, they go perfectly together in juice. I would highly recommend this article on IBM’s website. I particularly like the last sentence of the “Data Mesh vs. Data Fabric” section, which I took the liberty to copy here.
In fact, the data fabric makes the data mesh better because it can automate key parts of the data mesh, such as creating data products faster, enforcing global governance, and making it easier to orchestrate the combination of multiple data products.IBM, retrieved 2023-03-03.
So, is data mesh going Data Mesh to leave the CDOs without a job?
I had to reply with the following:
What modeling techniques are used to build a data product? Star Schema? Data Vault?
Don’t care. Don’t want to know. This is the data architect’s decision based on the use case. Data Mesh should adapt to the implementation, not the other way around.
What does a data quantum map to in a relational DB? Is it table-level? DB-level? Line-of-Business?
A data quantum maps to a domain. It can contain multiple databases, tables, and other resources based on the complexity of the domain. I try to keep them simple.
Which vendor tools have you had some experience with in implementing the data mesh? I’ve heard of Collibra, but I’m sure there are others.
Zhamak Dehghani created her company called Nextdata, but we still have to see what she comes up with. I do not think there is a solution now that can pretend to be a “Data Mesh in a box” (except this one and this one, of course!). However, you will have to interface with the existing applications in your company: Collibra is one mainly for cataloging, and so is IBM Watson Knowledge Catalog or Acryl Data Hub. But you will have to cover more than just cataloging. Cataloging is primarily for the self-service principle of Data Mesh.
What is the difference in software tools and platforms between the Data Product and Data Mesh layer? What are some of the tools used at the Data Mesh layer?
Following Dehghani’s documentation, we call them experience planes, not layers. At the data product experience plane, we leverage tools to manage a single data product, while at the mesh experience plane, we manage the mesh: so you can think that the catalog sits in the mesh experience plane, as an example.
What is the highest throughput of TB data that you could process with those principles?
Last I checked, we managed something like 3.14 Petabytes per millisecond. I am just kidding. Data Mesh, at least the way we implemented it, will not slow the users down. If your data is in text files on an Intel 386sx with ESDI drives, you will not have the same performance as a current database service leveraging a cluster of Xeon processors and NVMe drives. Data Mesh will not increase or decrease the performance.
Who owns the Data Architecture at the Data Quantum level? That is not the role of a DBA or Data Engineer.
The data architecture is owned by the data architect.
Hope skiing was fun yesterday!
It was, thanks… Here’s a picture!