As part of the Data Mesh Learning Community, Eric Broda invited Laveena Kewlani, Kruthika Potlapally, and me to discuss the implementation of Data Mesh at PayPal. As expected, the session went longer than scheduled, and some questions remained open. As with the previous Q&A sessions ([#1] and [#2]), here is an attempt to answer them.
The recording of the session and the initial part of the Q&A is available on YouTube. In the video and in this article, each participant expressed their views, not the ones of their employer.
What is the definition of a data product? A container?
My most straightforward definition is that a data product is a data set with a data contract. It can be a container, but it is primarily a logical construct: I see the data contract as the interface between the logical and physical world.
What is the definition of a component? Is it a library, sidecar, or a service connected to the database?
For us, a component is really following the definition of the Oxford dictionary: “a part or element of a larger whole, especially a part of a machine or vehicle.” If your question is about how we implement components, then the answer is all three.
Do the data products get built as part of a pipeline/CICD?
Can you share how you manage DevOps for each data product (new products vs. inefficiencies)?
I am not sure I fully understand your question, you could maybe rephrase it in the comments?
When you speak about a data contract, is it a written SLA or a standard understanding of data products?
The data contract is divided into several sections; one is the service-level agreements (SLA). It also contains demographics, dataset & schema, data quality, pricing, stakeholders, roles, and other properties. PayPal recently decided to Open Source their template of a data contract, which is now available on GitHub.
How do users interact with the data mesh? Did you define a unified data protocol?
We made it as streamlined as you would hope for your data consumers based on the tool they use. Are they using Notebooks? They can continue to use Notebooks. Are they using a BI tool? They can continue to use their BI tool (we have some optimization for some BI tools).
What tool are you using for data transformations?
Whatever tool you want. I often suggest Perl to show flexibility (and a bit of a joke). I do not think that your data mesh implementation should force the technology used of your Internal pipelines.
Do you use Dataplex? Are your data contracts similar to data cards in google data mesh?
We do not use Google Dataplex. With the overall complexity of a Data Mesh, I do not see an out-of-the-box commercial (or Open Source) off-the-shelf (OTS) offering yet. I am constantly educating myself on new products and solutions and I am happy to be proven wrong.
Are there common tests the code must go through, or is everything defined by the data product context?
The inner pipeline should implement all the testing (common or not) the data should go through. Some tests can be done by the data products itself as it embeds data quality in the data contract.
Indeed, it is difficult to understand how it is really implemented.
I think the idea here is to share our experience, not to reveal PayPal’s IP…
In a decentralized environment, how are common standards being managed – are there domain-based standards being managed by an enablement team?
Data Mesh enables decentralization: it does not mean that it is only decentralized. Policies & standards from your central data governance team can and should collaborate with the business unit’s needs. As an example, it is ok to have a centrally defined retention period, but local business units may be legally obliged to keep their data for a longer or a shorter period of time.
Also, how is domain-based data discovery being managed at the Producers?
When a data product is published, it is discoverable.
“We became good friends with many teams” — key success indicator (along with product thinking and leveraging the company’s cloud-native maturity).
It is not only about the cloud adoption maturity, but it also impacts everything, and you need to partner with many services/departments in the company.
It sounds like you have progressed significantly in your data mesh journey. Congratulations on that, especially automating as much as you have! May I ask, how long you have been on this journey?
Thanks! We started this project in December 2021, aligned to the Data Mesh vision in March 2022, delivered our first POC in May 2022, and went into production in December 2022. We just released our second production version in Q1 of 2023.
What did you use to create a catalog for your community?
For this project, we built our custom cataloging solution as it allowed us to capitalize on the richness of the data contract. It may not stay this way in the future.
Are the pipelines exclusively managed by the platform team, or can domain teams manage their own pipelines?
Both teams can manage pipelines. It will depend on the context. Domain teams have the domain expertise, but they may not have the knowledge of processing or deployment at scale, as it is often the case, it is a partnership.
Are all of your existing data products currently domain-oriented? Or do you have consumer-oriented/aggregated data products existing on your platform?
They are all domain-oriented. The next good question would be to define a domain…
Do you have a maturity framework for your domain teams or data mesh maturity?
We do not have a data mesh maturity model, but I am really thinking of establishing one. I know that there is one out there.
Can you give examples of domains/subdomains in your business?
Unfortunately, I cannot share anything specific about PayPal. You can think about marketing as a domain and advertisement as a subdomain.
Does Data mesh being used externally for revenue generation purposes?
This is really is groundbreaking, a great success story. Innovator status for Paypal.
Thanks! It is part of our excellent company culture.
An end-user selects which data products they want to use; how is accessibility/sharing managed?
Based on roles. You have the role; you can access it. If you don’t, you will have to request and justify your need. Data Mesh does not replace this but guides the user through it.
Accolades and comments in the chat (keep sending them, they are good for our ego):
- This is really is groundbreaking, a great success story: innovator status for Paypal.
- I have to agree.
- Thank you ever so much for this chat; it’s been great to see somebody further on the journey and feel the benefits. Kudos to the team @Paypal and thanks for hosting @Eric.
- Really helpful conversation, thank you!
- Thank you so much — looking forward to the open sourcing of data contract, and congrats on your accomplishments to date!
- Thank you for the session
- Very helpful thank you so much
- Be great to do an update in a year!
- Thanks JGP, Kruthika, Laveena. Eric you’ve been a great host.
Thanks to everyone for attending our discussion. My sincerest kudos to Laveena and Kruthika as it was their first panel.