I keep having questions about data pipelines. Data pipelines in Data Mesh is a topic I should tackle. So… Is the data pipeline the root of all evil?
As I continue working and advocating on Data Mesh, I keep receiving great questions from people at conferences, internally at PayPal, or randomly on LinkedIn or the Data Mesh Learning community. Please keep sending them, as I love a challenge and good discussions. The role of a data pipeline is an important one. I will update this article based on your feedback.
Short answer: no. Long answer: yes. Yes has three letters, making it a longer answer than no with two letters. Note that this would not work in French or Spanish and would be very confusing in German. Nevertheless, let’s try to answer it seriously.
The data pipeline in traditional data engineering
In traditional data engineering, everything revolves around the extract, transform, and load (ETL) processes. You have data sources, you have a target, and you connect the two. More advanced projects try to design the target to accommodate multiple consumer use cases.
The problem in this situation is that the data pipeline:
- Takes all the room; I have seen many cases where the pipeline is so important that the engineers do not implement data quality, which you consider a baseline in software engineering (and I am not even talking about TDD).
- Works in isolation: it can be implemented using any technology, and it has barely any guardrails.
Here comes Data Mesh
In the data mesh, the data pipelines are critical to the good operations of the data quantum (or data product). However, as you can see in the figure below, the data pipeline is an integral part of a whole.
The control API supervises the data pipeline. It delivers data to the interoperable model, which is exposed by the observability and dictionary services. I am assuming a scenario where you would need an ETL process and not data virtualization (but you see that it would work in both scenarios).
In more advanced data mesh implementations, the pipeline is observed and must obey precise reporting and logging. In this scenario, your observability API should offer a way to introspect the pipeline.
Finally, the data contract governs how the pipeline should behave.
So we’re not getting rid of the pipeline?
No, we are not eliminating the data pipeline in data mesh. Data mesh tames it and makes sure its arrogance is under control.
Feature photo by Ambady Kolazhikkaran.
This is terrific, Jean-George, thanks.
The pipeline is dead. Long live the pipeline!
Thanks David! I like that a lot. You can also think of King Pipeline… Is it it for a revolution à la française, or a constitutional monarchy? Whatever the solution is, it will be different.