Data Catering — How you could replicate your production data flow

The problem

Let’s imagine this scenario for a moment. I’ve recently been hired in a new company as a Software Engineer. I’m eager to get stuck in and get my hands dirty. I get given an introduction to the team, architecture, tech stack, some terminology, and soon enough, my first task has been assigned to me.

I need some more context about this task, so I ask a colleague and they give me a quick rundown. Checkout this project, then find the correct data sources in the data catalog, run this job and you should be able to replicate the error. Sure. Sounds easy enough. I find that the unit and integration tests are out of date and haven’t kept up with the latest schema changes. I’m now checking out what is inside the data catalog and start to get drowned in the complexity. Field A is defined as a subset of field B but only under certain conditions. In data source A, field C is a foreign key with data source B field D, there also can be multiple records corresponding to the primary key of data source C, etc.

Soon after, I get notified that we also have to notify the data consumer team, as we will have to test our new changes don’t break with their logic. The testing team also needs to create the corresponding tests for when it gets deployed into a non-prod environment. By now, I think you should start to understand the point I am making. Data is complex, spans across multiple teams, and requires coordination and alignment between producers and consumers. It not only affects new hires, but also seasoned veterans, as it is very difficult to stay on top of all domains, their intricacies and any changes to them.

The questions

How could we streamline this process?
What tools do we use to align everyone on what the data looks like and test it in each environment?
How do we do this in a reliable and consistent manner?
Can we get confidence that our changes actually work and don’t break existing logic?
How do we cater (😉) for future changes?

The solution

Imagine if we could connect to the data catalog (or other metadata services or even directly to the data source), gather information about the schema and data properties, generate data corresponding to this contract, feed it into the producer and then validate the consumer. This is where Data Caterer comes into play.

A generic data generation and validation tool that can automatically discover, generate and validate data across any data source.

How it works

In the above diagram, you can see from a high level perspective, how Data Caterer could be run. Going into a bit more detail:

Gather schema information either from metadata sources (such as data catalogs, schema registry) or directly from the data source. Define any relationships between the data (for each account create event in Kafka, the same account_id should appear in the CSV file), or this data can be gathered automatically.
Generate data based on the schema provided plus any extra metadata to guide the generator (i.e. account_id follows the pattern ‘ACC[0–9]{8}’, id is incremental, max value of merchant_number is 1000). Batch and events data can be generated.
Your applications consume and process the generated data. At this point, you may catch and detect errors or your consumer successfully pushes the data to another data source.
Data Caterer can then run data validations on this data source to ensure it has been consumed correctly. Type of validations include basic single column validations (contains, equalTo, lessThan, unique, etc.), aggregations (sum of amount per account should be less than 100), relationships (at least one transaction in CSV per account create event in Kafka) or data profile comparison (how close does the produced data profile matches with the expected data profile). More details on validations can be found here.

Following this pattern, we could replicate our production data flows in any environment, and then run validations on it. You may only want to generate some sample data to quickly test out your application and that is also possible as seen below on the different ways you can run Data Caterer.

Pros

Portable: Given that Data Caterer is a single docker image, it can be run in your local laptop or in a specific environment. Define it in your docker-compose to help developers startup an app or job and run with production-like data straight away.
Customizable: Derive all your base schema and metadata from existing services whilst also giving you the choice to override or alter their behaviour.
Streamlined: Remove the friction and reliance between teams as producers and consumers use Data Caterer to test their logic. Perfect for a Data Mesh style architecture or where data contracts are in place (check out ODCS for a great example of what a data contract should encompass).
Accurate: Get confidence that your application will work in production as it consumes production-like data.

Cons

Load testing: Any number of records can be generated by Data Caterer, but it has limited metric capturing abilities in regard to load testing, especially in terms of event load testing. There are more suitable tools, such as Gatling, that can serve this purpose.
Data cleanup: Data Caterer, in its current state, is able to remove the data it has generated but is unable to remove the consumers’ data. This is one of the potential roadmap items that can be found here.
Metadata storage: You could store and use metadata within the data generation/validation tasks but is not the recommended approach. Rather, this metadata should be gathered from existing services who handle metadata on behalf of Data Caterer.

Comparison

There are a number of similar tools in this space, but they are usually focused on either data generation or data validation, lacking one or more of the following features:

Generated data cleanup
Batch and event data generation
Define simple/complex relationships between datasets
Data validation

For more details about the comparison between other tools, you can check here.

Conclusion

The ecosystem of data tooling has become more mature over the years, but we still have room for improvement. Many tools focus on being reactive such as data monitoring tools (where they detect non-compliant data being producer or consumed) or data catalogs (where you collect metadata and expose it to users but don’t necessarily take action on it until things go wrong). Data Caterer pushes you towards being proactive by making better use of your test environments (replicate production-like data flows consistently) and most importantly, save time (less reliance on other teams and detect bugs earlier), as you can focus on value-adding efforts or other interesting tasks.

If you want to get some hands-on experience with Data Caterer, you can follow the quick start here or for a more detailed guided experience, checkout the guides here. Any questions, queries or feedback, feel free to contact me at:
peter.flook@data.catering

Data Catering — How you could replicate your production data flow

The problem

The questions

The solution

How it works

Pros

Cons

Comparison

Conclusion

Let's be social

@jgperrin

/jgperrin

/jgperrin

The problem

The questions

The solution

How it works

Pros

Cons

Comparison

Conclusion

Help share:

Let's be social

@jgperrin

/jgperrin

/jgperrin