In this article, I am trying to join data quality and service-level indicators in the context of data management.
Since I wrote my article about the CACTAR acronym for Data Quality, following my talk at Spark Summit in 2017, I have suffered from a “too little” syndrome.
As a reminder, CACTAR stands for consistency, accuracy, completeness, timeliness, accessibility, and reliability. Those attributes define data quality and are finite. CACTAR has six attributes; I have read other publications with seven, at most ten.
It appeared to me quickly that data quality lacked service-level agreement (SLA). Well… truly, you and I need to define pretty quickly.
For example, where can I find the guaranteed delivery time of batch process delivery? How long is this data going to be retained? When is my dataset going to be available? For how long?
Building on CACTAR, I started thinking about service-level indicators (SLI). I started realizing that the creativity of the human brain is the limit, and that can be pretty vast.
The following table lists some of those indicators combined with the data quality attributes.
Property | Abbreviation | DQ/ SLI | Group | In CACTAR? | Description |
---|---|---|---|---|---|
Availability | Av | SLI | Rest | N | Simply put, is my dataavailable? A data source can be unavailable for many reasons (server is down, files are corrupted…). The basic need is that the database answers your JDBC’s connect() method positively, right? |
Accessibility | Ac | DQ | Rest | Y – cactAr | How easily can data be accessed and manipulated? One (manual) way to measure data quality is to be able to sample it, do you have the right tool? How representative is your sample or are you always getting the same 100 first rows of your dataset? |
Reliability | Rl | DQ | Rest | Y – cactaR | How reliable your sources are. A source can be a database system, but also a human operator inputting data into a system. |
Completeness | Cm | DQ | Rest | Y – caCtar | To make sense, your data must be complete — at least, for key data. Missing fields, values, or rows will harm your analysis. |
Accuracy | Au | DQ | Motion | Y – cActar | The value must mean something, and this something must be accurate. |
Consistency | Cn | DQ | Motion | Y – Cactar | Data must be consistent across multiple data sources, servers, and platforms. |
General availability | Ga | SLI | Lifecycle | N | The date your data is available for consumption. It can be a date in the future for a development version. |
End of support | Es | SLI | Lifecycle | N | The date at which your data will not have support anymore. |
End of life | El | SLI | Lifecycle | N | The date at which your data will not be available anymore. |
Retention | Re | SLI | Behavior | N | Answers the question about how long are we keeping the data. |
Frequency | Fy | SLI | Behavior | N | How often is your data updated? For most batch processes, one expects a daily update. |
Latency | Ly | SLI | Behavior | N | Measures the time between the production of the data and its availability for consumption. |
Time to detect | Td | SLI | Time | N | How fast will a problem be detected? |
Time to notify | Tn | SLI | Time | N | How much time will you need to notify your users after you detected an issue? |
Time to repair | Tr | SLI | Time | N | How long do you need to fix the issue once it is detected? |
Timeliness | Tm | DQ | Time | Y – cacTar | How recent and relevant is the data? If you are building an industrial system, you must have your data in time. With the growing world of IoT, you expect data to be available right away. |
As you can see, I combined those indicators into five categories (or groups):
- Data at rest.
- Data in motion or as the result of a motion.
- Lifecycle properties.
- Behavior.
- Time.
A more visual representation of the table is through the Mendeleev periodic table. You can see the groups and the periods. The period rows list the time from earlier to later: availability happens before completeness, general availability happens before the end of life, and so on.
This article is an essay; many details will still need to be flushed. Please share in the comments the indicators (for service levels) that are missing here or the data quality attributes you use.
Featured photo by Pexels
Updates
- 2023-09-17 Added link to AAAS Science.org’s periodic table, which is a pretty gorgeous visualization and was worth referencing here. Better table.