Squares
The squares remind me of Mendeleev’s periodic table.

In this article, I am trying to join data quality and service-level indicators in the context of data management.

Since I wrote my article about the CACTAR acronym for Data Quality, following my talk at Spark Summit in 2017, I have suffered from a “too little” syndrome.

As a reminder, CACTAR stands for consistency, accuracy, completeness, timeliness, accessibility, and reliability. Those attributes define data quality and are finite. CACTAR has six attributes; I have read other publications with seven, at most ten.

It appeared to me quickly that data quality lacked service-level agreement (SLA). Well… truly, you and I need to define pretty quickly.

For example, where can I find the guaranteed delivery time of batch process delivery? How long is this data going to be retained? When is my dataset going to be available? For how long?

Building on CACTAR, I started thinking about service-level indicators (SLI). I started realizing that the creativity of the human brain is the limit, and that can be pretty vast.

The following table lists some of those indicators combined with the data quality attributes.

PropertyAbbreviationDQ/
SLI
GroupIn CACTAR?Description
AvailabilityAvSLIRestNSimply put, is my dataavailable? A data source can be unavailable for many reasons (server is down, files are corrupted…). The basic need is that the database answers your JDBC’s connect() method positively, right?
AccessibilityAcDQRestY – cactArHow easily can data be accessed and manipulated? One (manual) way to measure data quality is to be able to sample it, do you have the right tool? How representative is your sample or are you always getting the same 100 first rows of your dataset?
ReliabilityRlDQRestY – cactaRHow reliable your sources are. A source can be a database system, but also a human operator inputting data into a system.
CompletenessCmDQRestY – caCtarTo make sense, your data must be complete — at least, for key data. Missing fields, values, or rows will harm your analysis.
AccuracyAuDQMotionY – cActarThe value must mean something, and this something must be accurate.
ConsistencyCnDQMotionY – CactarData must be consistent across multiple data sources, servers, and platforms.
General availabilityGaSLILifecycleNThe date your data is available for consumption. It can be a date in the future for a development version.
End of supportEsSLILifecycleNThe date at which your data will not have support anymore.
End of lifeElSLILifecycleNThe date at which your data will not be available anymore.
RetentionReSLIBehaviorNAnswers the question about how long are we keeping the data.
FrequencyFySLIBehaviorNHow often is your data updated? For most batch processes, one expects a daily update.
LatencyLySLIBehaviorNMeasures the time between the production of the data and its availability for consumption.
Time to detectTdSLITimeNHow fast will a problem be detected?
Time to notifyTnSLITimeNHow much time will you need to notify your users after you detected an issue?
Time to repairTrSLITimeNHow long do you need to fix the issue once it is detected?
TimelinessTmDQTimeY – cacTarHow recent and relevant is the data? If you are building an industrial system, you must have your data in time. With the growing world of IoT, you expect data to be available right away.

As you can see, I combined those indicators into five categories (or groups):

  • Data at rest.
  • Data in motion or as the result of a motion.
  • Lifecycle properties.
  • Behavior.
  • Time.

A more visual representation of the table is through the Mendeleev periodic table. You can see the groups and the periods. The period rows list the time from earlier to later: availability happens before completeness, general availability happens before the end of life, and so on.

Using a Madeleev-like table to show how data quality and data service-level indicators can coexist.
Using a Madeleev-like table to show how data quality and data service-level indicators can coexist.

This article is an essay; many details will still need to be flushed. Please share in the comments the indicators (for service levels) that are missing here or the data quality attributes you use.


Featured photo by Pexels


Updates

  • 2023-09-17 Added link to AAAS Science.org’s periodic table, which is a pretty gorgeous visualization and was worth referencing here. Better table.