Earlier this month, I was in San Francisco, CA, to attend Spark Summit 2017. I gave a talk on the phase before you can apply Machine Learning on data, using Apache Spark and Java.

Slides

Video on YouTube

Spark and Java are Great for Data Quality

Spark Summit 2017 was in San Francisco, CA, a few blocks away from Databricks' HQ and IBM's Spark Technology Center.
Spark Summit 2017 was in San Francisco, CA, a few blocks away from Databricks’ HQ and IBM’s Spark Technology Center.

The talk illustrates how important Data Quality (DQ) is, specially in the context of Machine Learning (ML). In ML, data are ground, hashed, analyzed. As those operations are done over and over in a very intense and repetitive way: it is important to ensure we do not have too much “garbage in”. Once we went through the boring stuff (which I really tried not to make too boring), we dig into Java code examples using the Spark API, Spark SQL, and Spark UDF (User Defined Functions). We will introduce you to Cactar, the Mongolian warlord of Data Quality (more on him soon).

We first clean the data using UDFs and reusing “existing” code. The UDF only performs a call to the service, making sure we do not have too much business logic in what is out of our control, the UDF signature. Therefore, if we want to use our DQ rules elsewhere, it is completely feasible.

The second part of the code uses the same principles of UDF. We use a UDF to format our data to labels and features, so we can apply a very simple linear regression.

The example code and data, all in Java, can be dowloaded from GitHub.

Comments are closed.