The Apache Spark dataframe is pretty majestic, covered in chapter 3 of Spark with Java
The Apache Spark dataframe is pretty majestic, covered in chapter 3 of Spark with Java

Chapter 3 of Spark with Java is focusing on the dataframe. There is something majestic with Apache Spark’s dataframe, like those mountains of Montana. Apache Spark revolves around the concept of dataframe, which embraces both storage and a rich transformation API (application programming interface). Spark’s dataframe is fully programmable with Java.

Spark with Java chapter 3 overview

In chapter 3 of Spark with Java, you will learn about using the dataframe and will (hopefully) understand why the dataframe is so important in Spark applications, as it contains typed data through a schema and offer a powerful API.

To use Spark in a programmatic way, you need to understand some of its key APIs. Transformations are operations you do on data. Examples include extracting a year from a date, combining two fields, normalize data, and so on. In this chapter, I teach how to use dataframe-specific functions to perform transformations, as well as methods directly attached to the dataframe API. Spark’s dataframe API is used within Spark SQL, streaming, machine learning, and GraphX to manipulate graph-based data structure within Spark. The dataframe drastically simplifies the access to those technologies via a unified API.

You will also learn about other data structure in Spark and the subtle difference between a dataset and a dataframe. You will reuse your good ol’ POJO (plain old Java objects) in the context of transformation within Spark.

This chapter will also talk about the RDD (resilient distributed dataset). RDD were the first generation of storage in Spark, which the dataframe completely supports, extends, and makes sublime (but not majestic, that’s the dataframe).

Lab

This chapter’s lab will transform two open data datasets in Spark with Java. Those datasets include restaurant data from two North Carolinian counties (Durham and Wake). You will union the datasets. Transformations are as close as it can be to a production application.

Anything more?

You will get the definition of what immutability is. And, no, it is not a swear word. However, it could be a bit of a personal definition.

Outside of this new chapter, please go share the love on the new Facebook page for Spark with Java!