Starting today, I will host a weekly live show about data. You may join, attend “live,” and ask questions as I go through a data-oriented topic. For now, the topic is not known before the show, but you can suggest topics via Twitter. After the live broadcast, I will share a recording on a dedicated YouTube channel, originally name DataFriday.

Today, the topic is about ingesting a comma-separated value (CSV) file with Apache Spark. It is indeed a very basic example, but let’s keep it simple for the first episode, right? I will certainly add more complex discussions as we move forward.

The code is available on GitHub. The full description of the process is available in the first chapter of Spark in Action, 2nd edition. You can have a look at it on Manning’s live book website.

For convenience, the Java code is added here. You will see the main steps of this small application.

  1. Get a Spark session.
  2. Read the CSV file.
  3. Show the content of the dataframe.
package net.jgp.books.spark.ch01.lab100_csv_to_dataframe;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

/**
 * CSV ingestion in a dataframe.
 * 
 * @author jgp
 */
public class CsvToDataframeApp {

  /**
   * main() is your entry point to the application.
   * 
   * @param args
   */
  public static void main(String[] args) {
    CsvToDataframeApp app = new CsvToDataframeApp();
    app.start();
  }

  /**
   * The processing code.
   */
  private void start() {
    // Creates a session on a local master
    SparkSession spark = SparkSession.builder()
        .appName("CSV to Dataset")
        .master("local")
        .getOrCreate();

    // Reads a CSV file with header, called books.csv, stores it in a
    // dataframe
    Dataset<Row> df = spark.read().format("csv")
        .option("header", true)
        .load("data/books.csv");

    // Shows at most 5 rows from the dataframe
    df.show(5);
  }
}

This is a really basic example, but I needed one to start, no? The next episodes will certainly be a bit more complex.

More resources:

The YouTube channel for DataFriday lists all episodes. You can attend the live show every Friday morning at 8 AM EST on Zoom.