Starting today, I will host a weekly live show about data. You may join, attend “live,” and ask questions as I go through a data-oriented topic. For now, the topic is not known before the show, but you can suggest topics via Twitter. After the live broadcast, I will share a recording on a dedicated YouTube channel, originally name DataFriday.
Today, the topic is about ingesting a comma-separated value (CSV) file with Apache Spark. It is indeed a very basic example, but let’s keep it simple for the first episode, right? I will certainly add more complex discussions as we move forward.
The code is available on GitHub. The full description of the process is available in the first chapter of Spark in Action, 2nd edition. You can have a look at it on Manning’s live book website.
For convenience, the Java code is added here. You will see the main steps of this small application.
- Get a Spark session.
- Read the CSV file.
- Show the content of the dataframe.
package net.jgp.books.spark.ch01.lab100_csv_to_dataframe;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
/**
* CSV ingestion in a dataframe.
*
* @author jgp
*/
public class CsvToDataframeApp {
/**
* main() is your entry point to the application.
*
* @param args
*/
public static void main(String[] args) {
CsvToDataframeApp app = new CsvToDataframeApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("CSV to Dataset")
.master("local")
.getOrCreate();
// Reads a CSV file with header, called books.csv, stores it in a
// dataframe
Dataset<Row> df = spark.read().format("csv")
.option("header", true)
.load("data/books.csv");
// Shows at most 5 rows from the dataframe
df.show(5);
}
}
This is a really basic example, but I needed one to start, no? The next episodes will certainly be a bit more complex.
More resources:
- Loading CSV in Spark.
- (Almost) All you need to know about file ingestion in Apache Spark.
- File Ingestion in Apache Spark.
The YouTube channel for DataFriday lists all episodes. You can attend the live show every Friday morning at 8 AM EST on Zoom.