Loading CSV in Spark

Loading data can be tricky because of all the possible options, even (or specially) with a format like CSV

Loading CSV in Apache Spark is a standard feature since version 2.0, previously you required a free plugin (provided by Databricks).

Although it starts with a basic value proposition: Comma Separated Values (aka CSV), but you know, CSV can be tricky:

Do you have a header?
C stands for Comma, but what is your real delimiter?
Are your fields splitting over multiple lines?

And many more questions and configuration.

Loading a CSV file is fast and easy, as you can see in the next Java example.

    SparkSession spark = SparkSession
        .builder()
        .appName("Ingesting CSV")
        .master("local")
        .getOrCreate();
    Dataset&amp;amp;amp;amp;amp;lt;Row&amp;amp;amp;amp;amp;gt; df = spark
        .read()
        .format("csv")
        .option("inferSchema", "true")
        .option("header", "true")
        .load("my_csv_file.csv");
    df.show();

However, the list of options for reading CSV is long and somehow hard to find. Therefore, here it is, with additional explanations, updated as of Spark v2.2.0. Note that file ingestion is covered in detail in Spark with Java‘s chapter 7.

CSV options for ingestion in Spark

Option	Default	Comment	Since version
sep	,	Sets the single character as a separator for each field and value.	v2.0.0
encoding	UTF-8	Decodes the CSV files by the given encoding type	v2.0.0
quote	"	Sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behavior is different from com.databricks.spark.csv (the optional parser to Spark v1.x).	v2.0.0
escape	\	Sets the single character used for escaping quotes inside an already quoted value.	v2.0.0
comment	empty string	Sets the single character used for skipping lines beginning with this character. By default, it is disabled.	v2.0.0
header	false	Uses the first line as names of columns. Two-line headers are not supported.	v2.0.0
inferSchema	false	Infers the input schema automatically from data. It requires one extra pass over the data.	v2.0.0
ignoreLeadingWhiteSpace	false	Flag indicating whether or not leading whitespaces from values being read should be skipped.	v2.0.0
ignoreTrailingWhiteSpace	false	Flag indicating whether or not trailing whitespaces from values being read should be skipped.	v2.0.0
nullValue	empty string	Sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.	v2.0.0
nanValue	NaN	Sets the string representation of a non-number" value.	v2.0.0
positiveInf	Inf	Sets the string representation of a positive infinity value.	v2.0.0
negativeInf	-Inf	Sets the string representation of a negative infinity value.	v2.0.0
dateFormat	yyyy-MM-dd	Sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat or here. This applies to date type.	v2.0.0
timestampFormat	yyyy-MM-dd'T'HH:mm:ss.SSSXXX	Sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat or here. This applies to timestamp type.	v2.1.0
maxColumns	20480	Defines a hard limit of how many columns a record can have.	v2.0.0
maxCharsPerColumn	-1	Defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length. In Spark v2.0.0, the default was 1000000.	v2.0.0
mode	PERMISSIVE	Allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes. PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets null for extra fields. DROPMALFORMED : ignores the whole corrupted records. FAILFAST : throws an exception when it meets corrupted records.	v2.0.0
columnNameOfCorruptRecord	the value specified in spark.sql.-columnNameOf-CorruptRecord	allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.	v2.2.0
multiLine		Parse one record, which may span multiple lines.	v2.2.0

Source: latest Javadoc.

Loading CSV in Spark

CSV options for ingestion in Spark

Let's be social

@jgperrin

/jgperrin

/jgperrin

CSV options for ingestion in Spark

Help share:

Let's be social

@jgperrin

/jgperrin

/jgperrin