Loading data can be tricky because of all the possible options, even (or specially) with a format like CSV
Loading data can be tricky because of all the possible options, even (or specially) with a format like CSV

Loading CSV in Apache Spark is a standard feature since version 2.0, previously you required a free plugin (provided by Databricks).

Although it starts with a basic value proposition: Comma Separated Values (aka CSV), but you know, CSV can be tricky:

  • Do you have a header?
  • C stands for Comma, but what is your real delimiter?
  • Are your fields splitting over multiple lines?

And many more questions and configuration.

Loading a CSV file is fast and easy, as you can see in the next Java example.

    SparkSession spark = SparkSession
        .builder()
        .appName("Ingesting CSV")
        .master("local")
        .getOrCreate();
    Dataset<Row> df = spark
        .read()
        .format("csv")
        .option("inferSchema", "true")
        .option("header", "true")
        .load("my_csv_file.csv");
    df.show();

However, the list of options for reading CSV is long and somehow hard to find. Therefore, here it is, with additional explanations, updated as of Spark v2.2.0. Note that file ingestion is covered in detail in Spark with Java‘s chapter 7.

CSV options for ingestion in Spark

OptionDefaultCommentSince version
sep,Sets the single character as a separator for each field and value.v2.0.0
encodingUTF-8Decodes the CSV files by the given encoding typev2.0.0
quote"Sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behavior is different from com.databricks.spark.csv (the optional parser to Spark v1.x).v2.0.0
escape\Sets the single character used for escaping quotes inside an already quoted value.v2.0.0
 comment empty stringSets the single character used for skipping lines beginning with this character. By default, it is disabled.v2.0.0
 header falseUses the first line as names of columns. Two-line headers are not supported.v2.0.0
 inferSchema falseInfers the input schema automatically from data. It requires one extra pass over the data.v2.0.0
 ignoreLeadingWhiteSpace falseFlag indicating whether or not leading whitespaces from values being read should be skipped.v2.0.0
 ignoreTrailingWhiteSpace falseFlag indicating whether or not trailing whitespaces from values being read should be skipped.v2.0.0
nullValueempty string Sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.v2.0.0
 nanValue NaNSets the string representation of a non-number" value.v2.0.0
 positiveInf InfSets the string representation of a positive infinity value.v2.0.0
 negativeInf -InfSets the string representation of a negative infinity value.v2.0.0
 dateFormat yyyy-MM-ddSets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat or here. This applies to date type.v2.0.0
timestampFormat yyyy-MM-dd'T'HH:mm:ss.SSSXXXSets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat or here. This applies to timestamp type.v2.1.0
 maxColumns 20480Defines a hard limit of how many columns a record can have.v2.0.0
 maxCharsPerColumn -1Defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length. In Spark v2.0.0, the default was 1000000.v2.0.0
modePERMISSIVE Allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes.
  • PERMISSIVE : sets other fields to null when it meets a corrupted record, and puts the malformed string into a field configured by columnNameOfCorruptRecord. To keep corrupt records, an user can set a string type field named columnNameOfCorruptRecord in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When a length of parsed CSV tokens is shorter than an expected length of a schema, it sets null for extra fields.

  • DROPMALFORMED : ignores the whole corrupted records.

  • FAILFAST : throws an exception when it meets corrupted records.
  • v2.0.0
    columnNameOfCorruptRecordthe value specified in spark.sql.-columnNameOf-CorruptRecordallows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord.v2.2.0
    multiLineParse one record, which may span multiple lines.v2.2.0

    Source: latest Javadoc.

    Comments are closed.