Loading CSV in Apache Spark is a standard feature since version 2.0, previously you required a free plugin (provided by Databricks).
Although it starts with a basic value proposition: Comma Separated Values (aka CSV), but you know, CSV can be tricky:
- Do you have a header?
- C stands for Comma, but what is your real delimiter?
- Are your fields splitting over multiple lines?
And many more questions and configuration.
Loading a CSV file is fast and easy, as you can see in the next Java example.
SparkSession spark = SparkSession .builder() .appName("Ingesting CSV") .master("local") .getOrCreate(); Dataset<Row> df = spark .read() .format("csv") .option("inferSchema", "true") .option("header", "true") .load("my_csv_file.csv"); df.show();
However, the list of options for reading CSV is long and somehow hard to find. Therefore, here it is, with additional explanations, updated as of Spark v2.2.0. Note that file ingestion is covered in detail in Spark with Java‘s chapter 7.
CSV options for ingestion in Spark
Option | Default | Comment | Since version |
---|---|---|---|
sep | , | Sets the single character as a separator for each field and value. | v2.0.0 |
encoding | UTF-8 | Decodes the CSV files by the given encoding type | v2.0.0 |
quote | " | Sets the single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behavior is different from com.databricks.spark.csv (the optional parser to Spark v1.x). | v2.0.0 |
escape | \ | Sets the single character used for escaping quotes inside an already quoted value. | v2.0.0 |
comment | empty string | Sets the single character used for skipping lines beginning with this character. By default, it is disabled. | v2.0.0 |
header | false | Uses the first line as names of columns. Two-line headers are not supported. | v2.0.0 |
inferSchema | false | Infers the input schema automatically from data. It requires one extra pass over the data. | v2.0.0 |
ignoreLeadingWhiteSpace | false | Flag indicating whether or not leading whitespaces from values being read should be skipped. | v2.0.0 |
ignoreTrailingWhiteSpace | false | Flag indicating whether or not trailing whitespaces from values being read should be skipped. | v2.0.0 |
nullValue | empty string | Sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type. | v2.0.0 |
nanValue | NaN | Sets the string representation of a non-number" value. | v2.0.0 |
positiveInf | Inf | Sets the string representation of a positive infinity value. | v2.0.0 |
negativeInf | -Inf | Sets the string representation of a negative infinity value. | v2.0.0 |
dateFormat | yyyy-MM-dd | Sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat or here. This applies to date type. | v2.0.0 |
timestampFormat | yyyy-MM-dd'T'HH:mm:ss.SSSXXX | Sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat or here. This applies to timestamp type. | v2.1.0 |
maxColumns | 20480 | Defines a hard limit of how many columns a record can have. | v2.0.0 |
maxCharsPerColumn | -1 | Defines the maximum number of characters allowed for any given value being read. By default, it is -1 meaning unlimited length. In Spark v2.0.0, the default was 1000000. | v2.0.0 |
mode | PERMISSIVE | Allows a mode for dealing with corrupt records during parsing. It supports the following case-insensitive modes. | v2.0.0 |
columnNameOfCorruptRecord | the value specified in spark.sql.-columnNameOf-CorruptRecord | allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord. | v2.2.0 |
multiLine | Parse one record, which may span multiple lines. | v2.2.0 |
Source: latest Javadoc.
Comments are closed.