(Almost) All you need to know about file ingestion in Apache Spark

Ingestion in Apache Spark has nothing to do with digestion, or, at least difficult digestion

As you may know, I start writing Apache Spark with Java (now renamed Spark in Action, 2^nd edition). Usually, as the book develops, authors share a few excerpt of the book through articles. Apache Spark in Action is no exception to the tradition.

The great team at Manning, just finished sharing a series on file ingestion in Apache Spark using Java. I wanted to share the links to those article here.

Complex ingestion from CSV: everybody uses CSV (comma-separated values), but it can be tricky from time to time. Apache sparks offer a lot of options and dealing with CSV and this article should give you a pretty good intro to it.
Ingesting a JSON file: JSON is becoming so omnipresent that it is now everywhere. However, did you know that there are multiple types of JSON formats? Find out in the article, and of course, how to ingest one of them in Apache Spark. The lab associated to the article uses a real-life dataset downloaded from an Open Data portal.
Ingesting an XML file: XML is still in the picture. I loved XML, misses it. I think I liked its rigor and associated constraints, probably all the things that made it less popular than JSON… In this article, you’ll play with the NASA patent datasets. Have fun!

All the examples are available in the GitHub repository linked to chapter 7, they are all updated to work with Apache Spark v2.4 (nothing to do really). The previous articles (what happens behind the scenes with [Apache] Spark and the Majestic role of the dataframe in Spark) are also a great resources. They share the love.

Finally, a quick update: I am working on finalizing chapter 12, which is a cornerstone of the book. I will be attending (and speaking) at IBM Think 2019 in San Francisco, CA on February 11^th. Shout out if you’ll be there. There is also a new Facebook page for the book, please like it (and like my jokes).

Happy reading!

(Almost) All you need to know about file ingestion in Apache Spark

Let's be social

jgperrin.substack

/in/jgperrin

/jgperrin

Help share:

Let's be social

jgperrin.substack

/in/jgperrin

/jgperrin