Chapter 8 of Spark with Java is out and it covers ingestion, as did chapter 7. However, as chapter 7 was focusing on ingestion from files, chapter 8 focus on ingestion from databases. It explains how to get data from databases, both relational and NoSQL, into Apache Spark.
Not only RDBMS
In this chapter, you’ll learn about ingesting data from relational databases, with examples using MySQL. You will need a specific dialect when you use databases that are not natively supported by Spark. You will be learning the role of the dialects in the communication between Spark and databases, with a deep example using IBM Informix, one (if not the only one) of the RDBMS dear to my heart.
As ingestion of full tables at a time might not do it for you. You’ll also learn about building advanced queries in Spark to address the database prior to ingestion and understanding advanced communication with databases.
Finally, you’ll get into ingesting data from Elasticsearch. Elasticsearch does not define itself as a database. However, it’s a pretty cool technology that I call a database (or datastore for that purpose). If you do not know Elastic, this also gives you a quick intro in the appendices.
Extra material
Along with chapter 8, Manning is releasing appendix G: if you need help with relational databases, well, mainly installing them and getting started with some. We also released appendix E about getting started with Elastic and installing the sample data.
Appendix I (I for Ingestion) is already published. It serves as a reference for options for ingestion, connecting to data sources, etc.
Of course, all the labs/examples are in GitHub at https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch08.
Feel free to ask questions either here or on the Manning forum at https://forums.manning.com/forums/spark-with-java.
Happy reading, thanks for supporting this effort!