A new chapter of Spark in Action, 2e, (formerly known as Spark with Java) is available. Chapter 11 is titled “Working with SQL”. In chapter 11, you will explore how […]
(Almost) All you need to know about file ingestion in Apache Spark
As you may know, I start writing Apache Spark with Java (now renamed Spark in Action, 2nd edition). Usually, as the book develops, authors share a few excerpt of the book […]
What is Apache Spark, the podcast
A couple of weeks ago, I chatted about Apache Spark with Tobias Macey on data engineering on more specifically Apache Spark. Tobias Macey runs the data engineering podcast, which you can directly […]
Checklist for All Things Open (ATO)
The checklist is updated for ATO 2019! All Things Open 2018 (ATO 2018), a premier open source conference, will open its doors on October 21st 2018 in the Raleigh Convention Center, […]
Lazy is good: understand why it’s good for you that Spark is lazy
This new chapter, chapter 4, of Spark with Java (https://www.manning.com/books/spark-with-java) is not only about celebrating laziness, it also teaches, through examples and experiments, the fundamental differences in building a data […]
The majestic dataframe in Apache Spark
Chapter 3 of Spark with Java is focusing on the dataframe. There is something majestic with Apache Spark’s dataframe, like those mountains of Montana. Apache Spark revolves around the concept of […]
Advanced Spark Ingestion
Chapter 9 still covers Spark ingestion (like chapter 7 and chapter 8), but this time, it’s about “anything can become a Spark datasource.” When I was working for Zaloni, we […]
Ingestion of data from databases into Apache Spark
Chapter 8 of Spark with Java is out and it covers ingestion, as did chapter 7. However, as chapter 7 was focusing on ingestion from files, chapter 8 focus on […]
File Ingestion in Apache Spark
In a typical Big Data analytics scenario, you will probably be tempted to ingest files. You know, those pesky CSV files where the comma is sometimes a semicolon or a […]
Apache Spark with Java
Apache Spark has been a game changer for distributed data processing, thanks to an easy to understand API, a focus on simplicity, and an adoption of modern infrastructure. However, rumors […]