What is Apache Spark, the podcast

A couple of weeks ago, I chatted about Apache Spark with Tobias Macey on data engineering on more specifically Apache Spark. Tobias Macey runs the data engineering podcast, which you can directly…

Checklist for All Things Open (ATO)

The checklist is updated for ATO 2019! All Things Open 2018 (ATO 2018), a premier open source conference, will open its doors on October 21st 2018 in the Raleigh Convention Center,…

Lazy is good: understand why it’s good for you that Spark is lazy

This new chapter, chapter 4, of Spark with Java ( is not only about celebrating laziness, it also teaches, through examples and experiments, the fundamental differences in building a data…

The majestic dataframe in Apache Spark

Chapter 3 of Spark with Java is focusing on the dataframe. There is something majestic with Apache Spark’s dataframe, like those mountains of Montana. Apache Spark revolves around the concept of…

Advanced Spark Ingestion

Chapter 9 still covers Spark ingestion (like chapter 7 and chapter 8), but this time, it’s about “anything can become a Spark datasource.” When I was working for Zaloni, we…

Ingestion of data from databases into Apache Spark

Chapter 8 of Spark with Java is out and it covers ingestion, as did chapter 7. However, as chapter 7 was focusing on ingestion from files, chapter 8 focus on…

Apache Spark with Java

Apache Spark has been a game changer for distributed data processing, thanks to an easy to understand API, a focus on simplicity, and an adoption of modern infrastructure. However, rumors…

Loading CSV in Spark

Loading CSV in Apache Spark is a standard feature since version 2.0, previously you required a free plugin (provided by Databricks). Although it starts with a basic value proposition: Comma…

A New Dimension for Apache Spark Clusters

Summer has been busy and it’s now behind us. I won’t annoy you with all the details of what happened but I wanted to come back on a project I…

Meet Cactar, the Ancient Mongolian Warlord of Data Quality

A Little History On August 18, 1227, the well-known Mongolian emperor Genghis Khan passed. Despite numerous criticisms, based on rumors of genocide and brutality, he united Mongolia. One of his…

TechWork

What is Apache Spark, the podcast

Checklist for All Things Open (ATO)

Lazy is good: understand why it’s good for you that Spark is lazy

The majestic dataframe in Apache Spark

Advanced Spark Ingestion

Ingestion of data from databases into Apache Spark

Apache Spark with Java

Loading CSV in Spark

A New Dimension for Apache Spark Clusters

Meet Cactar, the Ancient Mongolian Warlord of Data Quality

Let's be social

jgperrin.substack

/in/jgperrin

/jgperrin

Help share:

Help share:

Help share:

Help share:

Help share:

Help share:

Help share:

Help share:

Help share:

Help share:

Let's be social

jgperrin.substack

/in/jgperrin

/jgperrin