As you may know, I start writing Apache Spark with Java (now renamed Spark in Action, 2nd edition). Usually, as the book develops, authors share a few excerpt of the book […]
The Key to Machine Learning is Prepping the Right Data
Earlier this month, I was in San Francisco, CA, to attend Spark Summit 2017. I gave a talk on the phase before you can apply Machine Learning on data, using […]
HDP 2.6 is Out: Spark 2, Hive 2, and Zeppelin 0.7 are GA
Hortonworks Data Platform (HDP) v2.6 has been released and you can download the platform from their website. The sandbox is not yet available in v2.6. New Versions of Key Components […]
What are Spark Checkpoints on Dataframes?
Let’s understand what can checkpoints do for your Spark dataframes and go through a Java example on how we can use them. Checkpoint on Dataframe In v2.1.0, Apache Spark introduced […]