Earlier this month, I was in San Francisco, CA, to attend Spark Summit 2017. I gave a talk on the phase before you can apply Machine Learning on data, using […]
HDP 2.6 is Out: Spark 2, Hive 2, and Zeppelin 0.7 are GA
Hortonworks Data Platform (HDP) v2.6 has been released and you can download the platform from their website. The sandbox is not yet available in v2.6. New Versions of Key Components […]
Recents Publications
A quick flashback on a few articles I published recently. You Are Not a Machine, So Learn Machine Learning published by Database Trends and Applications on February 21st, 2017. What Are Spark […]
What are Spark Checkpoints on Dataframes?
Let’s understand what can checkpoints do for your Spark dataframes and go through a Java example on how we can use them. Checkpoint on Dataframe In v2.1.0, Apache Spark introduced […]
Apache Spark Event in the Triangle
A quick post to share the next Spark event that we will run in the NC Triangle (RTP – Chapel Hill, Durham, Raleigh). This event will be held on December […]
Let’s Start Slack’ing
To help foster the Apache Spark community in the (Research) Triangle region (Raleigh, Durham, and Chapel Hill in North Carolina), with some friends, we decided to create a Slack team […]
Spark Recipes Updated
This week has seen the release of Apache Spark v2.0.0. As with every major releases, you can expect some changes. My Java recipes for Apache Spark have been affected, but […]
The new Apache Spark is out (2.0.0)
Unlike the new iPhone, the release of Apache Spark v2.0.0 did not gather 1,000s of people in a room, but it is a very important event in the small world of […]
Ways to Run your Apps with Apache Spark
When you start an application, you need to think about where it’s going to run, and also how it’s going to run. Basically, the way I use Spark is in […]
Managing Java UDF with Apache Spark
UDF stands for User Defined Functions. With those, you can easily extend Apache Spark with your own routines and business logic. Let’s see how we can build them and deploy […]