Earlier this month, I was in San Francisco, CA, to attend Spark Summit 2017. I gave a talk on the phase before you can apply Machine Learning on data, using […]

Hortonworks Data Platform (HDP) v2.6 has been released and you can download the platform from their website. The sandbox is not yet available in v2.6. New Versions of Key Components […]

A quick flashback on a few articles I published recently. You Are Not a Machine, So Learn Machine Learning published by Database Trends and Applications on February 21st, 2017. What Are Spark […]

Let’s understand what can checkpoints do for your Spark dataframes and go through a Java example on how we can use them. Checkpoint on Dataframe In v2.1.0, Apache Spark introduced […]

A quick post to share the next Spark event that we will run in the NC Triangle (RTP – Chapel Hill, Durham, Raleigh). This event will be held on December […]

To help foster the Apache Spark community in the (Research) Triangle region (Raleigh, Durham, and Chapel Hill in North Carolina), with some friends, we decided to create a Slack team […]

This week has seen the release of Apache Spark v2.0.0. As with every major releases, you can expect some changes. My Java recipes for Apache Spark have been affected, but […]

Unlike the new iPhone, the release of Apache Spark v2.0.0 did not gather 1,000s of people in a room, but it is a very important event in the small world of […]

When you start an application, you need to think about where it’s going to run, and also how it’s going to run. Basically, the way I use Spark is in […]

UDF stands for User Defined Functions. With those, you can easily extend Apache Spark with your own routines and business logic. Let’s see how we can build them and deploy […]