The new Apache Spark is out (2.0.0)

Unlike the new iPhone, the release of Apache Spark v2.0.0 did not gather 1,000s of people in a room, but it is a very important event in the small world of analytics. This article will not cover all the updates, but a few that I considered important/affect my day-to-day life. Note that I have not deployed v2 yet (I may be less excited after).

Without counting the new features, it includes 2500+ patches from 300+ contributors. Thanks to this awesome community.

Improvements

Definitely Dataset, Goodbye to the Rest

For me, the most important feature of this version 2 is the standardization of Dataset for everything. Goodbye RDD and other DataFrame. Although, they might not completely disappear, they are less in the way and application developers can focus on their code using the Dataset object and all the APIs using it.

Configuration

Welcome to sessions! Like other data-oriented (and web-oriented) tools, we will now have sessions to replace SQLContext. It comes with a new (said simpler) API. Welcome SparkSession.

SQL

Lots of SQL improvements: like subqueries, support for ANSI SQL and Hive QL… I need to find out more about the native DDL command implementations

CSV Natively Crunched

Wasn’t that a little miserable to have to use Databricks’ library to digest CSV? Well, now we don’t need this – thanks Databricks for doing it in the first place (really appreciated) and very certainly thanks for giving it to the community.

OMG – Other Miraculous Goody

Off-heap memory management for both caching and runtime execution. I need to repeat that and put it in bold: off-heap memory management for both caching and runtime execution. This may seem like nothing, but this will simplify the developer’s life considerably as they will not have to tweak the JVM if it is a little on the “low memory” side. SysAdm will certainly be pleased to be receiving less calls at 2:27am as the big job just crashed the heap on the production server.

MLlib

As I wrote earlier, generalization of DataFrame is great and applies to MLlib too. This will simplify all your machine learning applications (more on that soon).

Deprecations

Note, that some deprecations have come in the 2.x branch:

Fine-grained mode in Apache Mesos.
Support for Java v7.
Support for Python v2.6.

Check out the details of what has been removed in Spark v2.0.0.

Documentation

As I write these lines, http://spark.apache.org/docs/latest still points to v1.6.2 (the previous release), but http://spark.apache.org/docs/2.0.0 is fully operational and very well done.

And Now

I am missing a lot of features and improvements. I will also update my recipes to make sure they work with v2 and I have added a new tag to this blog (very originally Spark v2.0) for v2.0 specific features/articles. Fantastic new release, I❤️????. What’s your favorite enhancement?

The new Apache Spark is out (2.0.0)

Improvements

Definitely Dataset, Goodbye to the Rest

Configuration

SQL

CSV Natively Crunched

OMG – Other Miraculous Goody

MLlib

Deprecations

Documentation

And Now

Let's be social

jgperrin.substack

/in/jgperrin

/jgperrin

Improvements

Definitely Dataset, Goodbye to the Rest

Configuration

SQL

CSV Natively Crunched

OMG – Other Miraculous Goody

MLlib

Deprecations

Documentation

And Now

Help share:

Let's be social

jgperrin.substack

/in/jgperrin

/jgperrin