Spark is a Hadoop Killer

Apache Spark, an Hadoop Killer? Photo © Wikipedia, Illustration © JG Perrin

Of course, nobody will tell you I am right. At least officially. But at was what was goal of Hadoop? Perform analytics over a wide range of servers.

Of course, Hadoop comes with plenty of side libraries like Pig, Mahout; companies providing support; companies involved in the game, like Cloudera, Hortonworks, IBM, and quite a few more.

Of course, Hadoop is more than that, it is a very powerful file system through HDFS, it is a resource allocator thanks to YARN, it is certainly one of the best performing MapReduce freely available implementation. But, isn’t that the problem? Is Hadoop only about MapReduce?

But what I hear from the field is that it is complex, another development paradigm, limited, slow…

Here comes Apache Spark.

Spark is designed to perform very fast, in-memory computation across multiple servers, using as much memory as you give it, provide a common interface to a SQL-like language called Spark SQL, a machine learning library creatively called MLlib, a graph extension called GraphX, and one of my favorite, Spark Streaming for stream computing.

Spark does not provide storage like Hadoop does with HDFS or YARN, but it uses any storage available out there, including HDFS and YARN.

So yes, technically Spark is not a Hadoop killer, but when it comes to enterprise applications, they serve similar purpose: Big Data Analytics or Big Analytics.

What do you think?

Spark is a Hadoop Killer

Let's be social

@jgperrin

/jgperrin

/jgperrin

Help share:

Let's be social

@jgperrin

/jgperrin

/jgperrin