NC State teaching room: I love all the multimedia gadgets, I think I used them all.
NC State teaching room: I love all the multimedia gadgets, I think I used them all.

NCDevCon is a yearly event in the Triangle, targeted for developers of all breeds, from front-end to back-end. Its origin starts in the ol’ days of Adobe ColdFusion, and thus still a reference in the subject matter. However, the conference has grown to embrace mobile and more web, including API and Spark.

NC DevCon used NC State University’s facilities on Centennial Campus. It’s almost as nice as Chapel Hill’s UNC campus (Go Heels!).

Tutorial

Offering my first tutorial on Apache Spark in this context was a real honor and pleasure. In a little less than 2 hours, with about 20 people, we went through the installation of the required software, discover a bit of theory, and walked through 2 examples of Spark with Java.

I explained what an Analytics Operating System is, or at least my take on it. Then we also went on some use cases, including the original approach Veracity is working on, as well as IBM DSX and Event Store. Of course, I mentioned CLEGO, but it is not a required piece of hardware!

All the code can be downloaded from GitHub, as it should be. However, we only analyzed 2 examples, as I had planned (but was hoping for more. We focused on the DataFrame, as the data container, without too much details on RDD. Finally, we also went in the architecture of Spark, understanding key concepts such as the driver, the master, and the worker/slaves. I tried to alternate theory and examples to make this lecture more agreeable, but only my students-of-the-day can tell you if I succeeded.

Examples

Of course, the challenge is making sure that everyone gets the right software. Java8 is pretty straight forward, Eclipse too, when you use Oxygen, I have forgotten how Maven and Git integration was in old versions of the IDE. Just switch to Oxygen.

We walked through a basic CSV ingestion.

package net.jgp.labs.spark.l000_ingestion.l000_csv;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class CsvToDatasetApp {

    public static void main(String[] args) {
        System.out.println("Working directory = " + System.getProperty("user.dir"));
        CsvToDatasetApp app = new CsvToDatasetApp();
        app.start();
    }

    private void start() {
        SparkSession spark = SparkSession.builder()
                .appName("CSV to Dataset")
                .master("local")
                .getOrCreate();

        String filename = "data/tuple-data-file.csv";
        Dataset<Row> df = spark.read().format("csv")
                .option("inferSchema", "true")
                .option("header", "false")
                .load(filename);
        df.show();
        df.printSchema();
    }
}

Our second example was slightly more complex: it involves joining dataframes and an aggregation function.

package net.jgp.labs.spark.l200_join.l030_count_books;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class AuthorsAndBooksCountBooksApp {

    public static void main(String[] args) {
        AuthorsAndBooksCountBooksApp app = new AuthorsAndBooksCountBooksApp();
        app.start();
    }

    private void start() {
        SparkSession spark = SparkSession.builder()
                .appName("Authors and Books")
                .master("local").getOrCreate();

        String filename = "data/authors.csv";
        Dataset<Row> authorsDf = spark.read()
                .format("csv")
                .option("inferSchema", "true")
                .option("header", "true")
                .load(filename);
        authorsDf.show();

        filename = "data/books.csv";
        Dataset<Row> booksDf = spark.read()
                .format("csv")
                .option("inferSchema", "true")
                .option("header", "true")
                .load(filename);
        booksDf.show();

        Dataset<Row> libraryDf = authorsDf
                .join(
                        booksDf,
                        authorsDf.col("id").equalTo(booksDf.col("authorId")),
                        "left")
                .withColumn("bookId", booksDf.col("id"))
                .drop(booksDf.col("id"))
                .groupBy(
                        authorsDf.col("id"),
                        authorsDf.col("name"),
                        authorsDf.col("link"))
                .count();

        libraryDf.show();
        libraryDf.printSchema();
    }
}

You can see similar examples in my developerWorks articles: Offloading your Informix data in Spark, part 1, part 2, and part 3.

Takeaways

I think we had fun, there was definitely a lot of laughter in the gorgeous room from NC State University. I want to do that again (and more!). Please share your thoughts on the approach, I’d love to hear from you.

Comments are closed.