I was looking for an effortless data visualization that would interface easily with Apache Spark. I found a few interesting tools, but nothing that would not require some complex interfacing, setup, or infrastructure. In a good geek way, I then decided to write the tool. This lack of simple tools is how Drsti (pronounced drishti) was born.
Aren’t you tired of looking at dataframes that looked like they came straight from a 1980 VT100? Sure, if you use notebooks, either standalone or hosted (IBM Watson Studio, Databricks…), you are not (or less) confronted with the issue. However, if you are building pipelines outside of the Data Science toys, oops, tools, you may need to visualize data in a graph.
Simply put, the idea was to go from:
+----------+----------------+-----------+-----+
|month |internationalPax|domesticPax|pax |
+----------+----------------+-----------+-----+
|2000-01-01|5394 |41552 |46946|
|2000-02-01|5249 |43724 |48973|
|2000-03-01|6447 |52984 |59431|
|2000-04-01|6062 |50349 |56411|
|2000-05-01|6342 |52320 |58662|
+----------+----------------+-----------+-----+
only showing top 5 rows
To:
Installing and first run
You will need node and yarn on your machine. Yarn as in the JavaScript package tool, not the Hadoop resource scheduler. Then, get Drsti from GitHub:
git clone https://github.com/jgperrin/ai.jgp.drsti.git
cd ai.jgp.drsti
yarn install
That’s the time you get a coffee. Before you run Drsti, record where you have installed the visualization tool by running pwd
. On my machine, it’s like /Users/jgp/git/ai.jgp.drsti. Time to run Drsti.
yarn start
yarn will compile, prepare, crunch, massage the code, and start a browser (or a tab) pointing to http://localhost:3000/. After a short time, you should get something like:
You can play with the top right icons to export the graph or data. When you click on Raw data, Drsti displays the data in a tabular way.
The Row payload tab shows you how the data is being fetched from the server, which is extremely useful when you are debugging.
Finally, the last tab gives you metadata about the payload: you can see the weight of the file you download, the number of records, and the number of columns.
For this specific graph, dṛṣṭi uses the data from /public/data.csv and the metadata required for the graph rendering from /public/metadata.json.
Understanding how dṛṣṭi fits in an Apache Spark pipeline
Let’s see how Drsti interacts with Apache Spark. This section will focus on describing the architecture and data transfer. In the next section, you will start your first visualization.
As with any Spark pipeline, you will first ingest data. You can then transform it or directly render it. Before rendering set some optional parameters to configure Drsti.
When you ask dṛṣṭi to render your graph, it will split the dataframe data and the graph parameters in two files:
- Data will be saved in one CSV file, with headers, called data.csv.
- Metadata will be save in a JSON, called metadata.json.
Both files will be saved in the public folder on the Drsti web app. It will automatically trigger a reload and display the updated data.
A quick word on the internals of Drsti itself. Drsti is a React app written using TypeScript (yes, the future of JavaScript is typed). Drsti is using the IBM Carbon Design System, an awesome framework for building business applications. Drsti uses Carbon v10, but v11 is coming along soon.
Your first Drsti rendering
Let’s integrate Drsti in an Apache Spark pipeline using a very simple example. Drsti comes with a few demos using real-life datasets. Let’s study a simple one together. As usual, I will use Java.
Drsti for Spark is the component that will simplify your Apache Spark integration with Drsti. You can get it from GitHub at https://github.com/jgperrin/ai.jgp.drsti-spark. Clone it in your favorite IDE. The first demo uses the gross domestic product (GDP) of the United States of America (USA). You will find it in: /ai.jgp.drsti-spark/src/main/java/ai/jgp/drsti/spark/demo/us_gdp/UsGdpHistoryApp.java.
This is what we are going to render:
As with any Apache Spark application, get a session.
SparkSession spark = SparkSession.builder()
.appName("US GDP History")
.master("local[*]")
.getOrCreate();
Once, you have a session, ingest the GDP data.
Dataset<Row> df = spark.read().format("csv")
.option("header", true)
.load("data/us_gdp/usa_gdp.csv");
If you would display the dataframe at this point, you would get something like:
+--------------------+----+--------+-------+-------+------+----------+---------+---------+
| President|Year|Receipts|Outlays|Surplus|G.D.P.|Receipts %|Outlays %|Surplus %|
+--------------------+----+--------+-------+-------+------+----------+---------+---------+
| Calvin Coolidge|1930| 4.1| 3.3| 0.7| 98.4| 4.1| 3.4| 0.8|
| Herbert Hoover|1931| 3.1| 3.6| -0.5| 84.8| 3.7| 4.2| -0.5|
| Herbert Hoover|1932| 1.9| 4.7| -2.7| 68.5| 2.8| 6.8| -4|
| Herbert Hoover|1933| 2| 4.6| -2.6| 58.3| 3.4| 7.9| -4.5|
| Herbert Hoover|1934| 3| 6.5| -3.6| 62| 4.8| 10.6| -5.8|
|Franklin D. Roose...|1935| 3.6| 6.4| -2.8| 70.5| 5.1| 9.1| -4|
+--------------------+----+--------+-------+-------+------+----------+---------+---------+
only showing top 6 rows
Now, you only need to prepare the dataframe. As of now (and in the future as a default), the first column of the dataframe represents the value on the X-axis. Therefore, in this example, we need to do a little cleanup:
df = df
.drop("President")
.drop("Receipts")
.drop("Outlays")
.drop("Surplus")
.drop("Receipts %")
.drop("Outlays %")
.drop("Surplus %")
.orderBy("Year");
At this stage, your dataframe looks like this:
+----+------+
|Year|G.D.P.|
+----+------+
|1930| 98.4|
|1931| 84.8|
|1932| 68.5|
|1933| 58.3|
|1934| 62|
|1935| 70.5|
+----+------+
only showing top 6 rows
At this point, our only remaining task is to initialize Drsti and asks it to render our graph.
DrstiChart d = new DrstiLineChart(df);
// d.setWorkingDirectory("<path to your public directory of your Drsti web app");
d.render();
You do need to tell Drsti for Spark where to save the file, there are 3 ways to do it:
- Do nothing, the default path is /var/drsti/public. You will have to make sure you install and confidure Drsti there first.
- Using the DRSTI_DIR environment variable.
- Using the setWorkingDirectory() method of DrstiChart.
You’re all set. Run your Java app and you should see a simple graph of the US GDP very quickly, which you’ll admit, is nicer than a data dump.
What’s next?
Watch out for more articles and videos on Drsti. In the meanwhile, you can have a look at the Air Traffic demo in the GitHub repo, which uses more Drsti for Spark APIs to customize your graphs. Customizations include labels, titles, and more.
Of course, as Drsti is an Open Source application, under the Apache 2 license, contributions are welcome!