Standing in front of the convention center, next to the statue of Sir Walter Raleigh. On September 15th, 2021, after more than 18 months, I was finally able to give…
Bringing vision to Apache Spark
Drsti (pronounced drishti) is an effortless data visualization that interfaces easily with Apache Spark
DataFriday: manipulating the schemas of Spark dataframes
A stretch… Data is organized by schemas, data is stored on disk (or memory), but nothing like a good old school disk. to illustrate data In this fifth episode of…
DataFriday: extracting metadata from photos
This Rolleiflex requires a physical piece of paper and pencil to store the photo’s metadata Following episode 3, where I talked about metadata in relational databases, this week, I am…
DataFriday: what is Metadata?
Metadata is like the foundation of your data In this episode, I will explain what is metadata, at least, some metadata, more specifically metadata on relational databases. It’s a quick…
DataFriday: basic ETL ops with Apache Spark
In this episode, you will learn about doing a basic ETL (extract, transform, and load) operation using Apache Spark. You will load a basic CSV file with Apache Spark, make…
How I built the perfect data science team
When I assembled my first data science team, the term was barely getting printed in the Harvard Business Review. I had no clue that I was building a team pioneering…
Eleven key elements of data science outcome
Before thinking about what is the outcome of data science, maybe I should take the two seconds I think it takes to define it. As how to define data science,…
(Almost) All you need to know about file ingestion in Apache Spark
As you may know, I start writing Apache Spark with Java (now renamed Spark in Action, 2nd edition). Usually, as the book develops, authors share a few excerpt of the book…
Eight very hot data trends for 2019
Read about eight very hot predictions for data management in 2019, in usages, shapes, governance, and people.
