In this episode, you will learn about doing a basic ETL (extract, transform, and load) operation using Apache Spark. You will load a basic CSV file with Apache Spark, make […]
Spark in Action, Second Edition MEAP Update
I just wanted to share with you the latest update on Spark in Action, second edition What’s new? Chapter 12, “Transforming your data“ Chapter 13, “Transforming entire documents“ Appendix K, […]
How I built the perfect data science team
When I assembled my first data science team, the term was barely getting printed in the Harvard Business Review. I had no clue that I was building a team pioneering […]
Advanced Spark Ingestion
Chapter 9 still covers Spark ingestion (like chapter 7 and chapter 8), but this time, it’s about “anything can become a Spark datasource.” When I was working for Zaloni, we […]
Ingestion of data from databases into Apache Spark
Chapter 8 of Spark with Java is out and it covers ingestion, as did chapter 7. However, as chapter 7 was focusing on ingestion from files, chapter 8 focus on […]
File Ingestion in Apache Spark
In a typical Big Data analytics scenario, you will probably be tempted to ingest files. You know, those pesky CSV files where the comma is sometimes a semicolon or a […]
Apache Spark with Java
Apache Spark has been a game changer for distributed data processing, thanks to an easy to understand API, a focus on simplicity, and an adoption of modern infrastructure. However, rumors […]
A Deep-Dive Introduction to Spark for RDBMS Users
Earlier in the summer, I start a series of articles for IBM developerWorks. Those articles focus on Apache Spark from a RDBMS user perspective, of course, the database of choice […]
Meet Cactar, the Ancient Mongolian Warlord of Data Quality
A Little History On August 18, 1227, the well-known Mongolian emperor Genghis Khan passed. Despite numerous criticisms, based on rumors of genocide and brutality, he united Mongolia. One of his […]
Spark Boosts IBM Event Store
IBM just announced Event Store, a hybrid datastore to store events. The originality? Events can be streamed in and it is based on Apache Spark. IBM claims to be able […]