2 min read
From Spark To Databricks
Photo by Adam Nowakowski / Unsplash

As discussed in a previous article, Spark is the leading open source platform for processing and working with big data.  

Because Spark is fairly complex to deploy and consume, the founders of the project launched Databricks, an opinionated, cloud hosted and managed Spark solution which can be consumed through a Software As A Service (SaaS) model.

Databricks can massively accelerate the time to adopting Spark, avoiding the need to build and manage all of the infrastructure and the cluster yourself so that you can concentrate on working with the data immediately.  

Databricks has been adopted extremely quickly by industry, and has a well deserved unicorn valuation as a result.  

Unified Workflows For Data Professionals

Most businesses will have many types of Data roles within their business, including Data Analysts, Data Scientists and Data Engineers.  Databricks aims to bring the work of all of the data professionals into one collaborative platform.  For instance:

  • Data Engineers can use Databricks to host their ETL and build data lakes or data warehouses for their business;
  • Data Analysts can use Databricks for 'slice and dice' type analytics and business intelligence reporting;
  • Data Scientists can use Databricks for their analytics, model building and model deployment.

Furthermore, because of the way that Spark and Databricks are designed, these people are given considerable flexibility to use the languages and tools that they prefer.  Often, Data Engineers prefer to use Scala for their transformations, Data Analysts prefer to use SQL, and Data Scientists prefer to use Python.  All of these are accommodated within the platform.  

To bring all of these people onto the same platform, whilst allowing them to use the tools they are comfortable with is really a remarkable achievement, avoiding considerable investment in building and maintaining technology for.

Notebook Based Interface

From a user interface perspective, Databricks is based around the Notebook format.  

Notebooks are an interactive programming environment, usually hosted in the browser, where we iteratively execute code and see the results in steps.  In the example below, we have a code block in cell 1, and immediately see the results.  

Notebooks are a great fit and widely deployed in the data domain, because they help to explain step by step what is happening.  Without the Notebook format, code would be a black box and would be harder to debug and collaborate on.