Over the last 6 months, we’ve been working with Databricks for a client project.
For those who aren’t aware, Databricks is a SaaS/managed version of Spark, the popular open source big data processing framework.
Though we were initially sceptical about Databricks and leaned more towards DIY Spark, we had a very good experience and have now moved towards recommending it over custom builds of big data analytics environments where it’s a correct fit.
These are the key advantages we have found:
Common Environment For Data Analysts, Data Scientists and Data Engineers
Typically, these different job roles in a data team are siloed, each using their own languages, tools and approaches. For instance, big data engineers often use Scala as a result of the Hadoop/Spark evolution, whilst Data Scientists prefer Python. Whilst Data Analysts might prefer Tableau or PowerBI, Data Scientists might prefer working with Notebook based UIs.
The real power behind Databricks is to get all of these people onto a common platform, logging into the same system and interacting with the same datasets through the same Notebook based UI.
Not only can this lead to significant cost avoidance of building and managing different tooling stacks, but it also gives these people a common language and frame of reference for data innovation.
Fully Managed Spark
Though deploying Spark isn’t the hardest thing in the world, it is not trivial, and does require effort to deploy, monitor, maintain and upgrade.
Using Databricks, you simply select an auto-scaling cluster size and create it. You know this is configured for best practices and security, and upgrades and maintenance are hugely simplified through the management.
The real benefit here is about time to value, avoiding the need to build out cloud infrastructure, and redirect time and budget straight into the business deliverables.
In Your Own Cloud Account
Often, people are concerned with using SaaS products for their data, due to regulatory requirements, information security, or the value of the data assets.
Databricks solves this elegantly, by storing all data and processing within your own cloud account, where it can be securely managed and controlled. This makes information security and governance much easier, whilst still giving you a SaaS like experience.
Unification Of Data Warehouse And Data Lake
Within enterprise, often there will be a number of data lakes for unstructured data, and a data warehouse for more business intelligence type scenarios. There will also be lots of ETL moving data between databases and data lakes.
Databricks have a concept of a “Data Lakehouse”, whereby you organise your data as a data lake, but can then query it using SQL and Data Warehouse type semantics.
This can massively simplify the technology estate and allow you to consolidate down onto one data store.
Unification Of Batch and Streaming
As with data lakes and data warehouses, there are also a mixture of batch and streaming workloads within enterprises today.
Databricks and Spark have a number of features which simplify this technology, allowing you to treat batches of data as large streams, reducing complexity in the data infrastructure.
Exit Strategy Through Open Source Spark
For the most part, if you become unhappy with Databricks for any reason, there is usually a fairly simple exit process into an open source or managed Spark. As you will typically store your data on S3 in an open format such as Parquet, this really minimises lock-in.
I am aware that this article does sound like a sales pitch for Databricks, but to us these are all unarguable benefits. Less management of infrastructure and middleware, faster time to value and simplified architecture with limited lock-in is really a win/win.