blog image

What Are Spark & Databricks And How Can They Benefit Your Business?

Imagine you need to analyse and work with a very large dataset.  

For most situations, you’ll be able to do this on your laptop or a single server, using a tool such as Matlab, Excel or Python and libraries such as Pandas and NumPy.  

However, the challenge comes when your data grows too big to store in memory, or your computations become too slow.  In todays environment where we have huge logs of user activity, web scale data, IOT devices and the general explosion in data, this problem is becoming more common and acute.  

To solve this, one approach is to divide and distribute the data and the processing over a cluster of machines.  That way, the calculations can be performed in parallel on small subsets of the data, then the results combined to give you your final answer.  Spreading your work over multiple machines in this way introduces some complexity, but tends to give better performance and cost profile than simply buying bigger and bigger servers.  

Spark is a framework and runtime for supporting this type of distributed processing. 

It allows you to work with your data in a few ways:

  • To search your data: Spark will allow you to work with millions of data items in order to to search, filter, aggregate and interrogate them.  This type of work is bread and butter to relational databases, but even they do not scale to very large datasets which are stored in diverse locations, meaning you are struck with partitioning your data across servers and buying bigger and bigger servers to keep up.  Spark provides an interface called Spark SQL for supporting this kind of analysis;
  • For numerical analysis of your data: Spark will allow you to move beyond the type of work a database might do, to more complex statistical work where you wish to reprocess, analyse and interrogate the statistical properties of your data, moving into the realms of forecasting, anomaly detection and other machine learning model creation.  This is the type of work that inherently benefits from being carried out on a large cluster;
  • To manage your data: Spark will allow you to move, filter, cleanse and aggregate your data between different destinations in a manner similar to traditional extract transform and load.  Again there are lots of tools in the ETL space, but all hit limitations with large, complex datasets and do not have good support for dividing the work across a cluster;
  • To process streaming data: The data we are mainly interested in at Timeflow is streaming data, unbounded streams of events which occur with high velocity over time.  Spark streaming allows you to work with this data in order to process and analyse it as it occurs in order to reduce the time to insight;
  • And more: Spark also provides options and libraries for building and developing machine learning models, working with graph databases and much more, all across a distributed cluster in a resilient and robust way;

Databricks 

Because Spark is fairly complex to deploy and consume, the founders of the project launched Databricks, an opinionated, cloud hosted and managed Spark solution and a friendlier Notebook based user interface for working with Spark.  

Databricks aims to unify the work of all of the data professionals into one collaborative platform.  This includes Data Engineers who are working on ETL and building data lakes, Data Analysts doing slice and dice type analytics, and Data Scientsts doing analytics, model building and model deployment.  

Databricks can massively accelerate the time to adopting Spark, avoiding the need to build and manage all of the infrastructure yourself so that you can concentrate on working with the data immediately.  It also brings opinionated workflows and patterns such as the Delta Lake architecture to encourage the adoption of best practices.