Imagine you need to analyse and work with a very large dataset.
For most situations, you’ll be able to do this on your laptop or a single server, using a tool such as Matlab, Excel or Python and libraries such as Pandas and NumPy.
However, the challenge comes when your data grows too big to store in memory, or your computations become too slow. In todays environment where we have huge logs of user activity, web scale data, IOT devices and the general explosion in data, this problem is becoming more common and acute.
To solve this, one approach is to divide and distribute the data and the processing over a cluster of machines. That way, the calculations can be performed in parallel on small subsets of the data, then the results combined to give you your final answer. Spreading your work over multiple machines in this way introduces some complexity, but tends to give better performance and cost profile than simply buying bigger and bigger servers.
Spark is a framework and runtime for supporting this type of distributed processing.
It allows you to work with your data in a few ways:
Because Spark is fairly complex to deploy and consume, the founders of the project launched Databricks, an opinionated, cloud hosted and managed Spark solution and a friendlier Notebook based user interface for working with Spark.
Databricks aims to unify the work of all of the data professionals into one collaborative platform. This includes Data Engineers who are working on ETL and building data lakes, Data Analysts doing slice and dice type analytics, and Data Scientsts doing analytics, model building and model deployment.
Databricks can massively accelerate the time to adopting Spark, avoiding the need to build and manage all of the infrastructure yourself so that you can concentrate on working with the data immediately. It also brings opinionated workflows and patterns such as the Delta Lake architecture to encourage the adoption of best practices.