3 min read
Snowflake Architecture
Photo by Luca Bravo / Unsplash

The Snowflake architecture is an evolution over traditional data warehouses, modernised to take advantage of the cloud.  

In this article, we introduce the main features of this architecture, and explain how it leads to the desirable properties of Snowflake such as it's performance, cost profile and security.  

Software As A Service

Snowflake is a fully Software As A Service platform.  This means that there is nothing to deploy and manage on your own servers or in your own cloud accounts. This is in contrast to a platform such as Databricks which manages the "control plane" but executes the processing and stores the data in your own account.  

The upside to this is of course having essentially no technology to deploy and manage.  However, some organisations will naturally be concerned about handing over their critical business data 100% to a third party.    

Under the hood, Snowflake is running in one of the major cloud providers infrastructure - AWS,  Azure or GCP.  Though this is somewhat abstracted away from the Snowflake user or administrator, some appreciation of those environments is useful.  

Internal Architecture

The Snowflake Architecture is best thought of in tiers like so:

At the bottom level, we have the storage tier.  This is provided by the cloud provider, through a service such as AWS S3 or Azure Blob Storage.  As a Snowflake user or administrator, you won't be directly interacting with these services, though it's definitely worth having an appreciation of these services, how they work, the reliability guarantees they provide etc.  As a Snowflake administrator, you may also be loading data from or extracting data to these cloud services.  

At the second level, we have a cluster of compute nodes which are responsible for executing the queries, data manipulations and other processing.  This cluster is referred to as a Virtual Warehouse, and will typically serve one user or group of users.  It is also possible and typical to create multiple Virtual Warehouses, each of which may be long lived, or be created for a short period of time to execute a particular task.  

At the top tier, we have a set of cloud services which manage things such as authentication, security and access control, metadata etc.  These services are internal to Snowflake, but again it's worth understanding what these services are responsible for as occasionally they will touch on your day to day work.  

Separation Of Storage And Compute

Separation of Storage and Compute is one of the most commonly cited benefits of Snowflake.  

In a traditional database such as Oracle or MySQL, the two were historically tied together.  If we wanted more storage, we would add another node which stored data locally, and were usually be required to buy extra licenses from the vendor. Things were arranged this way due to the low latency requirements of transactional systems.  

Over the years, a few things have changed with regards to this picture.  Firstly, compute and network performance has improved to the extent that we can now realistically store data across the network from the compute nodes in order to decouple the two.  With data warehousing workloads such as Snowflake, we can also tolerate some increased latency of reading over the network.  This makes decoupling storage from the compute viable.  

What this seperation means is that that the storage and the compute can scale independently. For instance, we could have an extremely large dataset which takes petabytes of data, and a very small data warehouse.  On the flip side, we could have a a very small dataset and a very large processing tier to meet the needs of our business. This processing tier could then of course grow and shrink multiple times throughout the day, independently of the storage.  This allows businesses to completely rightsize their databases, and save significant costs in doing so.  

Shared Disk & Shared Nothing

This arrangement of separating out data into a different tier which is then shared multiple warehouses is referred to as a "Shared Disk" architecture. At the same time, the practice of running multiple virtual warehouses is sometimes referred to as a Shared Nothing architecture.  This hybrid of shared disk and shared nothing is really the secret sauce of Snowflake which is supporting it's rapid adoption.