When we load data into Snowflake, we will typically follow a two step process:
- Upload the data from your source to some staging area.
- Perform the COPY INTO operation to take data from that staging area and copy it into the Snowflake table in a relational format.
With regards to the staging area, there are a few options open to us. In this article, we will explain the choices and when to use which option.
Internal Vs External
The first decision is whether to use Internal or External staging areas.
- External staging areas are the external storage services offered by AWS, Azure and GCP. These are outside of control of Snowflake, but can be used as a staging area.
- Internal staging areas are managed directly by Snowflake.
In many cases, your data will already be stored e.g. on S3 or Azure Blob Storage, at which point, we may as well just do our load directly from those external stages.
If this isn't the case, the preference is usually to push to an Internal stage. This way, we will benefit from compression, the ability to query the data in a slightly more performant manner, and benefit from improved audit. It also means that we could potentially avoid interacting with the cloud provider and leave this to be brokered through Snowflake, keeping everything simple.
User Stage, Table Stage, Named Stage
With regards to internal stages, there are 3 different types:
- User Stage is an area available to the currently logged in user. Data stored in this stage is private to the logged in user and can not be seen by any other users.
- Table Stages are associated with a specific table. Every table has 1 and only 1 table stage and is created at table creation time.
- Named Stages are created separately and are named. They are not tied to any particular table or view.
When To Use Which?
The decision of where to stage from is sometimes quite subtle.
- User Stage: It makes sense to use this where an individual data engineer or analyst is loading their data, and has no need to share with anybody else within the organisation. This could be the case during development, whereby an individual data engineer is working on the ETL process before moving to a shared development, test, or production environment. The data could also be private and need to be locked down to an individual user scope.
- Table Stage: When data has to be shared between multiple Snowflake users, and has a one to one mapping with the table, this is the natural option to reach for when staging data. Some teams will also have a workflow whereby they develop within their user stage, then promote the change to the table stage as the code matures.
- Named Stage: Unlike the user and table stages, we can also specify additional details on the named stage which describe, for instance, the format of the data at load time instead of COPY INTO stage. Named stages are also specific database objects, so can be created, dropped, and have role based permissions applied. These stages tend to be more complex to work with, but more explicit for production pipelines.
-- Named stage with associated file format create or replace stage my_stage file_format = my_json_format;