Around six years ago, Amazon disrupted the data warehousing market by introducing a cloud-based and scalable solution for data warehousing, named Amazon Redshift. For the first time, it was possible for someone to spin-up a data warehouse on the cloud and scale it based on the volume of the data the company.
Amazon Redshift, especially coupled with S3, initiated a significant change in IT regarding data warehousing but the solution was still missing some critical features.
It was challenging to work with non-tabular data like JSON, Avro, etc., while computation was tightly coupled with storage. If you wanted to increase your cluster size for computational reasons, you also had to add storage. The cluster node was delivering storage and processing at the same time.
What kind of database is Snowflake?
Snowflake built on the success of Amazon Redshift and brought the second wave of disruption in the data warehousing space. It introduced an actual separation between storage and processing, support for different types of data, from well structured tabular data to data serializations likes JSON, Avro, and Parquet. At the same time, the need for managing the cluster was removed entirely. IT had little to do to administrate Snowflake, storage and processing can easily scale up and down depending on the company’s needs without the need for sophisticated DB management.
Based on that, Snowflake is currently one of the most fast-growing and mature data warehousing solutions. It offers a single integrated system with fully independent scaling for compute, storage, and services.
But a data warehouse, no matter how sophisticated it might be, it’s almost useless without data. So, how can someone fill a Snowflake instance with data? In this article, we will discuss:
- the capabilities of Snowflake when it comes to loading data into it
- and how you can establish ETL or ELT processes to it.
To ELT or not to ETL?
The traditional method to fill your data warehouse with data was through ETL or Extract Transform and Load. The sequence of the steps is important. Someone would extract the data from a data source, e.g., a production database, transform the data based on the schema the BI has decided and then load the data into the database.
Transforming the data before it was loaded into the database was essential to ensure that the data warehouse was utilized well. Applying transformations of the data after it was loaded was a bad practice. It was difficult to optimize the process based on the data warehouse technology and was adding a lot of overhead. At the same time, computing was expensive, and the CPU could easily become the bottleneck. Add to this that typically BI-related queries need to run aggregations on vast amounts of data and transforming the data before it was loaded made a lot of sense.
What is Snowflake ETL?
But since that era and with the introduction of technologies like the cloud and Snowflake the requirements have changed. Now data warehouses are operating under different restrictions and capabilities. The most important are:
- The separation of computation from storage made it easy and cheap to transform the data on the data warehouse, especially as the technology can scale almost linearly to the data you have.
- The volume and the complexity of the data makes it more difficult and expensive to iterate on the transformation of data before loading it. Deciding to change a transformation and run it over all your data, it might cost a lot in time.
- The data today lives on many different systems; some like databases can be accessed without adding a lot of overhead to the source system. But a lot of necessary data lives behind APIs, either as data generated by a SaaS product or data that is exposed by a microservice. Interacting with these services and extracting data is expensive in terms of time and resources and extracting them, again and again, becomes a huge bottleneck.
So with a solution like Snowflake ETL can be redefined, instead of Extracting, transforming and then loading your data, now you can Extract and Load your data and transform it on Snowflake, something that offers impressive flexibility both your IT and Analyst teams.
Loading data into Snowflake
When loading data into a Snowflake database, you first have to consider the nature of your data and what you are planning to do with them. Do you require real-time or almost real-time access to the data? Or you want to make the traditional BI approach of building reports on big amounts of data? The good news is that with Snowflake you can have both!
Loading data into Snowflake in batches
That is the most traditional and standard way of loading data into a data warehouse. Snowflake supports several sources to load data. These are:
- Data staged into Snowflake for loading; you can think of this as S3 bucket that Snowflake exposes.
- External stages, like Amazon S3, Google Cloud Storage, and Microsoft Azure.
No matter if you want to stage your data into Snowflake or you have the files on S3 or any other cloud storage provider, it’s easy and transparent to bulk load your data into Snowflake database.
That can be achieved by incorporating the COPY command. Similar to Amazon Redshift and Google BigQuery, you can execute a copy command that takes as a parameter the location of a file staged locally or on any of the providers mentioned above and loads the data into a table with its name given as a parameter.
Writing some scripts and scheduling their execution with something like cron ensures that you can easily manage repeating loading jobs. Always make sure that you structure your staged data according to the guidelines given by Snowflake.
Does Snowflake run on Azure?
Snowflake recently brought its data warehouse to Microsoft Azure. It uses Azure Storage and Blobs to store raw data and you may use Azure Data Lake for the unstructured data
Continuously load data into Snowflake
As we saw, loading bulk data into Snowflake is not a very complicated task, it is crucial to make sure that if we need to repeatedly load data into the data warehouse, as it is usually the case, we structure and schedule the process correctly while we make sure that we monitor the process for errors.
The second way of loading data is using Snowpipe for continuous loading of data. This approach makes more sense for applications where access to fresh and live information is critical.
The way Snowpipe works is by observing staged files, and as soon as they are available, it automatically performs a COPY command to load the data into a Snowflake cloud data warehouse. A user provides a COPY statement as part of the Snowpipe object, which executes whenever changes to the staged files are observed.
Apart from serving different use cases, Snowpipe and bulk copying differ in many other ways; you should consult the Snowflake documentation before you decide to choose one or the other.
As we can see, Snowflake provides many options to load data into the data warehouse. Keep in mind, though, that loading data once is an entirely different task in terms of complexity, compared to a continuous EL process at scale, and many things can go wrong. For this reason, there’s a rich ecosystem of Snowflake technology partners that can offer such services as part of their products.
Blendo is a Snowflake Technology partner.
Transforming data into Snowflake
Snowflake has enriched its data loading capabilities with transformations features. That is a compelling option that allows you to greatly simplify your ETL process by incorporating standard transformation patterns as part of your COPY command.
As a first step, it is possible to query your staged data for inspection. Tha can give you an idea of what the data looks like and what kind of transformation you would like to perform. For example, you might want to exclude some PII related fields from loading into the data warehouse for privacy reasons.
Snowflake, allows you to perform some tasks like:
- Filtering fields/columns out during COPY
- Perform CASTING on your fields if you want to change your data types
- Reorder your columns
- Truncate data
Your COPY command for a bulk or Snowpipe loading process can contain a “SELECT” part which transforms the data based on your requirements.
Thatis is possible because of the great separation between storage and processing in Snowflake.
We saw that it is possible to perform some core transformation tasks during the loading of your data into Snowflake. Although this is a powerful mechanism, you might still need to transform further your data to make it ready for consumption for your analysts. For example, you might want to create fact and dimension tables as part of a star schema.
That can be quickly done on a scale using the support of views than Snowflake has. It supports both regular and materialized views, so it’s possible to achieve high performance with your modeling. Additionally, the separation of storage and processing ensures that modeling your data on Snowflake will be efficient while maintaining maximum information available on the data warehouse.
Extracting of Data and Final thoughts
So far, we saw that Snowflake is a very feature-rich product that offers many different options regarding loading and transforming your data. On purpose, we didn’t discuss how we achieve data extraction as this is not the job of Snowflake or any other data warehouse solution.
At the same time, it’s essential to consider that when you connect all the dots – from extracting the data to staging, loading, and modeling the data – many things can go wrong. Building a robust infrastructure, even based on the advanced features of Snowflake, is a significant engineering challenge.
For this reason, Snowflake has created an amazing ecosystem of technology partners that can offer end to end solutions for data analytics with the Snowflake technology at its core. It’s worth checking what these partners have to offer to either substitute your in-house built solution or to complement it.
Blendo is a Snowflake Technology partner.
Simplifying Snowflake ELT with Blendo
Using third-party Snowflake ETL tools, like Blendo, can make the process of loading data into Snowflake completely hassle-free.
Blendo is a modern Snowflake ETL tool – that makes importing analytics-ready data in Snowflake easy and straightforward. In fact, with Blendo, it is a simple 3-step process without any underlying considerations:
- Connect the Snowflake cloud data warehouse as a destination.
- Add a data source.
- Blendo will automatically import all the data and load it into the Snowflake data warehouse.
With Blendo, you can integrate any data source, within minutes, without having to write any code, designed to offer self-serve analytics-ready data in modern data warehouses. Don’t hesitate to contact us and try Blendo for free.