A Redshift ETL Guide (For Everyone)

Giorgos Psistakis

Bonus Material: FREE Amazon Redshift Guide for Data Analysts PDF

Businesses today have access to humongous volumes of valuable, high-quality consumer information like never before. Thanks to data analytics, companies are no longer fumbling in the dark. Instead, their decisions are well-informed and based on valuable consumer insights.

Businesses can get to know what people are talking about them and which products or services are doing well in the market. They can analyze what seems to be working for their competitors and make informed decisions accordingly.

From launching a product or service to marketing it to the target audience, from improving potential customers’ engagement to enabling conversions, businesses are relying heavily on the individual as well as mass customer information that’s continuously available at their disposal and is expanding every second.

In this post, we will see three ways you can get your business data into a robust data warehouse for performant analytics.

Handling Data (ETL)

ETL (Extract, Transform, Load) emerged as an excellent solution for providing businesses with clean, analytics-ready data from multiple sources containing both structured and unstructured data.

You extract raw data from a source, refine and transform it into a structured format, and finally, load it in the target data warehouse where you can manage and analyze for driving actionable insights. Read more here.

What is Amazon Redshift?

Amazon Redshift is a fully managed data warehouse service in the cloud that allows storing as little as a few hundred gigabytes to as much as a petabyte of data and even more. Redshift enables fast query performance for data analytics on pretty much any size of data sets due to Massively Parallel Processing (MPP).

Click here to get our FREE 90+ page PDF Amazon Redshift Guide!

In other words, MPP databases are usually more flexible, scalable, and cost-effective than the traditional systems, and this adds up to Redshift’s performance for your data analytics. Read more here.

How to do ETL in Amazon Redshift

Below we will see the ways, you may leverage ETL tools or what you need to build an ETL process alone.

The manual way of Redshift ETL

All the best practices below are essential for an efficient Redshift ETL pipeline, and they need a considerable manual and technical effort. TLDR; It’s difficult so if you are clicking a faster, flexible way to ETL, scroll down by clicking here now.

Amazon suggests keeping in mind the Amazon Redshift’s architecture when designing an ETL pipeline in order not to lead to scalability and performance issues later.

Here is how Amazon Redshift ETL should be done:

1. Use COPY

Amazon Redshift is an MPP database, where each compute node is further divided into slices. To ensure each slice receives an equal amount of burden, split data into multiple, similar sized files instead of a single large or an unevenly divided file.

Also, then, use a single COPY command to load data from multiple files into a single table.

2. The WLM Method

Use Amazon Redshift’s WLM (workload management) for defining a dedicated queue for the ETL process. Configuring the ETL queue with a small number of slots will help in avoiding excessive COMMITs. Also, avoid COMMITing separately for each transaction since commits are expensive.

Instead, surround multiple steps of the ETL process by a BEGIN…END statement. You can perform COMMIT only after all transformation logic is executed.

3. Tables in Amazon Redshift receive new records using the COPY command and remove useless data using the DELETE command. However, COPY adds records to an unsorted region in the table and DELETE marks the records for deletion which does not free the occupied space.

Click here to get our 90+ page PDF Amazon Redshift Guide and read about performance, tools and more!

That could lead to performance issues as unsorted regions and space occupied by deleted records accumulate over time. After completing an ETL process, run the VACUUM command to re-sort tables and remove deleted records.

Use ANALYZE to keep database statistics updated.

4. Use Amazon S3 to stage data after retrieving from a data source before COPYing to the target table. Use a manifest file to retrieve large data volumes from multiple files. Perform transformations on data in temporary staging tables which you can drop after the ETL process is completed.

To transfer data from staging tables to the target table, use ALTER and APPEND, which is faster than CREATE TABLE AS or INSERT INTO operations.

5. For extracting a large number of rows, use UNLOAD to directly extract records to S3 instead of using the SELECT operation which can slow down the cluster leader node.

6. Keep an eye on the performance of the ETL process through Amazon’s monitoring scripts, and resolve issues before they start impacting your Redshift cluster.

You may want to read our more detailed guide on how to Load Data Into Amazon Redshift. or best practices for ETL here.

The not so manual way to do ETL to Redshift

If building and managing a Redshift ETL pipeline on your own seems too cumbersome, you can also utilize either Amazon’s ETL service called Glue, or you can leverage a third-party AWS Redshift ETL tools.

With Glue, you can enjoy automatic data model and schema discovery, and ETL scripts generation. It’s expensive, but a great option if your entire data is on Amazon; otherwise, you’re better off considering other options, such as Blendo – a Modern ETL tool for Redshift – that provides all the perks of data pipeline management while supporting several external data sources as well.

An alternative to Amazon Redshift ETL tools

Undoubtedly, ETL was the best way forward for data warehouses of the past that could not handle the immense volume and complexity of raw data. However, technological advances and the rise of cloud have made virtually unlimited storage and computational power a reality.

ELT allows businesses to leverage the cheap, limitless storage of modern data warehouses, and infinite scalability and performance of the cloud. Considering the columnar data storage and massively parallel processing of Redshift, it only makes sense to present decision makers with raw data in all its entirety.

That can allow them to decide themselves what they need instead of IT and marketing technology experts choosing on their behalf. Essentially you can shift the compute-intensive transformations to Redshift, which can carry out multiple transformations in parallel.

Read our ETL vs ELT guide.

The Best of Both Worlds with Blendo

Blendo, an AWS Redshift Partner, can take the burden of setting up and managing a Redshift ETL pipeline off your shoulder.

Although primarily an ELT platform, it performs some transformation ahead of loading to create analytics-ready tables in your data warehouse. However, you can leverage the flexibility of ELT by accessing all the data you need right away and carry out transformations as required by leveraging the MPP of Amazon Redshift.

Unlike Glue which is meant for Amazon’s infrastructure only, Blendo can connect multiple sales, marketing, or financial platforms to Amazon Redshift. In fact, should you need to add a new data source, a few clicks is all it takes.

Connect Redshift, connect a data source, and let Blendo load all the data for your access in a matter of minutes; it’s as simple as that. Using a tool that comes pre-integrated with dozens of data sources means you can invest business resources and time in understanding the data instead of setting up Redshift ETL on your own.

Blendo is a modern ETL platform (ELT) designed to offer self-serve analytics-ready data in modern data warehouses. Don’t hesitate to contact us and try Blendo for free.