Amazon Redshift Column Compression Settings

Data cleaning and preparation¬†are among the most time consuming parts of a Data Analyst or Data Scientist’s job. ¬†During these tasks, the data analyst tries to further understand the data.¬†What they¬†can achieve¬†with the data¬†in analytic terms and how it should be organized to achieve their analytic goals.

In this chapter, we are going to see how to use this knowledge to optimize an Amazon Redshift Cluster and improve the query performance. Such improvement can happen by selecting an appropriate Column Compression Encoding. This choice is largely driven by the nature of the data that the column holds.

Why is Column Compression Important?

Choosing the appropriate Column Compression Encoding is usually perceived as a choice made during the process of design the tables of a database. However, data is something that changes with time and a decision that made sense a few months ago, might not be the optimal one anymore.

Data is something that changes with time and a decision that made sense a few months ago, might not be the optimal one anymore.

For this reason, we prefer to include Column Compression Settings as part of cluster maintenance, identifying again how the work of a data analyst can drive  the related choices more efficiently.

Amazon Redshift is a columnar database, and the compression of columns can significantly affect the performance of queries. The reduced size of columns, result in a smaller amount of disk I/O operations and therefore it improves query performance.

The reduced size of columns, result in a smaller amount of disk I/O operations and therefore it improves query performance.

How Column Compression Works

By default, Amazon Redshift, stores data in its raw and uncompressed format. It is possible to define a Column Compression Encoding manually or ask Amazon Redshift to select an Encoding automatically during the execution of a COPY command.

The recommended way of applying Column Compression Encodings is by allowing Amazon Redshift to do it automatically but there are cases where manual selection might result in more optimized columns and tables.

Automatic Compression works by analyzing the data that are imported by the COPY command. Still, it has the following limitations:

  1. You cannot perform Automatic Column Compression Encoding on a table that already has data.
  2. It requires enough rows in the load data to decide an appropriate Column Compression Encoding successfully. The recommended amount of data is at least 100,000 rows.

Automatic compression works by taking a sample of the data to be loaded and selects the most appropriate Column Compression Encoding for each column of that table.

In the end, it recreates the table with the selected Column Compression Encoding for each column.

Amazon Redshift does not provide a mechanism for automatically detecting if a Compression Encoding of a column should change. For this reason, we can use the input of our data analyst to decide if the encoding should adjust and select the most appropriate one.

Amazon Redshift does not provide a mechanism for automatically detecting if a Compression Encoding of a column should change. For this reason, we can use the input of our data analyst to decide if the encoding should adjust and select the most appropriate one.

How the Compression Encoding of a column on an existing table can change

Currently, Amazon Redshift does not provide a mechanism to modify the Compression Encoding of a column on a table that already has data.

The preferred way of performing such a task is by following the next process:

  1. Create a new column with the desired Compression Encoding
  2. Copy the data of the initial column to the new one
  3. Delete the old column
  4. Rename the new column to the name of the old one
alter table customers add column name_new varchar lzo;
update customers set name_new = name;
alter table customers drop column name;
alter table customers rename column name_new to name;

Selecting the right Compression Encoding for your data

The time the above process takes is dependent on our table’s size.

As we mentioned earlier, the right Compression Encoding depends on the nature of our stored data. Amazon Redshift tries to analyze the data and select the best possible Encoding while offering a broad range of different Encodings that can cover different scenarios.

The default selections by Amazon Redshift are the following:

  • Columns¬†defined as sort keys, are assigned a RAW compression, which means that they are not compressed. This is something to be considered by the analyst as when we set a sort key to improve the performance of a query, it is not possible to benefit from the compression of the column.
  • Columns of a numerical type, like REAL and DOUBLE PRECISION together with BOOLEAN types are also assigned a RAW compression.
  • Any other Column is assigned the LZO compression.

Amazon Redshift supports a larger set of different encodings that we can select manually based on the nature of the data we are going to work. More specifically:

If an analyst knows that a column takes values from a small controlled vocabulary, something that usually is evident during the performance of the preliminary analysis, a Byte-Dictionary Encoding can be more efficient. Think of possible cases for such an encoding columns that hold data like country names, ISO codes, currency codes or HTTP response codes.

Among the most important data types that an analyst works with is the DATETIME. Some of the most valuable data is coming in the form of time series. For DATETIMES Delta Encodings can be more efficient.

LZO is automatically assigned by Amazon Redshift for the majority of the available data types as it offers a good trade-off between performance and compression ratio. It works extremely well with the CHAR and VARCHAR datatypes, so for variables or columns containing Strings, this encoding is a good choice.

Mostly Encoding is an interesting case of an encoding, mainly used with numerical data. We may use it when the datatype of a column is larger than the majority of data stored on it. By using this encoding, you can compress the majority of these values to smaller standard and keep the few larger outliers in their raw form.

This encoding can be used to optimize numerical columns, and the decision to use it can be driven by the analysis performed by an analyst before we load the data into the database. If for example, you have relatively small integers and a few outliers of huge numbers, you can select a data type that can hold the large numbers but compress the majority of the values using a smaller numerical data type.

Runlength Encoding can be used to compress data in a column that take values from a relatively small dictionary of values that are at least partially ordered. This encoding works because it substitutes repetitive values by a token pointing to the actual value, together with the number the value appears.

The use of this type of encoding requires an in-depth data understanding, and the insights of a data analyst might be of great use for selecting it.

Text255 and Text32K encodings like the Runlength Encoding are exploiting the higher occurrence of specific terms inside a text to compress the column data. If the analyst has identified that a column with string contains specific terms more often, she can select this encoding.

Finally, the Zstandard is a generic encoding that can be used with all the available data types of Amazon Redshift although it works best with CHAR and VARCHAR columns. If you work with text data and you cannot exploit any characteristics that would drive you to select among the previous encodings, this one might be a good generic choice.

Things to keep in mind when choosing Column Compression Encodings

Selecting the best compression encodings for your data require a deep understanding of both the data that you are working with and the analysis that you intent to perform. Both are subject of iterations, and as time goes on, things might change.

For this reason, it is important to go back and rethink your compression encodings whenever your analysts detect any substantial change. It might not be a frequent process like Vacuuming, but you should certainly do it. You should certainly leverage your data understanding as a data analyst to choose or update the correct encodings too.

Useful Resources

The SQL command ANALYZE COMPRESSION performs a compression analysis on your data and returns suggestions for the compression encoding to be used.

At this repository, you can find a tool written in python and maintained by Amazon that may help you automate the process of analyzing and selecting compression encodings for your tables.

If you are serious about the performance of your Amazon Redshift Cluster and you want to get the most out of it, you must validate your hypotheses in regards to the right compression encodings of your table columns. Here you can find a good methodology for testing your assumptions.

load data into any data warehouse - Blendo