Understanding and Selecting Sort Keys

“Any inaccuracies in this index may be explained by the fact that it has been sorted with the help of a computer”

Donald Knuth Sorting and Searching, (Addison-Wesley, 1973)

The above is a small in-joke you may find in the Index of the famous work of Donald Knuth (these jokes were replaced in later publications). Jokes aside, sorting is necessary, as it helps to find the information we are looking fast. Machines are exquisite in sorting data, but it is the responsibility of a human to select what we should sort and for what reason.

In this section, we discuss why sorting is necessary on Amazon Redshift, how it affects query performance and how an analyst can leverage Sort Keys to optimize the performance of an Amazon Redshift cluster.

What are Sort Keys?

When you create a table, you can optionally define one or more columns as sort keys. These columns are being used as data is loaded into the table to sort it accordingly. During this process some metadata is also generated, e.g. the min and max values of each block are stored and can be accessed directly without iterating the data every time a query executes.

These metadata pass to the query planner which in turn exploits this information to generate execution plans that are more performant.

Based on the above it becomes obvious that Sort Keys is an important performance tuning parameter of our tables that,

  • It can improve query performance and
  • Its tuning depends heavily on the queries we plan to execute and thus to go through the analysis to be performed by the analyst is important in finding the most efficient Sort Keys.

Sort Key Types

Amazon Redshift supports two different types of Sort Keys, Compound Sort Keys, and Interleaved Sort Keys. Selecting the right kind requires knowledge of the queries that you plan to execute.

Compound Sort Keys

Compound Sort Keys, are made up of all the columns that are listed in the sort key definition during the creation of the table, in the order that they are listed. The order is important, as the performance decreases when queries depend on the secondary sort columns.

When you define a Compound Sort Key, make sure to put as first in the list, the most frequently used column in your queries.

When you define a Compound Sort Key, make sure to put as first in the list, the most frequently used column in your queries.

Compound Sort Keys work best in situations where the query’s filter applies conditions, which use a prefix of the sort keys. Thus, they can improve the performance of queries with the following operators.

  • Joins
  • GROUP BY
  • ORDER BY
  • Window functions with
    • PARTITION BY
    • ORDER BY

Interleaved Sort Keys

Contrary to Compound Sort Keys, Interleaved Sort Keys put an equal weight to each of the included columns in the sort key. If there’s no dominant column in your queries, then you might get improved query performance by creating an Interleaved Sort Key.

Notably, in the case where a query uses restrictive predicates on secondary sort columns, Interleaved Sort Keys might significantly improve query performance.

As a case of a restrictive predicate, consider a WHERE clause you filter your data using an equality operator. E.g.

SELECT email from users WHERE name = ‘John’

Interleaved Sort Keys are more efficient with large tables. To find out if a table is a good candidate for using them, you can query the STV_BLOCKLIST system table. What you are looking for, is tables with a high number of 1MB blocks per slice and distributed over all slices if possible.

Another example where you might want to consider an Interleaved Sort Key is when you plan to sort over only one column. In this situation, it might give better query performance if the column values have a long common prefix. Again, the nature of the data is important here, so the knowledge of a data analyst might help to figure out if it makes sense to use one or the other Sort Key Type.

How to select Sort Key Types

To summarize all the above information, when you choose Sort Keys for your tables, keep the following in mind:

  1. Do you plan to use more than one columns as Sort Keys?
    • NO. Then Interleaved Sort Keys might work better.
    • YES. Is there a dominant column appearing in your queries?
      • NO. Then again you should consider Interleaved Sort Keys.
      • YES. Then use Compound Keys and make sure that dominant column is first in the column list.
  2. Do you have highly selective restrictive predicates in your queries?
    • YES. Consider Interleaved Sort Keys
    • NO. If your queries include JOINS, GROUP BY, ORDER BY and window functions with PARTITION BY or ORDER BY. Then consider Compound Keys
  3. Do you work with large tables (make sure to check table statistics)?
    • YES. Consider using Interleaved Sort Keys
    • NO. Compound Sort Keys might work better

Finally, it is important to know that as you load more data on your sorted tables, performance deteriorates over time. To fix this problem, you read here how to VACUUM your tables.

When you select your Sort Keys, you need to understand that

  • VACUUMING your tables is unavoidable, and you have to consider the performance hit that this has to your operations.
  • The Sort Key Type affects the performance of your VACUUMING process.

In general, Interleaved Sort Keys are more sensitive to VACUUMING, and usually, it takes longer to perform it on tables that have this kind of sort keys defined. As it is preferred to use this type with large tables, the result might be long VACUUMING times.

So, plan accordingly and make sure you have a good understanding of your data as an analyst and consult the SVV_INTERLEAVED_COLUMNS table for vital statistics on your tables that help you figure out the best possible Vacuuming strategy.

Useful Resources

As always, it helps to start with the documentation that is provided by Amazon.

It also always helps to run tests with your actual data before you come up with a good scheme of Sort Keys. Here you can find an example of how to do this.

Sort Keys are just one key ingredient of performance tuning on Amazon Redshift. Make sure you read about Distribution Keys.

Also, you should periodically re-assess the validity of your choices regarding Sort Keys. Check the section on Maintenance about Monitoring Query Performance to see what else you should be taking care of on a periodical basis.

load data into any data warehouse - Blendo