Table Distribution Styles

Bonus Material: FREE Amazon Redshift Guide for Data Analysts PDF

Amazon Redshift is a distributed relational database system capable of performing queries efficiently over petabytes of data. That is achieved by the combination of highly parallel processing, columnar design and targeted data compression encoding schemes.

In this section, we see how tables can be optimized to leverage the highly parallel nature of Amazon Redshift by defining Distribution Keys and selecting the appropriate Table Distribution Style.

Parallelization on Amazon Redshift

Distribution and parallelization on Amazon Redshift happen on two levels. What follows is a brief description of the high-level architecture of Amazon Redshift. For more information check this Section and also the Amazon documentation.

Amazon Redshift consists of a number of nodes where each one is an actual computing node on Amazon AWS, with dedicated resources like memory, CPU and disk space. These nodes create an Amazon Redshift Cluster over high-speed network connections.

One node is denoted the Leader and the rest act as compute nodes. The leader is responsible for distributing the data across the different compute nodes to achieve the desired performance.

Each compute node has its disk storage divided into slices. Although the number of slices varies depending on the type and size of a node, there are at least two of them on each compute node. Data distributed on slices is queries in parallel on each compute node.

When a table is created, and data is loaded into it, Redshift distributes its data across the different compute nodes and slices of a cluster. This distribution of data happens according to a Distribution Plan that can be selected by the user. When we perform a query, the Query Optimizer figures out if data has to be moved from one node or slice to another for the query to execute successfully.

This redistribution of data across the cluster might account for the poor performance of your queries if the distribution plan that was selected distributed the data in a way that is not optimal for the query execution.

This redistribution of data across the cluster might account for the poor performance of your queries if the distribution plan that was selected distributed the data in a way that is not optimal for the query execution.

Keep in mind that redistribution might include the relocation of a small number of rows, up to the copying of entire tables across all the nodes.

When you decide which distribution style to use for each table, keep the following in your mind:

  • When data is distributed unevenly across your computing nodes, some of them have to work more than the other. That is not desirable; you do not want underutilized compute nodes. So you should aim for an as much as a uniform distribution of your data.
  • Think of the analysis and the queries you plan to execute. A data analyst can help tremendously in the design of the database here. If tables have to move around your cluster, then you should probably reconsider your distribution style.
  • Distribution is per table. So you can select a different distribution style for each of the tables you are going to have in your database.

Click here to get our FREE 90+ page PDF Amazon Redshift Guide!

Distribution Styles

Amazon Redshift supports three distinct table distribution styles.

  • Even Distribution. This is the default distribution style of a table.
  • Key Distribution.
  • All Distribution.

Even Distribution

In Even Distribution the Leader node distributes the data of a table evenly across all slices, using a round robin approach.

Key Distribution

In Key Distribution, data is distributed across slices by the Leader node matching the values of a specific column. So all rows with the same value in the selected column will end up at the same slice.

All Distribution

With All Distribution, Leader node distributes the table to all the computing nodes. A copy of the table is maintained on all nodes which means that the storage needed for the table is multiplied by the number of nodes. As such, each operation of this particular table takes longer as it has to repeat on all nodes.

To view the distribution style of a table, you need to query the PG_CLASS system catalog table. The RELDISTSTYLE column contains the information of the table distribution style.

  1. 0 for EVEN distribution style
  2. 1 for KEY distribution style
  3. 8 for ALL distribution style

Use the following query to find the distribution style of your tables.

select relname, reldiststyle from pg_class
where relname like 'YOUR_TABLE_NAME;

A table’s Distribution Style cannot change after its creation. If you decide at some point that a different style is preferable, you should perform a DEEP COPY of the entire table.

A table’s Distribution Style cannot change after its creation. If you decide at some point that a different style is preferable, you should perform a DEEP COPY of the entire table.

Choosing Distribution Styles

Choosing the right Distribution Style for each of your tables might be a tricky process. At this point, we discuss some general rules that might help you to decide which distribution style to choose for your tables.

Before you start thinking about the table distribution style, you can decide upon the primary and foreign keys. Even if Redshift does not enforce these constraints, it helps the query planner to optimize the queries, and also it helps you understand better what queries you are going to use.

This knowledge can also help you choose the right distribution style. Of course, as the selection of a distribution style is highly sensitive to the queries that will run on the cluster, the input of a data analyst is valuable, let’s see why.

If a table does not participate in any JOIN, it is highly denormalized, and in general, you cannot think of a specific distribution style for it, chose the EVEN style.

If you have small tables, that do not often change, for example, authority tables, then you should choose ALL style. Think for example a table with country names and country codes; you might be joining this table to get the country name out of the code. Such tables are best to copy over the whole cluster.

Click here to get our 90+ page PDF Amazon Redshift Guide and read about performance, tools and more!

If a table participates in JOINS, it might be beneficial to select a KEY distribution. Depending on how the table participates in the JOIN, you should distribute it either on the primary or the foreign key of the table. Keep in mind that joins happens in pairs, and so you should also consider the other joining table and its distribution style.

In general, you should strive to select a distribution style that evenly distributes the data across all worker nodes. An EVEN style is not always an option as JOINS are involved, but your data analyst is your best friend here. She should have a good idea beforehand on the nature of the data that each candidate for a Distribution Key.

You are looking for columns that contribute to a uniform distribution, so columns with “heavy hitters” are to be avoided.

Using the Query Planner for Optimizing the Distribution Style

The above rules of thumb might be helpful when you work with a small database and simple queries, but as the queries get more complex and there are multiple JOINS involved, it becomes more and more difficult to figure out the right distribution style based on the queries performed.

In this case, the PLAN of a query is reported by using the EXPLAIN command.

EXPLAIN command

When you execute a query together with the EXPLAIN command, you get back a plan on how the query is going to be executed. Inside this plan, there’s useful information that can be used to decide if there’s space for improvement by selecting different distribution styles. Each step of the plan comes with a label that can help identifying issues related to the distribution style.

DS_DIST_NONE  and DS_DIST_ALL_NONE labels indicate that no redistribution is required for the execution of the step. This is good as no re-evaluation of the distribution style is suggested.

DS_DIST_INNER label is an indication that the redistribution cost might be high as the as the inner table has to be redistributed.

DS_DIST_ALL_INNER indicates that you should probably use a distribution key or even style for the outer table of a join.

DS_BCAST_INNER and DS_DIST_BOTH are not good and indicate that you should consider changing the distribution key of the outer table or use an ALL style for the inner table.

Useful Resources

Amazon provides an excellent plan on how to test your Distribution Styles using samples of your actual data.

It also hosts an extensive documentation reference for Distribution Styles.

load data into any data warehouse - Blendo