Why to Vacuum Amazon Redshift?
Most people who have worked with a Windows PC, especially in the past where SSD disks were not that common, are aware of a utility tool called, Disk Defragmenter.
Microsoft included Disk Defragmenter with all their OS distribution. Now, you might wonder why we mention a Windows OS utility tool in a guide about Amazon Redshift, but there’s a similar process related to Amazon Redshift. That process is Vacuuming, and it also happens to be quite important for the health and maintenance of our Redshift cluster.
The Vacuuming process, is quite important for the health and maintenance of your AWS Redshift cluster.
The purpose of the Disk Defragmenter was to rearrange the files on the disk to increase access speed. The Vacuuming process of tables on an Amazon Redshift cluster is of course not about rearranging files, but instead about reclaiming freed space under some specific conditions.
However, as with Disk Defragmenter, the Vacuum command process is an important process which also affects the performance of our queries, among others.
Vacuum is an important process which also affects the performance of our queries, among others
Why does Vacuuming exist on Amazon Redshift?
Amazon Redshift does not reclaim free space automatically. Such available space is created whenever you delete or update rows on a table. This process is a design choice inherited from PostgreSQL and a routine maintenance process which we need to follow for our tables if we want to maximize the utilization of our Amazon Redshift cluster.
So by running a Vacuum command on one of our tables, we reclaim any free space that is the result of delete and update operations. At the same time, the data of the table get sorted.
This way, we end up with a compact and sorted table, which are useful for the performance of our cluster.
If you wonder why the update operations are also included together with deletes, this happens because behind the scenes an UPDATE command is the combination of a DELETE command, where the old row is first deleted, and then an INSERT command where the new row is inserted.
During a DELETE command, a row is marked as deleted but not removed. Additionally, the query processor has to scan all the rows, including those marked as deleted. So it is easy to understand that keeping deleted rows on a table costs additional process and thus slow down your queries.
In extreme situations, you might event end up with queries which may time-out due to the extra overhead the deleted but not reclaimed space might add.
How often should we run the Vacuum command in our Amazon Redshift tables?
The most common advice when it comes to vacuuming process is, doing as often as you can.
However, as in most cases, you need to consider a few factors to come up with the ideal vacuuming plan for your situation. Figuring out the vacuuming frequency might sound like a task for a DBA, but it can be significantly improved if the data analyst or the data scientist also contributes to it.
There are two reasons for this:
- The data analyst is the person who knows best the importance of the queries planned for execution. Thus it is possible to rank their execution differently based on importance.
- The data analyst is aware of the structure, queries and the data involved and thus can help in determining the vacuuming type needed based on each query.
So, determining an efficient vacuuming plan requires knowledge related to the above two points.
Still, you need to consider the following factors too:
- Vacuuming process should happen during periods of inactivity or at least minimal activity on your cluster. So query planning is again essential here.
- The longer the time between consecutive vacuum commands for a table, the longer it takes for the vacuuming process to end.
- As vacuuming is about going through your data and reclaiming rows marked as deleted, it is an I/O intensive process. So, it affects any other queries or processes that you might be running concurrently, but the good thing is,
- Vacuuming can happen concurrently with other processes, so it may not block any ETL processes or queries you might be running.
The Vacuuming process comes in different flavors, it is a configurable process and depending on what we want to achieve with our data and the type of queries we intend to perform on it, we have different options.
- VACUUM FULL. Vacuum Full is also the default configuration of a vacuum command, so if you do not provide any parameters to the command, this is performed on your data. With a Full Vacuum type, we both reclaim space, and we also sort the remaining data. These steps happen one after the other, so Amazon Redshift first recovers the space and then sorts the remaining data.
- VACUUM DELETE ONLY. If we select this option, then we only reclaim space and the remaining data in not sorted.
- VACUUM SORT ONLY. With this option, we do not reclaim any space, but we try to sort the existing data.
- VACUUM REINDEX. This command is probably the most resource intensive of all the table vacuuming options on Amazon Redshift. It is a full vacuum type together with reindexing of interleaved data. It makes sense only for tables that use interleaved sort keys.
Selecting the most efficient Vacuum Type requires knowledge of all the queries involved with the tables to vacuum.
At this point, a data analyst can help to decide the most effective strategy for each table. The central questions that the analyst can answer and which can help us select the proper plan are:
- Do any of your queries benefit from a sorted key on the tables used by these queries?
- Does the data analyst have created any interleaved sort keys?
If the answer to (1) is no, then it might be better to proceed with a VACUUM DELETE strategy. If the answer to (2) is yes, then it might be beneficial to move forward with a VACUUM REINDEX strategy.
In any case, as the queries change, the need for indexes and sort keys change, the type and the way we schedule our vacuuming process might have to change as well. Thus it is important to always have the data analyst in the loop regarding the maintenance of an Amazon Redshift Cluster.
Interleaved Sort Keys and Vacuuming
Interleaved Sort Keys are a useful tool for improving the performance of specific queries. The case of Interleaved Sort Keys worths a particular mention when it comes to the vacuuming process for the following reasons:
- A data analyst will most likely use Interleaved Sort Keys with big tables
- There’s a particular type of Vacuuming when these keys exist in a table, which is computationally heavy.
- The effect of UPDATE and DELETE operations to the deterioration of the query performance, is greater when Interleaved Sort Keys are involved.
So, it is important for an analyst to consider as part of the pros and cons of Interleaved Sort Keys, also the performance penalty of a vacuuming process and monitor closer the performance of queries that involve such sort keys.
The key metric we should watch to decide when a VACUUM REINDEX should happen is the skew of values of columns that have acted as an Interleaved Sort Key for the table.
Things to keep in mind
Table Vacuuming on Amazon Redshift is an important maintenance function that affects the work of a data analyst but how it is used, should also be influenced by her. A few things an analyst should keep in mind regarding the Vacuuming process:
- Vacuuming can help improve the query performance
- The Vacuuming frequency and the Vacuuming type should be affected by the type of queries we plan to perform.
- The Vacuuming process does not block the rest of the operations on a cluster but while it runs might affect the performance. So plan the query execution and Vacuum accordingly.
- Interleaved Sort Keys are significantly affected by unclaimed space on your cluster, so monitor these tables even closer.
- As a data analyst, it is important to keep Vacuum in mind when you design your tables. There is a certain limit on the number of columns that a table have to perform a Vacuum, and this number is less than the maximum number of allowed columns for an Amazon Redshift Table. This number varies depending on your cluster configuration but nevertheless, try not to reach it. If for any reason you exceed the limit, Vacuum is not possible; we need to perform a DEEP COPY to reclaim free space. Unfortunately, DEEP COPY cannot happen while you operate on the table and you should avoid it.
The system view SVV_VACUUM_PROGRESS returns an estimate of remaining time for a vacuuming process that is currently running.
The system table STL_VACUUM displays raw and block statistics for tables we vacuumed.
Finally, you can have a look to the Analyze & Vacuum Schema Utility provided and maintained by Amazon. This script can help you automate the vacuuming process for your Amazon Redshift cluster.