Modeling Time Series data
Working with time series data is an important part of the job of a data analyst. Especially as someone delves into behavioral data, the temporal dimension cannot be ignored.
Data like the interactions of a customer with a product over time, the behavior of mail recipients to campaigns and the behavior of buyers on an e-commerce site, can be perceived as time series.
Usually, time-series data are characterized by their volume, e.g. sensor data or log files. When someone is looking for a solution to work with large amounts of time series data, then Redshift always comes up. The reason is the parallel nature of Amazon Redshift and its ability to handle a large volume of data.
At this section, we see how time series data can be efficiently stored and queried on Amazon Redshift
Storing Time Series Data on Amazon Redshift
Traditionally, storing time series data in a relational database involved the partitioning of the database. Partitioning is the process of logically splitting a larger table into smaller physical ones.
There are many reasons, mainly performance related that someone would partition their tables. However, it also makes much sense to do it when you have to deal with time series data. In this case, you are partitioning your data based on time, e.g. on a monthly basis or even a daily basis, depending on what kind of analysis you intend to perform and the amount of data you have to deal.
For an interesting article on how to work with partitions on PostgreSQL, you can check here.
Partitions are also used to implement retention policies, so, for example, it is possible to retain only the latest 30 days of data. Retention is necessary with time series data, as their volume might be enormous and after some time not relevant to the kind of analysis a data analyst wants to perform.
The problem here is that Amazon Redshift does not support table partitioning in the traditional way that other RDBMS systems do. So, using table partitioning to work with time series data is not possible. Instead, you will need to implement the partitioning logic into your ETL process.
As Amazon also recommends, if you plan to work with time series data and you have a fixed retention period then you should organize your data around tables that correspond to that retention period. For instance, if you plan to retain your data on a monthly basis, then you should have one table per month. If you intend to have a yearly retention, then you should have one table per year, and so forth.
There are some pros to this approach.
First, it is easy to delete old data, as you only have to drop your tables. Keep in mind that deletion is not an easy task on Amazon Redshift and if you do it regularly, then Vacuuming will be required at some point.
Then, you can use a UNION ALL view and have a unified view of all your time series data without having to deal with many different tables and joins in your queries.
Finally, if you have a timestamp column in your data and you use it as a sort key, you effectively store your data in sort key order, which eliminates the need for vacuuming, which is always a good thing.
The above table design emulates what the partitioning techniques do on a traditional RDBMS where each table on Amazon Redshift corresponds to a partition and the UNION ALL view to the original table. The difference here is that how the data will be partitioned on different tables is something that has to be implemented by you.
It is important to remember that whenever you drop old data, you need to renew your UNION ALL View to ensure that it reflects your current data, as views do not automatically update by Amazon Redshift.
In case that you do not wish to maintain multiple tables, you can also have only one and enforce a retention policy, e.g. assume that you want to have a weekly retention policy.
Then you have only one table and when the week ends, you should:
- Rename your table
- Create a new one using the original name
- Insert any data from the old table that is inserted after the end of the retention period.
- Drop the old table
Be careful and do not forget to do all the above steps into a TRANSACTION.
Sort Keys and Distribution Keys for Time Series data
Previous chapters have explained why selecting the appropriate sort keys, and distribution keys is an important process for maximizing the value you can get from your Amazon Redshift cluster. That is equally important for time series data.
Selecting a distribution key
Selecting an appropriate distribution key for time series data is an easy task. Whenever you have a timestamp column, and you should have one as you are dealing with time series data, you should use that as your distribution key. Timestamps, ensure the uniformity of such a key which is a desirable attribute for Amazon Redshift.
In this way, you can be sure that your data will not be skewed and you will end up with balanced processing on your cluster.
Selecting a sort key
Sort Keys are a bit more complicated, but again you should exploit the temporal nature of your data. First, if you have a timestamp column, you should consider using it as a Sort Key. By doing this, you also get the benefit of not having to perform Vacuuming as we mentioned earlier.
Additionally, having the timestamp as a sort key will also benefit the performance of your queries. It is most likely that if you plan to work with time series, you have to use time time-related columns in your queries.
The type of the Sort Key is not affected by the temporal nature of your data but you should consider the type of analysis that you plan to do. Check also the section on Sort Keys for guidelines on selecting the type of Sort Key for your time series tables.
Examples of Time Series Data
A typical example of time series data that you might want to load on your Amazon Redshift cluster, is data coming from Mixpanel. This data correspond to events of users from your product and is an excellent example of a time series. You can also read a more elaborate analysis of how to work with this time series data on Redshift.
Data coming from Mixpanel has a quite simple format, what we are interested in, is the following two attributes.
- properties_time: The time the event was triggered
- event: the event that was triggered at that time
The data that is coming from Mixpanel can be stored into different tables using the retention strategy that we want, e.g. monthly or weekly, by following the guidelines of this section.
Additionally, the attribute properties_time can be used as a distribution key which will ensure the uniform distribution of data on your cluster. It should also be used as a Sort Key, something that will make your table to have data orderly stored and thus avoid having to vacuum it.
The attribute event is most likely to be used in your queries in conjunction with properties_time. For this reason, it would be good to create Compound Sort Keys on your Mixpanel tables, using properties_time as the first key and the event as the second one.
Here you can find a post describing how to export data from Mixpanel to Redshift to use it for analytics.