All of us have at some point worked with some spreadsheet software, like Excel or Google Sheets, or BI tools and we have to admit that they offer certain functionalities that are very handy when it comes to data presentation and reporting, like the so-called pivot tables. Since many business applications require some sort of pivot tables, I am sure many of you have found themselves struggling with how to satisfy these requirements using a database instead of a spreadsheet.
Unfortunately, for many years applying the pivot functionality to data in a database table was cumbersome. The main reason for this is that in a typical relational schema, tables are supposed to grow vertically rather than horizontally, repeating value in the same columns when necessary.
Some may claim that this representation is sufficient -and indeed it is- but having a pivot table that would extend the data across and present metrics in the blink of an eye simplifies comparisons of large amounts of data and filtering on certain attributes’ values.
Lucky for us, during the past few years, some well-known database systems have implemented functions used for pivot table creation, saving us from the development of complicated and “hacky” queries.
At this post, we are going to demonstrate how you can create pivot tables when using PostgreSQL, Redshift or SQL Server. The data we are going to use come from Mixpanel (it is 10 sec work to connect and load your data from Mixpanel to any data warehouse) and represent the number of purchases of 5 different product along different countries.
For convenience, we have constructed a view over the database’s table to have immediate access only to the relevant data, and so our initial table looks like this:
So let the pivoting begin!
Since PostgreSQL version 8.3 was released, the crosstab function was introduced that allows users to apply pivoting over some column. In the examples, we are going to use the view’s data from the product catalog and pivot them over the location of the buyer.
However, before moving to the actual query implementation, we need to enable the tablefunc module in our PostgreSQL database that includes, among others, the crosstab function we are going to use.
From the crosstab documentation one can find a few options regarding the function’s options and chose those that fit best his data case.
For our example, the syntax that best fitted the needs was the following:
The SQL parameter produces the source set of data including three columns. The first refers to data that will be presented as rows names, the second to those that will be offered as column names and the last one as the corresponding values.
Using the following query, we can now create a pivot table for the sample data:
CREATE EXTENSION tablefunc; SELECT * FROM crosstab( 'select country, product, count from mixpanel_view group by country, product order by 1,2', 'select distinct properties_role from mixpanel_events where properties_role in (''product_1'',''product_2'', ''product_3'',''product_4'',''product_5'')') AS (country character varying, "product_1" numeric, "product_2" numeric, "product_3" numeric, "product_4" numeric, "product_5" numeric);
Presenting the queries’ results as a pivot table, we obtain the following view. Now, one can easily depict that in Argentina we sold 1 unit of product_1, four units of product_4, etc.
Another approach, more general and also applicable in most databases including PostgreSQL, Amazon Redshift, and Google BigQuery, involves the use of the CASE clause, as shown in the following example:
SELECT country, sum(CASE WHEN product = 'product_1' THEN count ELSE NULL END) AS mssql, sum(CASE WHEN product = 'product_2' THEN count ELSE NULL END) AS postgres, sum(CASE WHEN product = 'product_3' THEN count ELSE NULL END) AS bigquery, sum(CASE WHEN product = 'product_4' THEN count ELSE NULL END) AS redshift, sum(CASE WHEN product = 'product_5' THEN count ELSE NULL END) AS panoply FROM mixpanel_view GROUP BY country ORDER BY 1
For example, in a BigQuery database we would get the same results as before. To prove my point you can see the following table:
All these can also be implemented in MS SQL Server as well using the PIVOT clause which allows us to create a cross tabulation. According to the official documentation, it provides a more straightforward and more readable syntax than the one that may otherwise be specified in a complex series of SELECT…CASE statements.
Over the same initial view as before we can use the following query to create the desired pivot table:
SELECT * FROM example_data pivot(COUNT FOR product IN product([product_1], [product_2], [product_3], [product_4], [product_5]))
The result we come up to is the same as before.
Overall, it seems that tasks that used to be cumbersome in the past, are now tackled quite easily with the new functionalities that many databases have introduced, including PostgreSQL, Amazon Redshift, Google BigQuery and SQL Server.
However, if the pivot is often accessed by multiple users, it would be a good idea to store it as a materialized view so that others users can use the pivot table directly instead of having to manipulate the standard query results.