SQL Window Functions. Probably the coolest SQL feature which lacks popularity due to its peculiar syntax. But once you get used of it, you are running a high risk of putting them all over the place.
So what is a window function?
We will borrow the definition from PostgreSQL documentation and claim that :
“A window function performs a calculation across a set of table rows that are somehow related to the current row and is comparable to the type of calculation that can be done with an aggregate function”.
This blog post is part of our helpful SQL short guides
For those who wonder why is this so convenient I’ll start with an example scenario.
Imagine that you have a table where all the products of your company are stored along with product category and their price.
Product_Name | Product_Type | Price ($) |
Boho Bag | Accessory | 30 |
Flying Socks | Socks | 10 |
Royal Slippers | Homewear | 15 |
Happy Leggings | Socks | 7 |
Funky Hat | Accessory | 15 |
Colorful Pijamas | Homewear | 25 |
Awesome Necklace | Accessory | 7 |
What we want is a table with an extra column which will represent the average price of all products belonging to the same category as the one on the current line.
One approach to solve this problem is to calculate the average price per category using an aggregate function and then join the result with the initial table over the Product Type column in order to get a new table looking at which you can easily find out if a product is more expensive than the average of its category.
Although this would definitely do the job, the query would be quite complicated and lengthy and may lack readability. To avoid these, an alternative approach would be to make use of window function where there is no need to mess with subqueries and joins. When using a windowed function, you can retrieve both aggregated and non-aggregated values at the same time while when using GROUP BY you can get only the results grouped into a single output row.
In this example. As we mentioned previously, the type of calculation which we would like to perform is an average while the window over which we want to apply it is the set of products belonging to the same category as the one in the current line.
No matter if you are a user of PostgreSQL, Microsoft Server or Redshift, the SQL query which will do the job is as simple as the following:
SELECT Product_Name, Product_Type, Price, avg(Price) OVER (PARTITION BY Product_Type) FROM Product_Table;
On the same table, one may want to produce a numerical rank according to price field in each product category of the products within the current row’s partition. For this purpose, the most suitable function is the RANK() which takes no arguments as its behavior is entirely defined by the OVER clause. The SQL query suitable for this is syntactically the same as the one presented above except that it utilizes another function:
SELECT Product_Name, Product_Type, Price, rank(Price) OVER (PARTITION BY Product_Type) FROM Product_Table;
As you can see there are some special clauses being used that may need to be explained a bit more.
To begin with, a window function call always contains an OVER clause followed by the window function name and its arguments. This is the main syntactical difference from the rest of the function available in SQL.
In fact, the OVER clause when combined with PARTITION BY determines how the rows are split up for processing and states that the preceding function call must be done analytically by evaluating the returned rows of the query.
The PARTITION BY clause with OVER divides the rows into groups (can be thought as “windows”) that share the same values of the PARTITION BY expression. The OVER() clause actually implements this window definition and is the one that differentiates window functions from other analytical and reporting functions. For each row, the window function is computed across the rows that fall into the same partition as the current row.
In case the PARTITION BY clause is not present, the group or window over which the calculation is made is considered to be the whole table. In our case altering the initial goal, by omitting the PARTITION BY clause we can compare the price of each item with the average price of all items being sold by the company.
Furthermore, you control the order in which rows are processed using ORDER BY within partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row. This is a very handy feature which can easily answer the question regarding what is happening up to a certain point.
To better understand this functionality suppose that this time we are looking at the table where the sales of this store are being recorded.
Product_Name | Product_Type | Price ($) | Date_sold |
Boho Bag | Accessory | 30 | 1/2/2018 |
Flying Socks | Socks | 10 | 1/2/2018 |
Royal Slippers | Homewear | 15 | 1/4/2018 |
Happy Leggings | Socks | 7 | 1/5/2018 |
Funky Hat | Accessory | 15 | 1/5/2018 |
Colorful Pijamas | Homewear | 25 | 1/5/2018 |
Awesome Necklace | Accessory | 7 | 1/6/2018 |
Now what we want is to add an extra column which will present the total revenues up until the date of the current row. So practically we are going to use the sum() function and try to produce a table like the following:
Product_Name | Product_Type | Price ($) | Date_sold | Revenues ($) |
Boho Bag | Accessory | 30 | 1/2/2018 | 40 |
Flying Socks | Socks | 10 | 1/2/2018 | 40 |
Royal Slippers | Homewear | 15 | 1/4/2018 | 55 |
Happy Leggings | Socks | 7 | 1/5/2018 | 102 |
Funky Hat | Accessory | 15 | 1/5/2018 | 102 |
Colorful Pijamas | Homewear | 25 | 1/5/2018 | 102 |
Awesome Necklace | Accessory | 7 | 1/6/2018 | 109 |
Again if you are a user of PostgreSQL, Microsoft Server or Redshift, the SQL query you are looking for is like the following:
SELECT salary, sum(Price) OVER (ORDER BY Date_sold) FROM Sales_Table;
As explained before the summation will be computed over the products sold since the start date of the partition up through the date of the current row.
Although the examples presented in this post made use of SUM()
, RANK()
and AVG()
functions, other functions can be used as well. Either you need a more classical function such as COUNT()
, MIN()
, MAX()
or a more modern like ROW_NUMBER()
, RATIO_TO_REPORT()
you are able to do so.
Key takeaways
To summarize, it seems that the use of window functions can significantly decrease the complexity of your SQL queries, increase the efficiency and also come handy in many use cases including the following:
- Ranking results within a specific window (e.g. per-group ranking)
- Accessing data from another row in a specific window
- Aggregation within a specific window using ORDER BY (e.g. running totals)
No matter what database system you are working on, there is no need to mess up with unnecessary subqueries and demanding table joins once you get familiar with window functions syntax!
Further Reading: