SQL Window Functions Introduction

Eleni Markou

SQL Window Functions. Probably the coolest SQL feature which lacks popularity due to its peculiar syntax. But once you get used of it, you are running a high risk of putting them all over the place.

So what is a window function?

We will borrow the definition from PostgreSQL documentation and claim that :
“A window function performs a calculation across a set of table rows that are somehow related to the current row and is comparable to the type of calculation that can be done with an aggregate function”.

 

For those who wonder why is this so convenient I’ll start with an example scenario.

Imagine that you have a table where all the products of your company are stored along with product category and their price.

Product_Name Product_Type Price ($)
Boho Bag Accessory 30
Flying Socks Socks 10
Royal Slippers Homewear 15
Happy Leggings Socks 7
Funky Hat Accessory 15
Colorful Pijamas Homewear 25
Awesome Necklace Accessory 7

What we want is a table with an extra column which will represent the average price of all products belonging to the same category as the one on the current line.

One approach to solve this problem is to calculate the average price per category using an aggregate function and then join the result with the initial table over the Product Type column in order to get a new table looking at which you can easily find out if a product is more expensive than the average of its category.

Although this would definitely do the job, the query would be quite complicated and lengthy and may lack readability. To avoid these, an alternative approach would be to make use of window function where there is no need to mess with subqueries and joins. When using a windowed function, you can retrieve both aggregated and non-aggregated values at the same time while when using GROUP BY you can get only the results grouped into a single output row.

In this example. As we mentioned previously, the type of calculation which we would like to perform is an average while the window over which we want to apply it is the set of products belonging to the same category as the one in the current line.

No matter if you are a user of PostgreSQL, Microsoft Server or Redshift, the SQL query which will do the job is as simple as the following:

SELECT Product_Name, Product_Type, Price, 
      avg(Price) OVER (PARTITION BY Product_Type) 
FROM Product_Table;

On the same table, one may want to produce a numerical rank according to price field in each product category of the products within the current row’s partition. For this purpose, the most suitable function is the RANK() which takes no arguments as its behavior is entirely defined by the OVER clause. The SQL query suitable for this is syntactically the same as the one presented above except that it utilizes another function:

SELECT Product_Name, Product_Type, Price, 
      rank(Price) OVER (PARTITION BY Product_Type) 
FROM Product_Table;

As you can see there are some special clauses being used that may need to be explained a bit more.

To begin with, a window function call always contains an OVER clause followed by the window function name and its arguments. This is the main syntactical difference from the rest of the function available in SQL.

In fact, the OVER clause when combined with PARTITION BY determines how the rows are split up for processing and states that the preceding function call must be done analytically by evaluating the returned rows of the query.

The PARTITION BY clause with OVER divides the rows into groups (can be thought as “windows”) that share the same values of the PARTITION BY expression. The OVER() clause actually implements this window definition and is the one that differentiates window functions from other analytical and reporting functions. For each row, the window function is computed across the rows that fall into the same partition as the current row.

In case the PARTITION BY clause is not present, the group or window over which the calculation is made is considered to be the whole table. In our case altering the initial goal, by omitting the  PARTITION BY clause we can compare the price of each item with the average price of all items being sold by the company.

Furthermore, you control the order in which rows are processed using ORDER BY within partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row. This is a very handy feature which can easily answer the question regarding what is happening up to a certain point.

To better understand this functionality suppose that this time we are looking at the table where the sales of this store are being recorded.

Product_Name Product_Type Price ($) Date_sold
Boho Bag Accessory 30 1/2/2018
Flying Socks Socks 10 1/2/2018
Royal Slippers Homewear 15 1/4/2018
Happy Leggings Socks 7 1/5/2018
Funky Hat Accessory 15 1/5/2018
Colorful Pijamas Homewear 25 1/5/2018
Awesome Necklace Accessory 7 1/6/2018

Now what we want is to add an extra column which will present the total revenues up until the date of the current row. So practically we are going to use the sum() function and try to produce a table like the following:

Product_Name Product_Type Price ($) Date_sold Revenues ($)
Boho Bag Accessory 30 1/2/2018 40
Flying Socks Socks 10 1/2/2018 40
Royal Slippers Homewear 15 1/4/2018 55
Happy Leggings Socks 7 1/5/2018 102
Funky Hat Accessory 15 1/5/2018 102
Colorful Pijamas Homewear 25 1/5/2018 102
Awesome Necklace Accessory 7 1/6/2018 109

Again if you are a user of PostgreSQL, Microsoft Server or Redshift, the SQL query you are looking for is like the following:

SELECT salary, sum(Price) OVER (ORDER BY Date_sold)
FROM Sales_Table;

As explained before the summation will be computed over the products sold since the start date of the partition up through the date of the current row.

Although the examples presented in this post made use of  SUM(), RANK() and AVG() functions, other functions can be used as well.  Either you need a more classical function such as COUNT(), MIN()MAX() or a more modern like ROW_NUMBER()RATIO_TO_REPORT() you are able to do so.

Key takeaways

To summarize, it seems that the use of window functions can significantly decrease the complexity of your SQL queries, increase the efficiency and also come handy in many use cases including the following:

  • Ranking results within a specific window (e.g. per-group ranking)
  • Accessing data from another row in a specific window
  • Aggregation within a specific window using ORDER BY (e.g. running totals)

No matter what database system you are working on, there is no need to mess up with unnecessary subqueries and demanding table joins once you get familiar with window functions syntax!

Further Reading:

Work with analytics-ready data into your database with no configuration - no maintenance data pipelines. Raw data to insights in minutes.

Blendo is the easiest way to automate powerful data integrations.

Try Blendo free for 14 days. No credit card required.