跳至正文

【SQL自我精进】SQL累积窗口计算与Tableau表计算

标签:

作者推荐:作为曾经的SQL小白和如今的SQL学习者,我觉得SQL中优雅的逻辑语法对于理解Tableau的表计算大有用处。前几日给一家客户做客户流失率的分析,我就提出了一个“客户累计购买次数及新存标签”的需求,从而基于昨日状态、今日状态判断客户迁徙。只是,我是完全无法实现这样的数据准备的,因为需要用SQL移动汇总,而且要填充上客户没有消费的日期。

因此,我觉得今年进一步进修SQL课程,也为下半年参加某大学的博士课程做准备。

在这篇文章中,作者简单扼要的介绍了 SQL移动汇总的方法、只有一个order by次序的语法、分别包含order by和partition by的语法、partiton by包含多个参数的语法。

这个汇总可以和Tableau 表计算对应,但是我又隐约发现了一些不同。

  • 在Tableau表计算中,我们说所有的表计算都必须引用聚合,也就是window_sum( AGG)的样式。但是我们发现SQL似乎就没有这样的要求,可以直接 SUM() partition by order by ——这个时候,也没有group by,似乎可以理解为行级别的查询和聚合。
  • 这就是奇怪的地方,我还需要确认我在Tableau中诠释逻辑,是否可以推演到SQL中,这样就可以用可视化的、层次的理解方式,理解sql的窗口计算。特别是 “窗口计算是维度参与的计算”。

我学习的原文如下:

What Is a SQL Running Total and How Do You Compute It?

Author's photo

Dorota Wdzięczna

Dorota is an IT engineer and works as a Data Science Writer for Vertabelo. She has experience as a Java programmer, webmaster, teacher, lecturer, IT specialist, and coordinator of IT systems. In her free time, she loves working in the garden, taking photos of nature, especially macro photos of insects, and visiting beautiful locations in Poland.

SQL中的移动汇总计算是普遍用于财务和时序分析中的关键语法。在这篇文章中,你可以学习如何相关知识并使用SQL语法完成。

The SQL running total is a very common pattern, used frequently in finance and in trend analysis. In this article, you’ll learn what a running total is and how to write a SQL query to compute it.

So, without further ado, let’s get started on the first part of the question.

什么是SQL移动汇总计算?

What’s a SQL Running Total?

在SQL中,移动汇总是在一列中把此前数据值累计聚合的特殊方法。以下面数据为例,展示了一家在线商店每天的注册用户数(第二列)。

In SQL, a running total is the cumulative sum of the previous numbers in a column. Look at the example below, which presents the daily registration of users for an online shop:

registration_dateregistered_userstotal_users
2020-03-053232
2020-03-061547
2020-03-07653

第一列是日期,第二列是对应第一列日期的注册人数;第三列total-users累计用户则是把当前注册用户的累积情况。(比如第二天3月6日,累计汇总就是第一天32人和第二天15人的合计)

The first column shows the date. The second column shows the number of users who registered on that date. The third column, total_users, sums the total number of registered users on that day. 

For example, on the first day (2020-03-05), 32 users registered and the total value of registered users was 32. The next day (2020-03-06) 15 users registered; the total_usersvalue became 47 (32+15). The third day (2020-03-07), six users registered and the total_users value was 53. In other words, total_users is a running value that changes from day to day. It is the total number of users on each day.

The next example uses the total_running column to deal with company revenue in a similar way. Look at the table below:

daterevenuetotal_revenue
2020-04-02125 000125 000
2020-04-03125 000250 000
2020-04-0420 500270 500
2020-04-05101 000371 500

For each day, the total_revenue column is calculating the amount of revenue generated up to the given day. On 2020-04-04, the company achieved a total revenue of $270,500 because that is the sum of all revenues from 2020-04-02 to 2020-04-04. 

Relational databases (like SQL SERVERORACLEPOSTGRESQL, and MYSQL) and even non-relational engines like HIVE and PRESTO provide window functions that allow us to calculate a running total. Next, we’ll talk about the SQL query that builds such a sum and learn more about window functions.

How to Compute a Cumulative Sum in SQL

If you would like to compute running total in SQL, you need to be familiar with the window functions provided by your database. Window functions operate on a set of rows and return an aggregate value for each row in the result set. If you are interested in learning more about window functions, try the WINDOW FUNCTIONS course on LEARNSQL.COM platform.

不要急于阅读相关的窗口函数知识(可能会吓退你),最好是边看边练习。

Don’t just read about window functions – practice what you’re learning. I recommend LearnSQL.com’s Window Functions course. It’s a great hands-on way to dig into using analytical functions to power up your SQL.

The syntax of the SQL window function that computes a cumulative sum across rows is:

window_function ( column ) OVER ( [ PARTITION BY partition_list ] [ ORDER BY order_list] )

It’s mandatory to use the OVER clause in a window function, but the arguments in this clause are optional. We will discuss them in the next paragraphs of this article.

Example 1

In this example, we will calculate the total running sum of the registered users each day. 

registration_dateregistered_users
2020-03-0532
2020-03-0615
2020-03-076

This query …

SELECT registration_date,registred_users,  SUM(registred_users) OVER (ORDER BY registration_date)  AS total_usersFROM registration;

… selects the registration date for all users. We also need the sum of all users for each day, starting from the first given day (2020-03-05) to the day in that row. 

This is the result set:

registration_dateregistered_userstotal_users
2020-03-055757
2020-03-062784
2020-03-0716100

To calculate the running total, we use the SUM() aggregate function and put the column registered_users as the argument; we want to obtain the cumulative sum of users from this column. 

The next step is to use the OVER clause. In our example, this clause has one argument: ORDER BY registration_date. The rows of the result set are sorted according to this column (registration_date). For each value in the registration_date column, the total sum of the previous column values is computed (i.e. the sum of the number of users before the date in the current row) and the current value (i.e. users registered on the day of the current row) is added to it. 

Notice that the total sum is shown in the new column, which we named total_users

In the first step (the registration date 2020-03-05), we have 57 registered users. The sum of users registered this day is the same 57. In the next step, we add to this total value (57). What do we add? The number of users registered on the current date (2020-03-06), which is 27; this gives us a running total of 84. In the last row of the result set (for the last registration date, 2020-03-07) the running total is 100. 

Thanks to SQL window functions, it is easy to find the cumulative total number of users during a given period of time. For example, during 2020-03-05 – 2020-03-06, the total number of registered users was 84.

Example 2

In the second example, we’ll go into more details about users. We’ll show users with their countries. Look at the table below: 

countryregistration_dateregistered_users
England2020-03-0525
England2020-03-0612
England2020-03-0710
Poland2020-03-0532
Poland2020-03-0615
Poland2020-03-076

Notice that for each day we have the number of users for each country shown separately. In this example, we will compute a separate cumulative sum of registered users for each country. 

This query …

SELECT
country,
registration_date,
registred_users,  
SUM(registred_users)   OVER (PARTITION BY country ORDER BY registration_date)  AS total_users
FROM registration;

… calculates the sum of users for each day, first for users from England and then for users from Poland.

Here’s the result set:

countryregistration_dateregistered_userstotal_users
England2020-03-052525
England2020-03-061237
England2020-03-071047
Poland2020-03-053232
Poland2020-03-061547
Poland2020-03-07653

For each country, each registration day gets a running total. The PARTITION BY clause in the OVER clause has the column country as its argument. This partitions rows by country, allowing SQL to compute a running total for that country only (instead of both countries together). Thus, in England from 2020-03-05 to 2020-03-07, we have a total of 47 users. For the same period in Poland, the total of registered users was 53.

Tired of doing simple SQL exercises? Let’s move to a more advanced level! Check out our Advanced SQL track!

Example 3

In the last example, we’ll analyze the data in the competition table, which stores the columns game_idgamer_idgame_levelcompetition_date, and score

game_idgame_levelgamer_idcompetition_datescore
1342020-04-024
1242020-04-035
1142020-04-042
1352020-04-021
1252020-04-032
2372020-04-074
2272020-04-086
2172020-04-072
2362020-04-081
2262020-04-091
2382020-04-072

We need to check each gamer’s total cumulative score for each day in two different games. Look at the query below, which creates this running total:

SELECT game_id,game_level,gamer_id,competition_date,score,  SUM(score)  OVER (PARTITION BY game_id, gamer_id    ORDER BY competition_date)  AS total_scoreFROM competition;

The result:

game_idgame_levelgamer_idcompetition_datescoretotal_score
1342020-04-0244
1242020-04-0359
1142020-04-04211
1352020-04-0211
1252020-04-0323
2362020-04-0711
2262020-04-0812
2372020-04-0744
2272020-04-08610
2172020-04-09212
2382020-04-0722

In this result table, we can read that the gamer with ID=4 starts from a score of 4 and finishes with a total score of 11. The best was the gamer with ID=7, who finished with a total score of 12. 

Once again, in the OVER clause we use PARTITION BY. This time, we use a list of columns (game_id, gamer_id). This allows us to create two partitions: one for game 1 and one for game 2. 

Next, rows were divided by gamer_id for each game. In game 1, we have the gamers 4 and 5; in game 2, we have the gamers 6, 7, and 8. Among each group (a given gamer plays in a given game), rows are sorted by competition_date and the score from each day is summed. In each group, we can observe each gamer’s changing score in a given game. 

How Will You Use SQL Running Totals?

Using a running total value in SQL reports can be very handy, especially for financial specialists. Therefore, it is worthwhile to know what a cumulative sum is and how to use SQL window functions to create one. This article presented a few selected use cases. For more about window functions, check out our article SQL WINDOW FUNCTION EXAMPLE WITH EXPLANATIONS or the LearnSQL course WINDOW FUNCTIONS.

Want to learn about window functions? Click here for a great interactive experience!


了解 喜乐君 的更多信息

订阅后即可通过电子邮件收到最新文章。