跳至正文

【读书】cross join妙用——在Tableau中借SQL生成“日历表”

标签:

2024/1/7 本文在之前发布时,最后的案例选择了错误的方法,从公众号转到博客过程中,做了修订。 ——喜乐君


喜乐君按:我的一家关键客户之前多次问过我如何在tableau中生成“日历表”,比如2022年的全年日期数据,从而和车间的生产数据匹配,我表示只能借助手工创建成数据表(毕竟几年也不过千行)。

今天阅读leaning SQL 第10章中,惊奇地发现作者Alan Beaulieu用 cross join生成笛卡尔积,巧妙地解决了这个问题——虽然tableau不能无中生有创建明细表,但是SQL却可以借助select间接完成。因此,使用tableau中的自定义SQL,就可以生成日历表了。

⚠️ 本文适用于高级用户,SQL用mysql 8完成

Feb 3, 2022 喜乐君 

1- Cross Joins 交叉连接 生成日历表

cross joins的目的在于生成笛卡尔积,虽然没有on条件的join默认是笛卡尔积,不过建议使用cross join的连接方式。

// from 网络假设使用CROSS JOIN连接两个表,结果集将包括两个表中的所有行,其中结果集中的每一行都是第一个表中的行与第二个表中的行的组合。要特别注意的是,如果每个表有1000行,那么结果集中就有1000 x 1000 = 1,000,000行。//原文出自【易百教程】,商业转载请联系作者获得授权,非商业转载请保留原文链接:https://www.yiibai.com/mysql/cross-join.html

作者提出了一个 Data Fabrication(数据构造)的业务场景,而后使用了union all方式完成。原图如下(P193):

接下来,才是最重要的内容——如何使用union all和cross join的功能,创建2018年的日历。

书中先构建了一个从0到399的序列,fabricate有创造、伪造、虚构之意,fabricated table即本来不存在,因分析需要而构建的数据表。

我先用一个简单的练习,比如set{0,1,2,3,4} 和 set {10,11} 做cross join,如下

mysql> -- cross joinmysql> select one.num ,two.num, one.num+two.num as 1_plus_2 -> from -> (select 0 num union all -> select 1 num union all -> select 2 num union all -> select 3 num union all -> select 4) one -> cross join -> (select 10 num union all -> select 11 num ) two ;+-----+-----+----------+| num | num | 1_plus_2 |+-----+-----+----------+| 0 | 10 | 10 || 0 | 11 | 11 || 1 | 10 | 11 || 1 | 11 | 12 || 2 | 10 | 12 || 2 | 11 | 13 || 3 | 10 | 13 || 3 | 11 | 14 || 4 | 10 | 14 || 4 | 11 | 15 |+-----+-----+----------+10 rows in set (0.01 sec)

我们发现,这样会出现很多重复数据。既然10已经可以和set{0,1,2,3,4} ,因此就不需要11,可以直接从 10+4的下个数字开始。

同理,要完成0~99的 data set,只可以通过0~9的数据set 和0、10、20……90的set计算笛卡尔积完成。

m = {0,1,2,3,4,5,6,7,8,9} 

n= {0,10,20,30,40,50,60,70,80,90}

通过sql的cross join,如下生成,为了便于理解,我增加了order by clause,通过二者的和排序,即0~99的序列:

mysql> -- cross joinmysql> select one.num ,two.num, one.num+two.num as 1_plus_2 -> from -> (select 0 num union all -> select 1 num union all -> select 2 num union all -> select 3 num union all -> select 4 num union all -> select 5 num union all -> select 6 num union all -> select 7 num union all -> select 8 num union all -> select 9 num ) one -> cross join -> (select 0 num union all -> select 10 num union all -> select 20 num union all -> select 30 num union all -> select 40 num union all -> select 50 num union all -> select 60 num union all -> select 70 num union all -> select 80 num union all -> select 90) two -> order by 1_plus_2 ;+-----+-----+----------+| num | num | 1_plus_2 |+-----+-----+----------+| 0 | 0 | 0 || 1 | 0 | 1 || 2 | 0 | 2 || 3 | 0 | 3 || 4 | 0 | 4 || 5 | 0 | 5 || 6 | 0 | 6 || 7 | 0 | 7 || 8 | 0 | 8 || 9 | 0 | 9 || 0 | 10 | 10 || 1 | 10 | 11 || 2 | 10 | 12 || 3 | 10 | 13 || 4 | 10 | 14 || 5 | 10 | 15 || 6 | 10 | 16 || 7 | 10 | 17 || 8 | 10 | 18 || 9 | 10 | 19 || 0 | 20 | 20 || 1 | 20 | 21 || 2 | 20 | 22 || 3 | 20 | 23 || 4 | 20 | 24 || 5 | 20 | 25 || 6 | 20 | 26 || 7 | 20 | 27 || 8 | 20 | 28 || 9 | 20 | 29 || 0 | 30 | 30 || 1 | 30 | 31 || 2 | 30 | 32 || 3 | 30 | 33 || 4 | 30 | 34 || 5 | 30 | 35 || 6 | 30 | 36 || 7 | 30 | 37 || 8 | 30 | 38 || 9 | 30 | 39 || 0 | 40 | 40 || 1 | 40 | 41 || 2 | 40 | 42 || 3 | 40 | 43 || 4 | 40 | 44 || 5 | 40 | 45 || 6 | 40 | 46 || 7 | 40 | 47 || 8 | 40 | 48 || 9 | 40 | 49 || 0 | 50 | 50 || 1 | 50 | 51 || 2 | 50 | 52 || 3 | 50 | 53 || 4 | 50 | 54 || 5 | 50 | 55 || 6 | 50 | 56 || 7 | 50 | 57 || 8 | 50 | 58 || 9 | 50 | 59 || 0 | 60 | 60 || 1 | 60 | 61 || 2 | 60 | 62 || 3 | 60 | 63 || 4 | 60 | 64 || 5 | 60 | 65 || 6 | 60 | 66 || 7 | 60 | 67 || 8 | 60 | 68 || 9 | 60 | 69 || 0 | 70 | 70 || 1 | 70 | 71 || 2 | 70 | 72 || 3 | 70 | 73 || 4 | 70 | 74 || 5 | 70 | 75 || 6 | 70 | 76 || 7 | 70 | 77 || 8 | 70 | 78 || 9 | 70 | 79 || 0 | 80 | 80 || 1 | 80 | 81 || 2 | 80 | 82 || 3 | 80 | 83 || 4 | 80 | 84 || 5 | 80 | 85 || 6 | 80 | 86 || 7 | 80 | 87 || 8 | 80 | 88 || 9 | 80 | 89 || 0 | 90 | 90 || 1 | 90 | 91 || 2 | 90 | 92 || 3 | 90 | 93 || 4 | 90 | 94 || 5 | 90 | 95 || 6 | 90 | 96 || 7 | 90 | 97 || 8 | 90 | 98 || 9 | 90 | 99 |+-----+-----+----------+100 rows in set (0.00 sec)
mysql>

完美的fabricated table。

而为了完成0~399的序列,就可以再增加一个select clause,增加一个{0,100,200,300}和上面的结果计算笛卡尔积,最高可以获得99+300 =399的数字。

-- m = {0,1,2,3,4,5,6,7,8,9} -- n= {0,10,20,30,40,50,60,70,80,90}-- l= {0,100,200,300}select one.num ,two.num,three.num, one.num+two.num+three.num as total from(select 0 num union allselect 1 num union allselect 2 num union allselect 3 num union allselect 4 num union allselect 5 num union allselect 6 num union allselect 7 num union allselect 8 num union allselect 9 num ) one cross join (select 0 num union allselect 10 num union allselect 20 num union allselect 30 num union allselect 40 num union allselect 50 num union allselect 60 num union allselect 70 num union allselect 80 num union allselect 90) two cross join (select 0 num union allselect 100 num union allselect 200 num union allselect 300) three order by total ;
+-----+-----+-----+-------+| num | num | num | total |+-----+-----+-----+-------+| 0 | 0 | 0 | 0 || 1 | 0 | 0 | 1 || 2 | 0 | 0 | 2 || 3 | 0 | 0 | 3 || 4 | 0 | 0 | 4 |……| 6 | 90 | 300 | 396 || 7 | 90 | 300 | 397 || 8 | 90 | 300 | 398 || 9 | 90 | 300 | 399 |+-----+-----+-----+-------+400 rows in set (0.00 sec)

接下来,如果要生成2022年1月1日到年底的日期,就可以通过date_add函数,结合上面的SQL语句构建了——fabricate 

cross join 用于生成 0~399的data set数据集,而date_add函数用于把数据集转化为日期,如下所示:

mysql> mysql> select "2022-01-01", -> -- one.num ,two.num,three.num, -> one.num+two.num+three.num as total, -> date_add( '2022-01-01', interval one.num+two.num+three.num Day) dt -> from -> ( -> (select 0 num union all -> select 1 num union all -> select 2 num union all -> select 3 num union all -> select 4 num union all -> select 5 num union all -> select 6 num union all -> select 7 num union all -> select 8 num union all -> select 9 num ) one -> cross join -> (select 0 num union all -> select 10 num union all -> select 20 num union all -> select 30 num union all -> select 40 num union all -> select 50 num union all -> select 60 num union all -> select 70 num union all -> select 80 num union all -> select 90) two -> cross join -> (select 0 num union all -> select 100 num union all -> select 200 num union all -> select 300) three ) -> order by total ;+------------+-------+------------+| 2022-01-01 | total | dt |+------------+-------+------------+| 2022-01-01 | 0 | 2022-01-01 || 2022-01-01 | 1 | 2022-01-02 || 2022-01-01 | 2 | 2022-01-03 || 2022-01-01 | 3 | 2022-01-04 || 2022-01-01 | 4 | 2022-01-05 || 2022-01-01 | 5 | 2022-01-06 || 2022-01-01 | 6 | 2022-01-07 || 2022-01-01 | 7 | 2022-01-08 || 2022-01-01 | 8 | 2022-01-09 || 2022-01-01 | 9 | 2022-01-10 || 2022-01-01 | 10 | 2022-01-11 |……| 2022-01-01 | 393 | 2023-01-29 || 2022-01-01 | 394 | 2023-01-30 || 2022-01-01 | 395 | 2023-01-31 || 2022-01-01 | 396 | 2023-02-01 || 2022-01-01 | 397 | 2023-02-02 || 2022-01-01 | 398 | 2023-02-03 || 2022-01-01 | 399 | 2023-02-04 |+------------+-------+------------+400 rows in set (0.00 sec)
mysql>

这样就生成了从2022-1-1 到2023-2-4的数据集。业务过程中,我们并不在乎多构建出来的日期,join自然会处理掉,而且提前构建未来的日期也有好处。

如果只想保留2022年的日期,也可以在上述的SQL最后,增加一个where条件如下:

select "2022-01-01", -- one.num ,two.num,three.num, one.num+two.num+three.num as total,date_add( '2022-01-01', interval one.num+two.num+three.num Day) dtfrom ((select 0 num union allselect 1 num union allselect 2 num union allselect 3 num union allselect 4 num union allselect 5 num union allselect 6 num union allselect 7 num union allselect 8 num union allselect 9 num ) one cross join (select 0 num union allselect 10 num union allselect 20 num union allselect 30 num union allselect 40 num union allselect 50 num union allselect 60 num union allselect 70 num union allselect 80 num union allselect 90) two cross join (select 0 num union allselect 100 num union allselect 200 num union allselect 300) three )where date_add( '2022-01-01', interval one.num+two.num+three.num Day) <='2022-12-31' -- filter to year 2022order by total ;这样,我们就构建了一个完成的日期函数。

2- 销售数据与日历表的结合应用

接下来,作者提供了一个业务需求:计算全年下来,每天的账户数量(如果当天没有营业,当然就应该是0)。

Now that you have a mechanism for fabricating all the days in 2008, what shou you do with it ? well , you might be asked to generate a query that shows every day in 2008 along with the number of banking transactions couducted on that day, the number of accounts opened on that day, and so forth. 

比如我用超市的数据来看,2018年,并非每一天都有营业,此时不管是离散的日期,还是连续的日期,都不能得到每一天的数据——数据表中压根没有“闭门歇业”期间的任何数据。比如在tableau中摘取一段如下所示:

【补充2022】

上面的问题,其实不需要用join创建数据。如果只是为了显示0,Tableau Desktop可以借助于格式设置完成。

3、使用数据合并在明细行中补充数据

如果我们需要标记的不是0,而是其他逻辑。比如没有购买标记“客户流程”,然后分析每一天“前一天购买,当日流失的客户比例”,这样就需要在数据明细中明确的增加一行,从而完成后续判断。

喜乐君之前写过一篇Prep文章,旨在在数据准备阶段完成此类分析。 【深度Tableau】Prep Builder2021.3“新行”与客户迁徙数据准备案例 如果数量比较少,追求简单灵活,也可以在视图中使用合并完成。

比如,可以把已有的数据,和日历表做一个关联,从而生成每一天的数据,这里我先使用tableau desktop完成这个简易过程。

这样,即便没有营业的日期,也至少有一行明细数据,只是对应的字段都是空值,此时的柱状图或者折线图,都会出现歇业日期的业绩——0.

当然,书中自然是用SQL完成这里的过程,我可以将超市数据和日历表合并如下


-- superstore right outer join calendar day
Select cast(sp.订单日期 AS DATE) sp订单日期, cast(calender.dt AS DATE) dt , count(distinct sp.`订单 Id`) 订单数
from tableau.superstore sp
right outer join 
(
select  -- "2018-01-01", 
-- one.num ,two.num,three.num, 
one.num+two.num+three.num as total,
cast(date_add( '2018-01-01', interval one.num+two.num+three.num Day) as DATE) dt
from 
(
(select 0 num union all
select 1 num union all
select 2 num union all
select 3 num union all
select 4 num union all
select 5 num union all
select 6 num union all
select 7 num union all
select 8 num union all
select 9 num ) one 
cross join 
(select 0 num union all
select 10 num union all
select 20 num union all
select 30 num union all
select 40 num union all
select 50 num union all
select 60 num union all
select 70 num union all
select 80 num union all
select 90) two 
cross join 
(select 0 num union all
select 100 num union all
select 200 num union all
select 300) three )
where date_add( '2018-01-01', interval one.num+two.num+three.num Day) <='2018-12-31' -- filter to year 2022
-- order by total
) calender
on cast(sp.订单日期 AS DATE) = cast(calender.dt AS DATE)
group by cast(sp.订单日期 AS DATE), cast(calender.dt AS DATE)
order by 2;

可以通过UI查看前面的几行数据,

书中,作者也兴奋地说,

“This is one of the more interesting queries thus far in the book, in that it includes cross joins, outer joins, a data function, grouping, set operations(union all), and an aggregation . It is also not the most elegant solution to the given problem, but it should server as an example of how, with a little creativity and a firm grasp on the language, you can make even a seldom-used feature like cross joins a potent tool in your SQL toolkit”


了解 喜乐君 的更多信息

订阅后即可通过电子邮件收到最新文章。