【读书】Learning_SQL by Alan_Beaulieu-第三章-Tableau喜乐君-Tableau Visionary，敏捷BI布道师

XILEJUN

喜乐君 Tableau Visionary ✦ 5

📊 业务数据分析「专家」· 敏捷 BI 布道师

📚 《数据可视化分析》《业务可视化分析》多本书作者

🎓 中国地质大学（武汉）经管学院 MBA 校外导师

🤝 以 Tableau 会友，致力于构建业务分析通识框架

🌐 XILEJUN.com 全球 🌐 VIZWISE.cn 国内

📚 本文配套课程 · SQL 系列

🎬 B 站课程：DAX「别裁新解」 — https://www.bilibili.com/cheese/play/ss8780

注：本书是2022年春节期间阅读的笔记，之前发布在微信公众号中，这里整理到博客中，并必要调整。

【读书笔记】Learning_SQL by Alan_Beaulieu第三章 SQL关键子句入门

Learning_SQL by Alan_Beaulieu第七章日期函数行级别函数之SQL中与日期相关的函数

[sql]Learning_SQL by Alan_Beaulieu Chapter 8 分组聚合深刻理解分析的本质——分组聚合

【读书】Learning_SQL by Alan_Beaulieu-第五/十章连接理解多表合并，特别是Join

【读书】cross join妙用——在Tableau中借SQL生成“日历表” 多表合并的特殊性：笛卡尔积，构建日历表

喜乐君

2024年㊗️ 新年慢乐·喜乐平安

喜乐君按：假期居家修订《数据可视化分析》一书，期间阅读了一些不错的书籍，今日将Learning_SQL第三章、第四章做简要笔记，分享读者。SQL并非学习Tableau的必备，但却是深入理解数据的好工具，也是为数不多历经50年没有消失的极少数代码语言之一。

SQL近期的阅读材料为：

Introduction to SQL by Phil Spector （University of California, Berkeley）
Learning_SQL by Alan_Beaulieu （2009 O’Reilly Media）

Chapter 3 Query Primer

Query Mechanics

SQL的查询是一个先验证权限和语法（permission and syntax），而后执行（execution）并返回数据集（data sets）的过程。

查询过程中，有多个部分构成（query clauses），不过只有select是必须mandatory。

select：determines which columns include the query’s result set
from :identifies the tables from which to dray data and how the tables should be joined
where:filters out unwanted data
group by:used to group rows together by common column values
having: filters out unwanted groups
order by :sorts the rows of the final result set by one or more columns

select

select部分最好理解，*代表all columns，或者就指定字段名称（还可以使用as增加别名alias）；不过，随着学习的深入，select后面可以使用多个样式：

literals，比如数字或者字符串
expressions表达式，比如 amount*1.1
build-in function calls，内置函数，比如round（amount，2）
user-defined function calls

from

from ，如果按照难易程度，也可以分为至少两个类型：

from single table

单表中又可以分为三种类型，

permanent tables(i.e. created using the create table statement)
temporary tables(i.e. rows returned by a subquery )
virtual tables (i.e. created using the create view statement)

这和tableau中的数据表类型多有类似，也可以视为是“物理-逻辑”的演进。系统中已有的表、通过create table生成的表是包含数据的物理表，而通过create view创建的视图，则只是一个未来可用的query查询，因此逻辑上存在、内容是空，故名virtual ；临时表介于二者之间，通过subquery子查询临时创建，在短时间内存在，事后就释放，如同很多数据工具中的“影子数据”或者浏览器的缓存cache。不过，SQL中create view生成的view，是数据视图，和tableau的可视化视图viz虽然有类似之处（都是临时的数据集），但最终展现的形式不同，view可以认为是cross table，而tableau viz是可视化visualization

create view employee_view AS select emp_id, name, YEAR(start_date)from employee

from table links

数据可以来自于多表，此时就需要多表创建关联，否则默认生成的多表字段的笛卡尔积。

默认的join 是inner join， join需要和on condition结合使用；同时多表可以使用AS增加table alias。

select e.emp_id, e.emp_name, d.department from employee e INNER JOIN department d ON e.dept_id = d.dept_id

where

Most of time, you will not with to retrieve every row from a table but will want to filter out those rows that are not of interest. this is a job for the where clause.

书中Charter 4深入介绍了where子句的使用，把它分为四种类型，对我更好地理解tableau筛选，并做进一步分类起到了重要作用。

后面单独分享

group by and having clauses

从group by开始，我觉得开进入分析的世界。我们可以把数据的query分为两类：（1）query raw data without any maunipulation (2) group datas by columns to find trends.

group by是显形的语法，与之相对的潜在的聚合过程是aggregate。也许是正因此，书中Chapter 8的标题是 grouping and aggregates。在我最近的图书修订中，我发现自己过度的强调了aggregate的部分，而对group的阐述不足，虽然group的进一步抽象是问题的层次，但是还需要更具体化的解释才行。后续介绍chapter 8.

having正是对聚合后的数据做的筛选，因此理解它和where的不同至关重要。

where针对的row level data，优先级高于group by
having针对group by之后的数据，是对聚合后的数据筛选

-- how many departments and num_employees of each dept
-- num_employees more than 2
select d.name, count(e.emp_id) as num_employees
from deparment d
inner join employee e
on d.dept_id = e.dept_id
group by d.namehaving count(e.emp_id)>2;

order by clauses

The order by clause is the mechanism sorting your result set using either raw column data or expressions based on column data .

sort by可以是单一字段，可以是多个字段，注意多个字段的排序是有优先级的，从前往后。另外有几个可选项：

ASC或者DESC两种排序方式
可以使用limit 限制行数
可以使用expression ，比如 right(fred_id,3)
可以使用数字位置符numeric placeholders，不过谨慎地使用（sparingly）

⚠️ 排序对于性能是负面影响，因此如果只是出数据，并非必须。

select emp_id, name, title, fname,lnamefrom employeee order by title,lname -- order by 3,5   -- sorting via numeric placeholders

Chapter 4 Filtering

Filter 广泛应用于query，modify，delete等多个场景。

Chapter 3介绍的where 和 having都可以完成类似的筛选，前者是narrow your foucs to a subset of a table’s rows,后者则是 fitler conditions pertaining to grouped data 。二者的差异与group by息息相关，或者说where是针对明细的，having 是针对聚合的。

本章介绍rows的筛选

Basic of logic

大学期间学过逻辑学，虽然大部分知识似乎忘却，但是一些关键的逻辑方法却已经深入内心，所以这里关于and、or、not、parentheses（圆括号）等operator就好理解了。

另外，还有两类operators ：

Comparision operators：=, !=, <,>,LIKE, IN , BETWEEN
Arithmeic operators: +,- ,* , / （四则运算）

之后，作者把filter分为了四类，这里值得我参考借鉴。作者并未像tableau可视化工具一样指出前面的分类是基于discrete and continuous fiels的字段分类，。

Equality Condition

使用comparisons operators可以完成equality或者inequality conditions。

……where title = "manager"where amount = 13.4 where dep_id = (select dept_id from department where name ='Joy')where title <> 'manager'

当然，如果是使用多个条件，就借助于and 和or等逻辑符。

where title = 'manager' OR belong_to = 'CEO

Range Condition

ranges of values to include or exclude，建立在连续的字段基础上，因此书中有句话：

this type of condition is common when working with numeric or temporal data. 这也应了tableau中的连续字段，要么是数字，要么是日期。

这里有几个关键：

如果只是指定一侧，那么就用大于或者小于符号
如果是两侧的范围，那么between and是简化逻辑的超级选择

SELECT emp_id, fname, lname, start_date FROM employee WHERE start_date < '2007-01-01'          AND start_date >= '2005-01-01';

SELECT emp_id, fname, lname, start_date FROM employee WHERE start_date BETWEEN '2005-01-01' AND '2007-01-01';

这对于数字依然是适用的。

WHERE avail_balance BETWEEN 3000 AND 5000; -- --WHERE avail_balance >=3000 and  avail_balance <=5000;

现在再回头看tableau中的筛选，是不是瞬间一摸一样了。

Membership Condition

set of values to include or exclude

多个值（注意是离散值）的筛选，书中作为单独一类介绍，可能是为了介绍独特的IN语法吧。

比如下面的一段SQL，locate all accounts whose product code is either ‘CHK’, ‘SAV’, ‘CD’, or ‘MM’。

SELECT account_id, product_cd, cust_id, avail_balance FROM account WHERE product_cd = 'CHK' OR product_cd = 'SAV' OR product_cd = 'CD'      OR product_cd = 'MM';

看上去，使用OR逻辑符太麻烦了，此时可以用IN函数简化。如下：

-- WHERE product_cd IN ('CHK','SAV','CD','MM');

当然，一些嵌套的筛选，也可以使用IN函数加subquery完成，。

SELECT account_id, product_cd, cust_id, avail_balance FROM accountWHERE product_cd IN (SELECT product_cd FROM product     WHERE product_type_cd = 'ACCOUNT');

只是，这种用法通常需要深厚的逻辑能力和SQL水平了。

Tableau 在2020年版本增加了IN函数，也是为了让此类问题更加简单。同时也可以与“条件筛选”等问题对应，诸如“筛选销售额大于1万的客户”，可以视为是一个简单的in函数。

比如，我可以使用以下SQL查询订单次数大于10的客户：

select `客户 Id` ,count(distinct `订单 Id`) from tableau.superstore group by `客户 Id` having count(distinct `订单 Id`) >10 ;

之后，我把上述结果，作为IN函数后的subquery，然后查询这些客户所属的地区及利润。

select 地区 ,sum(利润)from tableau.superstore where `客户 Id` in (select `客户 Id` ,count(distinct `订单 Id`) from tableau.superstore     group by `客户 Id`     having count(distinct `订单 Id`) >10 )group by 地区;

当然，这个语法可能不是最优的，通过性能查看器或者其他方式，也可以查看tableau自动生成的SQL语言。

Matching Conidition

various pattern-searching techniques to look up partial matches (string)

match规则有很多，比如以字母T开头，以S结尾，或者包含K（通配符）。SQL中可以使用build-in function完成，或者借助于通配符。

SELECT emp_id, fname, lname FROM employee WHERE LEFT(lname, 1) = 'T' -- 首字母为T OR lname LIKE '_a%'; -- 或者任意字幕开始，第二位为a

Wildcard character

这里是用了两个通配符字符Wildcard character ——短横和%，短横代表任意一个字符，%代表多个。

_ Exactly one character
% Any number of characters (including 0)

作者还给出了一些实例：

Using regular expressions

正则表达式针对字符串处理，有点难，先跳过吧。

总结：

tableau基于SQL而创造了用于可视化的VizQL，不过它的底层依然是SQL逻辑，二者的相同之处让我在阅读这本书时游刃有余，并能进一步增加对Tableau的理解。