![](https://img2023.cnblogs.com/blog/3076680/202308/3076680-20230809235422409-1190410594.png) # 1. 2版DB版本 ## 1.1. DB2 11.5 ## 1.2. Oracle 19c ## 1.3. ...
1. 2版DB版本
1.1. DB2 11.5
1.2. Oracle 19c
1.3. PostgreSQL 12
1.4. MySQL 8.0
1.5. SQL Server 2017
2. 子查詢
2.1. 如果想創建虛擬表,以便對其執行包含視窗函數或聚合函數的查詢,那麼最簡單的做法無疑是使用子查詢
2.2. 只需編寫一個查詢並將其放在括弧內,然後再編寫另一個使用它的查詢
2.3. 有些 RDBMS 要求給子查詢表指定別名,有些則不要求這樣做
2.3.1. Oracle 不要求
2.4. 示例
2.4.1. sql
select max(HeadCount) as HighestJobHeadCount from
(select job,count(empno) as HeadCount
from emp
group by job) head_count_tab
3. 通用表表達式
3.1. 為了剋服子查詢的一些局限性,引入了 CTE
3.1.1. 讓 SQL 支持遞歸
3.2. CTE 的可讀性更高
3.2.1. sql
with head_count_tab (job,HeadCount) as
(select job,count(empno)
from emp
group by job)
select max(HeadCount) as HighestJobHeadCount
from head_count_tab
3.3. 遞歸 CTE 來計算前 20 個斐波那契數
3.3.1. sql
with recursive workingTable (fibNum, NextNumber, index1)
as
(select 0,1,1
union all
select fibNum+nextNumber,fibNUm,index1+1
from anchor
where index1<20)
select fibNum from workingTable as fib
3.3.2. 關鍵字 RECURSIVE,它在 MySQL、Oracle 和 PostgreSQL 中必不可少,但在 SQL Server 和 DB2 中是可選的
3.3.3. index1 列旨在簡化在 WHERE 子句中指定返回行數的工作
3.3.4. WHERE 子句至關重要,如果沒有它,查詢將不會終止
4. 根據發音比較字元串
4.1. 匹配單詞
4.1.1. 匹配拼寫正確和拼寫錯誤的單詞
4.1.2. 匹配拼寫方式不同(比如英式拼寫和美式拼寫)的單詞
4.1.3. 匹配由不同字元串表示的單詞
4.1.3.1. 查找拼寫不同但發音相同的字元串
4.2. 函數 SOUNDEX 將字元串轉換為英語發音
4.2.1. SOUNDEX 會保留第一個字母,並將其他字母替換為數字
4.2.2. 發音相似的字母將被替換為相同的數字
4.3. SQL Server 函數 DIFFERENCE 會使用 SOUNDEX 對兩個字元串進行比較,並返回表示相似程度的數字 0~4
4.4. sql
a_name
----
1 Johnson
2 Jonson
3 Jonsen
4 Jensen
5 Johnsen
6 Shakespeare
7 Shakspear
8 Shaekspir
9 Shakespar
4.4.1. sql
select an1.a_name as name1, an2.a_name as name2,
SOUNDEX(an1.a_name) as Soundex_Name
from author_names an1
join author_names an2
on (SOUNDEX(an1.a_name)=SOUNDEX(an2.a_name)
and an1.a_name not like an2.a_name)
5. 查找與模式不匹配的文本
5.1. sql
select emp_id, text
from employee_comment
EMP_ID TEXT
---------- ------------------------------------------------------------
7369 126 Varnum, Edmore MI 48829, 989 313-5351
7499 1105 McConnell Court
Cedar Lake MI 48812
Home: 989-387-4321
Cell: (237) 438-3333
5.2. 列出其中電話號碼格式不正確的行
5.3. sql
select emp_id, text
from employee_comment
where regexp_like(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')
and regexp_like(
regexp_replace(text,
'[0-9]{3}([-. ])[0-9]{3}\1[0-9]{4}',''),
'[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}')
EMP_ID TEXT
---------- ------------------------------------------------------------
7369 126 Varnum, Edmore MI 48829, 989 313-5351
7844 989-387.5359
9999 906-387-1698, 313-535.8886
6. 使用絕對中位差找出異常值
6.1. 值存在疑問的原因
6.1.1. 數據收集方式有問題
6.1.1.1. 記錄值的儀錶存在誤差
6.1.2. 數據輸入錯誤導致的
6.1.3. 數據生成時環境出現異常
6.1.3.1. 意味著數據點是正確的,但應謹慎根據數據得出任何結論
6.2. 檢測異常數據的常用方法
6.2.1. 計算數據的標準偏差,並將超過 3 倍標準偏差(或其他類似距離)的數據點視為異常數據
6.2.2. 如果數據不符合正態分佈,則這種方法可能錯誤地識別異常數據,而當數據分佈不對稱,或者如果你遠離平均值,數據就不像正態分佈那樣變得稀疏時更是如此
6.3. 偏差是中值與各個值的絕對差
6.4. 絕對中位差是偏差的中值
6.5. Oracle
6.5.1. sql
with
Deviation (Deviation)
as
(select abs(sal-median(sal))
from emp),
MAD (MAD) as
(select median(Deviation)
from Deviation )
select abs(sal-median)/MAD, sal, ename, job
FROM MAD join emp
6.6. SQL Server
6.6.1. sql
with median (median)
as
(select distinct percentile_cont(0.5) within group(order by sal)
over()
from emp),
Deviation (Deviation)
as
(Select abs(sal-median)
from emp join median on 1=1),
MAD (MAD) as
(select DISTINCT PERCENTILE_CONT(0.5) within group(order by deviation) over()
from Deviation )
select abs(sal-median)/MAD, sal, ename, job
from MAD join emp on 1=1
6.7. DB2
6.8. PostgreSQL
6.9. PERCENTILE_CONT 被視為聚合函數,而不是視窗函數
6.9.1. sql
with median (median)
as
(select percentile_cont(0.5) within group(order by sal)
from emp),
devtab (deviation)
as
(select abs(sal-median)
from emp join median),
MedAbsDeviation (MAD) as
(select percentile_cont (0.5) within group(order by deviation)
from devtab)
select abs(sal-median)/MAD, sal, ename, job
FROM MedAbsDeviation join emp
6.10. MySQL
6.10.1. sql
with rank_tab (sal, rank_sal) as (
select sal, cume_dist() over (order by sal)
from emp),
inter as
(
select sal, rank_sal from rank_tab
where rank_sal>=0.5
union
select sal, rank_sal from rank_tab
where rank_sal<=0.5
)
,
medianSal (medianSal) as
(
select (max(sal)+min(sal))/2
from inter),
deviationSal (Sal,deviationSal) as
(select Sal,abs(sal-medianSal)
from emp join medianSal
on 1=1
)
,
distDevSal (sal,deviationSal,distDeviationSal) as
(
select sal,deviationSal,cume_dist() over (order by deviationSal)
from deviationSal
),
DevInter (DevInter, sal) as
(
select min(deviationSal), sal
from distDevSal
where distDeviationSal >= 0.5
union
select max(DeviationSal), sal
from distDevSal
where distDeviationSal <= 0.5
),
MAD (MedianAbsoluteDeviance) as
(
select abs(emp.sal-(min(devInter)+max(devInter))/2)
from emp join DevInter on 1=1
)
select emp.sal,MedianAbsoluteDeviance,
(emp.sal-deviationSal)/MedianAbsoluteDeviance
from (emp join MAD on 1=1)
join deviationSal on emp.sal=deviationSal.sal
6.10.2. 沒有提供函數 MEDIAN 或 PERCENTILE_CONT
6.10.3. 需要在 CTE 中使用兩個子查詢
6.11. 計算中值,然後計算這個中值與各個值的絕對偏差的中值,即絕對中位差
6.12. 使用查詢來找出每個值相對於中值的偏差與絕對中位差的比值
6.13. 可以像使用標準偏差那樣使用這些比值了
6.13.1. 如果一個值相對於中值的偏差是絕對中位差的 3 倍以上,就可以認為它是異常值
6.14. 優點
6.14.1. 即便數據不呈正態分佈,它依然有效
6.14.2. 即便數據分佈不平衡,絕對中位差給出的答案依然合理
7. 使用本福特法則查找反常數據
7.1. 檢測不像異常值那樣顯而易見的反常數據的一種方式是查看數字位的出現頻率,這種頻率通常符合本福特法則
7.2. 本福特法則最常用於檢測數據造假
7.2.1. 在數據集中人為地添加偽造的數字
7.2.2. 用於檢測不符合預期規律的數據
7.3. 本福特法則
7.3.1. 計算數字位的期望分佈
7.3.2. 將其與實際分佈進行比較
7.4. sql
with
FirstDigits (FirstDigit)
as
(select left(cast(SAL as CHAR),1) as FirstDigit
from emp),
TotalCount (Total)
as
(select count(*)
from emp),
ExpectedBenford (Digit,Expected)
as
(select ID,(log10(ID + 1) - log10(ID)) as expected
from t10
where ID < 10)
select count(FirstDigit),Digit,
coalesce(count(*)/Total,0) as ActualProportion,Expected
From FirstDigits
Join TotalCount
Right Join ExpectedBenford
on FirstDigits.FirstDigit=ExpectedBenford.Digit
group by Digit
order by Digit
7.5. 最終的結果集包含 4 列數據,分別是第一位的預測頻率、第一位的實際頻率、本福特法則預測的前幾位的頻率,以及前幾位的實際頻率
8. SQL 不像專用包 SAS、統計編程語言 R 和 Python 統計庫那樣提供了完備的統計工具
9. SQL Server
9.1. PIVOT操作符
9.1.1. sql
DEPT_10 DEPT_20 DEPT_30 DEPT_40
------- ---------- ---------- ----------
3 5 6 0
9.1.2. sql
select [10] as dept_10,
[20] as dept_20,
[30] as dept_30,
[40] as dept_40
from (select deptno, empno from emp) driver
pivot (
count(driver.empno)
for driver.deptno in ( [10],[20],[30],[40] )
) as empPivot
9.1.2.1. 不使用CASE表達式或額外的連接操作
9.2. UNPIVOT操作符
9.2.1. sql
ACCOUNTING RESEARCH SALES OPERATIONS
---------- ---------- ---------- ----------
3 5 6 0
9.2.2. sql
DNAME CNT
-------------- ----------
ACCOUNTING 3
RESEARCH 5
SALES 6
OPERATIONS 0
9.2.3. sql
select DNAME, CNT
from (
select [ACCOUNTING] as ACCOUNTING,
[SALES] as SALES,
[RESEARCH] as RESEARCH,
[OPERATIONS] as OPERATIONS
from (
select d.dname, e.empno
from emp e,dept d
where e.deptno=d.deptno
) driver
pivot (
count(driver.empno)
for driver.dname in ([ACCOUNTING],[SALES],[RESEARCH],[OPERATIONS])
) as empPivot
) new_driver
unpivot (cnt for dname in (ACCOUNTING,SALES,RESEARCH,OPERATIONS)
) as un_pivot