Hive執行計劃之hive依賴及許可權查詢和常見使用場景

[TOC] ## 概述 Hive查看執行計劃的命令中還有兩個不怎麼常用但很重要的命令，接下來詳細介紹一下。有一個問題：**如何在hiveSQL執行之前就探查到這段邏輯的血緣依賴關係？** hive血緣是很多生產級數倉必須要提供的功能，大多數解決方案都是**使用hive hooks的方法通過SQL執 ...

概述
1.explain dependency的查詢與使用
2.藉助explain dependency解決一些常見問題
- 2.1.識別看似等價的SQL代碼實際上是不等價的：
- 2.2 通過explain dependency驗證將過濾條件在不同位置的查詢區別
3.查看SQL操作涉及到的相關許可權信息

概述

Hive查看執行計劃的命令中還有兩個不怎麼常用但很重要的命令，接下來詳細介紹一下。

有一個問題：如何在hiveSQL執行之前就探查到這段邏輯的血緣依賴關係？

hive血緣是很多生產級數倉必須要提供的功能，大多數解決方案都是使用hive hooks的方法通過SQL執行後解析得到hive表的依賴關係。

這個方案能細粒度到欄位級依賴，屬於很完善的一個解決方案，但有很多場景我們需要在SQL執行之前就得到依賴關係，那麼如何解決的呢？

1.explain dependency的查詢與使用

explain dependency 提供了這樣的一個解決方案，它可以查詢一段SQL需要的數據來源，以JSON的形式展現結果數據。裡面主要包含兩部分內容：

input_tables：描述一段SQL依賴的數據來源表，裡面存儲的是hive表名的列表，格式如下：

{"tablename":"庫名@表名","tabletype":"表的類型（外部表/內部表）"}
input_partitions：描述一段SQL依賴的數據來源表分區，裡面存儲的是分區名稱的列表，格式如下：

{"partitionName":"庫名@表名@分區列=分區列的值"}

如果查詢的表為非分區表，則顯示為空。

可以通過以下例子來進行比對，其中例1是查詢非分區普通表SQL的explain dependency，例2是查詢分區表SQL的explain dependency。

例1 使用explain dependency查看SQL非分區普通表。

explain dependency
-- 統計年齡小於30歲各個年齡里，昵稱裡帶“小”的人數
select age,count(0) as num from temp.user_info_all_no
where age < 30 and nick like '%小%'
group by age;

輸出結果內容：

{"input_tables":[{"tablename":"temp@user_info_all_no","tabletype":"MANAGED_TABLE"}],"input_partitions":[]}

例2 使用explain dependency查看SQL查詢分區表。

explain dependency
-- 統計年齡小於30歲各個年齡里，昵稱裡帶“小”的人數,其中ymd欄位為分區欄位
select age,count(0) as num from temp.user_info_all where ymd >= '20230501'
and age < 30 and nick like '%小%'
group by age;

輸出結果內容：

{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"}]}

2.藉助explain dependency解決一些常見問題

explain dependency的使用場景有以下幾個：

場景一，快速排除。快速排除因為讀不到相應分區的數據而導致任務數據輸出異常。例如，在一個以天為分區的任務中，上游任務因為生產過程不可控因素出現異常或者空跑，導致下游任務引發異常。通過這種方式，可以快速查看SQL讀取的分區是否出現異常。

場景二，理清表的輸入，幫助理解程式的運行，特別是有助於理解有多重子查詢，多表連接的依賴輸入。

場景三，提前通過解析hiveSQL腳本進行血緣依賴解析，用於一些定製化數據平臺工具開發中的血緣構建。

explain dependency的使用能幫助開發者解決哪些問題呢？

2.1.識別看似等價的SQL代碼實際上是不等價的：

對於接觸SQL不久的程式員來說，很多人容易將

select * from a left join b on a.no=b.no and a.f>1 and a.f<3;

這段邏輯等價於 select * from a left join b on a.no=b.no where a.f>1 and a.f<3;

這兩段的邏輯的區別是在多表left join的時候where 後加條件是否等價與on後面加條件。

我們通過實例來看看其中的區別：

例3 使用explain dependency識別看似等價的SQL代碼。

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd and a.ymd >= '20230501' and a.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where a.ymd >= '20230501' and a.ymd <= '20230502';

輸出結果內容：

// 代碼1輸出結果
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230503"},{"partitionName":"temp@user_act_info@ymd=20230606"}]}

// 代碼2輸出結果
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

通過以上輸出結果可以看出，上面例子里的兩段SQL其實並不等價。在left join（left outer join）的連接條件中加入非等值的過濾條件後，這裡特指作用於a表，也就是連接的基表，並沒有將左外連接的左右兩個表按照過濾條件進行過濾，左外連接在執行時會讀取所有分區數據，然後進行關聯數據過濾操作。

left outer join 針對左表非等值條件on和where查詢數據on條件查詢數據大於where條件查詢數據。

下麵查看left outer join對右表的過濾條件實例：

例4 使用explain dependency識別left outer join 右表過濾非等值條件區別

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd and b.ymd >= '20230501' and b.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where b.ymd >= '20230501' and b.ymd <= '20230502';

輸出結果內容：

// 代碼1輸出結果,on後跟非等值條件
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

// 代碼2輸出結果,where後跟非等值條件
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230503"},{"partitionName":"temp@user_act_info@ymd=20230606"}]}

可以看到left outer join 針對右表非等值條件on和where查詢數據左表都是全表掃描，右表on條件是條件過濾，where條件是全表掃描。

接下來對inner join，right outer join，full outer join進行測試。會發現

inner join 的類似針對左右表非等值條件on和where查詢數據是等價的。

right outer join和left join相反。

full outer join都是全表掃描。

那麼可以很好的判斷出一下兩段SQL的過濾條件數據讀取範圍是完全不一樣的。就不貼執行結果了。

例5 left outer join下的對左表和右表不等值條件過濾。

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd and a.ymd >= '20230501' and a.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd and b.ymd >= '20230501' and b.ymd <= '20230502';

以上不同join類型數據查詢範圍不一致主要原因和hive對join和where的謂詞下推支持不同有關。通過explain dependency可以直接驗證hive對join和where進行謂詞下推規則的驗證。

謂詞下推可詳細查看什麼是謂詞下推，看這一篇就夠了

2.2 通過explain dependency驗證將過濾條件在不同位置的查詢區別

如果要使用外連接並需要對左右兩個表進行條件過濾，做好的方式是將過濾條件放到就近處，即如果已經知道表數據過濾篩選條件，那麼在使用該表前，就先用過濾條件進行過濾，然後進行其他操作。

一些SQL內置優化器會做一些過濾下推優化，但部分條件還是不會進行下推。所以我們在寫SQL時儘量養成先過濾而後進行其他操作（聚合，關聯）的習慣。

可以看如下實例：

例6 left outer join對左表過濾數據的優化對比。

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where a.ymd >= '20230501' and a.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from (
	select uid,ymd from temp.user_info_all
  -- 在子查詢內部進行過濾
  where ymd >= '20230501' and ymd <= '20230502'  
) a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd;

-- 代碼3
explain dependency
select a.uid from (
	select uid,ymd from temp.user_info_all
  -- 在子查詢內部進行過濾
  where ymd >= '20230501' and ymd <= '20230502'  
) a
left outer join (
	select uid,ymd from temp.user_act_info
  where ymd >= '20230501' and ymd <= '20230502'
) b
on a.uid = b.uid and a.ymd = b.ymd;

執行結果：

//代碼1，左右表都進行了過濾
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

//代碼2，右表進行了全表掃描
{"input_tables":[{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230503"},{"partitionName":"temp@user_act_info@ymd=20230606"}]}

//代碼3，左右表都進行了過濾
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

可以看到left outer join對左表過濾數據的優化中代碼1片段等價於代碼3片段，即兩表都在就近處都過濾。

例7 left outer join對右表過濾數據的優化對比。

-- 代碼1
explain dependency
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where b.ymd >= '20230501' and b.ymd <= '20230502';

-- 代碼2
explain dependency
select a.uid from (
	select uid,ymd from temp.user_info_all
  -- 在子查詢內部進行過濾
  where ymd >= '20230501' and ymd <= '20230502'  
) a
left outer join (
	select uid,ymd from temp.user_act_info
  where ymd >= '20230501' and ymd <= '20230502'
) b
on a.uid = b.uid and a.ymd = b.ymd;

-- 代碼3
explain dependency
select a.uid from temp.user_info_all a
left outer join (
	select uid,ymd from temp.user_act_info
  where ymd >= '20230501' and ymd <= '20230502'
) b
on a.uid = b.uid and a.ymd = b.ymd;

執行結果內容：

// 代碼1 ，左右表都進行了全表掃描
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230503"},{"partitionName":"temp@user_act_info@ymd=20230606"}]}

//代碼2，左右表都進行了過濾
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"}]}

//代碼3，右表都進行了過濾
{"input_tables":[{"tablename":"temp@user_info_all","tabletype":"MANAGED_TABLE"},{"tablename":"temp@user_act_info","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"temp@user_act_info@ymd=20230501"},{"partitionName":"temp@user_act_info@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230430"},{"partitionName":"temp@user_info_all@ymd=20230501"},{"partitionName":"temp@user_info_all@ymd=20230502"},{"partitionName":"temp@user_info_all@ymd=20230503"},{"partitionName":"temp@user_info_all@ymd=20230504"},{"partitionName":"temp@user_info_all@ymd=20230505"},{"partitionName":"temp@user_info_all@ymd=20230529"}]}

可以看到left outer join對右表過濾數據的優化中代碼2是最優，代碼3次之，代碼1最差。

3.查看SQL操作涉及到的相關許可權信息

通過explain authorization可以知道當前SQL訪問的數據來源（INPUTS）和數據輸出（OUTPUTS），以及當前Hive的訪問用戶（CURRENT_USER）和操作（OPERATION）。

可以看以下實例：

例8 使用explain authorization查看許可權相關信息。

explain authorization
select a.uid from temp.user_info_all a
left outer join temp.user_act_info b
on a.uid = b.uid and a.ymd = b.ymd 
where a.ymd >= '20230501' and a.ymd <= '20230502';

執行結果：

INPUTS: 
  temp@user_info_all
  temp@user_act_info
  temp@user_info_all@ymd=20230501
  temp@user_info_all@ymd=20230502
  temp@user_act_info@ymd=20230501
  temp@user_act_info@ymd=20230502
OUTPUTS: 
  hdfs://nameservice1/tmp/hive/hdfs/a88cc133-c310-4129-bfa0-28011ac23904/hive_2023-06-07_19-42-55_464_2777807904847671424-1/-mr-10000
CURRENT_USER: 
  hdfs
OPERATION: 
  QUERY
AUTHORIZATION_FAILURES: 
  Permission denied: Principal [name=hdfs, type=USER] does not have following privileges for operation QUERY [[SELECT] on Object [type=TABLE_OR_VIEW, name=temp.user_act_info], [SELECT] on Object [type=TABLE_OR_VIEW, name=temp.user_info_all]]

從上面的信息可知：

上面案例的數據來源是temp資料庫中的 user_info_all表和user_act_info表；

數據的輸出路徑是hdfs://nameservice1/tmp/hive/hdfs/a88cc133-c310-4129-bfa0-28011ac23904/hive_2023-06-07_19-42-55_464_2777807904847671424-1/-mr-10000；

當前的操作用戶是hdfs，操作是查詢（QUERY）；

觀察上面的信息我們還會看到AUTHORIZATION_FAILURES信息，提示對當前的輸入沒有查詢許可權，但如果運行上面的SQL的話也能夠正常運行。為什麼會出現這種情況？Hive在預設不配置許可權管理的情況下不進行許可權驗證，所有的用戶在Hive裡面都是超級管理員，即使不對特定的用戶進行賦權，也能夠正常查詢。

通過上面對explain相關參數的介紹，可以發現explain中有很多值得我們去研究的內容，讀懂 explain 的執行計劃有利於我們優化Hive SQL，同時也能提升我們對SQL的掌控力。

下一期：Hive執行計劃之什麼是hiveSQL向量化模式及優化詳解

按例，歡迎點擊此處關註我的個人公眾號，交流更多知識。

後臺回覆關鍵字 hive，隨機贈送一本魯邊備註版珍藏大數據書籍。