DDL（數據定義語言）操作 Hive配置單元包含一個名為 default 預設的資料庫. create database [if not exists] <database name>；創建資料庫 show databases | schemas; --顯示所有資料庫 drop database ...

DDL（數據定義語言）操作

　　Hive配置單元包含一個名為 default 預設的資料庫.

　　　　create database [if not exists] <database name>；---創建資料庫

　　　　show databases | schemas; --顯示所有資料庫

　　　　drop database if exists <database name> [restrict|cascade]; --刪除資料庫，預設情況下，hive不允許刪除含有表的資料庫，要先將資料庫中的表清空才能drop，否則會報錯
　　　　--加入cascade關鍵字，可以強制刪除一個資料庫,預設是restrict，表示有限制的
　　　　　　eg. hive> drop database if exists users cascade;

　　　　use <database name>; --切換資料庫

　　顯示命令

　　　　show tables; --顯示當前庫中所有表

　　　　show partitions table_name; --顯示表的分區，不是分區表執行報錯

　　　　show functions; --顯示當前版本 hive 支持的所有方法

　　　　desc extended table_name; --查看表信息

　　　　desc formatted table_name; --查看表信息（格式化美觀）

　　　　describe database database_name; --查看資料庫相關信息

　　創建表

　　　　create [external] table [if not exists] table_name
　　　　　　[(col_name data_type [comment col_comment], ...)]
　　　　　　[comment table_comment]
　　　　　　[partitioned by (col_name data_type [comment col_comment], ...)]
　　　　　　[clustered by (col_name, col_name, ...)
　　　　　　[sorted by (col_name [asc|desc], ...)] into num_buckets buckets]
　　　　　　[row format row_format]
　　　　　　[stored as file_format]
　　　　　　[location hdfs_path]

　　　　說明：

　　　　　　1、 create table 創建一個指定名字的表。如果相同名字的表已經存在，則拋出異常；用戶可以用 if not exists 選項來忽略這個異常。

　　　　　　2、 external關鍵字可以讓用戶創建一個外部表，在建表的同時指定一個指向實際數據的路徑（ LOCATION）。

　　　　　　Hive 創建內部表時，會將數據移動到數據倉庫指向的路徑；若創建外部表，僅記錄數據所在的路徑，不對數據的位置做任何改變。在刪除表的時候，內部表的元數據和數據會被一起刪除，而外部表只刪除元數據，不刪除數據。

　　　　　　3、 like允許用戶複製現有的表結構，但是不複製數據。

　　　　　　create [external] table [if not exists] [db_name.]table_name like existing_table;

　　　　　　4、row format delimited（指定分隔符）

　　　　　　　　[fields terminated by char]

　　　　　　　　[collection items terminated by char]

　　　　　　　　[map keys terminated by char]

　　　　　　　　[lines terminated by char] | serde serde_name

　　　　　　　　[with serdeproperties(property_name=property_value, property_name=property_value,...)]

　　　　　　hive 建表的時候預設的分割符是'\001'，若在建表的時候沒有指明分隔符，load 文件的時候文件的分隔符需要是'\001'；若文件分隔符不是'001'，程式不會報錯，但表查詢的結果會全部為'null'；

　　　　　　用 vi 編輯器 Ctrl+v 然後 Ctrl+a 即可輸入'\001' -----------> ^A

　　　　　　SerDe 是 Serialize/Deserilize 的簡稱，目的是用於序列化和反序列化。

　　　　　　Hive 讀取文件機制：首先調用 InputFormat（預設 TextInputFormat），返回一條一條記錄（預設是一行對應一條記錄）。然後調用SerDe （預設LazySimpleSerDe）的 Deserializer，將一條記錄切分為各個欄位（預設'\001'）。

　　　　　　Hive 寫文件機制：將 Row 寫入文件時，主要調用 OutputFormat、SerDe 的Seriliazer，順序與讀取相反。可通過 desc formatted 表名；進行相關信息查看。當我們的數據格式比較特殊的時候，可以自定義 SerDe。

　　　　　　5、 partitioned by（分區）

　　　　在 hive Select 查詢中一般會掃描整個表內容，會消耗很多時間做沒必要的工作。有時候只需要掃描表中關心的一部分數據，因此建表時引入了 partition 分區概念。

　　　　分區表指的是在創建表時指定的 partition 的分區空間。一個表可以擁有一個或者多個分區，每個分區以文件夾的形式單獨存在表文件夾的目錄下。表和列名不區分大小寫。分區是以欄位的形式在表結構中存在，通過 describe table 命令可以查看到欄位存在，但是該欄位不存放實際的數據內容，僅僅是分區的表示。

　　　　　　6、 stored as sequencefile|textfile|rcfile

　　　　如果文件數據是純文本，可以使用 stored as textfile。如果數據需要壓縮，使用 STORED AS SEQUENCEFILE。

　　　　textfile是預設的文件格式，使用 delimited子句來讀取分隔的文件。

　　　　　　7、 clustered by into num_buckets buckets（分桶）

　　　　對於每一個表（table）或者分，Hive 可以進一步組織成桶，也就是說桶是更為細粒度的數據範圍劃分。Hive 也是針對某一列進行桶的組織。Hive 採用對列值哈希，然後除以桶的個數求餘的方式決定該條記錄存放在哪個桶當中。

　　　　把表（或者分區）組織成桶（Bucket）有兩個理由：

　　　　（1）獲得更高的查詢處理效率。桶為表加上了額外的結構，Hive 在處理有些查詢時能利用這個結構。具體而言，連接兩個在（包含連接列的）相同列上劃分了桶的表，可以使用 Map 端連接（Map-side join）高效的實現。比如 JOIN 操作。對於 JOIN 操作兩個表有一個相同的列，如果對這兩個表都進行了桶操作。那麼將保存相同列值的桶進行 JOIN 操作就可以，可以大大較少 JOIN 的數據量。

　　　　（2）使取樣（sampling）更高效。在處理大規模數據集時，在開發和修改查詢的階段，如果能在數據集的一小部分數據上試運行查詢，會帶來很多方便。

　　內部版要想映射成功文件位置必須在指定的路徑下

　　/user/hive/warehouse/xxx.db
　　要想映射成功要根據文件指定具體的分隔符

　　row format delimited fields terminated by ","
　　要想映射成功必須保證定義的表欄位順序類型跟結構化文件中保持一致

　　本地模式
　　　　

set hive.exec.mode.local.auto=true;

　hive分隔符：

　　　　row format delimited(hive內置分隔符類) |serde（自定義或者其他分隔符類）

create table day_table (id int, content string) partitioned by (dt string) row format delimited fields terminated by ',';   ---指定分隔符創建分區表

　　　複雜類型的數據表指定分隔符

create table complex_array(name string,work_locations array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';

數據：
1,zhangsan,唱歌:非常喜歡-跳舞:喜歡-游泳:一般般
2,lisi,打游戲:非常喜歡-籃球:不喜歡

create table t_map(id int,name string,hobby map<string,string>)
    row format delimited 
    fields terminated by ','
    collection items terminated by '-'
    map keys terminated by ':' ;

　　分區表（PARTITIONED BY）（創建子文件夾）

　　　　分區建表分為2種，一種是單分區，也就是說在表文件夾目錄下只有一級文件夾目錄。另外一種是多分區，表文件夾下出現多文件夾嵌套模式。

　　　　分區欄位不是表中真實欄位虛擬欄位（它的值只是分區的標識值）

　　　　分區的欄位不能是表中已經存有的欄位否則編譯出錯

　　　 單分區建表語句：

　　　create table day_table (id int, content string) partitioned by (dt string);單分區表，按天分區，在表結構中存在id，content，dt三列。

　　　　導入數據

     LOAD DATA local INPATH '/root/hivedata/dat_table.txt' INTO TABLE day_table partition(dt='2017-07-07');

　　　　雙分區建表語句：

create table day_hour_table (id int, content string) partitioned by (dt string, hour string);雙分區表，按天和小時分區，在表結構中新增加了dt和hour兩列。

　　　　導入數據

LOAD DATA local INPATH '/root/hivedata/dat_table.txt' INTO TABLE day_hour_table PARTITION(dt='2017-07-07', hour='08');

　　　　基於分區的查詢：

SELECT day_table.* FROM day_table WHERE day_table.dt = '2017-07-07';

　　　　查看分區：

show partitions day_hour_table;

　　　　　　總的說來partition就是輔助查詢，縮小查詢範圍，加快數據的檢索速度和對數據按照一定的規格和條件進行管理。

　　分桶表（cluster by into num buckets）

分桶表是在文件的層面把數據劃分的更加細緻
分桶表定義需要指定根據那個欄位分桶
分桶表分為幾個桶最後自己設置的時候保持一致
分桶表的好處在於提高join查詢效率減少笛卡爾積（交叉相差）的數量

　　　　　　#指定開啟分桶（分成幾個文件）

set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;//與分桶數相同，少於分桶數時不影響分桶，但是速度會降低

　　　　TRUNCATE TABLE stu_buck;

　　　　drop table stu_buck;

create table stu_buck(Sno int,Sname string,Sex string,Sage int,Sdept string)
clustered by(Sno) 
into 4 buckets
row format delimited
fields terminated by ',';

　　分桶表導入數據

insert overwrite table stu_buck
select * from student cluster by(Sno);//通過查詢中間表向分桶表裡導入數據，兩張表的結構應該相同

--------------------------------華麗的分割線----------------------------------

　　分桶、排序等查詢：cluster by 、sort by、distribute by

select * from student cluster by(Sno);

insert overwrite table student_buck
select * from student cluster by(Sno) sort by(Sage); 報錯,cluster 和 sort 不能共存

　　對某列進行分桶的同時，根據另一列進行排序

insert overwrite table stu_buck
select * from student distribute by(Sno) sort by(Sage asc);

　　根據指定的欄位把數據分成幾桶，分成幾桶取決於set mapreduce.job.reduces=？　　

　　當分的欄位跟排序的欄位不是同一個的時候，distribute by(xxx) sort by(yyy)

　　order by 根據指定的欄位全局排序這時候不管環境設置set mapreduce.job.reduces為幾，最終執行的時候都是1個，因為只有一個reducetask才能保證所有的數據來到一個文件中才能全局排序

　　總結：
　　　　cluster（分且排序，必須一樣）==distribute（分） + sort（排序）（可以不一樣）

　　內部表、外部表

　　　　建內部表（映射時需將文件上傳到hdfs的固定路徑；刪除表時，不僅會刪除表數據，還會將hdfs中hive路徑中的文件刪除）

create table student(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',';

　　　　建外部表（文件可以在hdfs的任意位置，但須使用location指定其路徑）

create external table student_ext(Sno int,Sname string,Sex string,Sage int,Sdept string) row format delimited fields terminated by ',' location '/stu';

　　　　內、外部表載入數據：

　　　　　　如果加local 表示數據來自於本地（hiveserver2服務運行所在機器的本地文件系統）

　　　　　如果不加local 表示數據在hdfs分散式文件系統的某個位置

- - 如果數據存在於本地 load載入數據是純複製操作
  - 如果數據位於hdfs load載入數據就是移動操作

load data local inpath '/root/hivedata/students.txt' overwrite into table student;//local指的是hiveserver服務所在的伺服器，而不是hivecli或beeline客戶端所在的機器（生產環境中hiveserver和hivecli不再一臺機器上）

load data inpath '/stu' into table student_ext;//hdfs中的文件

修改表

　　　　增加分區：

　　　　　　alter table table_name add partition (dt='20170101') location '/user/hadoop/warehouse/table_name/dt=20170101'; //一次添加一個分區

　　　　　　alter table table_name add partition (dt='2008-08-08', country='us') location '/path/to/us/part080808' partition (dt='2008-08-09', country='us') location '/path/to/us/part080809'; //一次添加多個分區

　　　　刪除分區

　　　　　　alter table table_name drop if exists partition (dt='2008-08-08');

　　　　　　alter table table_name drop if exists partition (dt='2008-08-08', country='us');

　　　　修改分區

　　　　　　alter table table_name partition (dt='2008-08-08') rename to partition (dt='20080808');

　　　　添加列

　　　　　　alter table table_name add|replace columns (col_name string);

　　　　註： add 是代表新增一個欄位，新增欄位位置在所有列後面 (partition 列前 )

　　　　　　replace 則是表示替換表中所有欄位。

　　　　修改列

　　　　test_change (a int, b int, c int);

　　　　　　alter table test_change change a a1 int; //修改 a 欄位名

　　　　// will change column a's name to a1, a's data type to string, and put it after column b. the new

　　　　table's structure is: b int, a1 string, c int

　　　　　　alter table test_change change a a1 string after b;

　　　　// will change column b's name to b1, and put it as the first column. the new table's structure is:

　　　　b1 int, a ints, c int

　　　　　　alter table test_change change b b1 int first;

　　　　表重命名

　　　　　　alter table table_name rename to new_table_name

Hive的DDL操作

DDL（數據定義語言）操作

顯示命令

創建表

hive分隔符：

分區表（PARTITIONED BY）（創建子文件夾）

分桶表（cluster by into num buckets）

分桶、排序等查詢：cluster by 、sort by、distribute by

內部表、外部表

修改表

　　顯示命令

　　創建表

　hive分隔符：

　　分區表（PARTITIONED BY）（創建子文件夾）

　　分桶表（cluster by into num buckets）

　　分桶、排序等查詢：cluster by 、sort by、distribute by

　　內部表、外部表