Hive官方文檔:Home-UserDocumentation Hive DML官方文檔:LanguageManual DML 參考文章:Hive 用戶指南 1. Loading files into tables 當我們做Load操作是,hive不會做任何數據轉換,只是純複製/移動操作,將數據文件 ...
Hive官方文檔:Home-UserDocumentation
Hive DML官方文檔:LanguageManual DML
參考文章:Hive 用戶指南
1. Loading files into tables
當我們做Load操作是,hive不會做任何數據轉換,只是純複製/移動操作,將數據文件移動到與Hive表對應的位置。
語法
1 LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
實例
1 # 將本地的數據導入到表中 2 # 參見 Hive-1.2.1_03_DDL操作 3 load data local inpath '/app/software/hive/t_sz05_buck.dat' into table t_sz05; # 導入數據 4 load data local inpath '/app/software/hive/t_sz03_part.dat' into table t_sz03_part partition (dt='20180711', country='CN');
2. Inserting data into Hive Tables from queries
可以使用insert子句將查詢結果插入到表中。
語法
1 # 標準語法: 2 INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; 3 INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; 4 5 # Hive extension (multiple inserts): 6 FROM from_statement 7 INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 8 [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] 9 [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...; 10 FROM from_statement 11 INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 12 [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] 13 [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...; 14 15 # Hive extension (dynamic partition inserts): 16 INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement; 17 INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
實例
1 # 建表 2 create table t_sz10 (id int, name string) 3 row format delimited fields terminated by ','; 4 5 # 操作步驟 6 0: jdbc:hive2://mini01:10000> select * from t_sz02_ext; # 要查詢的表 7 +----------------+------------------+--+ 8 | t_sz02_ext.id | t_sz02_ext.name | 9 +----------------+------------------+--+ 10 | 1 | 劉晨 | 11 | 2 | 王敏 | 12 | 3 | 張立 | 13 | 4 | 劉剛 | 14 | 5 | 孫慶 | 15 | 6 | 易思玲 | 16 | 7 | 李娜 | 17 | 8 | 夢圓圓 | 18 | NULL | NULL | 19 +----------------+------------------+--+ 20 9 rows selected (0.099 seconds) 21 0: jdbc:hive2://mini01:10000> insert into table t_sz10 select id, name from t_sz02_ext where id < 5; 22 ……………… # MapReduce 23 No rows affected (16.029 seconds) 24 0: jdbc:hive2://mini01:10000> select * from t_sz10; # 數據已經插入 25 +------------+--------------+--+ 26 | t_sz10.id | t_sz10.name | 27 +------------+--------------+--+ 28 | 1 | 劉晨 | 29 | 2 | 王敏 | 30 | 3 | 張立 | 31 | 4 | 劉剛 | 32 +------------+--------------+--+ 33 4 rows selected (0.092 seconds)
3. Writing data into the filesystem from queries
根據查詢結果導出數據。如果有local 那麼導出到本地,如果沒有local那麼導出到HDFS。
1 Standard syntax: 2 INSERT OVERWRITE [LOCAL] DIRECTORY directory1 3 [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0) 4 SELECT ... FROM ... 5 6 Hive extension (multiple inserts): 7 FROM from_statement 8 INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1 9 [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ... 10 11 12 row_format 13 : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] 14 [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] 15 [NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
實例1
1 ### 這是一個分區表 2 0: jdbc:hive2://mini01:10000> select * from t_sz03_part; 3 +-----------------+-------------------+-----------------+----------------------+--+ 4 | t_sz03_part.id | t_sz03_part.name | t_sz03_part.dt | t_sz03_part.country | 5 +-----------------+-------------------+-----------------+----------------------+--+ 6 | 1 | 張三_20180711 | 20180711 | CN | 7 | 2 | lisi_20180711 | 20180711 | CN | 8 | 3 | Wangwu_20180711 | 20180711 | CN | 9 | 11 | Tom_20180711 | 20180711 | US | 10 | 12 | Dvid_20180711 | 20180711 | US | 11 | 13 | cherry_20180711 | 20180711 | US | 12 | 1 | 張三_20180712 | 20180712 | CN | 13 | 2 | lisi_20180712 | 20180712 | CN | 14 | 3 | Wangwu_20180712 | 20180712 | CN | 15 | 11 | Tom_20180712 | 20180712 | US | 16 | 12 | Dvid_20180712 | 20180712 | US | 17 | 13 | cherry_20180712 | 20180712 | US | 18 +-----------------+-------------------+-----------------+----------------------+--+ 19 12 rows selected (0.543 seconds)
導出1
1 ### 導出1,如果導出的目錄不存在,那麼創建對應目錄 2 0: jdbc:hive2://mini01:10000> insert overwrite local directory '/app/software/hive/export/t_sz03_part_exp.dat' 3 0: jdbc:hive2://mini01:10000> select a.* from t_sz03_part a; 4 INFO : Number of reduce tasks is set to 0 since there's no reduce operator 5 INFO : number of splits:2 6 INFO : Submitting tokens for job: job_1531701073794_0001 7 INFO : The url to track the job: http://mini02:8088/proxy/application_1531701073794_0001/ 8 INFO : Starting Job = job_1531701073794_0001, Tracking URL = http://mini02:8088/proxy/application_1531701073794_0001/ 9 INFO : Kill Command = /app/hadoop/bin/hadoop job -kill job_1531701073794_0001 10 INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 11 INFO : 2018-07-16 09:42:32,888 Stage-1 map = 0%, reduce = 0% 12 INFO : 2018-07-16 09:42:43,496 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 2.87 sec 13 INFO : 2018-07-16 09:42:44,575 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.58 sec 14 INFO : MapReduce Total cumulative CPU time: 6 seconds 580 msec 15 INFO : Ended Job = job_1531701073794_0001 16 INFO : Copying data to local directory /app/software/hive/export/t_sz03_part_exp.dat from hdfs://mini01:9000/tmp/hive/yun/38de38d6-11fc-4090-957d-21d983d9df04/hive_2018-07-16_09-42-16_386_323439967845595583-1/-mr-10000 17 INFO : Copying data to local directory /app/software/hive/export/t_sz03_part_exp.dat from hdfs://mini01:9000/tmp/hive/yun/38de38d6-11fc-4090-957d-21d983d9df04/hive_2018-07-16_09-42-16_386_323439967845595583-1/-mr-10000 18 No rows affected (29.35 seconds) 19 20 # 本地系統的導出數據, 沒有任何分隔符 21 [yun@mini01 t_sz03_part_exp.dat]$ pwd 22 /app/software/hive/export/t_sz03_part_exp.dat 23 [yun@mini01 t_sz03_part_exp.dat]$ ll 24 total 8 25 -rw-r--r-- 1 yun yun 176 Jul 16 09:42 000000_0 26 -rw-r--r-- 1 yun yun 176 Jul 16 09:42 000001_0 27 [yun@mini01 t_sz03_part_exp.dat]$ cat 000000_0 28 1張三_2018071120180711CN 29 2lisi_2018071120180711CN 30 3Wangwu_2018071120180711CN 31 11Tom_2018071220180712US 32 12Dvid_2018071220180712US 33 13cherry_2018071220180712US 34 [yun@mini01 t_sz03_part_exp.dat]$ cat 000001_0 35 11Tom_2018071120180711US 36 12Dvid_2018071120180711US 37 13cherry_2018071120180711US 38 1張三_2018071220180712CN 39 2lisi_2018071220180712CN 40 3Wangwu_2018071220180712CN
導出2
1 # 導出2 # 有分隔符 2 0: jdbc:hive2://mini01:10000> insert overwrite local directory '/app/software/hive/export/t_sz03_part_exp2.dat' 3 0: jdbc:hive2://mini01:10000> row format delimited fields terminated by ',' 4 0: jdbc:hive2://mini01:10000> select a.* from t_sz03_part a; 5 INFO : Number of reduce tasks is set to 0 since there's no reduce operator 6 INFO : number of splits:2 7 INFO : Submitting tokens for job: job_1531701073794_0002 8 INFO : The url to track the job: http://mini02:8088/proxy/application_1531701073794_0002/ 9 INFO : Starting Job = job_1531701073794_0002, Tracking URL = http://mini02:8088/proxy/application_1531701073794_0002/ 10 INFO : Kill Command = /app/hadoop/bin/hadoop job -kill job_1531701073794_0002 11 INFO : Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0 12 INFO : 2018-07-16 09:49:23,516 Stage-1 map = 0%, reduce = 0% 13 INFO : 2018-07-16 09:49:34,985 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.2 sec 14 INFO : 2018-07-16 09:49:36,245 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.49 sec 15 INFO : MapReduce Total cumulative CPU time: 6 seconds 490 msec 16 INFO : Ended Job = job_1531701073794_0002 17 INFO : Copying data to local directory /app/software/hive/export/t_sz03_part_exp2.dat from hdfs://mini01:9000/tmp/hive/yun/38de38d6-11fc-4090-957d-21d983d9df04/hive_2018-07-16_09-49-09_419_2948346934380749234-1/-mr-10000 18 INFO : Copying data to local directory /app/software/hive/export/t_sz03_part_exp2.dat from hdfs://mini01:9000/tmp/hive/yun/38de38d6-11fc-4090-957d-21d983d9df04/hive_2018-07-16_09-49-09_419_2948346934380749234-1/-mr-10000 19 No rows affected (27.983 seconds) 20 21 # 本地導出數據,根據 逗號(,) 分隔 22 [yun@mini01 t_sz03_part_exp2.dat]$ pwd 23 /app/software/hive/export/t_sz03_part_exp2.dat 24 [yun@mini01 t_sz03_part_exp2.dat]$ ll 25 total 8 26 -rw-r--r-- 1 yun yun 176 Jul 16 09:49 000000_0 27 -rw-r--r-- 1 yun yun 176 Jul 16 09:49 000001_0 28 [yun@mini01 t_sz03_part_exp2.dat]$ cat 000000_0 29 1,張三_20180711,20180711,CN 30 2,lisi_20180711,20180711,CN 31 3,Wangwu_20180711,20180711,CN 32 11,Tom_20180712,20180712,US 33 12,Dvid_20180712,20180712,US 34 13,cherry_20180712,20180712,US 35 [yun@mini01 t_sz03_part_exp2.dat]$ cat 000001_0 36 11,Tom_20180711,20180711,US 37 12,Dvid_20180711,20180711,US 38 13,cherry_20180711,20180711,US 39 1,張三_20180712,20180712,CN 40 2,lisi_20180712,20180712,CN 41 3,Wangwu_20180712,20180712,CN
4. Insert
語法
1 INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...] 2 3 Where values_row is: 4 ( value [, value ...] )
就是一個正常的insert語句
實例1
1 # 建表語句 2 CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2)) 3 CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC; 4 5 # insert語句 其中insert 會走map reduce 6 INSERT INTO TABLE students 7 VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32); 8 9 # 查詢結果 10 0: jdbc:hive2://mini01:10000> select * from students; 11 +------------------+---------------+---------------+--+ 12 | students.name | students.age | students.gpa | 13 +------------------+---------------+---------------+--+ 14 | fred flintstone | 35 | 1.28 | 15 | barney rubble | 32 | 2.32 | 16 +------------------+---------------+---------------+--+ 17 2 rows selected (0.241 seconds)
實例2
1 CREATE TABLE pageviews (userid VARCHAR(64), link STRING, came_from STRING) 2 PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC; 3 4 INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23') 5 VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null); 6 7 8 # 查詢結果 9 0: jdbc:hive2://mini01:10000> select * from pageviews ; 10 +-------------------+-----------------+----------------------+----------------------+--+ 11 | pageviews.userid | pageviews.link | pageviews.came_from | pageviews.datestamp | 12 +-------------------+-----------------+----------------------+----------------------+--+ 13 | jsmith | mail.com | sports.com | 2014-09-23 | 14 | jdoe | mail.com | NULL | 2014-09-23 | 15 +-------------------+-----------------+----------------------+----------------------+--+ 16 2 rows selected (0.123 seconds)
5. Select
1 SELECT [ALL | DISTINCT] select_expr, select_expr, ... 2 FROM table_reference 3 [WHERE where_condition] 4 [GROUP BY col_list] 5 [ORDER BY col_list] 6 [CLUSTER BY col_list 7 | [DISTRIBUTE BY col_list] [SORT BY col_list] 8 ] 9 [LIMIT [offset,] rows]
註意:
1、order by 會對輸入做全局排序,因此只有一個reducer,會導致當輸入規模較大時,需要較長的計算時間。
2、sort by不是全局排序,其在數據進入reducer前完成排序。因此,如果用sort by進行排序,並且設置mapred.reduce.tasks>1,則sort by只保證每個reducer的輸出有序,不保證全局有序。
3、distribute by(欄位)根據指定的欄位將數據分到不同的reducer,且分發演算法是hash散列。
4、Cluster by(欄位)除了具有Distribute by的功能外,還會對該欄位進行排序。
因此,如果分桶和sort欄位是同一個時,此時,cluster by = distribute by + sort by
分桶表的作用:最大的作用是用來提高join操作的效率;
(思考這個問題:
select a.id,a.name,b.addr from a join b on a.id = b.id;
如果a表和b表已經是分桶表,而且分桶的欄位是id欄位
做這個join操作時,還需要全表做笛卡爾積嗎?)
5.1. Join
兩張表
1 SELECT a.* FROM a JOIN b ON (a.id = b.id); 2 SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department); 3 SELECT a.* FROM a LEFT OUTER JOIN b ON (a.id <> b.id); 4 5 示例: 6 select * from t_sz01 a join t_sz05 b on a.id = b.id;
三張表
1 SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2); 2 3 示例: 4 select * from t_sz01 a join t_sz05 b on a.id = b.id join t_sz03_part c on a.id = c.id;
6. Update
語法
1 UPDATE tablename SET column = value [, column = value ...] [WHERE expression]
7. Delete
語法
1 DELETE FROM tablename [WHERE expression]
8. User-Defined Functions (UDFs)
官方文檔:LanguageManual UDF
8.1. 使用案例
1 hive (test_db)> select current_database(); 2 OK 3 test_db 4 Time taken: 0.324 seconds, Fetched: 1 row(s) 5 hive (test_db)> create table dual (id string); # 建表 6 OK 7 Time taken: 0.082 seconds 8 9 # 本地文件上傳 10 [yun@mini01 hive]$ ll /app/software/hive/dual.dat 11 -rw-rw-r-- 1 yun yun 2 Jul 16 20:54 /app/software/hive/dual.dat 12 [yun@mini01 hive]$ cat /app/software/hive/dual.dat 13 # 只有一個空格 【必須要有一個字元,不能為空】 14 hive (test_db)> load data local inpath '/app/software/hive/dual.dat' overwrite into table dual; # 導入數據 15 Loading data to table test_db.dual 16 Table test_db.dual stats: [numFiles=1, numRows=0, totalSize=2, rawDataSize=0] 17 OK 18 Time taken: 0.421 seconds 19 20 # 函數測試 21 hive (test_db)> select substr('zhangtest', 2, 3) from dual; # 測試 substr 22 OK 23 han 24 Time taken: 0.081 seconds, Fetched: 1 row(s) 25 hive (test_db)> select concat('zha', '---', 'kkk') from dual; # 測試concat 26 OK 27 zha---kkk 28 Time taken: 0.118 seconds, Fetched: 1 row(s)
8.2. Transform實現
Hive的 TRANSFORM 關鍵字提供了在SQL中調用自寫腳本的功能
適合實現Hive中沒有的功能又不想寫UDF的情況
使用示例1:下麵這句sql就是借用了weekday_mapper.py對數據進行了處理.
1 CREATE TABLE u_data_new ( 2 movieid INT, 3 rating INT, 4 weekday INT, 5 userid INT) 6 ROW FORMAT DELIMITED 7 FIELDS TERMINATED BY '\t'; 8 9 add FILE weekday_mapper.py; 10 11 INSERT OVERWRITE TABLE u_data_new 12 SELECT 13 TRANSFORM (movieid , rate, timestring,uid) 14 USING 'python weekday_mapper.py' 15 AS (movieid, rating, weekday,userid) 16 FROM t_rating;
其中weekday_mapper.py內容如下
1 #!/bin/python 2 import sys 3 import datetime 4 5 for line in sys.stdin: 6 line = line.strip() 7 movieid, rating, unixtime,userid = line.split('\t') 8 weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday() 9 print '\t'.join([movieid, rating, str(weekday),userid])