Hive筆記——技術點彙總_ZenDei技術網路在線

目錄 · 概況 · 手工安裝 · 引言 · 創建HDFS目錄 · 創建元資料庫 · 配置文件 · 測試 · 原理 · 架構 · 與關係型資料庫對比 · API · WordCount · 命令 · 數據類型 · 文件存儲格式 · 數據格式 · 資料庫 ...

概況

1. 設計目標：讓精通SQL而Java技能較弱的數據分析師能利用Hadoop數據分析。

2. 使用率：實際開發中，80%操作使用Hive完成，20%使用MapReduce。

3. 與MapReduce相比，Hive特點：穩定、代碼精簡、易於維護。

4. HiveQL：未嚴格實現SQL-92標準。

5. 本質：將HiveQL轉化為一個或多個MapReduce作業併在集群上運行，但並不是所有HiveQL都會轉為MapReduce作業。

手工安裝

引言

1. 假設已成功安裝JDK、Hadoop集群及MySQL。

2. 本次手工安裝在Hadoop的Master節點上。

3. 創建HDFS目錄並分配許可權。

創建HDFS目錄

hadoop fs -mkdir /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse

創建元資料庫

2. 創建MySQL元資料庫。

create database hive character set latin1;

3. 創建用戶。

grant all on hive.* to 'hive'@'localhost' identified by 'hive';
grant all on hive.* to 'hive'@'%' identified by 'hive';
flush privileges;

配置文件

4. 創建MySQL元資料庫。

tar zxvf apache-hive-1.2.2-bin.tar.gz -C /opt/app
cp mysql-connector-java-5.1.42.jar /opt/app/apache-hive-1.2.2-bin/lib
cd /opt/app/apache-hive-1.2.2-bin
vi conf/hive-site.xml

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value><![CDATA[jdbc:mysql://centos1:3306/hive?characterEncoding=latin1]]></value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hive</value>
</property>
</configuration>

測試

5. 配置環境變數。

vi /home/hadoop/.bash_profile

export HIVE_HOME=/opt/app/apache-hive-1.2.2-bin
export PATH=$PATH:$HIVE_HOME/bin
export HADOOP_USER_CLASSPATH_FIRST=true

source /home/hadoop/.bash_profile
env | grep HIVE_HOME

6. 啟動與測試。

hive

hive> show tables;

原理

架構

1. 訪問介面：CLI（命令行）、HWI（Hive網路界面）、Thrift Server提供的JDBC和ODBC。

2. Driver職責：通過訪問介面提交的命令，由Driver模塊解釋和編譯，對計算優化，生成執行計劃執行，執行計劃由若幹MapReduce作業組成。

3. Metastore：Hive元數據存儲，通常是關係型資料庫（例如MySQL、PostgreSQL、Oracle）。

4. 部署：Hive通過與YARN通信來執行MapReduce作業，所以Hive無需部署在YARN集群。

與關係型資料庫對比

	Hive	RDBMS
查詢語言	HiveQL	SQL
數據存儲位置	HDFS	Raw Device或本地文件系統
數據格式	用戶定義	系統決定
數據更新	不支持	支持
索引	無	有
執行	MapReduce	Executor
執行延遲	高	低
伸縮性	高	低
數據規模	大	小
事務	無	一般有

API

WordCount

create table wordcount_in (
  line string
);
load data local inpath '/opt/app/apache-hive-1.2.2-bin/LICENSE' overwrite into table wordcount_in;
create table wordcount_out as
select word, count(1) as count
from (
  select explode(split(line, ' ')) as word
  from wordcount_in
) words
group by word
order by word;
select * from wordcount_out;

命令

1. hive命令

a) 官方文檔

b) 示例

hive -e 'select * from wordcount_out'
hive -f hive-script.sql

2. beeline命令

a) 官方文檔

數據類型

1. 基本數據類型

數據類型	長度	示例
tinyint	1位元組，有符號整數	11
smallint	2位元組，有符號整數	11
int	4位元組，有符號整數	11
bigint	8位元組，有符號整數	11
float	4位元組，單精度浮點數	11.0
double	8位元組，雙精度浮點數	11.0
boolean	布爾值	TRUE
string	字元串	'hadoop'

2. 複雜數據類型

數據類型	長度	示例
struct	結構體，可通過“列名.屬性名”訪問，如struct(first_name string, last_name string)	struct('Harry', 'Potter')
map	鍵值對集合，可通過“列名.鍵”訪問	map('first_name', 'Harry', 'last_name', 'Potter')
array	數組，即相同數據類型的變數集合，可通過“列名[索引]”訪問（索引從0開始）	array('Harry', 'Potter')

3. 更多數據類型參考官方文檔。

文件存儲格式

1. textfile

a) 原理：平面文本文件，預設格式。

b) 特點：未壓縮，磁碟開銷大，數據解析開銷大。

2. sequencefile

a) 原理：二進位格式。

b) 特點：使用方便、可分割、可壓縮，數據按行切分。

3. rcfile

a) 原理：行列存儲結合的存儲方式。

b) 特點：數據按行分塊，保證同一行數據在一個塊上，避免讀取一行需要讀取多個塊。

4. parquet

a) 原理：面向分析性業務的列式存儲格式。

b) 特點：對Impala有顯著性能提升。

數據格式

1. 預設使用少用字元作為分隔符

分隔符	描述
\n	換行符，預設行分隔符
^A（Ctrl+A）	在文本中以八進位編碼\001表示，列分隔符
^B（Ctrl+B）	在文本中以八進位編碼\002表示，array、struct元素分隔符
^C（Ctrl+C）	在文本中以八進位編碼\003表示，map鍵值對分隔符

2. 自定義（此處仍然使用預設分隔符）

create table test_table (
  test_string string,
  test_int int,
  test_array array<string>,
  test_struct struct<test_struct_string:string, test_struct_int:int>,
  test_map map<string, int>
)
row format delimited
fields terminated by '\001'
collection items terminated by '\002'
map keys terminated by '\003'
lines terminated by '\n'
stored as textfile;

資料庫

1. 本質：表的目錄或命名空間。

2. 場景：生產環境如果表較多，則一般使用資料庫將表邏輯分組。

3. 預設資料庫：default。

4. 存儲結構

a) 庫：一個HDFS目錄，如test_database其他為/user/hive/warehouse/test_database.db。

c) default庫：HDFS目錄/user/hive/warehouse。

5. 查看

show databases;
describe database test_database;

6. 創建

create database test_database;
create database test_database location '/test/hive/test_database';

7. 切換

use test_database;

8. 刪除

drop database test_database;
drop database if exists test_database; -- 避免不存在引起的錯誤
drop database test_database cascade; -- 其中的表一起刪除（預設不允許刪除非空資料庫）

數據表

1. 管理表：managed table，數據完全由Hive管理。

2. 外部表：external table，數據除Hive管理外，還會有其他分析工具管理，如Pig。

3. 管理表與外部表區別

a) 刪除：刪除管理表會刪除元數據及其數據；刪除外部表僅刪除元數據。

b) 註意：創建表時是否指定location，管理表、外部表並無區別。

4. 分區表

a) 含義：將表水平切分，數據按某種規則存儲。

b) 分區欄位報錯：分區欄位不能與表定義欄位同名，否則報錯“Column repeated in partitioning columns”。

c) 存儲目錄

/user/hive/warehouse/test_order/year=1999/month=11
/user/hive/warehouse/test_order/year=1999/month=12
/user/hive/warehouse/test_order/year=2000/month=01
/user/hive/warehouse/test_order/year=2000/month=02
/user/hive/warehouse/test_order/year=2000/month=03

d) 查詢性能：查詢大數據集時，根據分區欄位查詢，僅掃描分區目錄下的數據，性能比非分區表的全表掃描有顯著提升。

select * from test_order where year = '2000' and month = '01';

e) 常用分區欄位：生產環境常按創建時間或修改時間分區。

f) strict模式：如果針對分區表的查詢未對分區限制，則禁止提交；修改hive-site.xml參數“hive.mapred.mode”為strict。

5. 查看

a) 表

show tables;
show tables in test_database; -- 指定資料庫
desc test_table; -- 表結構
desc extended test_table; -- 表級別信息
desc formatted test_table; -- 表級別信息

b) 分區

show partitions test_order;

6. 創建

a) 管理表

create table if not exists test_table (
  test_string string,
  test_int int
)
location '/test/hive/test_table';
create table test_table_bak1 like test_table; -- 複製表結構，不複製數據
create table test_table_bak2 as select test_string from test_table;

b) 外部表

create external table test_table (
  test_string string,
  test_int int
);

c) 分區表

create table test_order (
  order_no string,
  balance double
)
partitioned by (year string, month string);

7. 刪除

drop table test_table;
drop table if exists test_table;

8. 修改

a) 重命名

alter table test_table rename to test_table_new;

b) 增加分區

alter table test_order add partition (year='2000', month='04');
alter table test_order add partition (year='2000', month='05') location '/test/hive/test_order/year=2000/month=05';

c) 修改分區路徑

alter table test_order add partition (year='2000', month='05') set location '/test_order/year=2000/month=05';

d) 刪除分區

alter table test_order drop partition (year='2000', month='04');

e) 修改列

alter table test_table change column test_int test_double double comment 'new column' after test_string;

f) 增加列

alter table test_table add columns (
  test_long long,
  test_float float
);

g) 重定義列

alter table test_table replace columns (col1 string, col2 string, col3 string);

裝載數據

1. 以追加方式將HDFS文件移動到表

load data inpath '/test/data' into table test_table;

2. 以覆蓋方式將HDFS文件移動到表

load data inpath '/test/data' overwrite into table test_table;

3. 以追加方式將本地文件複製到表

load data local inpath '/opt/test/data' into table test_table;

4. 以覆蓋方式將本地文件複製到表

load data local inpath '/opt/test/data' overwrite into table test_table;

插入數據

1. 以覆蓋方式插入非分區表

insert overwrite table test_table select * from test_source;

2. 以覆蓋方式插入分區表

insert overwrite table test_order partition (year='2000', month='05') select * from test_source;

3. 以追加方式插入非分區表

insert into table test_table select * from test_source;

4. 以追加方式插入分區表

insert into table test_order partition (year='2000', month='05') select * from test_source;

5. multiple insert

from test_source
insert into table test_order partition (year='2000', month='05') select order_no, balance where order_no >= '0001' and order_no < '1000'
insert into table test_order partition (year='2000', month='06') select order_no, balance where order_no >= '1000' and order_no < '5000'
insert overwrite table test_order partition (year='2000', month='07') select order_no, balance where order_no >= '5000' and order_no < '9000';

6. 動態分區插入

a) 含義：Hive根據查詢結果最後幾個欄位作為分區欄位自動創建分區並插入數據，這些欄位需與分區欄位順序一致。

b) 參數

Configuration Property	Default	Note
hive.exec.dynamic.partition	true	Whether or not to allow dynamic partitions in DML/DDL.
hive.exec.dynamic.partition.mode	strict	In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions. In nonstrict mode all partitions are allowed to be dynamic.
hive.exec.max.dynamic.partitions.pernode	100	Maximum number of dynamic partitions allowed to be created in each mapper/reducer node.
hive.exec.max.dynamic.partitions	1000	Maximum number of dynamic partitions allowed to be created in total.
hive.exec.max.created.files	100000	Maximum number of HDFS files created by all mappers/reducers in a MapReduce job.
hive.error.on.empty.partition	false	Whether to throw an exception if dynamic partition insert generates empty results.

c) 示例

from page_view_stg pvs
insert overwrite table page_view partition(dt='2008-06-08', country)
select pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt;

導出數據

1. 導出到HDFS

insert overwrite directory '/test/hive/data' select * from test_table;

2. 導出到本地磁碟

insert overwrite local directory '/opt/test/data' select * from test_table;

查詢數據

1. select...from...

select test_string, test_int from test_table;

a) 表別名

select t.test_string, t.test_int from test_table t;

b) 嵌套查詢

select s.name, sc.course from (select id, name from student_info) s join (select id, course from student_class_info) sc on s.id = sc.id;

c) 限制返回行數

select * from test_table limit 100;

d) case...when...then...

select id, name, case when gender = 'm' then '男' when gender = 'f' then '女' else '保密' end
from student_info;

2. where

select * from student_info where gender = 'm';

3. group by與having

select gender, count(id) from student_info group by gender;

4. inner join

select s.id, s.name from student_info s inner join student_class_info sc on s.id = sc.id;
select *
from t1
join t2 on t1.id = t2.id
join t3 on t1.id = t3.id;

5. left outer join

select s.id, s.name from student_info s left outer join student_class_info sc on s.id = sc.id;

6. right outer join

select s.id, s.name from student_info s right outer join student_class_info sc on s.id = sc.id;

7. full outer join

select s.id, s.name from student_info s full outer join student_class_info sc on s.id = sc.id;

8. left-semi join

a) left-semi join效果：如果左表數據對於右表滿足on條件，則返回左表數據。

b) left-semi join場景：替代標準SQL的in操作。Hive不支持如下in操作：

select s.id, s.name from student_info s where s.id in (select sc.id from student_class_info sc);

c) left-semi join原理：當左表的一條數據一旦滿足右表的on條件，則停止掃描該左表數據對應的右表，所以left-semi join效率高於inner join。

d) 示例：效果相同，但前者效率高於後者。

select s.id, s.name from student_info s left semi join student_class_info sc on s.id = sc.id;
select s.id, s.name from student_info s inner join student_class_info sc on s.id = sc.id;

9. map-side join

a) map-side join原理：如果有一張小表，則將該小表載入到記憶體，直接在map階段進行join，效率大大提升。

b) 非map-side join原理：通過在reduce階段求笛卡爾積進行join，效率相對較低。

c) 示例

select /*+ mapjoin(s) */ s.name, sc.course from student_info s join student_class_info sc on s.id = sc.id;

10. order by與sort by

a) order by：全局有序。

b) sort by：每個Reduer局部有序。

c) 示例

select * from student_info order by id asc, name desc;
select * from student_info sort by id asc, name desc;

11. distribute by與cluster by

a) distribute by：相當於MapReduce的Partitioner，即控制中間結果分發的目的Reducer。

b) cluster by：如果distribute by和sort by的列相同且正序，可用cluster by替代。

c) 示例

select * from student_info distribute by id sort by id asc, name desc;

12. 抽樣

a) 公式： “bucket x out of y on z”，根據z列的哈希值將數據分為y桶，取其中第x桶。

b) 分桶抽樣

select * from student_info tablesample(bucket 3 out of 10 on id);

c) 隨機抽樣

select * from student_info tablesample(bucket 3 out of 10 on rand());

13. union all

a) 註意：Hive不支持直接union all，須嵌套查詢。

b) 示例

select s.id, s.name
from (
  select s1.id, s1.name from student_info s1
  union all
  select s2.id, s2.name from student_info s2
) s;

函數

1. 標準函數

a) 含義：操作一行數據，產生一個值。

b) 內置函數參考官方文檔。

2. 聚合函數

a) 含義：操作多行數據，產生一個值。

b) 內置函數參考官方文檔。

3. 表生成函數

a) 含義：操作一行數據，產生多行數據。

b) 內置函數參考官方文檔。

4. 自定義函數開發

a) Maven依賴

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>1.2.2</version>
</dependency>
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.6.5</version>
</dependency>

b) 使用方法：註冊jar文件“add jar /opt/test/test-hive.jar;”；創建函數“create temporary function concat_udf as 'hive.ConcatUDF';”；查看“show functions”。

5. 自定義標準函數（UDF）

a) 方法：繼承UDF，並實現evaluate()方法（不是Override，因為無法預知參數）。

b) 示例

 1 package hive;
 2 
 3 import org.apache.hadoop.hive.ql.exec.UDF;
 4 import org.apache.hadoop.io.Text;
 5 
 6 public class ConcatUDF extends UDF {
 7     
 8     private Text result = new Text();
 9 
10     public Text evaluate(Text str1, Text str2) {
11         if (str1 == null || str2 == null) {
12