Hadoop學習(7)-hive的安裝和命令行使用和java操作

Hive的用處，就是把hdfs里的文件建立映射轉化成資料庫的表但hive里的sql語句都是轉化成了mapruduce來對hdfs里的數據進行處理，並不是真正的在資料庫里進行了操作。而那些表的定義則是儲存在了mysql資料庫中，他只是記錄相應表的定義所以你的集群中要有一臺機器裝了mysql 裝 ...

Hive的用處，就是把hdfs里的文件建立映射轉化成資料庫的表

但hive里的sql語句都是轉化成了mapruduce來對hdfs里的數據進行處理

，並不是真正的在資料庫里進行了操作。

而那些表的定義則是儲存在了mysql資料庫中，他只是記錄相應表的定義

所以你的集群中要有一臺機器裝了mysql

裝hive，裝到哪都行

然後解壓tar –zxvf xxxxx –C apps

然後進入到這個目錄里下的conf里

創建hive-site.xml文件

告訴他mysql在哪，連接驅動是啥，用戶名和密碼

然後進入lib目錄下，把jdbc jar 包放到該目錄下

然後是啟動hive

你的hadoop和hive要配置的有環境變數

echo $PATH //可以查看配置的環境變數

echo $HADOOP-HOME //可以查看具體的哪一個

然後最好把hadoop和yarn都啟動起來

然後再安裝目錄里bin/hive就可以啟動了

預設的是default資料庫

創建資料庫和表都會在真正的hdfs裡面創建目錄

然後如果你要是想往表裡面導數據，你需要把相應的文件用 ^A 來分割放到hdfs里的相應目錄下

然後把這個文件上傳到hdfs裡面

hadoop fs -put stu.info /user/hive/warehouse/t_big24/

在hive交互頁面中，顯示當前庫

設置一些基本參數，讓hive使用起來更便捷，比如：

1、讓提示符顯示當前庫：

hive>set hive.cli.print.current.db=true;

2、顯示查詢結果時顯示欄位名稱：

hive>set hive.cli.print.header=true;

但是這樣設置只對當前會話有效，重啟hive會話後就失效，解決辦法：

在linux的當前用戶目錄中，編輯一個.hiverc文件，將參數寫入其中：

vi .hiverc

set hive.cli.print.header=true;

set hive.cli.print.current.db=true;

配置hive環境變數

比如我hive是解壓在 /root/apps/hive-1.2.1

Vi /etc/profile

然後在最後加上

Export HIVE_HOME=/root/apps/hive-1.2.1

Export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin

還可以把hive當成一個服務，使用客戶端來訪問這個服務

服務埠號10000

啟動hive服務

bin/hiveserver2

然後可以在linux監聽埠號netstat -nltp

啟動成功後，可以在別的節點上用beeline去連接

啟動服務 bin/beeline

然後要連接他

！connect jdbc:hive2://hdp-01:10000

然後輸入賬戶root 沒有密碼

課外知識標準輸出重定向。Linux里1就是標準輸出

./linux腳本文件 1>/要輸入的文件名 2>/錯誤時要定向到的文件名 &

這樣就不會再終端列印了

/dev/null 是一個“黑洞”什麼東西都會刪除

上述啟動，會將這個服務啟動在前臺，如果要啟動在後臺，則命令如下：

nohup bin/hiveserver2 1>/dev/null 2>&1 &

前面加上nohup就是就算這個用戶退出，這個進程也會繼續

hive -e "sql命令"

這樣可以不用進到hive直接運行

然後，進一步，可以將上述命令寫入shell腳本中，以便於腳本化運行hive任務，並控制、調度眾多hive任務，示例如下：

vi t_order_etl.sh

#!/bin/bash

hive -e "select * from db_order.t_order"

hive -e "select * from default.t_user"

hql="create table default.t_bash as select * from db_order.t_order"

hive -e "$hql"

如果要執行的hql語句特別複雜，那麼，可以把hql語句寫入一個文件：

vi x.hql

select * from db_order.t_order;

select count(1) from db_order.t_user;

然後，用hive -f /root/x.hql 來執行

use db_order;

create table t_order(id string,create_time string,amount float,uid string);

表建好後，會在所屬的庫目錄中生成一個表目錄

/user/hive/warehouse/db_order.db/t_order

只是，這樣建表的話，hive會認為表數據文件中的欄位分隔符為 ^A

正確的建表語句為：

create table t_order(id string,create_time string,amount float,uid string)

row format delimited

fields terminated by ',';

這樣就指定了，我們的表數據文件中的欄位分隔符為 ","

內部表(MANAGED_TABLE)：表目錄按照hive的規範來部署，位於hive的倉庫目錄/user/hive/warehouse中

外部表(EXTERNAL_TABLE)：表目錄由建表用戶自己指定

create external table t_access(ip string,url string,access_time string)

row format delimited

fields terminated by ','

location '/access/log';

外部表和內部表的特性差別：

1、內部表的目錄在hive的倉庫目錄中 VS 外部表的目錄由用戶指定

2、drop一個內部表時：hive會清除相關元數據，並刪除表數據目錄

3、drop一個外部表時：hive只會清除相關元數據；

分區表的實質是：在表目錄中為數據文件創建分區子目錄，以便於在查詢時，MR程式可以針對分區子目錄中的數據進行處理，縮減讀取數據的範圍。

比如，網站每天產生的瀏覽記錄，瀏覽記錄應該建一個表來存放，但是，有時候，我們可能只需要對某一天的瀏覽記錄進行分析

1、創建帶分區的表

create table t_access(ip string,url string,access_time string)

partitioned by(dt string)

row format delimited

fields terminated by ',';

註意：分區欄位不能是表定義中的已存在欄位

向分區中導入數據

load data local inpath '/root/access.log.2017-08-04.log' into table t_access partition(dt='20170804');

load data local inpath '/root/access.log.2017-08-05.log' into table t_access partition(dt='20170805');

針對分區數據進行查詢

統計8月4號的總PV：

select count(*) from t_access where dt='20170804';

實質：就是將分區欄位當成表欄位來用，就可以使用where子句指定分區了

建表：

create table t_partition(id int,name string,age int)

partitioned by(department string,sex string,howold int)

row format delimited fields terminated by ',';

導數據：

load data local inpath '/root/p1.dat' into table t_partition partition(department='xiangsheng',sex='male',howold=20);

可以通過已存在表來建表：

1、create table t_user_2 like t_user;

新建的t_user_2表結構定義與源表t_user一致，但是沒有數據

2、在建表的同時插入數據

create table t_access_user

select ip,url from t_access;

t_access_user會根據select查詢的欄位來建表，同時將查詢的結果插入新表中

1.1.1. 將hive表中的數據導出到指定路徑的文件

1、將hive表中的數據導入HDFS的文件

insert overwrite directory '/root/access-data'

row format delimited fields terminated by ','

select * from t_access;

2、將hive表中的數據導入本地磁碟文件

insert overwrite local directory '/root/access-data'

row format delimited fields terminated by ','

select * from t_access limit 100000;

hql裡面的數據類型和普通的沒什麼區別

array數組類型

arrays: ARRAY<data_type> (Note: negative values and non-constant expressions are allowed as of Hive 0.14.)

示例：array類型的應用

假如有如下數據需要用hive的表去映射：

戰狼2,吳京:吳剛:龍母,2017-08-16

三生三世十里桃花,劉亦菲:癢癢,2017-08-20

設想：如果主演信息用一個數組來映射比較方便

建表：

create table t_movie(moive_name string,actors array<string>,first_show date)

row format delimited fields terminated by ','

collection items terminated by ':';

導入數據：

load data local inpath '/root/movie.dat' into table t_movie;

查詢：

select * from t_movie;

select moive_name,actors[0] from t_movie;

select moive_name,actors from t_movie where array_contains(actors,'吳剛');

select moive_name,size(actors) from t_movie;

map類型

1) 假如有以下數據：

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28

2,lisi,father:mayun#mother:huangyi#brother:guanyu,22

3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29

4,mayun,father:mayongzhen#mother:angelababy,26

可以用一個map類型來對上述數據中的家庭成員進行描述

2) 建表語句：

create table t_person(id int,name string,family_members map<string,string>,age int)

row format delimited fields terminated by ','

collection items terminated by '#'

map keys terminated by ':';

3) 查詢

select * from t_person;

## 取map欄位的指定key的值

select id,name,family_members['father'] as father from t_person;

## 取map欄位的所有key

select id,name,map_keys(family_members) as relation from t_person;

## 取map欄位的所有value

select id,name,map_values(family_members) from t_person;

select id,name,map_values(family_members)[0] from t_person;

## 綜合：查詢有brother的用戶信息

select id,name,father

from

(select id,name,family_members['brother'] as father from t_person) tmp

where father is not null;

struct類型

1) 假如有如下數據：

1,zhangsan,18:male:beijing

2,lisi,28:female:shanghai

其中的用戶信息包含：年齡：整數，性別：字元串，地址：字元串

設想用一個欄位來描述整個用戶信息，可以採用struct

2) 建表：

create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)

row format delimited fields terminated by ','

collection items terminated by ':';

3) 查詢

select * from t_person_struct;

select id,name,info.age from t_person_struct;

其他的執行語句和sql裡面的是基本一樣的

註意：一旦有group by子句，那麼，在select子句中就不能有（分組欄位，聚合函數）以外的欄位

## 為什麼where必須寫在group by的前面，為什麼group by後面的條件只能用having

因為，where是用於在真正執行查詢邏輯之前過濾數據用的

having是對group by聚合之後的結果進行再過濾；

上述語句的執行邏輯：

1、where過濾不滿足條件的數據

2、用聚合函數和group by進行數據運算聚合，得到聚合結果

3、用having條件過濾掉聚合結果中不滿足條件的數據

假如有以下數據：

1,zhangsan,化學:物理:數學:語文

2,lisi,化學:數學:生物:生理:衛生

3,wangwu,化學:語文:英語:體育:生物

映射成一張表：

create table t_stu_subject(id int,name string,subjects array<string>)

row format delimited fields terminated by ','

collection items terminated by ':';

然後，我們利用這個explode的結果，來求去重的課程：

select distinct tmp.sub

from

(select explode(subjects) as sub from t_stu_subject) tmp;

然後java代碼操作的話，需要現在伺服器上開啟hive2服務，這個跟上面使用beeline連接hive是一個道理

需要的包在解壓後的hive裡面都有

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;

public class getConnection {
 public getConnection() {
 }

 public static Connection getConnection() throws ClassNotFoundException, SQLException {
     Class.forName("org.apache.hive.jdbc.HiveDriver");
     Connection connection = DriverManager.getConnection("jdbc:hive2://hdp-02:10000/test","root","123456");
     
     return connection;
 }
}

這樣就可以獲得一個連接

import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;

public class getData {
    public getData() {
    }

    public static void getdata() throws ClassNotFoundException, SQLException {
        Connection connection = getConnection.getConnection();
        Statement statement = connection.createStatement();
        String sql = "select * from people";
        ResultSet res = statement.executeQuery(sql);while(res.next()) {
            System.out.println(res.getString(1) + "  " + res.getString(2) + "  " + res.getString(3) + "  " + res.getString(4));
        }

        res.close();
        statement.close();
        connection.close();
        return res;
        
    }
}

這個其實和連接普通的mysql也沒啥區別........