一.環境的搭建 1.安裝配置mysql rpm –ivh MySQL-server-5.6.14.rpm rpm –ivh MySQL-client-5.6.14.rpm 啟動mysql 創建hive用戶 grant all on *.* to hadoop@’%’ identified by ‘h ...
一.環境的搭建
1.安裝配置mysql
rpm –ivh MySQL-server-5.6.14.rpm
rpm –ivh MySQL-client-5.6.14.rpm
啟動mysql
創建hive用戶
grant all on *.* to hadoop@’%’ identified by ‘hadoop’;
grant all on *.* to hadoop@’localhost’ identified by ‘hadoop’;
grant all on *.* to hadoop@’master’ identified by ‘hadoop’;
創建hive資料庫
create database hive_1;
2. hive的安裝
tar –zxvf apache-hive-0.13-1-bin.tar.gz
3. hive的配置
vi /apache-hive-0.13-1-bin/conf/hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration>
<property>
<name>hive.metastore.local</name>
<value>true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://master:3306/hive_1?characterEncoding=UTF-8</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name> <value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name> <value>hadoop</value>
</property>
</configuration>
4. tar -zxvf mysql-connector-java-5.1.27.tar.gz
5. 將java connector複製到依賴庫中
cp mysql-connector-java-5.1.27-bin.jar ~/apache-hive-0.13-1-bin/lib/
6. 配置環境變數
vi .bash_profile
HIVE_HOME = /home/gdou/apache-hive-0.13-1-bin
PATH=$PATH:$HIVE_HOME/bin
二.搜狗日誌數據分析
1. sogou.500w.utf8預處理
數據格式
訪問時間\t 用戶ID \t 關鍵詞 \t 排名\t \頁數 \t URL
2.查看數據
less sogou.500w.utf8
wc -l sogou.500w.utf8
head -100 sogou.500w.utf8 sogou.tmp
3. 數據擴展
將時間欄位拆分並拼接,添加年 月 日 小時欄位
bash sogou-log-extend.sh sogou.500w.utf8 sogou.500w.utf8.ext
4. 數據過濾
過濾第二欄位UID和第三個欄位關鍵字為空的行
bash sogou-log-filter.sh sogou.500w.utf8.ext sogou.500w.utf8 sogou.500w.utf8.flt
5.將文件sogou.500w.utf8和sogou.500w.utf8.flt上傳至HDFS上。
6.HiveQL
基本操作
hive > show databases;
hive > create database sogou;
hive > use sogou;
hive > show tables;
hive > create external table sogou.sogou20111230(timestamp string, uid string, keyword string , rank int, order int, url string)
> comment 'this is a sogou table'
> row format delimited
> fields terminated by '\t'
> stored as textfile
> location 'hdfs://master:9000/sogou/20111230';
hive>show create table sogou.sogou20111230;
hive>describe sogou.sogou20111230;
hive> select * from sogou.sogou20111230 limit 3;
select count(*) from sogou.sogou20111230;
select count(distinct uid) from sogou.sogou20111230;
7.用hiveQL完成下列查詢(寫出HiveQL語句)
1)統計關鍵字非空查詢的條數;
select count(*) from sogou.sogou20111230 where keyword is not null;
結果為:5000000
2)查詢頻度最高的前五十個關鍵字;
select keyword,count(keyword) as num from sogou.sogou20111230 group by keyword order by num desc limit 50;
3)統計每個uid的平均查詢次數
select avg(bb.num) from (select count(b.uid) as num from sogou.sogou20111230 b group by b.uid) as bb;
輸出結果:3.69
4)搜索關鍵字內容包含‘仙劍奇俠’超過三次的用戶id
select tt.uid,tt.num from(select t.uid,count(t.uid) as num from (select * from sogou.sogou20111230 where keyword like concat('%','仙劍奇俠','%')) as t group by t.uid order by num desc) tt where tt.num > 3 limit 50;
輸出結果:
5)查找直接輸入URL作為關鍵字的條目;
select * from sogou.sogou20111230 where keyword rlike '[a-zA-z]+://[^\s]*' limit 50;
6)統計不重覆的uid的行數;
select count(DISTINCT uid) from sogou.sogou20111230;
輸出結果:1352664
相關資料:
鏈接:http://pan.baidu.com/s/1dFD7mdr 密碼:xwu8