本教程為單機版+偽分散式的Hadoop,安裝過程寫的有些簡單,只作為筆記方便自己研究Hadoop用。 環境 Hadoop 有兩個主要版本,Hadoop 1.x.y 和 Hadoop 2.x.y 系列,比較老的教材上用的可能是 0.20 這樣的版本。Hadoop 2.x 版本在不斷更新,本教程均可適用 ...
本教程為單機版+偽分散式的Hadoop,安裝過程寫的有些簡單,只作為筆記方便自己研究Hadoop用。
環境
操作系統 | Centos 6.5_64bit | |
本機名稱 | hadoop001 | |
本機IP | 192.168.3.128 | |
JDK | jdk-8u40-linux-x64.rpm | 點此下載 |
Hadoop | 2.7.3 | 點此下載 |
Hadoop 有兩個主要版本,Hadoop 1.x.y 和 Hadoop 2.x.y 系列,比較老的教材上用的可能是 0.20 這樣的版本。Hadoop 2.x 版本在不斷更新,本教程均可適用。如果需安裝 0.20,1.2.1這樣的版本,本教程也可以作為參考,主要差別在於配置項,配置請參考官網教程或其他教程。
單機安裝
一、創建Hadoop用戶
為了方便之後的操作,不幹擾其他用戶,咱們先建一個單獨的Hadoop用戶並設置密碼[root@localhost ~]# useradd -m hadoop -s /bin/bash
[root@localhost ~]# passwd hadoop Changing password for user hadoop. New password: BAD PASSWORD: it is based on a dictionary word BAD PASSWORD: is too simple Retype new password: passwd: all authentication tokens updated successfully.
//還要修改host文件 [root@hadoop001 .ssh]# vim /etc/hosts 192.168.3.128 hadoop001
二、創建SSH無密碼登錄
單節點、集群都需要用到SSH登錄,方便無障礙登錄和通訊。
[hadoop@hadoop001 .ssh]$ cd ~/.ssh/ [hadoop@hadoop001 .ssh]$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): // 回車 Enter passphrase (empty for no passphrase): //回車 Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: 97:75:b0:56:3b:57:8c:1f:b1:51:b6:d9:9f:77:f3:cf hadoop@hadoop001 The key's randomart image is: +--[ RSA 2048]----+ | . .=*| | +.+O| | + +=+| | + . o+| | S o o+| | . =| | .| | ..| | E| +-----------------+ [hadoop@hadoop001 .ssh]$ cat ./id_rsa.pub >> ./authorized_keys [hadoop@hadoop001 .ssh]$ ll total 12 -rw-rw-r--. 1 hadoop hadoop 398 Mar 14 14:09 authorized_keys -rw-------. 1 hadoop hadoop 1675 Mar 14 14:09 id_rsa -rw-r--r--. 1 hadoop hadoop 398 Mar 14 14:09 id_rsa.pub [hadoop@hadoop001 .ssh]$ chmod 644 authorized_keys [hadoop@hadoop001 .ssh]$ ssh hadoop001 Last login: Tue Mar 14 14:11:52 2017 from hadoop001
這樣的話本機免密碼登錄已經配置成功了。
三、安裝JDK
rpm -qa |grep java // 卸載所有出現的包 rpm -e --nodeps java-x.x.x-gcj-compat-x.x.x.x-40jpp.115 // 執行jdk-8u40-linux-x64.rpm包,不用配環境變數,不過需要加JAVA_HOME echo "JAVA_HOME"=/usr/java/latest/ >> /etc/environment
測試安裝成功與否
[hadoop@hadoop001 soft]$ java -version java version "1.8.0_40" Java(TM) SE Runtime Environment (build 1.8.0_40-b25) Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)
四、安裝Hadoop
//安裝到opt目錄下 [root@hadoop001 soft]# tar -zxf hadoop-2.7.3.tar.gz -C /opt/
修改目錄許可權
[root@hadoop001 opt]# ll total 20 drwxr-xr-x. 9 root root 4096 Aug 17 2016 hadoop-2.7.3 [root@hadoop001 opt]# chown -R hadoop:hadoop hadoop-2.7.3/ [root@hadoop001 opt]# ll total 20 drwxr-xr-x. 9 hadoop hadoop 4096 Aug 17 2016 hadoop-2.7.3
添加環境變數
[hadoop@hadoop001 bin]$ vim ~/.bash_profile # hadoop HADOOP_HOME=/opt/hadoop-2.7.3 PATH=$PATH:$HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export PATH
測試安裝成功與否
[hadoop@hadoop001 bin]$ hadoop Usage: hadoop [--config confdir] [COMMAND | CLASSNAME] CLASSNAME run the class named CLASSNAME or where COMMAND is one of: fs run a generic filesystem user client version print the version jar <jar> run a jar file note: please use "yarn jar" to launch YARN applications, not this command. checknative [-a|-h] check native hadoop and compression libraries availability distcp <srcurl> <desturl> copy file or directories recursively archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive classpath prints the class path needed to get the credential interact with credential providers Hadoop jar and the required libraries daemonlog get/set the log level for each daemon trace view and modify Hadoop tracing settings Most commands print help when invoked w/o parameters.
單詞統計
創建輸入文件夾input放輸入文件
[root@hadoop001 /]# mkdir -p /data/input //創建測試文件word.txt [root@hadoop001 /]# vim word.txt Hi, This is a test file. Hi, I love hadoop and love you . //授權 [root@hadoop001 /]# chown hadoop:hadoop /data/input/word.txt //運行單詞統計 [hadoop@hadoop001 hadoop-2.7.3]$ hadoop jar /opt/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /data/ //...中間日誌省略 17/03/14 15:22:44 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=592316 FILE: Number of bytes written=1165170 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=3 Map output records=14 Map output bytes=114 Map output materialized bytes=127 Input split bytes=90 Combine input records=14 Combine output records=12 Reduce input groups=12 Reduce shuffle bytes=127 Reduce input records=12 Reduce output records=12 Spilled Records=24 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=525336576 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=59 File Output Format Counters Bytes Written=85
執行成功,到output目錄下看結果
[hadoop@hadoop001 output]$ vim part-r-00000
. 1 Hi, 2 I 1 This 1 a 1 and 1 file. 1 hadoop 1 is 1 love 2 test 1 you 1
【至此單機安裝完成】
偽分散式安裝
Hadoop 可以在單節點上以偽分散式的方式運行,Hadoop 進程以分離的 Java 進程來運行,節點既作為 NameNode 也作為 DataNode,同時,讀取的是 HDFS 中的文件。
Hadoop 的配置文件位於 /$HADOOP_HOME/etc/hadoop/ 中,偽分散式需要修改2個配置文件 core-site.xml 和 hdfs-site.xml 。
Hadoop的配置文件是 xml 格式,每個配置以聲明 property 的 name 和 value 的方式來實現。
修改core-site.xml
<configuration> <property> <name>hadoop.tmp.dir</name> <value>file:/opt/hadoop-2.7.3/tmp</value> <description>Abase for other temporary directories.</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop001:9000</value> </property> </configuration>
修改hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/data/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/data/dfs/data</value> </property> </configuration>
偽分散式雖然只需要配置 fs.defaultFS 和 dfs.replication 就可以運行(官方教程如此),不過若沒有配置 hadoop.tmp.dir 參數,則預設使用的臨時目錄為 /tmp/hadoo-hadoop,而這個目錄在重啟時有可能被系統清理掉,導致必須重新執行 format 才行。所以我們進行了設置,同時也指定 dfs.namenode.name.dir 和 dfs.datanode.data.dir,否則在接下來的步驟中可能會出錯。
格式化namenode
[hadoop@hadoop001 hadoop]$ hdfs namenode –format
好,格式化後啟動namenode和datanode的守護進程,發現報錯
設置一下hadoop-env.sh文件,把${JAVA_HOME}替換成絕對路徑
[hadoop@hadoop001 hadoop-2.7.3]$ vim etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_40/
重新啟動start-dfs.sh
守護進程已經成功啟動了,證明配置偽分散式成功。
遠程訪問http://192.168.3.128:50070,發現無法訪問,本地可以訪問。
原因其實是修改了hadoop-env.sh 後沒有重啟格式化namenode,重新格式化後發現datanode啟動不起來了。
最後,刪除datanode數據文件下VERSION文件,格式化後重啟就可以了。