samza是一個分散式的流式數據處理框架(streaming processing),它是基於Kafka消息隊列來實現類實時的流式數據處理的。(準確的說,samza是通過模塊化的形式來使用kafka的,因此可以構架在其他消息隊列框架上,但出發點和預設實現是基於kafka)
samza是一個分散式的流式數據處理框架(streaming processing),它是基於Kafka消息隊列來實現類實時的流式數據處理的。(準確的說,samza是通過模塊化的形式來使用kafka的,因此可以構架在其他消息隊列框架上,但出發點和預設實現是基於kafka)
Apache Kafka主要是用來控制發消息的
Apache Hadoop YARN會提供錯誤信息,隔離處理器,安全和資源管理.
本文將介紹怎麼在 Ubuntu 14.04 的32位 系統上安裝Samza.
安裝準備:
要安裝和配置Apache-Samza,需要以下東西
JDK 1.7
maven2
kafka
yarn
zookeeper
# apt-get install curl gem
下載並設置JDK路徑:
我們需要安裝JDK並設置好其環境變數.
# cd /usr/java # wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/7u79-b15/jdk-7u79-linux-i586.tar.gz" # tar xzf jdk-7u79-linux-i586.tar.gz
解壓並設置好JAVA_HOME路徑
# tar -zxvf jdk-7u79-linux-i586.tar.gz # JAVA_HOME=/usr/java/jdk1.7.0_79 # export JAVA_HOME # PATH=$JAVA_HOME/bin:$PATH # export PATH
把上面的加入到 ~/.bashrc 和 /etc/bashrc文件去
安裝Maven2:
接下來下載安裝maven
# wget https://launchpad.net/~bneijt/+archive/ubuntu/ppa/+build/2139203/+files/maven3_3.0.1-0~ppa2_all.deb
# dpkg -i maven3_3.0.1-0~ppa2_all.deb
檢查maven版本好
# mvn3 -versionApache Maven 3.0.1 (r1038046; 2010-11-23 16:28:32+0530)
Java version: 1.7.0_79
Java home: /usr/java/jdk1.7.0_79/jre
Default locale: en_IN, platform encoding: UTF-8
OS name: "linux" version: "3.8.0-29-generic" arch: "i386" Family: "unix"
安裝Hello-Samza :
我們就按照在 /usr/local 文件夾下麵把
# cd /usr/local
把hello-samza複製進來,
# git clone git://git.apache.org/samza-hello-samza.git hello-samza
本項目中含有一個"grid"的腳本,其中有hello-samza變數,有了這個你可以搞定一切了. 使用它可以安裝 Kafka, Yarn和Zookeeper.
執行下麵的命令,
# cd /usr/local/hello-samza
root@dev:/usr/local/hello-samza# bin/grid install kafkaEXECUTING: install kafka
Downloading kafka_2.10-0.8.2.1.tgz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
15 15.4M 15 2406k 0 0 304k 0 0:00:51 0:00:07 0:00:44 443k
root@dev:/usr/local/hello-samza# bin/grid install yarnEXECUTING: install yarn
Downloading hadoop-2.6.1.tar.gz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
77 187M 77 145M 0 0 239k 0 0:13:23 0:10:22 0:03:01 204k
root@dev:/usr/local/hello-samza# bin/grid install zookeeperEXECUTING: install zookeeper
Downloading zookeeper-3.4.3.tar.gz...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
8 15.4M 8 1324k 0 0 212k 0 0:01:14 0:00:06 0:01:08 266k
現在你會發現所有的包都在hello-samza根目錄下麵的一個名字叫 “deploy”文件夾裡面.
root@dev:/usr/local/hello-samza# cd deploy root@dev:/usr/local/hello-samza/deploy# lskafka yarn zookeeper
執行bin/grid bootstrap命令
root@dev:/usr/local/hello-samza# bin/grid bootstrapDownload http://repo1.maven.org/maven2/org/fusesource/scalate/scalate-util_2.10/1.6.1/scalate-util_2.10-1.6.1.jar
:samza-yarn_2.10:processResources
:samza-yarn_2.10:classes
:samza-yarn_2.10:lesscss
....
....
BUILD SUCCESSFUL
Total time: 20 mins 32.855 secs
/usr/local/hello-samza
EXECUTING: install zookeeper
Using previously downloaded file /root/.samza/download/zookeeper-3.4.3.tar.gz
EXECUTING: install yarn
Using previously downloaded file /root/.samza/download/hadoop-2.6.1.tar.gz
EXECUTING: install kafka
Using previously downloaded file /root/.samza/download/kafka_2.10-0.8.2.1.tgz
EXECUTING: start zookeeper
JMX enabled by default
Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
EXECUTING: start yarn
starting resourcemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-resourcemanager-dev.out
starting nodemanager, logging to /usr/local/hello-samza/deploy/yarn/logs/yarn-root-nodemanager-dev.out
EXECUTING: start kafka
上面的grid執行完後,你就可以驗證YARN是否安裝好了併在運行,訪問URL http://localhost:8088. 看到的就是YARN UI界面.
Build一個Samza工作包:
你需要build下這個包,YARN就是通過這個包來執行grid的.
註: 比如你build的是hello-samza項目的最新版的話,記得首先執行下下麵的命令。
root@dev:/usr/local/hello-samza#./gradlew publishToMavenLocal
你可以在hello-samza項目中使用這些命令:
root@dev:/usr/local/hello-samza# mvn clean package root@dev:/usr/local/hello-samza# mkdir -p deploy/samza root@dev:/usr/local/hello-samza# tar -xvf ./target/hello-samza-0.10.0-dist.tar.gz -C deploy/samza
執行Samza任務:
完成build Samza包之後,你就可以在grid使用t run-job.sh 腳本來完成一些任務了
root@dev:/usr/local/hello-samza # deploy/samza/bin/run-job.sh --config-factory=org.apache.samza.config.factories.PropertiesConfigFactory --config-path=file://$PWD/deploy/samza/config/wikipedia-feed.properties
上面的這個任務將會從Wikipedia上把實施反饋編輯撤銷掉,會把這些編輯放到一個叫“thelinuxfaq-raw”的主題裡面去.
讓這個主題運行幾分鐘後,你再來看下Kafka 最後面的更新情況:
root@dev:/usr/local/hello-samza# deploy/kafka/bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic thelinuxfaq-raw
再次訪問YARN UI界面(http://localhost:8088). 你就看到Samza很正常的運行而不是有錯誤提示了!
關閉Samza:
一切都弄好了,你就可以使用grid腳本關閉所有的相關伺服器了.
root@dev:/usr/local/hello-samza # bin/grid stop all
輸出示例:
EXECUTING: stop all EXECUTING: stop kafka EXECUTING: stop yarn stopping resourcemanager stopping nodemanager EXECUTING: stop zookeeper JMX enabled by default Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg Stopping zookeeper ... STOPPED
啟動Samza :
同意的,你可以使用grid腳本來啟動所有服務,
root@dev:/usr/local/hello-samza # bin/grid start all
輸出示例:
EXECUTING: start all EXECUTING: start zookeeper JMX enabled by default Using config: /usr/local/hello-samza/deploy/zookeeper/bin/../conf/zoo.cfg Starting zookeeper ... STARTED EXECUTING: start yarn .... EXECUTING: start kafka轉自:在Linux上怎麼安裝和配置Apache Samza