Spark簡介 官網地址:http://spark.apache.org/ Apache Spark is a fast and general engine for large-scale data processing. Speed Run programs up to 100x faster ...
Spark簡介
官網地址:http://spark.apache.org/
Apache Spark is a fast and general engine for large-scale data processing.
Speed
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Apache Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Ease of Use
Write applications quickly in Java, Scala, Python, R.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
Generality
Combine SQL, streaming, and complex analytics.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Runs Everywhere
Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase, Hive, Tachyon, and any Hadoop data source.
官方文檔地址:http://spark.apache.org/docs/1.6.0/
安裝
下載:
Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark 1.6.0 uses Scala 2.10. You will need to use a compatible Scala version (2.10.x).
spark版本:spark-1.6.0-bin-hadoop2.6.tgz
scala版本:scala-2.10.4.tgz
spark監控埠 :4040
安裝集群:
tar -xzvf spark-1.6.0-bin-hadoop2.6.tgz -C /usr/
修改解壓後的配置文件 /conf/spark-env.sh
export SPARK_MASTER_IP=node1 export SPARK_MASTER_PORT=7077 export SPARK_WORKER_CORES=1 export SPARK_WORKER_INSTANCES=1 export SPARK_WORKER_MEMORY=1024m export SPARK_LOCAL_DIRS=/data/spark/dataDir export HADOOP_CONF_DIR=/usr/hadoop-2.5.1/etc/hadoop
修改slaves文件
node2
node3
集群管理器類型有
Standalone
Apache Mesos
Hadoop Yarn
這裡介紹兩種方式 Standalone 和 Yarn
配置文件中的 HADOOP_CONF_DIR就是為了使用Yarn配置的
Standalone
啟動spark自身的管理器
spark/sbin/start-all.sh
運行測試腳本
standalone client模式
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://node1:7077 --executor-memory 512m --total-executor-cores 1 ./lib/spark-examples-1.6.0-hadoop2.6.0.jar 100
standalone cluster模式
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://node1:7077 --deploy-mode cluster --supervise --executor-memory 512M --total-executor-cores 1 ./lib/spark-examples-1.6.0-hadoop2.6.0.jar 100
查詢結果http://node1:4040 client模式通過4040查詢不到結果,在腳本執行結束後在終端就能看到結果。
Yarn
啟動hadoop集群, 由於使用Yarn不需要啟動spark
start-all.sh
運行測試腳本:
Yarn client模式
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --executor-memory 512M --num-executors 1 ./lib/spark-examples-1.6.0-hadoop2.6.0.jar 100
Yarn cluster模式
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --executor-memory 512m --num-executors 1 ./lib/spark-examples-1.6.0-hadoop2.6.0.jar 100
查詢執行結果 http://node1:8088 client模式通過8088查詢不到結果,在腳本執行結束後在終端就能看到結果。
Yarn應用場景,集群中同時運行mapreduce和spark建議使用yarn,公用相同的資源調度器。
Standalone針對只跑spark應用的集群。
集群模式
The following table summarizes terms you’ll see used to refer to cluster concepts:
Term | Meaning |
---|---|
Application | User program built on Spark. Consists of a driver program and executors on the cluster. |
Application jar | A jar containing the user's Spark application. In some cases users will want to create an "uber jar" containing their application along with its dependencies. The user's jar should never include Hadoop or Spark libraries, however, these will be added at runtime. |
Driver program | The process running the main() function of the application and creating the SparkContext |
Cluster manager | An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN) |
Deploy mode | Distinguishes where the driver process runs. In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster. |
Worker node | Any node that can run application code in the cluster |
Executor | A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors. |
Task | A unit of work that will be sent to one executor |
Job | A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save , collect ); you'll see this term used in the driver's logs. |
Stage | Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs. |