Spark之Spark任务的提交方式【Spark

2023-09-29 06:57| 来源: 网络整理| 查看: 265

Spark任务的提交方式 1、spark-shell1.1 概述1.2 启动1.3 应用场景 2、spark-submit2.1 概述2.2 基本语法 3、spark-shell、spark-submit比较

使用spark-shell命令和spark-submit命令来提交spark任务。

当执行测试程序，使用spark-shell，spark的交互式命令行

提交spark程序到spark集群中运行时，spark-submit

1、spark-shell

\quad \quad spark-shell 是 Spark 自带的交互式 Shell 程序，方便用户进行交互式编程，用户可以在该命令行下用 Scala 编写 spark 程序。

1.1 概述

1、spark-shell 使用帮助

(py27) [root@master ~]# cd /usr/local/src/spark-2.0.2-bin-hadoop2.6 (py27) [root@master spark-2.0.2-bin-hadoop2.6]# cd bin (py27) [root@master bin]# ./spark-shell --help Usage: ./bin/spark-shell [options] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. --exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts. --repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages. --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. --files FILES Comma-separated list of files to be placed in the working directory of each executor. --conf PROP=VALUE Arbitrary Spark configuration property. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M). --driver-java-options Extra Java options to pass to the driver. --driver-library-path Extra library path entries to pass to the driver. --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). --proxy-user NAME User to impersonate when submitting the application. This argument does not work with --principal / --keytab. --help, -h Show this help message and exit. --verbose, -v Print additional debug output. --version, Print the version of current Spark. Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1). Spark standalone or Mesos with cluster deploy mode only: --supervise If given, restarts the driver on failure. --kill SUBMISSION_ID If given, kills the driver specified. --status SUBMISSION_ID If given, requests the status of the driver specified. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. Spark standalone and YARN only: --executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode) YARN-only: --driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: "default"). --num-executors NUM Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM. --archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. --principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS. --keytab KEYTAB The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically. Spark-shell源码 (py27) [root@master bin]# cat spark-shell function main() { if $cygwin; then # Workaround for issue involving JLine and Cygwin # (see http://sourceforge.net/p/jline/bugs/40/). # If you're using the Mintty terminal emulator in Cygwin, may need to set the # "Backspace sends ^H" setting in "Keys" section of the Mintty options # (see https://github.com/sbt/sbt/issues/562). stty -icanon min 1 -echo > /dev/null 2>&1 export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Djline.terminal=unix" "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@" stty icanon echo > /dev/null 2>&1 else export SPARK_SUBMIT_OPTS "${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@" fi } ,在 spark-shell 的main 方法中执行了 spark-submit 的脚本并且制定–name 为 Spark-shell ，这就解释了,为什么运行 spark-shell 的时候,在web ui中显示的名字是spark-shell ,其次我们看到指定了主类是 org.apache.spark.repl.Main ,这里其实可以猜到,必定为我们创建sparkcontext 对象,因为我们在执行spark-shell 的时候,会自动为我们创建好 sparkcontext 对象从源码可以看出Spark-shell其实最终执行了Spark-submit命令 1.2 启动

1、直接启动bin目录下的spark-shell

./spark-shell

在这里插入图片描述

表示默认使用local 模式启动，在本机启动一个SparkSubmit进程

还可指定参数 --master，如：

spark-shell --master local[N] 表示在本地模拟N个线程来运行当前任务 spark-shell --master local[*] 表示使用当前机器上所有可用的资源

不携带参数默认就是

spark-shell --master local[*] 退出spark-shell :quit或快捷键Ctrl+D

2、 ./spark-shell --master spark://master:7077

以Standalone模式进行启动，即集群模式在这里插入图片描述

3、 ./spark-shell --master yarn-client

在这里插入图片描述

以Yarn client 模式启动./spark-shell --master yarn-cluster这个是以Yarn cluster 模式启动 1.3 应用场景通常是以测试为主所以一般直接以./spark-shell启动，进入本地模式测试 2、spark-submit

\quad \quad 程序一旦打包好，就可以使用 bin/spark-submit 脚本启动应用了。这个脚本负责设置 spark 使用的 classpath 和依赖，支持不同类型的集群管理器和发布模式。

2.1 概述

\quad \quad 它主要是用于提交编译并打包好的Jar包到集群环境中来运行，和hadoop中的hadoop jar命令很类似，hadoop jar是提交一个MR-task,而spark-submit是提交一个spark任务，这个脚本可以设置Spark类路径（classpath）和应用程序依赖包，并且可以设置不同的Spark所支持的集群管理和部署模式。相对于spark-shell来讲它不具有REPL(交互式的编程环境)的，在运行前需要指定应用的启动类，jar包路径,参数等内容。

1、spark-submit使用帮助

./spark-submit --help (py27) [root@master bin]# ./spark-submit --help Usage: spark-submit [options] [app arguments] Usage: spark-submit --kill [submission ID] --master [spark://...] Usage: spark-submit --status [submission ID] --master [spark://...] Usage: spark-submit run-example [options] example-class [example args] Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. --exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts. --repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages. --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. --files FILES Comma-separated list of files to be placed in the working directory of each executor. --conf PROP=VALUE Arbitrary Spark configuration property. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M). --driver-java-options Extra Java options to pass to the driver. --driver-library-path Extra library path entries to pass to the driver. --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). --proxy-user NAME User to impersonate when submitting the application. This argument does not work with --principal / --keytab. --help, -h Show this help message and exit. --verbose, -v Print additional debug output. --version, Print the version of current Spark. Spark standalone with cluster deploy mode only: --driver-cores NUM Cores for driver (Default: 1). Spark standalone or Mesos with cluster deploy mode only: --supervise If given, restarts the driver on failure. --kill SUBMISSION_ID If given, kills the driver specified. --status SUBMISSION_ID If given, requests the status of the driver specified. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. Spark standalone and YARN only: --executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode) YARN-only: --driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: "default"). --num-executors NUM Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM. --archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. --principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS. --keytab KEYTAB The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically. Spark-submit源码 (py27) [root@master bin]# cat spark-submit if [ -z "${SPARK_HOME}" ]; then export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)" fi # disable randomized hash for string in Python 3.3+ export PYTHONHASHSEED=0 exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@" spark-submit 比较简单就是去执行 spark-class 脚本制定主类是 org.apache.spark.deploy.SparkSubmit ,把接受的参数都传到主类中 2.2 基本语法例子：提交任务到 hadoop yarn 集群执行。 ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --driver-memory 1g \ --executor-memory 1g \ --executor-cores 1 \ --queue thequeue \ examples/target/scala-2.11/jars/spark-examples*.jar 10

参数的解释：

参数名参数说明- -class应用程序的主类，仅针对 java 或 scala 应用- -mastermaster 的地址，提交任务到哪里执行，例如 local,spark://host:port, yarn, local- -deploy-mode在本地 (client) 启动 driver 或在 cluster 上启动，默认是 client- -name应用程序的名称，会显示在Spark的网页用户界面- -jars用逗号分隔的本地 jar 包，设置后，这些 jar 将包含在 driver 和 executor 的 classpath 下- -packages包含在driver 和executor 的 classpath 中的 jar 的 maven 坐标- -exclude-packages为了避免冲突而指定不包含的 package- -repositories远程 repository- -conf PROP=VALUE指定 spark 配置属性的值，例如 -conf spark.executor.extraJavaOptions="-XX:MaxPermSize=256m"- -properties-file加载的配置文件，默认为 conf/spark-defaults.conf- -driver-memoryDriver内存，默认 1G- -driver-java-options传给 driver 的额外的 Java 选项- -driver-library-path传给 driver 的额外的库路径- -driver-class-path传给 driver 的额外的类路径- -driver-coresDriver 的核数，默认是1。在 yarn 或者 standalone 下使用- -executor-memory每个 executor 的内存，默认是1G- -total-executor-cores所有 executor 总共的核数。仅仅在 mesos 或者 standalone 下使用- -num-executors启动的 executor 数量。默认为2。在 yarn 下使用- -executor-core每个 executor 的核数。在yarn或者standalone下使用 3、spark-shell、spark-submit比较

1、相同点：放置的位置都在/spark/bin目录下面

2、不同点：

（1）Spark-shell本身是交互式的，dos界面上会提供一种类似于IDE的开发环境，开发人员可以在上面进行编程。在运行时，会调用底层的spark-submit方法进行执行。（2）Spark-submit本身不是交互性的，用于提交在IDEA等编辑器中编译并打包生成的Jar包到集群环境中，并执行。

【本文地址】

公司简介

联系我们