2016-03-10

Spark 部署文档-如何向 Spark 上提交应用程序-2

Descriptions:
最近正在系统阅读 Spark 官方文档，阅读的同时也试着翻译了其中部分的章节，本篇文档原文链接地址:http://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management ，
个人水平有限,若翻译有不恰当之处欢迎指正，邮箱地址 kylin27@outlook.com

Submitting Applications

如何向 Spark 上提交应用程序

The spark-submit script in Spark’s bin directory is used to launch applications on a cluster.
位于 Spark bin 目录下面的 spark-submit 脚本是用来在集群上启动应用程序的。

It can use all Spark’s supported cluster cluster managers through a uniform interface so you don’t have to configure your application specially for each one.
spark-submit 启动脚本通过统一接口来支持 Spark 中的多种集群(资源)管理器(译者: 比如说 Spark 自带的资源管理器，或是 Mesos ，YARN 这种第三方资源调度框架)，所以你无需因为管理器的不同来重新配置你的应用程序。

Building Your Applications’s Dependencies

如何构建依赖于其他库的应用程序

If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster.
如果你所编写的代码依赖于其他的项目的话，建议你在打包时将代码依赖的项目和你所编写的应用放到一起，这样的话可以保证依赖项目可以随代码一起被分发到 Spark 集群上。

To do this, create an assembly jar(or “uber” jar) containing your code and its dependencies.
可以通过创建一个组装 Jar 包，在该 Jar 中既包含你编写的代码又包含代码所依赖的文件，这样的话便可以实现自定义代码所依赖项目会随应用程序一起被分发了。

Both sbt and Maven have assembly plugins.
sbt和 Maven 这两种项目管理工具均有可生成上述的组装 Jar 包的插件。

When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime.
当打包组装 Jar 包的时候，会将你所编写代码所依赖的 Spark 和 Hadoop 库显示出来；不过你并不需要将 Spark 和 Hadoop 的库文件加载到 Jar 包中的，因为这些依赖文件将会在应用程序的运行期由 Spark 集群资源管理器来提供。

Once you have an assembled jar you can call the bin/spark-submit script as shown here while passing your jar.
一旦 Jar 文件被成功创建，你便可以通过调用位于 bin 目录下面的也是当前我们正在介绍的这个 spark-submit 脚本来将你的 Jar 文件提交到 Spark 集群上了。

For Python, you can use the –py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application.
如果依赖的是 Python 语言编写的库的话，你可以在使用 spark-submit 脚本的时候通过后接 –py-files 这个参数选项来将以 .py，.zip 或是 .egg 为后缀文件和你的应用一起发布到 Spark 集群上面。

If you depend on multiple Python files we recommend packaging them into a .zip or .egg.
如果你的应用程序依赖于不止一个 Python 文件，建议你将多个 Python 文件打包成 .zip 或是 .egg 类型的文件。

Launching Applications with spark-submit

使用 spark-submit 脚本来发布应用程序到 Spark 集群

Once a user application is bundled, it can be launched using the bin/spark-submit script.
一旦用户自定义应用程序和依赖文件被成功绑定，便可以使用 bin 路径下的 spark-submit 脚本将其发布到 Spark 上。

This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Sparks supports.
spark-submit 脚本支持为 Spark 和其所依赖的库来设定 classpath，同时 spark-submit 启动脚本后接的多种参数选项可以很好的配合 Spark 支持的多种集群管理器与启动模式:

ps: 由于博客支持符号有限，将下面的 # 替换成左右尖括号

./bin/spark-submit \
  --class  #main-class#     \
  --master #master-url#      \
  --deploy-mode #deploy-mode# \
  --conf #key#=#value#           \
  ... # other options 
  #application-jar#                \
  [application-arguments]

Some of the commonly used options are:
spark-submit 脚本中常用的参数选项描述如下:

–class: The entry point for your application(e.g. org.apache.spark.examples.SparkPi)
–class: 这个参数选项是用来指定整个应用程序的入口点的(例如， Spark 源码包中的 SparkPi 这个类就可以看做是整个应用的入口点)。
–master: The master URL for the cluster(e.g. spark://23.195.26.187:7077)
–master: 这个参数选项是用来指定集群主节点的 URL 地址的(例如你可以将主结点的 IP 设定为 23.195.26.187:7077 然后通过 spark://23.195.26.187:7077 来访问主结点)。
–deploy-mode: Whether to deploy your driver on the worker nodes(cluster) or locally as an external client(client)(default: client)
–deploy-mode: 这个选项是用来设定你在哪里启动驱动程序的，是在工作结点上(这个是集群模式)还是作为外部客户端在本地启动的(这个是客户端模式)，默认的缺省部署模式是本地启动的客户模式。
–conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes(as shown).
–conf: Spark 中用来以键值对方式强制改写配置信息的参数选项。如果键值对的数值中有空格，可以使用引号来包装”键=值”。(像这样 “键=值”)
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs://path of a file:// path that is present on all nodes.
application-jar: 这个参数选项用来指定包含你的应用程序和其所依赖的文件在内绑定好的 jar 文件的路径。如果使用 URL 来定位的话，这个 URL 对于整个集群来说一定要是全局可见的，例如，若是在 HDFS 上面以 hdfs:// 开头，或是如果文件路径在所有结点上都存在则可以以 file:// 来开头。
application-arguments: Arguments passed to the main method of your main class,if any.
application-arguments: 如果你所编写的应用程序的主函数入口在运行的时候需要传参的话，使用这个参数选项并后接需要传递给主函数的参数，参数便可以正确地传递给应用程序的主函数。

A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines(e.g. Master node in a standalone EC2 cluster).
一种比较常用的在 Spark 集群上部署应用程序的方式是在网关主机上提交你的应用，这个网关主机指的就是在集群中和其他工作主机通过物理网络是互相可达的。(例如在独立模式的 EC2 集群中的主结点就扮演着网关主机的角色)。

In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster.
启动 Spark 时，客户/本地模式是很推荐的启动模式。在客户模式下启动，驱动程序会以进程成员的方式随 spark-submit 一起直接被启动，该进程扮演着访问集群的客户端的角色。

The input and output of the application is attached to the console.
应用程序的输入输出信息可通过控制台上来直接访问。

Thus, this mode is especially suitable for applications that involve the REPL(e.g. Spark shell).
正因如此这种模式尤其适合使用到 REPL 表达式(例如 Spark shell 脚本)的应用程序。

Alternatively, if your application is submitted from a machine far from the worker machines(e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors.
相应的，如果你提交应用程序的主机距离其所支配的工作主机很远的话(比如说，你使用笔记本来执行本地任务提交)，推荐你使用集群这种部署方式来启动 Spark 以减少驱动程序和执行器二者之间网络通信所带来的延迟。

Note that cluster mode is currently not supported for Mesos clusters.
值得注意的是，目前 Spark 的集群并不适用于由 Mesos 管理的集群中的。

Currently only YARN supports cluster mode for Python applications.
目前只有基于 YARN 管理器的 Spark 集群支持 Python 应用程序的运行。

For Python applications, simply pass a .py file in the place of instead of a JAR, and Python .zip, .egg or .py files to the search path with –py-files.
关于如何向 Spark 集群提交 Python 应用程序，只需要简单地将在参数选项后面原本追加 JAR 文件的地方使用你所要上传的 .py 文件即可，如果是多个 .py 文件的话，可以将其打包成 .zip ，.egg 文件包或者使用 –py-files 参数来制定该多个 .py 文件的搜索路径名称都可以。

There are a few options available that are specific to the cluster manager that is being used.
专门用于处于运行中状态的集群管理器的命令参数选项并不多。

For example, with a Spark standalone cluster with cluster deploy mode, you can also specify –supervise to make sure that the driver is automatically restarted if it fails with non-zero exit code.
例如以独立模式启动的 Spark你可以通过 –supervise 这个命令参数是来确保当驱动程序以返回值非零的状态退出之后(译者:也就是错误状态退出的时候)，该驱动程序可以实现自动重启。

To enumerate all such options avaialable to spark-submit, run it with –help. Here are a few examples of common options:
如果想一一列举 spark-submit 脚本可用的参数选项信息的话，可以在启动 spark-submit 脚本的时候输入 –help 选项。下面是使用 spark-submit 脚本常用的参数选项举例:

# Run application locally on 8 cores
# 以本地 8 核的方式来向 Spark 提交应用程序

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8]  \
  /path/to/examples.jar \
  100

译者注: 上述的 spark-submit 提交命令解释入下
--class org.apache.spark.examples.SparkPi 
这条命令所对应的是用于指定提交给 Spark 的 Jar 文件的主入口函数的
(也就是包含 main() 入口函数的类，并且在其中成功创建 SparkContext 对象实例的所在类)所在类

--master 
这个参数选项，应该指定(暂时还不太清楚这个)

/path/to/examples.jar 
这个参数选项用来指定提交给 Spark 的 Jar 文件在本地的路径信息

100 
这个参数是用来传递给 Jar 中的主入口函数在启动的时候需要向 
main(String [] args ){...} 传入的参数，
当然如果主入口函数不需要传入参数的话，这个参数选项可以不加。

# Run on a Spark standalone cluster in client deploy mode
# 以客户部署模式来启动独立 Spark 集群

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G  \
  --total-executor-cores 100 \
  /path/to/examples.jar  \
  1000

译者注: 上述的 spark-submit 提交命令注释如下
--class org.apache.spark.examples.SparkPi
主函数入口所在类

--master spark://207.184.161.138:7077
用 URL 来定位 Spark 集群中的主结点

--executor-memory 20G 
用来为每个执行器进程分配内存空间

--total-executor-cores 100 
用来指定 Spark 开启多少个执行者进程

/path/to/examples.jar 
用来指定用户提交 Jar 文件所在目录信息

1000
用户提交的 Jar 文件中主函数启动所需要的参数

# Run on a Spark standalone cluster in cluster deploy mode with supervise
# 使用监控着来以集群部署模式来启动独立 Spark 集群

./bin/spark-submit  \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
 --class org.apache.spark.examples.SparkPi \
 --master yarn \
 --deploy-mode cluster \  # can be client for client mod
 --executor-memory 20G \
 --num-executors 50  \
 /path/to/examples.jar \
 1000

# Run a Python application on a Spark standalone cluster
# 在独立的 Spark 集群上运行 Python 应用程序

./bin/spark-submit   \
  --master spark://207.184.161,138:7077 \
  examples/src/main/python/pi.py  \
  1000

# Run on a Mesos cluster in cluster deploy mode with supervise 
# 使用 supervise 监控在集群部署模式下启动运行在 Mesos 集群上的 Spark 

./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master mesos://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  http://path/to/examples.jar \
  1000

Master URLs

关于主结点 URL 地址描述格式

The master URL passed to Spark can be in one of the following formats:
主节点的 URL 地址以参数选项 –master 的方式传递给 Spark ，不过该主结点的 URL 地址需要遵循如下描述的格式：

Master URL / Meaning
主结点 URL / 该 URL 地址所适用的场合

local
Runing Spark locally with one worker thread (i.e. no parallelism at all).
此种主结点 URL 地址描述适用于仅开启一个工作者线程的本地 Spark 运行模式(也就是说，运行于此模式下的 Spark 并不支持并行)

local[K]
Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
此证 URL 地址描述适用于开启 K 个工作者线程且本地模式启动 Spark 的场景，(通常情况下，理想的方式是运行 Spark 主机 CPU 中有多少个核便相应地开启多少个线程)

local[*]
Run Spark locally with as many worker threads as logical cores on your machine.
本地模式启动 Spark 且让当前主机的 CPU 中的内核 AT 力场全开的运行尽量多的线程。

spark://HOST:PORT
Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
这种 URL 地址适用于用于连接以独立集群的方式启动的 Spark 上的主结点。端口号要先对其进行配置然后才可以使用，默认的端口号是 7077 .

mesos://HOST:PORT
Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which 5050 by default. Or, for a Mesos cluster using Zookeeper, use mesos://zk://…. To submit with –deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
当想要连接使用 Mesos 管理的 Spark 集群中的主结点的时候，使用这个 URL 地址就对了。端口号在使用之前必须要通过相关配置文件加以设定，缺省端口号是 5050. 或者是，如果这个 Mesos 使用 Zookeeper 框架的话，那么 URL 地址就相应地变成 mesos://zk://….。可以使用 –deploy-mode cluster 来向 Mesos 组织的 Spark 集群上提交应用程序，而 HOST:PORT 应该被配置成 Mesos 集群分配器的 URL 访问地址。

yarn
Connect to a YARN cluster in client or cluster mode depending on the value of –deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
yarn 常用作连接使用 YARN 管理器的以本地客户端模式或是集群模式启动的 Spark 集群中的主结点， yarn 是 –deploy-mode 这个参数选项的后接值。集群的地址可通过查找相关配置文件中的 HADOOP_CONF_DIR 或是 YARN_CONF_DIR 环境变量值来定位。

yarn-client
Equivalent to yarn with –deploy-mode client, which is preferred to ‘yarn-client’.
该 URL 等同于以 yarn 的客户/本地模式启动 Spark 的时候后接 –deploy-mode 这个参数选项，不过前者更加被人所熟知。

yarn-cluster
Equivalent to yarn with –deploy-mode cluster, which is prefered to ‘yarn-cluster’
这个 URL 地址使用与以 yarn 集群方式启动的时候使用 –deploy-mode cluster 这个参数选项，也通常人们更喜欢用 ‘yarn-cluster’ 。

Loading Configuration from a File

从文件中来加载配置选项

The spark-submit script can load default Spark configuration values from a properties file and pass them on to your application.
Spark 中用于提交应用的 spark-submit 脚本可以从配置文件中加载默认的 Spark 配置信息，并将这些配置信息应用在你所提交的应用程序中。

By default it will read options from conf/spark-defaults.conf in the Spark directory.
默认情况下， spark-submit 脚本会从 Spark 所在路径下的 conf/spark-default.conf 配置文件中来读取配置选项信息。

For more detail, see the section on loading default configurations.
若想要进一步了解关于配置信息加载的问题，可以查阅如何加载默认配置信息这篇文章。

Loading default Spark configurations this way can obviate the need for certain flags to spark-submit. For instance, if the spark.master property is set, you can safely omit the –master flag from spark-submit.
如果 Spark 配置文件书写得当的话，可以减少在运行 spark-submit 脚本的时候后接参数的数目。例如，如果 spark.master 这个参数的选项已经在配置脚本中设定好了，那么在调用 spark-submit 的时候可以省去不写 –master 这个参数选项。

In general, configuration values explicitly set on a SparkConf take the highest precedence, then flags passed to spark-submit, then values in the defaults file.
通常情况下不同方式的配置数值之间是有着明确的优先级的，其中通过 SparkConf 对象实例所设定的配置参数享有最高的优先级，接下来是运行 spark-submit 脚本时所传递的参数，最后是写在配置文件中的默认选项信息。

If your are ever unclear where configuration options are coming from, you can print out fine-grained debugging information by running spark-submit with the –verose option.
如果你对 Spark 中的某些配置选项不是很理解，不知道某些参数是用来做什么的话，建议在运行 spark-submit 脚本的时后接 –verose 参数这个选项，这样就可以将每个后接参数的调试信息详细打印出来了。

Advanced Dependency Management

##
When using spark-submit, the application jar along with any jars included with the –jars option will be automatically transferred to the cluster.
当运行 spark-submit 脚本时，应用程序连同任何 –jars 参数选项后接的 jar 文件都会被自动的提交给集群。

Spark uses the following URL scheme to allow different strategies for disseminating jars:
Spark 使用如下所示的多种 URL 模式来实现不同策略的 jar 文件的分发:

file: - Absolute path and file:/URIs are servered by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
file: - 此种文件描述方式使用的是本地文件所在的绝对路径地址以及使用 file:/URL 的文件路径描述方式是由文件 HTTP 服务器所提供的文件定位服务。位于每个结点上的执行器也均是从 HTTP 文件服务器上来抽取其所需要的文件。
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
hdfs：此种文件路径描述是以 http, https, ftp 文件传输协议根据文件的 URI 描述地址来抓取普通和 JAR 类型的文件的。

local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, ClusterFS, etc.
使用 local 来作为文件路径描述的话，通常是按照 local:/ + 文件的 URI 地址这种格式，不过若以次种文件描述方式，需要每个工作结点的本地路径上都需要有该文件的备份。这便意味着在整个过程中不会涉及到文件在网络进行传输，且由于大块的普通 JAR 文件都已经被推送到每个工作结点本地或是以 NFS, ClusterFS 文件系统的方式进行共享，所以在这种情景下其工作效率十分的高效。
(译者注: 不过将大文件冗余地存放到如此多的结点上所带来的开销也是很大的，类似于算法中的空间与时间二者之间的权衡)

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.
需要知道的是 JAR 和普通文件会被拷贝到位于每个执行结点上所创建的的 SparkContext 的工作路径下。

This can use up a significant amount of space over time and will need to be cleaned up.
这是一种会随着时间推移十分吃内存的处理方法，所以会涉及到空间的清理与回收操作。

With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the spark.worker.cleanup.appDataTtl property.
如果使用 YARN 这种资源管理框架的话，空间资源的清理与回收会被自动的执行；如果使用 Spark 自带的资源管理器的话，可以通过配置 spark.worker.cleanup.appDataTlt 这个选项来决定让 Spark 是否能够自动清理与回收空间。

Users may also included any other dependencies by supplying a comma-delimited list of maven coordinates with –packages.
若是用户希望包含其他 maven 相关的依赖文件，还在命令后配合使用 –packages 这个参数选项；若是不止一个依赖文件可使用逗号作为分隔符。

All transitive dependencies will be handled when using this command.
使用上述的这个参数选项的话，依赖文件的所有隐式依赖文件也一并会被处理。

Additional repositories( or resolvers in SBT) can be added in a comma-delimited fashion with the flag –repositories.
另外，项目代码资源库(或是使用 SBT 作为版本控制项目)也可以以逗号分隔符的方式配合 –repositories 这个参数选项来使用。

These commands can be used with pyspark, spark-shell, and spark-submit to include Spark packages.
上述的这些命令可以配合 Spark 安装包中的 pyspark,spark-shell 和 spark-submit 这些脚本中的任意一个脚本使用。

For Python, the equivalent –py-files option can be used to distribute .egg, .zip and .py libraries to executors.
如果是上传以 Python 编写的项目代码的话，可以相应地使用 –py-files 这个参数选项来将打包成 .egg,.zip 的 python 文件，或是 python 库文件本身分发到位于每个结点的执行器上。

More Information

关于进一步学习

Once you have deployed your application, the cluster mode overview describes the components involved in distributed execution, and how to monitor and debug applications.
一旦你将你的应用部署到集群之上，在关于集群模式概览一文中介绍了 Spark 在分布式环境下运行时所需要的组件，以及如何监控与调试你所部署的应用程序。

end

Kylin's Blog

Kylin27@outlook.com