Spark On Execo
Apache Spark is general-purpose cluster computing framework providing significant performance gains over Hadoop MapReduce. Spark is a good alternative to execute parallel computations in the Grid5000 platform, but its deployment, configuration and tuning may be time consuming. Spark on Execo provides an abstraction layer which by using Execo allows to manage Spark clusters in an easy way. The script and classes that manage Spark clusters belong to the
hadoop_g5k project, available at here.
More information about the installation and deployment of other Hadoop components with
hadoop_g5k can be found in the wiki page of Hadoop On Execo.
Spark Cluster Management
Hadoop_g5k provides support for Apache Spark both through a Python class,
SparkCluster, and a command-line script called
SparkCluster exposes several useful methods to manage Spark clusters and link them to other Hadoop components, such as its distributed filesystem HDFS or its resource manager YARN. The documentation about the API can be found at Read the Docs.
spark_g5k Script Basic Usage
spark_g5k is a script providing a command-line interface to manage a Spark cluster. As in other scripts, the possible options can be shown with the help command:
Apache Spark can be deployed in 3 modes: standalone, on top of Apache Mesos and on top of Hadoop YARN.
spark_g5k provides both the standalone and Hadoop YARN deployments. Here we present a basic usage with the YARN mode.
First, we need to reserve a set of nodes with the
As we are going to use Spark on top of Hadoop, we first deploy a Hadoop cluster. More information about how to deploy a Hadoop cluster can be found in Hadoop On Execo.
Suppose that our Hadoop cluster has been assigned the id 1. Now we can create a Spark cluster linked to it. We use the
--hid option to refer to the Hadoop cluster. The nodes for the Spark cluster are going to be the same as those of the Hadoop cluster.
Now we need to install Apache Spark in all the nodes of the cluster. We provide a path to the binaries. In this example we are using a version publicly available in all the sites of Grid5000.
Once installed, we need to initialize the cluster. This action will configure the nodes depending both on the parameters specified through the configuration file (if any) and the characteristics of the machines of the cluster. We can also start the services in the same command:
Now our cluster is ready to execute jobs or to be accessed through the shell. We start a Spark shell in Python with the following command:
Note that we specified
ipython to get the additional features provided by this shell.
When we are done, we should delete the clusters in order to remove all temporary files created during execution.