Spark On Execo

From Grid5000
Jump to: navigation, search

Apache Spark is general-purpose cluster computing framework providing significant performance gains over Hadoop MapReduce. Spark is a good alternative to execute parallel computations in the Grid5000 platform, but its deployment, configuration and tuning may be time consuming. Spark on Execo provides an abstraction layer which by using Execo allows to manage Spark clusters in an easy way. The script and classes that manage Spark clusters belong to the hadoop_g5k project, available at here.

More information about the installation and deployment of other Hadoop components with hadoop_g5k can be found in the wiki page of Hadoop On Execo.

Spark Cluster Management

Hadoop_g5k provides support for Apache Spark both through a Python class, SparkCluster, and a command-line script called spark_g5k.

SparkCluster exposes several useful methods to manage Spark clusters and link them to other Hadoop components, such as its distributed filesystem HDFS or its resource manager YARN. The documentation about the API can be found at Read the Docs.

spark_g5k Script Basic Usage

spark_g5k is a script providing a command-line interface to manage a Spark cluster. As in other scripts, the possible options can be shown with the help command:

Terminal.png frontend:
spark_g5k -h

Apache Spark can be deployed in 3 modes: standalone, on top of Apache Mesos and on top of Hadoop YARN. spark_g5k provides both the standalone and Hadoop YARN deployments. Here we present a basic usage with the YARN mode.

First, we need to reserve a set of nodes with the oarsub command.

Terminal.png frontend:
oarsub -I -t allow_classic_ssh -l nodes=4,walltime=2

As we are going to use Spark on top of Hadoop, we first deploy a Hadoop cluster. More information about how to deploy a Hadoop cluster can be found in Hadoop On Execo.

Terminal.png node:
hg5k --create $OAR_FILE_NODES --version 2
Terminal.png node:
hg5k --bootstrap /home/mliroz/public/sw/hadoop/hadoop-2.4.0.tar.gz
Terminal.png node:
hg5k --initialize --start

Suppose that our Hadoop cluster has been assigned the id 1. Now we can create a Spark cluster linked to it. We use the --hid option to refer to the Hadoop cluster. The nodes for the Spark cluster are going to be the same as those of the Hadoop cluster.

Terminal.png node:
spark_g5k --create YARN --hid 1

Now we need to install Apache Spark in all the nodes of the cluster. We provide a path to the binaries. In this example we are using a version publicly available in all the sites of Grid5000.

Terminal.png node:
spark_g5k --bootstrap /home/mliroz/public/sw/spark/spark-1.2.0-bin-hadoop2.4.tgz

Once installed, we need to initialize the cluster. This action will configure the nodes depending both on the parameters specified through the configuration file (if any) and the characteristics of the machines of the cluster. We can also start the services in the same command:

Terminal.png node:
spark_g5k --initialize --start

Now our cluster is ready to execute jobs or to be accessed through the shell. We start a Spark shell in Python with the following command:

Terminal.png node:
spark_g5k --shell ipython

Note that we specified ipython to get the additional features provided by this shell.

When we are done, we should delete the clusters in order to remove all temporary files created during execution.

Terminal.png node:
spark_g5k --delete
Terminal.png node:
hg5k --delete