Run Hadoop On Grid'5000
From Grid5000
|
Running Hadoop on Grid'5000
The main goal with this tutorial is present possible setup cases of Apache Hadoop Framework on Grid'5000, which are distinguished by difficult levels and time of configuration. Also are goals, facilitating and encouraging the use of this MapReduce implementation, as well as providing a GNU/Linux environment preconfigured for Hadoop and two scripts to configure and/or installing it on this platform.
Through this tutorial are proposed two solutions for possible setup cases of Hadoop on Grid'5000, which follow:
| Setup Case | Description | Difficulty | Time |
|---|---|---|---|
| Setup Case 01 | Hadoop image + config.sh | Easy | 5min |
| Setup Case 02 | Hadoop image from scratch | Average | 50min |
Setup cases
In this tutorial, are proposed two ways to configure Hadoop on Grid'5000, which reach the same result: the framework installed and running on machines reserved for the job. Definitely, these aren't the only two ways to obtain this goal, but they are functional proposals that varying in time for the configuration and at the number of commands to be executed. The first setup case is the fastest to reach the result, in this way is the most suitable for rapid tests with little customization. The second allows the use of other environments available on Grid'5000, but also accomplishes the complete installation process of Hadoop, that means a little more time than in previous mode. Finally, the third mode is the most time consuming, but is also the most customizable, in which are listed all the commands needed for the Hadoop installation from scratch. Following are the cases of configuration and the commands needed for this task in each of them.
Setup case 01 - Hadoop image + config.sh
In this first setup case, is proposed the use of a GNU/Linux image, preconfigured to use Hadoop. This preconfiguration is nothing more than Hadoop's files allocated in their proper places and some basic configuration performed, as for example, the definition of mapred.job.tracker property in mapred-site.xml file. It's also proposed a script that finalizes the configuration, after the allocated machines being work with Hadoop environment. This script assigns the nodes to their functions (masters and slaves), and fills some XML files in the configuration folder of Hadoop. To deploy the proposed environment and configure the Hadoop with this script, follow the instructions below:
Connect to the access machine of your site.
Connect to the frontend machine of your site, if it exists.
Allocate, with OAR, the quantity of nodes you will need for Hadoop job, with determined walltime. Simply indicate the quantity by replacing NUM_HOSTS and walltime (HH:MM:SS) desired.
Deploy Hadoop environment and run the script at the nodes allocated for the job.
frontend:
| kadeploy3 -a ~pcosta/hadoop/0.21.0/squeeze-x64-nfs-hadoop.dsc3 -f $OAR_FILE_NODES -k ~/.ssh/id_rsa.pub -s ~pcosta/hadoop/0.21.0/config.sh |
or
frontend:
| kadeploy3 -a ~pcosta/hadoop/0.20.1/squeeze-x64-nfs-hadoop.dsc3 -f $OAR_FILE_NODES -k ~/.ssh/id_rsa.pub -s ~pcosta/hadoop/0.20.1/config.sh |
At the end, the script will return the master node host, where Hadoop is configured and running. Then you are able to connect at this node to run your Hadoop jobs.
Setup case 02 - Hadoop image from scratch
Connect to the access machine of your site.
Connect to the frontend machine of your site, if it exists.
Allocate, with OAR, one node to customize the environment which you will choose and walltime equal to three hours.
Deploy the environment of your choice.
Connect to the reserved node as root. In most cases, the root password is 'grid5000'.
Create your own user and/or group in this image, if it doesn't exist yet. Or if you prefer, create a hadoop user/group in this image.
| Note | |
|---|---|
If you want to create a local hadoop user, you can use '-d' or '--home' option to change the Hadoop user's home. E.g. | |
Install the version 1.6.X of Java in your environment. E.g. In debian like:
Connect in same node as your created user.
Create the Hadoop folder.
Get the last version of Hadoop.
Unpack Hadoop package in this folder.
Fill the /opt/hadoop-0.21.0/conf/core-site.xml file with hadoop.tmp.dir and fs.default.name properties.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-\${user.name}</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://node.site.grid5000.fr:54310</value>
</property>
</configuration>
Fill the /opt/hadoop-0.21.0/conf/hdfs-site.xml file, with dfs.replications property.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>NUMBER_OF_REPLICATIONS</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
Fill the /opt/hadoop-0.21.0/conf/mapred-site.xml file with mapred.job.tracker property.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapred.job.tracker</name> <value>hdfs://node.site.grid5000.fr:54311</value> <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> </configuration>
Fill /opt/hadoop-0.21.0/conf/masters file with the host of master node.
node01.site.grid5000.fr
Fill /opt/hadoop-0.21.0/conf/slaves file with the list of deployed nodes.
node01.site.grid5000.frnode02.site.grid5000.frnode03.site.grid5000.fr ...nodeNN.site.grid5000.fr
At this point you may save your own environment, as showed in the tutorial Deploy environment-OAR2.
The new Hadoop environment is ready, however modifications are specifics for each oarsub requisition. It means, that for each reservation, you have to setup the properties at /opt/hadoop-0.21.0/conf files, as did previously, and all nodes have to be with same configured files.
After setup
After setup being ready, you can run your own applications or Hadoop examples. A quick example to execute a test is the wordcount. To run it just perform the following steps:
Connect to the master node.
Copy to HDFS some example book files.
Run the wordcount example JAR.
master:
| /opt/hadoop-0.21.0/bin/hadoop jar /opt/hadoop-0.21.0/hadoop-mapred-examples-0.21.0.jar wordcount gutenberg gutenberg-output |
Script config.sh
#!/bin/bash
echo "/**"
echo " * Beggining the Hadoop setup proccess with config.sh script"
echo " */"
USER_LOGIN_ID=$(id -u)
USER_LOGIN_NAME=$(id -un)
USER_GROUP_ID=$(id -g)
USER_GROUP_NAME=$(id -gn)
DATE=$(date +%s)
HADOOP_PATH='/opt/hadoop-0.21.0'
HADOOP_CONF_PATH=$HADOOP_PATH/'conf'
TMP_DIR=/tmp/$DATE'-'$$'-hadoop'
ROOT='root'
NR_REDUCE_TASKS=2
DEBUG_LEVEL='ALL'
HDFS_REPLICA=3
NUM_HOSTS=$(cat $OAR_FILE_NODES | uniq | wc -l)
SLAVES=$(cat $OAR_FILE_NODES | uniq)
MASTERS=$(head -n 1 $OAR_FILE_NODES)
mkdir $TMP_DIR
cat > $TMP_DIR/masters << EOF
$MASTERS
EOF
cat > $TMP_DIR/slaves << EOF
$SLAVES
EOF
#Create/Modify "hdfs-site.xml" file
cat > $TMP_DIR/hdfs-site.xml << EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>$HDFS_REPLICA</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>
EOF
#Create/Modify "core-site.xml" file
cat > $TMP_DIR/core-site.xml << EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-\${user.name}</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://$MASTERS:54310</value>
</property>
</configuration>
EOF
#Create/Modify "mapred-site.xml" file
cat > $TMP_DIR/mapred-site.xml << EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>hdfs://$MASTERS:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>$NR_REDUCE_TASKS</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>$HADOOP_PATH/temp</value>
</property>
</configuration>
EOF
if [ -d $HOME/.ssh ]
then
cd $HOME/.ssh/
else
mkdir $HOME/.ssh
cd $HOME/.ssh/
fi
eval $(/usr/bin/ssh-agent -s)
if [ -f id_rsa -a -f id_rsa.pub ]
then
echo "/**"
echo " * The user has a RSA key pair."
echo " */"
ssh-add id_rsa
echo $(cat id_rsa.pub) >> authorized_keys
echo "StrictHostKeyChecking no" > config
echo "VerifyHostKeyDNS yes" >> config
else
if [ -f id_dsa -a -f id_dsa.pub ]
then
echo "/**"
echo " * The user has a DSA key pair."
echo " */"
ssh-add id_dsa
echo $(cat id_dsa.pub) >> authorized_keys
echo "StrictHostKeyChecking no" > config
echo "VerifyHostKeyDNS yes" >> config
else
echo "/**"
echo " * Creating a RSA key pair."
echo " */"
ssh-keygen -t rsa -f $HOME/.ssh/id_rsa -P ""
ssh-add id_rsa
echo $(cat id_rsa.pub) >> authorized_keys
echo "StrictHostKeyChecking no" > config
echo "VerifyHostKeyDNS yes" >> config
fi
fi
for SLAVE in $SLAVES
do
scp $HOME/.ssh/config $ROOT@$SLAVE:/$ROOT/.ssh/
scp $HOME/.ssh/id_rsa.pub $ROOT@$SLAVE:/$ROOT/.ssh/
ssh $ROOT@$SLAVE chown -R $USER_LOGIN_NAME:$USER_GROUP_NAME $HADOOP_PATH
scp $TMP_DIR/* $USER_LOGIN_NAME@$SLAVE:$HADOOP_CONF_PATH
done
rm -rf $TMP_DIR/
ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/stop-dfs.sh
ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/stop-mapred.sh
ssh $USER_LOGIN_NAME@$MASTERS rm -rf $HADOOP_PATH/logs/*
ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/hadoop namenode -format
ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/start-dfs.sh
ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/start-mapred.sh
echo 'Masternode: '$MASTERS;
echo 'Slavenode: '$SLAVES;
