Run Hadoop On Grid'5000

From Grid5000
Jump to: navigation, search

Contents

Running Hadoop on Grid'5000

The main goal with this tutorial is present possible setup cases of Apache Hadoop Framework on Grid'5000, which are distinguished by difficult levels and time of configuration. Also are goals, facilitating and encouraging the use of this MapReduce implementation, as well as providing a GNU/Linux environment preconfigured for Hadoop and two scripts to configure and/or installing it on this platform.

Through this tutorial are proposed two solutions for possible setup cases of Hadoop on Grid'5000, which follow:

Setup Case Description Difficulty Time
Setup Case 01 Hadoop image + config.sh Easy 5min
Setup Case 02 Hadoop image from scratch Average 50min

Setup cases

In this tutorial, are proposed two ways to configure Hadoop on Grid'5000, which reach the same result: the framework installed and running on machines reserved for the job. Definitely, these aren't the only two ways to obtain this goal, but they are functional proposals that varying in time for the configuration and at the number of commands to be executed. The first setup case is the fastest to reach the result, in this way is the most suitable for rapid tests with little customization. The second allows the use of other environments available on Grid'5000, but also accomplishes the complete installation process of Hadoop, that means a little more time than in previous mode. Finally, the third mode is the most time consuming, but is also the most customizable, in which are listed all the commands needed for the Hadoop installation from scratch. Following are the cases of configuration and the commands needed for this task in each of them.

Setup case 01 - Hadoop image + config.sh

In this first setup case, is proposed the use of a GNU/Linux image, preconfigured to use Hadoop. This preconfiguration is nothing more than Hadoop's files allocated in their proper places and some basic configuration performed, as for example, the definition of mapred.job.tracker property in mapred-site.xml file. It's also proposed a script that finalizes the configuration, after the allocated machines being work with Hadoop environment. This script assigns the nodes to their functions (masters and slaves), and fills some XML files in the configuration folder of Hadoop. To deploy the proposed environment and configure the Hadoop with this script, follow the instructions below:

Connect to the access machine of your site.

Terminal.png outside:
ssh login@access.site.grid5000.fr

Connect to the frontend machine of your site, if it exists.

Terminal.png access:
ssh frontend

Allocate, with OAR, the quantity of nodes you will need for Hadoop job, with determined walltime. Simply indicate the quantity by replacing NUM_HOSTS and walltime (HH:MM:SS) desired.

Terminal.png frontend:
oarsub -I -t deploy -l nodes=NUM_HOSTS,walltime=HH:MM:SS

Deploy Hadoop environment and run the script at the nodes allocated for the job.

Terminal.png frontend:
kadeploy3 -a ~pcosta/hadoop/0.21.0/lenny-x64-nfs-hadoop.dsc3 -f $OAR_FILE_NODES -k ~/.ssh/id_rsa.pub -s ~pcosta/hadoop/0.21.0/config.sh

or

Terminal.png frontend:
kadeploy3 -a ~pcosta/hadoop/0.20.1/lenny-x64-nfs-hadoop.dsc3 -f $OAR_FILE_NODES -k ~/.ssh/id_rsa.pub -s ~pcosta/hadoop/0.20.1/config.sh
Warning.png Warning

~pcosta/hadoop/0.21.0/config.sh is buggy: extra spaces at the beginning of the first lines breaks the script (invalid shebang)

At the end, the script will return the master node host, where Hadoop is configured and running. Then you are able to connect at this node to run your Hadoop jobs.

Note.png Note

Alternatively, you can run the script by hand after the kadeploy command completion. Please note that the OAR_FILE_NODES environment variable must be defined in the shell where you run the script. Export it if your are not in the shell created by your oarsub command.

Warning.png Warning

Please note that the script overwrites the content of your ~/.ssh/ directory...

Setup case 02 - Hadoop image from scratch

Connect to the access machine of your site.

Terminal.png outside:
ssh login@access.site.grid5000.fr

Connect to the frontend machine of your site, if it exists.

Terminal.png access:
ssh frontend

Allocate, with OAR, one node to customize the environment which you will choose and walltime equal to three hours.

Terminal.png frontend:
oarsub -I -t deploy -l nodes=1,walltime=03:00:00

Deploy the environment of your choice.

Terminal.png frontend:
kadeploy3 -e environment-name -f $OAR_FILE_NODES -k ~/.ssh/id_rsa.pub

Connect to the reserved node as root. In most cases, the root password is 'grid5000'.

Terminal.png frontend:
ssh root@node.site.grid5000.fr

Create your own user and/or group in this image, if it doesn't exist yet. Or if you prefer, create a hadoop user/group in this image.

Terminal.png frontend:
addgroup --gid GROUP_ID GROUP_NAME
Terminal.png frontend:
adduser --uid USER_ID --ingroup GROUP_NAME USER_NAME
Note.png Note

If you want to create a local hadoop user, you can use '-d' or '--home' option to change the Hadoop user's home. E.g. adduser -d /var/hadoop/ hadoop

Install the version 1.6.X of Java in your environment. E.g. In debian like:

Terminal.png frontend:
apt-get upgrade

apt-get update

apt-get install sun-java6-jdk
Note.png Note

You can, at this point, install other packages if you want. E.g. subversion, pssh

Connect in same node as your created user.

Terminal.png frontend:
ssh login@node.site.grid5000.fr

Create the Hadoop folder.

Terminal.png frontend:
mkdir /opt/hadoop-0.21.0

Get the last version of Hadoop.

Unpack Hadoop package in this folder.

Terminal.png frontend:
tar -xf /opt/hadoop-0.21.0/hadoop-X.XX.X.tar.gz

Fill the /opt/hadoop-0.21.0/conf/core-site.xml file with hadoop.tmp.dir and fs.default.name properties.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/tmp/hadoop-\${user.name}</value>
 </property>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://node.site.grid5000.fr:54310</value>
 </property>
</configuration>

Fill the /opt/hadoop-0.21.0/conf/hdfs-site.xml file, with dfs.replications property.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property> 
  <name>dfs.replication</name>
  <value>NUMBER_OF_REPLICATIONS</value>
  <description>Default block replication.
   The actual number of replications can be specified when the file is created.
   The default is used if replication is not specified in create time.
  </description>
 </property>
</configuration>

Fill the /opt/hadoop-0.21.0/conf/mapred-site.xml file with mapred.job.tracker property.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
  <name>mapred.job.tracker</name>
  <value>hdfs://node.site.grid5000.fr:54311</value>
  <description>The host and port that the MapReduce job tracker runs
   at.  If "local", then jobs are run in-process as a single map
   and reduce task.
  </description>
 </property>
</configuration>

Fill /opt/hadoop-0.21.0/conf/masters file with the host of master node.

node01.site.grid5000.fr

Fill /opt/hadoop-0.21.0/conf/slaves file with the list of deployed nodes.

node01.site.grid5000.fr
node02.site.grid5000.fr
node03.site.grid5000.fr
...
nodeNN.site.grid5000.fr

At this point you may save your own environment, as showed in the tutorial Deploy environment-OAR2.

The new Hadoop environment is ready, however modifications are specifics for each oarsub requisition. It means, that for each reservation, you have to setup the properties at /opt/hadoop-0.21.0/conf files, as did previously, and all nodes have to be with same configured files.

After setup

After setup being ready, you can run your own applications or Hadoop examples. A quick example to execute a test is the wordcount. To run it just perform the following steps:

Connect to the master node.

Terminal.png frontend:
ssh login@node01.site.grid5000.fr

Copy to HDFS some example book files.

Terminal.png master:
/opt/hadoop-0.21.0/bin/hdfs dfs -copyFromLocal ~pcosta/hadoop/0.21.0/gutenberg gutenberg

Run the wordcount example JAR.

Terminal.png master:
/opt/hadoop-0.21.0/bin/hadoop jar /opt/hadoop-0.21.0/hadoop-mapred-examples-0.21.0.jar wordcount gutenberg gutenberg-output

Fetch the result files from HDFS.

Terminal.png master:
/opt/hadoop-0.21.0/bin/hdfs dfs -copyToLocal gutenberg-output /tmp

Script config.sh

#!/bin/bash
 
echo "/**"
echo " * Beggining the Hadoop setup proccess with config.sh script"
echo " */"

USER_LOGIN_ID=$(id -u)
USER_LOGIN_NAME=$(id -un)
USER_GROUP_ID=$(id -g)
USER_GROUP_NAME=$(id -gn)

DATE=$(date +%s)

HADOOP_PATH='/opt/hadoop-0.21.0'
HADOOP_CONF_PATH=$HADOOP_PATH/'conf'
TMP_DIR=/tmp/$DATE'-'$$'-hadoop'

ROOT='root'
NR_REDUCE_TASKS=2
DEBUG_LEVEL='ALL'
HDFS_REPLICA=3

NUM_HOSTS=$(cat $OAR_FILE_NODES | uniq | wc -l)
SLAVES=$(cat $OAR_FILE_NODES | uniq)
MASTERS=$(head -n 1 $OAR_FILE_NODES)

mkdir $TMP_DIR

cat > $TMP_DIR/masters << EOF
$MASTERS
EOF

cat > $TMP_DIR/slaves << EOF
$SLAVES
EOF

#Create/Modify "hdfs-site.xml" file
cat > $TMP_DIR/hdfs-site.xml << EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property> 
    <name>dfs.replication</name>
    <value>$HDFS_REPLICA</value>
    <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
    </description>
  </property>
</configuration>
EOF

#Create/Modify "core-site.xml" file
cat > $TMP_DIR/core-site.xml << EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/tmp/hadoop-\${user.name}</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://$MASTERS:54310</value>
  </property>
</configuration>
EOF

#Create/Modify "mapred-site.xml" file
cat  > $TMP_DIR/mapred-site.xml << EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>hdfs://$MASTERS:54311</value>
    <description>The host and port that the MapReduce job tracker runs
      at.  If "local", then jobs are run in-process as a single map
      and reduce task.
    </description>
  </property>

  <property>
        <name>mapred.reduce.tasks</name>
        <value>$NR_REDUCE_TASKS</value>
   </property>
  <property>
        <name>mapred.local.dir</name>
        <value>$HADOOP_PATH/temp</value>
  </property>

</configuration>
EOF

if [ -d $HOME/.ssh ]
   then
      cd $HOME/.ssh/
   else
      mkdir $HOME/.ssh 
      cd $HOME/.ssh/
fi
eval $(/usr/bin/ssh-agent -s)

if [ -f id_rsa -a -f id_rsa.pub ]
   then
      echo "/**"
      echo " * The user has a RSA key pair."
      echo " */"
      ssh-add id_rsa 
      echo $(cat id_rsa.pub) >> authorized_keys
      echo "StrictHostKeyChecking no" > config
      echo "VerifyHostKeyDNS yes" >> config
else
      if [ -f id_dsa -a -f id_dsa.pub ]
         then
            echo "/**"
            echo " * The user has a DSA key pair."
            echo " */"
            ssh-add id_dsa
            echo $(cat id_dsa.pub) >> authorized_keys
            echo "StrictHostKeyChecking no" > config
            echo "VerifyHostKeyDNS yes" >> config
         else
            echo "/**"
            echo " * Creating a RSA key pair."
            echo " */"
            ssh-keygen -t rsa -f $HOME/.ssh/id_rsa -P ""
            ssh-add id_rsa 
            echo $(cat id_rsa.pub) >> authorized_keys
            echo "StrictHostKeyChecking no" > config
            echo "VerifyHostKeyDNS yes" >> config
      fi
fi

for SLAVE in $SLAVES
do
   scp $HOME/.ssh/config $ROOT@$SLAVE:/$ROOT/.ssh/
   scp $HOME/.ssh/id_rsa.pub $ROOT@$SLAVE:/$ROOT/.ssh/
   ssh $ROOT@$SLAVE chown -R $USER_LOGIN_NAME:$USER_GROUP_NAME $HADOOP_PATH

   scp $TMP_DIR/* $USER_LOGIN_NAME@$SLAVE:$HADOOP_CONF_PATH
done

rm -rf $TMP_DIR/

ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/stop-dfs.sh
ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/stop-mapred.sh
ssh $USER_LOGIN_NAME@$MASTERS rm -rf $HADOOP_PATH/logs/*
ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/hadoop namenode -format
ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/start-dfs.sh
ssh $USER_LOGIN_NAME@$MASTERS $HADOOP_PATH/bin/start-mapred.sh

echo 'Masternode: '$MASTERS;
echo 'Slavenode: '$SLAVES;
Personal tools
Namespaces

Variants
Actions
Public Portal
Users Portal
Admin portal
Wiki special pages
Toolbox