Revision as of 15:59, 3 February 2010

	Warning
	Practical session under construction.

Running MPI on Grid'5000

When attempting to run MPI on Grid'5000 you'll be faced with a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common uses cases, which are

setting up and starting openMPI on a default environment using allow_classic_ssh
setting up and starting openMPI on a default environment using oarsh
setting up and starting openMPI on a kadeploy image
setting up and starting openMPI to use high performance interconnect

Pre-requisite

Basic knowledge of MPI; if you don't know MPI, you can read: Grid_computation
Get OpenMPI here : http://www.open-mpi.org/software/ompi/v1.4/

Overwiew

Currently, the default environment is not the same on every sites, therefore, you don't have the same version of OpenMPI on every site. If you want to use OpenMPI for a grid experiment, you will have to install your own MPI version; You have two options:

install OpenMPI on your home dir (but you should recompile it and install it on all the sites you want to use! it may work by simply copying the compiled file, but it's not guaranteed)
use the same kadeploy image and deploy it on the the sites you want to use

If you are only interested on a single site experiment, you may use the version provided by the default environment.

Using OpenMPI on a default environment

Compilation

Make an interactive reservation and compile openmpi on a node :

frontend:

oarsub -I

Unarchive openmpi

node:

cd /tmp/

node:

tar jvxf ~/openmpi-1.4.1.tar.bz2

node:

cd openmpi-1.4.1

configure and compile:

node:

./configure --prefix=$HOME/openmpi/ --with-memory-manager=none

node:

cd openmpi-1.4.1

node:

make -j4

Install it on your home directory (in $HOME/openmpi/ )

node:

make install

Create a sample MPI program

We will use a vary basic MPI program to test OAR/MPI; create a file $HOME/src/mpi/tp.c and copy the following source:

frontend:

mkdir -p $HOME/src/mpi

frontend:

vi $HOME/src/mpi/tp.c

the code source:

#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */

int main (int argc, char *argv []) {
       char hostname[257];
       int size, rank;
       int i, pid;
       int bcast_value = 1;

       gethostname (hostname, sizeof hostname);
       MPI_Init (&argc, &argv);
       MPI_Comm_rank (MPI_COMM_WORLD, &rank);
       MPI_Comm_size (MPI_COMM_WORLD, &size);
       if (!rank) {
            bcast_value = 42;
       }
       MPI_Bcast (&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       fflush(stdout);

       MPI_Barrier (MPI_COMM_WORLD);
       MPI_Finalize ();
       return 0;
}

This program will use mpi to communicate between processes; the mpi process of rank 0 will broadcast an integer (value 42) to all the others processes. Then, each process prints it's rank, the total number of processes, and the value he got from process zero.

Setting up and starting OpenMPI on a default environment using allow_classic_ssh

Submit a job with the allow_classic_ssh type

frontend:

oarsub -I -t allow_classic_ssh -l nodes=3

Compile your code

node:

$HOME/openmpi/bin/mpicc src/mpi/tp.c -o src/mpi/tp

Use this script to launch

node:

$HOME/openmpi/bin/mpirun -machinefile $OAR_NODEFILE $HOME/src/mpi/tp

Setting up and starting OpenMPI on a default environment using `oarsh`

frontend:

oarsub -I -l nodes=3

oarsh is the default connector used when you reserve a node. To be able to use this connector, you need to add the option --mca plm_rsh_agent "oarsh" to mpirun.

node:

$HOME/openmpi/bin/mpirun --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE $HOME/src/mpi/tp

multi sites

Is this practical sessions, we will do multiples sites MPI with kadeploy. If you want to do this with the default environment, the following steps are required:

recompile openmpi on all the sites you want to use, in the same directory ($HOME/openmpi)
recompile your mpi application on all the sites using your openmpi.
use oargridsub to reserve nodes on several sites
build a node file using oargridstat -l
launch mpirun from the first node of your nodefile, using this nodefile instead of $OAR_NODEFILE.

Setting up and starting OpenMPI on a kadeploy image

Building a kadeploy image

	Note
	You can skip this section and use directly the environment lenny-x64-openmpi available at sophia

The default openmpi version available in debian based distributions are not compiled with high performances libraries like myrinet/MX, therefore we must recompile OpenMPI from sources. Fortunately, the default images (lenny-x64-XXX) includes all the libraries for high performance interconnect, and OpenMPI will find them at compile time.

We will create a kadeploy image based on an existing one.

frontend:

oarsub -I -t deploy -l nodes=1,walltime=2

frontend:

kadeploy3 -f $OAR_NODEFILE -e lenny-x64-nfs -k

Then connect on the deployed node as root, and install openmpi:

frontend:

ssh root@node

Unarchive openmpi

node:

cd /tmp/

node:

tar jvxf ~username/openmpi-1.4.1.tar.bz2

node:

cd openmpi-1.4.1

Add gfortran,f2c and blas library

node:

apt-get -y install gfortran f2c libblas-dev

Configure and compile

node:

./configure --libdir=/usr/local/lib64 --with-memory-manager=none

node:

make -j4

node:

make install

Create the image using tgz-g5k

node:

tgz-g5k /dev/shm/image.tgz

Copy the image on the frontend:

frontend:

scp node:/dev/shm/image.tgz $HOME/lenny-openmpi.tgz

Copy the description file of lenny-x64-nfs

frontend:

cp /grid5000/desriptions/lenny-x64-nfs-2.0.dsc3 $HOME/lenny-openmpi.dsc

Change the image name in the description file:

frontend:

perl -i -pe "s@/grid5000/images/lenny-x64-nfs-2.0.tgz@$HOME/lenny-openmpi.tgz@" $HOME/lenny-openmpi.dsc

Using a kadeploy image

single site

frontend:

oarsub -t deploy -l /nodes=5

frontend:

kadeploy3 -a $HOME/lenny-openmpi.dsc -f $OAR_NODEFILE -k

multiple sites

Choose three clusters from 3 different sites.

frontend:

oargridsub -t deploy cluster1:rdef="nodes=2",cluster2:rdef="nodes=2",cluster3:rdef="nodes=2"

Setting up and starting OpenMPI to use high performance interconnect

By default, openMPI tries to use any high performance interconnect he can find. This is true only if he has found the libraries at compile time (compilation of openmpi, not your application). This should be true if you have built OpenMPI on a lenny-x64 environment.

We will using the Netpipe tool to check if the high performance interconnect is really used: download it from this URL: http://www.scl.ameslab.gov/netpipe/code/NetPIPE-3.7.1.tar.gz

node:

cd $HOME/src/mpi

Unarchive Netpipe

node:

tar zvxf ~/dload/NetPIPE-3.7.1.tar.gz

node:

cd NetPIPE-3.7.1

Change your PATH

node:

export PATH=~/openmpi/bin:$PATH

Compile

node:

make mpi

Myrinet hardware :

Myrinet hardware is available on severals sites (see Hardware page):

sophia (2G)
rennes
orsay (10G)
lille (10G)
bordeaux (2G and 10G)
lyon (2G and 10G)

To reserve one core on two nodes with a myrinet interconnect: Myrinet 2G

frontend:

oarsub -I -l /nodes=2/core=1 -p "myri2g='YES'"

or Myrinet 10G

frontend:

oarsub -I -l /nodes=2/core=1 -p "myri10g='YES'"

node:

cd $HOME/src/mpi/NetPIPE-3.7.1

node:

$HOME/openmpi/bin/mpirun --mca plm_rsh_agent "oarsh" -machinefile $OAR_NODEFILE NPmpi

you should have something like that:

 0:         1 bytes   4080 times -->      0.31 Mbps in      24.40 usec     
 1:         2 bytes   4097 times -->      0.63 Mbps in      24.36 usec     
 ...
 122: 8388608 bytes      3 times -->    896.14 Mbps in   71417.13 usec
 123: 8388611 bytes      3 times -->    896.17 Mbps in   71414.83 usec

the minimum latency is given by the last column for a 1 byte message the maximum throughput is given by the last line, 896.17 Mbps in this case So in this case a latency of 24usec is very high, therefore the myrinet was not used as expected. It can happen if openmpi has not found the mx libraries during the compilation. You can check this with :

node:

ompi_info | grep mx

If the output is empty, there is no mx support builtin.

Let's deploy our image then. Exit the current job:

node:

exit

frontend:

oarsub -I -t deploy -l /nodes=2 -p "myri2g='YES'"

create a nodefile with a single entry per node:

frontend:

uniq $OAR_NODEFILE > nodes

connect to the first node:

frontend:

ssh `head -1 nodes`

node:

cd $HOME/src/mpi/NetPIPE-3.7.1

node:

/usr/local/bin/mpirun -machinefile ~/nodes NPmpi

This time we have:

  0:       1 bytes  23865 times -->      2.03 Mbps in       3.77 usec     
  1:       2 bytes  26549 times -->      4.05 Mbps in       3.77 usec     
...
122: 8388608 bytes      3 times -->   1773.88 Mbps in   36079.17 usec
123: 8388611 bytes      3 times -->   1773.56 Mbps in   36085.69 usec

This time we have 3.77usec, with is good, and almost 1.8Gbps. We are using the myrinet interconnect!

Infiniband hardware :

Infiniband hardware is available on severals sites (see Hardware page):

rennes (10G)
nancy (10G or 20G ??? FIXME)
bordeaux (10G)
grenoble (20G)

To reserve one core on two nodes with a 10G infiniband interconnect:

frontend:

oarsub -I -l /nodes=2/core=1 -p "ib10g='YES'"

or for 20G

frontend:

oarsub -I -l /nodes=2/core=1 -p "ib20g='YES'"

Do exactly the same thing as for myrinet interconnect. To check if the support for infiniband is available in openmpi, run:

node:

ompi_info | grep openib

you should see something like this:

                MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)

@@ Line 70: / Line 70: @@
   }
+This program will use mpi to communicate between processes; the mpi process of rank 0 will broadcast an integer (value 42) to all the others processes. Then, each process prints it's rank, the total number of processes, and the value he got from process zero.
 ==Setting up and starting OpenMPI on a default environment using allow_classic_ssh==

Run MPI On Grid'5000: Difference between revisions

Revision as of 15:59, 3 February 2010

Contents

Running MPI on Grid'5000

Pre-requisite

Overwiew

Using OpenMPI on a default environment

Compilation

Create a sample MPI program

Setting up and starting OpenMPI on a default environment using allow_classic_ssh

Setting up and starting OpenMPI on a default environment using `oarsh`

multi sites

Setting up and starting OpenMPI on a kadeploy image

Building a kadeploy image

Using a kadeploy image

single site

multiple sites

Setting up and starting OpenMPI to use high performance interconnect

Myrinet hardware :

Infiniband hardware :

Navigation menu

Run MPI On Grid'5000: Difference between revisions

Revision as of 15:59, 3 February 2010

Running MPI on Grid'5000

Pre-requisite

Overwiew

Using OpenMPI on a default environment

Compilation

Create a sample MPI program

Setting up and starting OpenMPI on a default environment using allow_classic_ssh

Setting up and starting OpenMPI on a default environment using oarsh

multi sites

Setting up and starting OpenMPI on a kadeploy image

Building a kadeploy image

Using a kadeploy image

single site

multiple sites

Setting up and starting OpenMPI to use high performance interconnect

Myrinet hardware :

Infiniband hardware :

Navigation menu

Search

Setting up and starting OpenMPI on a default environment using `oarsh`