Run MPI On Grid'5000: Difference between revisions
Jgaidamour (talk | contribs) |
Jgaidamour (talk | contribs) |
||
Line 130: | Line 130: | ||
Launch your parallel job: | Launch your parallel job: | ||
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE ~/mpi/tp}} | {{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE ~/mpi/tp}} | ||
{{Note|text=The <code class=command>oarsh</code> connector uses the Linux resources confinement system mechanism '''cpuset''' to restrict the jobs on assigned resources. Therefore, the <code>allow_classic_ssh</code> option cannot be used when nodes are shared between users (i.e. for reservations at the core level).}} | |||
== Setting up and starting Open MPI to use high performance interconnect == | == Setting up and starting Open MPI to use high performance interconnect == |
Revision as of 13:36, 28 January 2016
![]() |
Note |
---|---|
This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team. |
Introduction
MPI is a programming interface that enables the communication between processes of a distributed memory system. This tutorial focuses on setting up MPI environments on Grid'5000 and only requires a basic understanding of MPI concepts. For instance, you should know that standard MPI processes live in their own memory space and communicate with other processes by calling library routines to send and receive messages. For a comprehensive tutorials on MPI, see the IDRIS course on MPI. There are several freely-available implementations of MPI, including Open MPI, MPICH2, MPICH, LAM, etc. In this practical session, we focus on the Open MPI implementation.
Before following this tutorial you should already have some basic knowledge of OAR (see the Getting Started tutorial) . For the second part of this tutorial, you should also know the basics about OARGRID (see the Advanced OAR tutorial) and Kadeploy (see the Getting Started tutorial).
Running MPI on Grid'5000
When attempting to run MPI on Grid'5000 you will face a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:
- Setting up and starting Open MPI on a default environment using
oarsh
. - Setting up and starting Open MPI on a default environment using the
allow_classic_ssh
option. - Setting up and starting Open MPI to use high performance interconnect.
- Setting up and starting Open MPI to run on several sites using
oargridsub
. - Setting up and starting Open MPI in a kadeploy image.
Using Open MPI on a default environment
The default Grid'5000 environment provides Open MPI 1.6.5 (see ompi_info
).
Creating a sample MPI program
For the purposes of this tutorial, we create a simple MPI program where the MPI process of rank 0 broadcasts an integer (42) to all the other processes. Then, each process prints its rank, the total number of processes and the value he received from the process 0.
In your home directory, create a file ~/mpi/tp.c
and copy the source code:
#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */
int main (int argc, char *argv []) {
char hostname[257];
int size, rank;
int i, pid;
int bcast_value = 1;
gethostname(hostname, sizeof hostname);
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (!rank) {
bcast_value = 42;
}
MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
fflush(stdout);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
You can then compile your code:
Setting up and starting Open MPI on a default environment using oarsh
Submit a job:
oarsh
is the remote shell connector of the OAR batch scheduler. It is a wrapper around the ssh
command that handles the configuration of the SSH environment. You can connect to the reserved nodes using oarsh
from the submission frontal of the cluster or from any node. As Open MPI defaults to using ssh
for remote startup of processes, you need to add the option --mca orte_rsh_agent "oarsh"
to your mpirun
command line. Open MPI will then use oarsh
in place of ssh
.
You can also set an environment variable (usually in your .bashrc):
Open MPI also provides a configuration file for --mca
parameters. In your home directory, create a file as ~/.openmpi/mca-params.conf
orte_rsh_agent=oarsh filem_rsh_agent=oarcp
You should have something like:
helios-52 - 4 - 12 - 42 helios-51 - 0 - 12 - 42 helios-52 - 5 - 12 - 42 helios-51 - 2 - 12 - 42 helios-52 - 6 - 12 - 42 helios-51 - 1 - 12 - 42 helios-51 - 3 - 12 - 42 helios-52 - 7 - 12 - 42 helios-53 - 8 - 12 - 42 helios-53 - 9 - 12 - 42 helios-53 - 10 - 12 - 42 helios-53 - 11 - 12 - 42
You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:
[[2616,1],2]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: helios-8.sophia.grid5000.fr Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc
or like this:
[griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_btl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04865] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) [griffon-80.nancy.grid5000.fr:04867] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored) ...
You could use FAQ#How_to_use_MPI_in_Grid5000.3F to avoid this warnings.
Setting up and starting Open MPI on a default environment using allow_classic_ssh
When your reservation only includes entire nodes (i.e. you are not making reservations at the core level), you can use ssh
as a connector instead of oarsh
.
Submit a job with the allow_classic_ssh
type:
Launch your parallel job:
Setting up and starting Open MPI to use high performance interconnect
By default, Open MPI tries to use any high performance interconnect it can find. But it works only if the related libraries were found during the compilation of Open Mpi (not during the compilation of your application). It should work if you built Open MPI on a jessie-x64 environment, and it also works correctly on the default environment.
Options can be used to either select or disable an interconnect.
MCA parameters (--mca) can be used to select the drivers that are used at run-time by Open MPI. To learn more about the MCA parameters, see also:
- The Open MPI FAQ about tuning parameters
- How do I tell Open MPI which IP interfaces / networks to use?
- The Open MPI documentation about OpenFabrics (ie: Infiniband)
- The Open MPI documentation about Myrinet
We will be using NetPIPE to check the performances of high performance interconnects.
To download, extract and compile NetPIPE, do:
As NetPipe only works between two MPI processes, we will reserve one core of two distinct nodes. If your reservation includes more resources, you will have to create a MPI machinefile file (--machinefile) with only two entries as follows:
Infiniband hardware is available on several sites. For example, you will find clusters with Infiniband interconnect in Rennes (20G), Nancy (20G) and Grenoble (20G & 40G). Myrinet hardware is available at Lille (10G) (see Hardware page).
To reserve one core of two distinct nodes with:
- a 20G InfiniBand interconnect:
- a 40G InfiniBand interconnect:
- a 10G Myrinet interconnect:
To test the network:
To check if the support for InfiniBand is available in Open MPI, run:
you should see something like this:
MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1)
To check if the support for Myrinet is available in Open MPI, run:
(if the output is empty, there is no builtin mx support).
Without high performance interconnect, results looks like this:
0: 1 bytes 4080 times --> 0.31 Mbps in 24.40 usec 1: 2 bytes 4097 times --> 0.63 Mbps in 24.36 usec ... 122: 8388608 bytes 3 times --> 896.14 Mbps in 71417.13 usec 123: 8388611 bytes 3 times --> 896.17 Mbps in 71414.83 usec
The latency is given by the last column for a 1 byte message; the maximum throughput is given by the last line (896.17 Mbps in that case).
With a Myrinet2G network, a typical result looks like this:
0: 1 bytes 23865 times --> 2.03 Mbps in 3.77 usec 1: 2 bytes 26549 times --> 4.05 Mbps in 3.77 usec ... 122: 8388608 bytes 3 times --> 1773.88 Mbps in 36079.17 usec 123: 8388611 bytes 3 times --> 1773.56 Mbps in 36085.69 usec
In this example, we have 3.77 ms of latency and almost 1.8 Gbit/s of bandwitdh.
With Infiniband 40G (QDR), you should have much better performance that using Ethernet or Myrinet 2G or Infiniband 20G:
0: 1 bytes 30716 times --> 4.53 Mbps in 1.68 usec 1: 2 bytes 59389 times --> 9.10 Mbps in 1.68 usec ... 121: 8388605 bytes 17 times --> 25829.13 Mbps in 2477.82 usec 122: 8388608 bytes 20 times --> 25841.35 Mbps in 2476.65 usec 123: 8388611 bytes 20 times --> 25823.40 Mbps in 2478.37 usec
Less than 2 ms of latency and almost 26 Gbit/s of bandwitdh !
More advanced use cases
Running MPI on several sites at once
In this tutorial, we use the following sites: Rennes, Sophia and Grenoble. For making a reservation on multiple sites, we will be using oargrid. See the Grid_jobs_management tutorial for more information.
![]() |
Note |
---|---|
For multiple sites, we should only use TCP, and there is no MX or InfiniBand network between sites. Therefore, we add this option to mpirun: --mca btl self,sm,tcp |
The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do that with rsync. Suppose that you are connected in Sophia and that you want to copy Sophia's mpi/ directoy to Grenoble and Rennes.
(you can also add the --delete' option to remove extraneous files from the mpi directory of rennes and grenoble).
Reserve nodes in each site from any frontend with oargridsub (you can also add options to reserve nodes from specific clusters if you want to):
![]() |
frontend :
|
oargridsub -w 02:00:00 rennes :rdef="nodes=2",grenoble :rdef="nodes=2",sophia :rdef="nodes=2" > oargrid.out |
Get the oargrid Id and Job key from the output of oargridsub:
Get the node list using oargridstat and copy the list to the first node:
Connect to the first node:
And run your MPI application:
![]() |
node :
|
mpirun -machinefile ~/gridnodes --mca orte_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.160.0/24\;192.168.14.0/23" --mca btl_tcp_if_exclude ib0,lo,myri0 --mca btl self,sm,tcp tp |
Compilation of Open MPI
If you want to use a custom version of Open MPI, you can compile it in your home directory. Make an interactive reservation and compile Open MPI from a node. This prevents overloading the site frontend:
Get Open MPI from the official website:
Run configure:
Compile:
- Install it on your home directory (in $HOME/openmpi/ )
To use your version of Open MPI, use $HOME/openmpi/bin/mpicc
and $HOME/openmpi/bin/mpirun
or add the following to your configuration:
You should recompile your program before trying to use the new runtime environment.
Setting up and starting Open MPI in a kadeploy image
![]() |
Warning |
---|---|
This part of the tutorial is known to be buggy. It is strongly recommended that you skip it, unless you have an important need for Myrinet networking -- however, be prepared to fix it if it is the case. |
Building a kadeploy image
The default Open MPI version available in Debian based distributions is not compiled with libraries for high performance networks like myrinet/MX, therefore we must recompile Open MPI from sources if we want to use Myrinet networks. Fortunately, every default image (jessie-x64-XXX) but the min variant includes the libraries for high performance interconnects, and Open MPI will find them at compile time.
We will create a kadeploy image based on an existing one.
Connect on the first node as root, and install openmpi:
Download Open MPI:
Install g++, make gfortran, f2c and the BLAS library:
Configure and compile:
To run a MPI application, we will create a dedicated user named mpi. We add it to the rdma group for Infiniband. Also, we copy the ~root/authorized_keys files so that we can login as user mpi from the frontend. We also create an SSH key for identifying the mpi user (needed by Open MPI).
useradd -m -g rdma mpi -d /var/mpi echo "* hard memlock unlimited" >> /etc/security/limits.conf echo "* soft memlock unlimited" >> /etc/security/limits.conf mkdir ~mpi/.ssh cp ~root/.ssh/authorized_keys ~mpi/.ssh chown -R mpi ~mpi/.ssh su - mpi ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys echo " StrictHostKeyChecking no" >> ~/.ssh/config exit # exit session as MPI user exit # exit the root connection to the node # You can then copy your file from the frontend to the '''mpi''' home directory: rsync -avz ~/mpi/ mpi@$(head -1 $OAR_NODEFILE):mpi/ # copy the tutorial
You can save the newly created disk image by using tgz-g5k:
Disconnect from the node (exit). From the frontend, copy the image to the public directory:
Copy the description file of jessie-x64-base:
![]() |
frontend :
|
grep -v visibility /grid5000/descriptions/jessie-x64-base-2016011914.dsc > $HOME/public/jessie-openmpi.dsc
|
Change the image name in the description file; we will use an http URL for multi-site deployment:
perl -i -pe "s@server:///grid5000/images/jessie-x64-base-2016011914.tgz@http://public.$(hostname | cut -d. -f2).grid5000.fr/~$USER/jessie-openmpi.tgz@" $HOME/public/jessie-openmpi.dsc
Now you can terminate the job:
Using a kadeploy image
Single site
connect to the first node:
Single site with Myrinet hardware
Create a nodefile with a single entry per node:
Copy it to the first node:
connect to the first node:
This time we have:
0: 1 bytes 23865 times --> 2.03 Mbps in 3.77 usec 1: 2 bytes 26549 times --> 4.05 Mbps in 3.77 usec ... 122: 8388608 bytes 3 times --> 1773.88 Mbps in 36079.17 usec 123: 8388611 bytes 3 times --> 1773.56 Mbps in 36085.69 usec
This time we have 3.77usec, which is good, and almost 1.8Gbps. We are using the myrinet interconnect!
Multiple sites
Choose three clusters from 3 different sites.
![]() |
frontend :
|
oargridsub -t deploy -w 02:00:00 cluster1 :rdef="nodes=2",cluster2 :rdef="nodes=2",cluster3 :rdef="nodes=2" > oargrid.out |
Get the node list using oargridstat:
Deploy on all sites using the --multi-server option :
![]() |
frontend :
|
kadeploy3 -f gridnodes -a $HOME/public/jessie-openmpi.dsc -k --multi-server -o ~/nodes.deployed |
connect to the first node:
MPICH2
![]() |
Warning |
---|---|
This documentation is about using MPICH2 with the MPD process manager. But the default process manager for MPICH2 is now Hydra. See also: The MPICH documentation. |
If you want/need to use MPICH2 on Grid5000, you should do this:
First, you have to do this once (on each site)
Then you can use a script like this to launch mpd/mpirun:
NODES=$(uniq < $OAR_NODEFILE | wc -l | tr -d ' ')
NPROCS=$(c -l < $OAR_NODEFILE | tr -d ' ')
mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE
sleep 1
mpirun -n $NPROCS mpich2binary