Run MPI On Grid'5000: Difference between revisions

From Grid5000
Jump to navigation Jump to search
No edit summary
Line 1: Line 1:
{{Maintainer|Nicolas Niclausse}}
{{Maintainer|Jeremie Gaidamour}}
{{Portal|User}}
{{Portal|User}}
{{Portal|Tutorial}}
{{Portal|Tutorial}}
Line 25: Line 23:
= Using Open MPI on a default environment =
= Using Open MPI on a default environment =


The default Grid'5000 environment provides Open MPI 3.1.3 (see <code class='command'>ompi_info</code>).
The default Grid'5000 environment provides Open MPI 4.1.0 (see <code class='command'>ompi_info</code>).


== Creating a sample MPI program ==
== Creating a sample MPI program ==
Line 70: Line 68:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=3}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=3}}


<code class=command>oarsh</code> is the remote shell connector of the OAR batch scheduler. It is a wrapper around the <code class=command>ssh</code> command that handles the configuration of the SSH environment. You can connect to the reserved nodes using <code class=command>oarsh</code> from the submission frontal of the cluster or from any node. As Open MPI defaults to using <code class=command>ssh</code> for remote startup of processes, you need to add the option <code class=command>--mca orte_rsh_agent "oarsh"</code> to your <code class=command>mpirun</code> command line. Open MPI will then use <code class=command>oarsh</code> in place of <code class=command>ssh</code>.
OAR batch scheduler provides the $OAR_NODEFILE file, containing the list of job's nodes (one line per core). It also uses <code class=command>oarsh</code> as remote shell connector. It is a wrapper around the <code class=command>ssh</code> command that handles the configuration of the SSH environment. You can connect to the reserved nodes using <code class=command>oarsh</code> from the submission frontal of the cluster or from any node. As Open MPI defaults to using <code class=command>ssh</code> for remote startup of processes, you need to add the option <code class=command>--mca orte_rsh_agent "oarsh"</code> to your <code class=command>mpirun</code> command line. Open MPI will then use <code class=command>oarsh</code> in place of <code class=command>ssh</code>.  


{{Term|location=node|cmd=<code class="command">mpirun</code> --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp}}
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp}}
Line 99: Line 97:


You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:
You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:
[[2616,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
  Host: helios-8.sophia.grid5000.fr
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc
warning:regcache incompatible with malloc


or like this:
  [1637577186.512697] [taurus-10:2765 :0]      rdmacm_cm.c:638  UCX  ERROR rdma_create_event_channel failed: No such device
  [griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
  [1637577186.512752] [taurus-10:2765 :0]    ucp_worker.c:1432 UCX  ERROR failed to open CM on component rdmacm with status Input/output error
  [griffon-80.nancy.grid5000.fr:04866] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_btl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
  [taurus-10.lyon.grid5000.fr:02765] ../../../../../../ompi/mca/pml/ucx/pml_ucx.c:273 Error: Failed to create UCP worker
  [griffon-80.nancy.grid5000.fr:04865] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
[griffon-80.nancy.grid5000.fr:04867] mca: base: component_find: unable to open /usr/lib/openmpi/lib/openmpi/mca_mtl_mx: perhaps a missing symbol, or compiled for a different version of Open MPI? (ignored)
  ...


You can avoid those warnings by using the following options:
To tell OpenMPI not try to use high performance hardware and avoid those warnings message, use the following options:


{{Term|location=node|cmd=<code class="command"> mpirun --mca orte_rsh_agent "oarsh" --mca btl openib,sm,self --mca pml ^cm -machinefile $OAR_NODEFILE  $HOME/mpi_programm</code>}}  
{{Term|location=node|cmd=<code class="command"> mpirun --mca orte_rsh_agent "oarsh" --mca pml ^ucx -machinefile $OAR_NODEFILE  $HOME/mpi_programm</code>}}  


* For other clusters, you may use the following options:
** --mca pml ob1 --mca btl tcp,self
** --mca btl ^openib
** --mca btl ^mx


==Setting up and starting Open MPI on a default environment using allow_classic_ssh==
==Setting up and starting Open MPI on a default environment using allow_classic_ssh==
Line 139: Line 118:


== Setting up and starting Open MPI to use high performance interconnect ==
== Setting up and starting Open MPI to use high performance interconnect ==
By default, Open MPI tries to use any high performance interconnect (e.g. Infiniband, Omni-Path) it can find. Options are available to either select or disable an interconnect:


MCA parameters ('''--mca''') can be used to select the drivers that are used at run-time by Open MPI. To learn more about the MCA parameters, see also:
Open MPI provides several alternative ''compononents'' to use High Performance Interconnect hardware (such as Infiniband or Omni-Path).
 
MCA parameters ('''--mca''') can be used to select the component that are used at run-time by OpenMPI. To learn more about the MCA parameters, see also:
* [https://www.open-mpi.org/faq/?category=tuning#mca-params The Open MPI FAQ about tuning parameters]
* [https://www.open-mpi.org/faq/?category=tuning#mca-params The Open MPI FAQ about tuning parameters]
* [http://www.open-mpi.org/faq/?category=tcp#tcp-selection How do I tell Open MPI which IP interfaces / networks to use?]
* [http://www.open-mpi.org/faq/?category=tcp#tcp-selection How do I tell Open MPI which IP interfaces / networks to use?]
* [http://www.open-mpi.org/faq/?category=openfabrics The Open MPI documentation] about [https://en.wikipedia.org/wiki/OpenFabrics_Alliance OpenFabrics] (ie: [https://en.wikipedia.org/wiki/InfiniBand Infiniband])
* [http://www.open-mpi.org/faq/?category=openfabrics The Open MPI documentation] about [https://en.wikipedia.org/wiki/OpenFabrics_Alliance OpenFabrics] (ie: [https://en.wikipedia.org/wiki/InfiniBand Infiniband])
* [https://www.open-mpi.org/faq/?category=opa The Open MPI documention] about Omni-path. [https://www.intel.com/content/www/us/en/support/articles/000016242/network-and-i-o/fabric-products.html Intel documentation about Omni-Path tools].
Open MPI packaged in Grid'5000 debian11 includes [https://openucx.org/ ''ucx''], [https://ofiwg.github.io/libfabric/ ''ofi''] and ''openib'' components to make use of Infiniband network, and [https://github.com/cornelisnetworks/opa-psm2 ''psm2''] and ''ofi'' to make use of Omnipath network.
If you want some of these components, you can for example use '''--mca pml ^ucx --mca mtl ^psm2,ofi --mca btl ^ofi,openib'''. This disables all high performance components mentionned above and will force Open MPI to use its TCP backend.
=== Infiniband network ===
By default, OpenMPI tries to use Infiniband high performance interconnect using the [https://openucx.org/ UCX] component. When Infiniband network is available, it will provide best results in most of cases (UCX even uses multiple interfaces at a time when available).
=== Omni-Path network ===
For Open MPI to work with Omni-Path network hardware, PSM2 component must be used. It is necessary to explicitely disable other components:
{{Term|location=node|cmd=<code class="command">mpirun</code> -machinefile $OAR_NODEFILE -mca mtl psm2 -mca pml ^ucx,ofi -mca btl ^ofi,openib  ~/mpi/tp}}
=== IP over Infiniband or Omni-Path ===


To learn more about specific Omni-Path tools, refer to [https://www.intel.com/content/www/us/en/support/articles/000016242/network-and-i-o/fabric-products.html this page].
Nodes with Infiniband or Omni-Path network interfaces also provide an '''IP over Infiniband''' interface (these interfaces are named '''ibX'''). The TCP backend of Open MPI will try to use them by default.


If you want to disable native support for high performance networks, use '''--mca btl self,sm,tcp --mca mtl ^psm2,ofi'''. The first part disables the '''openib''' backend, and the second part disables the '''psm2''' backend (used by Omni-Path). This will switch to TCP backend of Open MPI.
You can explicitely select interfaces used by the TCP backend using for instance '''--mca btl_tcp_if_exclude ib0,lo''' (to avoid using IP over Infiniband and local interfaces) or '''--mca btl_tcp_if_include eno2''' (to force using the 'regular' Ethernet interface eno2).


Nodes with Infiniband or Omni-Path interfaces also provide an '''IP over Infiniband''' interface (these interfaces are named '''ibX'''), and can still be used by the TCP backend. To also disable their use, use '''--mca btl_tcp_if_exclude ib0,lo''' or select a specific interface with '''--mca btl_tcp_if_include eno2'''. You will ensure that 'regular' Ethernet interface is used.
 
== Benckmarking ==


We will be using [http://mvapich.cse.ohio-state.edu/benchmarks/ OSU micro benchmark] to check the performances of high performance interconnects.
We will be using [http://mvapich.cse.ohio-state.edu/benchmarks/ OSU micro benchmark] to check the performances of high performance interconnects.


To download, extract and compile our benchmark, do:
To download, extract and compile our benchmark, do:
{{Term|location=frontend|cmd=<code class="command">cd</code> ~/mpi}}
{{Term|location=frontend|cmd=<code class="command">cd</code> ~/mpi}}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.1.tar.gz}}
{{Term|location=frontend|cmd=<code class="command">wget</code> https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.8.tgz}}
{{Term|location=frontend|cmd=<code class="command">tar</code> -xf osu-micro-benchmarks-5.6.1.tar.gz}}
{{Term|location=frontend|cmd=<code class="command">tar</code> xf osu-micro-benchmarks-5.8.tgz}}
{{Term|location=frontend|cmd=<code class="command">cd</code> osu-micro-benchmarks-5.6.1/}}
{{Term|location=frontend|cmd=<code class="command">cd</code> osu-micro-benchmarks-5.8/}}
{{Term|location=frontend|cmd=<code class="command">make</code> CC=$(which mpicc) CXX=$(which mpicxx)}}
{{Term|location=frontend|cmd=<code class="command">./configure </code> CC=$(which mpicc) CXX=$(which mpicxx)}}
{{Term|location=frontend|cmd=<code class="command">make</code>}}


As we will benchmark two MPI processes, reserve only one core in two distinct nodes. If your reservation includes more resources, you will have to create a MPI machinefile file with only two entries, such as follow:
As we will benchmark two MPI processes, reserve only one core in two distinct nodes.  
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=2}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l nodes=2}}
{{Term|location=node|cmd=<code class="command">uniq $OAR_NODEFILE &#124; head -n 2 > /tmp/machinefile</code>}}


Infiniband hardware is available on several sites. For example, you will find clusters with Infiniband interconnect in Rennes (20G), Nancy (20G) and Grenoble (20G & 40G).
To start the network benchmark, use:
{{Term|location=node|cmd=<code class="command">mpirun</code>mpirun --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE -npernode 1 ~/mpi/osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency
}}


To reserve one core of two distinct nodes with:
The option ``-npernode 1`` tells to only spawn one process on each node, as the benchmark requires only two processes to communicate.
* a 20G InfiniBand interconnect (DDR, Double Data Rate):
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib='DDR'"}}
* a 40G InfiniBand interconnect (QDR, Quad Data Rate):
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "ib='QDR'"}}


To check if the support for InfiniBand is available in Open MPI, run:
You can then try to compare performance of various network hardware available on Grid'5000. See for instance Network section of [[Hardware page|Hardware#Network_interface_models]].
{{Term|location=node|cmd=<code class="command">ompi_info</code> &#124; grep openib}}
you should see something like this:
                MCA btl: openib (MCA v2.1.0, API v3.0.0, Component v3.1.3)


To start the network benchmark, use:
OAR can select nodes according to properties related to network performance. For example:
{{Term|location=node|cmd=<code class="command">cd</code> osu-micro-benchmarks-5.6.1/}}
* To reserve one core of two distinct nodes with a 56Gbps InfiniBand interconnect:
{{Term|location=node|cmd=<code class="command">mpirun</code> --mca  orte_rsh_agent "oarsh" -machinefile /tmp/machinefile ./mpi/pt2pt/osu_latency}}
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1  -p "ib_rate=56"}}
* To reserve one core of two distinct nodes with a 100Gbps Omni-Path interconnect:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -l /nodes=2/core=1 -p "opa_rate=100"}}


== Specificity of uva and uvb cluster on Sophia ==
uva and uvb are connected to a 40G QDR Infiniband network. But this network uses a partition key : <code>PKEY</code>
As a result, if you want to use mpi with the infiniband network, you have to specify the partition key (which is 8100) with the following option <code>--mca btl_openib_pkey  "0x8100"</code>


== Use a newer Open MPI version using modules ==
== Use a newer Open MPI version using modules ==
Line 192: Line 184:
If you need latest Open MPI version you should use the module command.  
If you need latest Open MPI version you should use the module command.  


{{Term|location=frontend|cmd=<code class="command">module av</code>}}
{{Term|location=frontend|cmd=<code class="command">module av openmpi</code>}}
<pre>
<pre>
----------------------- /grid5000/spack/share/spack/modules/linux-debian9-x86_64 -----------------------
[...]
[...]
openmpi/3.1.3_gcc-6.4.0
openmpi/4.1.1_gcc-8.3.0 (D)
openmpi/4.0.1_gcc-6.4.0
[...]
[...]
</pre>
</pre>


{{Term|location=frontend|cmd=<code class="command">module load openmpi/4.0.1_gcc-6.4.0</code>}}
{{Term|location=frontend|cmd=<code class="command">module load openmpi</code>}}


{{Term|location=frontend|cmd=<code class="command">mpirun --version</code>}}
{{Term|location=frontend|cmd=<code class="command">mpirun --version</code>}}
Line 209: Line 199:
{{Term|location=frontend|cmd=<code class="command">mpicc ~/mpi/tp.c -o ~/mpi/tp</code>}}
{{Term|location=frontend|cmd=<code class="command">mpicc ~/mpi/tp.c -o ~/mpi/tp</code>}}


After you must submit a job and use the same Open MPI library version on the node.
From your job, you must ensure the same Open MPI version is used on every nodes:


{{Term|location=frontend|cmd=<code class="command">oarsub -I -l nodes=3</code>}}
{{Term|location=frontend|cmd=<code class="command">oarsub -I -l nodes=3</code>}}


{{Term|location=node|cmd=<code class="command">module load openmpi/4.0.1_gcc-6.4.0</code>}}
{{Term|location=node|cmd=<code class="command">module load openmpi</code>}}
 
{{Term|location=node|cmd=<code class="command">$(which mpirun) --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp</code>}}


The last step is tu run simple example. As the module environment through a ssh connection is lost we use <code>$(which mpirun)</code> command.
Note that <code>$(which mpirun)</code> command is used in this last step to ensure mpirun from the module environment is used.


{{Term|location=node|cmd=<code class="command">$(which mpirun) --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp</code>}}


= More advanced use cases=
= More advanced use cases=
Line 269: Line 260:
More info [https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php#toc22 in OpenMPI manual pages].
More info [https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php#toc22 in OpenMPI manual pages].


<!--
==Compilation of Open MPI ==
If you want to use a custom version of Open MPI, you can compile it in your home directory.
Make an interactive reservation and compile Open MPI from a node. This prevents overloading the site frontend:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I}}
Get Open MPI from the [http://www.open-mpi.org/software/ompi/ official website]:
{{Term|location=node|cmd=cd /tmp/}}
{{Term|location=frontend|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.2.tar.bz2}}
{{Term|location=node|cmd=tar -xf openmpi-1.10.2.tar.bz2}}
{{Term|location=node|cmd=cd openmpi-1.10.2}}
Run ''configure'':
{{Term|location=node|cmd=<code class="command">./configure</code> --prefix=$HOME/openmpi/  --with-memory-manager=none}}
Compile:
{{Term|location=node|cmd=<code class="command">make</code> -j8}}
Install it on your home directory (in $HOME/openmpi/ )
{{Term|location=node|cmd=<code class="command">make install</code>}}
To use your version of Open MPI, use <code class='command'>$HOME/openmpi/bin/mpicc</code> and <code class='command'>$HOME/openmpi/bin/mpirun</code> or add the following to your configuration:
{{Term|location=node|cmd=<code class="command">export PATH=$HOME/openmpi/bin/:$PATH</code>}}
You should recompile your program before trying to use the new runtime environment.
-->
<!--
== Make your own Kadeploy image with latest Open MPI version ==
=== Build a Kadeploy image ===
If you need latest Open MPI version, or use Open MPI with specific compilation options, you must recompile Open MPI from sources. In this section we are going to build an image that includes latest version of Open MPI built from sources. Note that you could also build and install this custom Open MPI in your home directory, without requiring deploying (i.e., using <code>./configure --prefix=$HOME/openmpi/</code>).
This image will be based on jessie-x64-base. Let's deploy it:
{{Term|location=frontend|cmd=<code class="command">oarsub</code> -I -t deploy -l nodes=1,walltime=2 }}
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f $OAR_NODEFILE -e jessie-x64-base -k }}
Connect to reserved node as root:
{{Term|location=frontend|cmd=<code class="command">ssh root@</code>$(head -1 $OAR_NODEFILE)}}
Download Open MPI sources from [http://www.open-mpi.org/software/ompi/ official website]:
{{Term|location=node|cmd=<code class="command">cd</code> /tmp/}}
{{Term|location=node|cmd=<code class="command">wget</code> http://www.open-mpi.org/software/ompi/v1.10/downloads/openmpi-1.10.2.tar.bz2}}
{{Term|location=node|cmd=tar -xf openmpi-1.10.2.tar.bz2}}
{{Term|location=node|cmd=cd openmpi-1.10.2}}
Install build dependencies:
{{Term|location=node|cmd=<code class="command">apt-get</code> update}}
{{Term|location=node|cmd=<code class="command">apt-get</code> -y install g++ make gfortran f2c libblas-dev}}
Configure, compile and install:
{{Term|location=node|cmd=./configure --libdir=/usr/local/lib64 --with-memory-manager=none}}
{{Term|location=node|cmd=make -j8}}
{{Term|location=node|cmd=make install}}
To run our MPI applications, we create a dedicated user named '''mpi'''. We add it to the '''rdma''' group to allow access to Infiniband hardware. Also, we copy the ~root/authorized_keys files so that we can login as user '''mpi''' from the frontend. We also create an SSH key for identifying the '''mpi''' user (needed by Open MPI).
groupadd rdma
useradd -m -g rdma mpi -d /var/mpi -s /bin/bash
echo "* hard memlock unlimited" >> /etc/security/limits.conf
echo "* soft memlock unlimited" >> /etc/security/limits.conf
mkdir ~mpi/.ssh
cp ~root/.ssh/authorized_keys ~mpi/.ssh
chown -R mpi ~mpi/.ssh
su - mpi
ssh-keygen -N "" -P "" -f /var/mpi/.ssh/id_rsa
cat .ssh/id_rsa.pub >> ~/.ssh/authorized_keys
echo "        StrictHostKeyChecking no" >> ~/.ssh/config
exit # exit session as MPI user
exit # exit the root connection to the node
# You can then copy your file from the frontend to the '''mpi''' home directory:
rsync -avz ~/mpi/ mpi@$(head -1 $OAR_NODEFILE):mpi/ # copy the tutorial
Save the newly created image by using tgz-g5k:
{{Term|location=frontend|cmd=ssh root@$(head -1 $OAR_NODEFILE) <code class="command">tgz-g5k</code> > $HOME/public/jessie-openmpi.tgz}}
Copy the description file of jessie-x64-base:
{{Term|location=frontend|cmd=kaenv3 -p jessie-x64-base &#124; grep -v visibility > $HOME/public/jessie-openmpi.dsc}}
Change the image name in the description file; we will use an http URL for multi-site deployment:
{{Term|location=frontend|cmd=sed -i "s,server://.*images/jessie-x64-base.*,http://public.$(hostname -f&#124; cut -d. -f2).grid5000.fr/~$USER/jessie-openmpi.tgz," $HOME/public/jessie-openmpi.dsc}}
Release your job:
{{Term|location=frontend|cmd=<code class="command">oardel</code> $OAR_JOB_ID}}
=== Use your Kadeploy image ===
==== Single site  ====
Reserve some nodes and deploy them:
{{Term|location=frontend|cmd=oarsub -I -t deploy -l /nodes=3}}
{{Term|location=frontend|cmd=kadeploy3 -a $HOME/public/jessie-openmpi.dsc -f $OAR_NODEFILE -k}}
Copy machines file and connect to first node:
{{Term|location=frontend|cmd=<code class="command">scp</code> $OAR_NODEFILE  mpi@$(head -1 $OAR_NODEFILE):nodes}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@$(head -1 $OAR_NODEFILE)}}
Copy your MPI application to other nodes and run it:
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}}
{{Term|location=node|cmd=for node in $(uniq ~/nodes &#124; grep -v $(hostname)); do scp ~/mpi/tp $node:~/mpi/tp; done}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/nodes ./tp}}
==== Multiple sites  ====
Choose three clusters from 3 different sites.
{{Term|location=frontend|cmd=<code class="command">oargridsub</code> -t deploy -w 02:00:00 <code class="replace">cluster1</code>:rdef="nodes=2",<code class="replace">cluster2</code>:rdef="nodes=2",<code class="replace">cluster3</code>:rdef="nodes=2" > oargrid.out}}
{{Term|location=frontend|cmd=<code class="command">export</code> OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out &#124; cut -f2 -d=)}}
Get the node list using oargridstat:
{{Term|location=frontend|cmd=<code class="command">oargridstat</code> -w -l $OARGRID_JOB_ID &#124; grep grid > ~/gridnodes}}
Deploy on all sites using the --multi-server option :
{{Term|location=frontend|cmd=<code class="command">kadeploy3</code> -f gridnodes -a $HOME/public/jessie-openmpi.dsc -k --multi-server}}
Copy machines file and connect to first node:
{{Term|location=frontend|cmd=<code class="command">scp</code>  ~/gridnodes mpi@$(head -1 ~/gridnodes):}}
{{Term|location=frontend|cmd=<code class="command">ssh</code> mpi@$(head -1 ~/gridnodes)}}
Copy your MPI application to other nodes and run it:
{{Term|location=node|cmd=<code class="command">cd</code> ~/mpi/}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpicc</code> tp.c -o tp}}
{{Term|location=node|cmd=for node in $(uniq ~/gridnodes &#124; grep -v $(hostname)); do scp ~/mpi/tp $node:~/mpi/tp; done}}
{{Term|location=node|cmd=<code class="command">/usr/local/bin/mpirun</code> -machinefile ~/gridnodes --mca btl self,sm,tcp --mca opal_net_private_ipv4  "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo --mca orte_keep_fqdn_hostnames 1 tp}}
== MPICH2 ==
{{Warning|text=This documentation is about using MPICH2 with the MPD process manager. But the default process manager for MPICH2 is now Hydra. See also: [http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager The MPICH documentation].}}
If you want/need to use MPICH2 on Grid5000, you should do this:
First, you have to do this once (on each site)
{{Term|location=frontend|cmd=<code class="command">echo</code> "MPD_SECRETWORD=<code class="replace">secret</code>" > $HOME/.mpd.conf}}
{{Term|location=frontend|cmd=<code class="command">chmod</code> 600 $HOME/.mpd.conf}}
Then you can use a script like this to launch mpd/mpirun:
NODES=$(uniq < $OAR_NODEFILE | wc -l | tr -d ' ')
NPROCS=$(c -l < $OAR_NODEFILE | tr -d ' ')
mpdboot --rsh=oarsh --totalnum=$NODES --file=$OAR_NODEFILE
sleep 1
mpirun -n $NPROCS <code class="replace">mpich2binary</code>
-->


{{Pages|HPC}}
{{Pages|HPC}}

Revision as of 14:51, 1 December 2021

Note.png Note

This page is actively maintained by the Grid'5000 team. If you encounter problems, please report them (see the Support page). Additionally, as it is a wiki page, you are free to make minor corrections yourself if needed. If you would like to suggest a more fundamental change, please contact the Grid'5000 team.

Introduction

MPI is a programming interface that enables the communication between processes of a distributed memory system. This tutorial focuses on setting up MPI environments on Grid'5000 and only requires a basic understanding of MPI concepts. For instance, you should know that standard MPI processes live in their own memory space and communicate with other processes by calling library routines to send and receive messages. For a comprehensive tutorials on MPI, see the IDRIS course on MPI. There are several freely-available implementations of MPI, including Open MPI, MPICH2, MPICH, LAM, etc. In this practical session, we focus on the Open MPI implementation.

Before following this tutorial you should already have some basic knowledge of OAR (see the Getting Started tutorial) . For the second part of this tutorial, you should also know the basics about OARGRID (see the Advanced OAR tutorial).

Running MPI on Grid'5000

When attempting to run MPI on Grid'5000 you will face a number of challenges, ranging from classical setup problems for MPI software to problems specific to Grid'5000. This practical session aims at driving you through the most common use cases, which are:

  • Setting up and starting Open MPI on a default environment using oarsh.
  • Setting up and starting Open MPI on a default environment using the allow_classic_ssh option.
  • Setting up and starting Open MPI to use high performance interconnect.
  • Setting up and starting latest Open MPI library version.
  • Setting up and starting Open MPI to run on several sites using oargridsub.


Using Open MPI on a default environment

The default Grid'5000 environment provides Open MPI 4.1.0 (see ompi_info).

Creating a sample MPI program

For the purposes of this tutorial, we create a simple MPI program where the MPI process of rank 0 broadcasts an integer (42) to all the other processes. Then, each process prints its rank, the total number of processes and the value he received from the process 0.

In your home directory, create a file ~/mpi/tp.c and copy the source code:

Terminal.png frontend:
mkdir ~/mpi
Terminal.png frontend:
vi ~/mpi/tp.c
#include <stdio.h>
#include <mpi.h>
#include <time.h> /* for the work function only */
#include <unistd.h>

int main (int argc, char *argv []) {
       char hostname[257];
       int size, rank;
       int bcast_value = 1;

       gethostname(hostname, sizeof hostname);
       MPI_Init(&argc, &argv);
       MPI_Comm_rank(MPI_COMM_WORLD, &rank);
       MPI_Comm_size(MPI_COMM_WORLD, &size);
       if (!rank) {
            bcast_value = 42;
       }
       MPI_Bcast(&bcast_value,1 ,MPI_INT, 0, MPI_COMM_WORLD );
       printf("%s\t- %d - %d - %d\n", hostname, rank, size, bcast_value);
       fflush(stdout);

       MPI_Barrier(MPI_COMM_WORLD);
       MPI_Finalize();
       return 0;
}

You can then compile your code:

Terminal.png frontend:
mpicc ~/mpi/tp.c -o ~/mpi/tp

Setting up and starting Open MPI on a default environment using oarsh

Submit a job:

Terminal.png frontend:
oarsub -I -l nodes=3

OAR batch scheduler provides the $OAR_NODEFILE file, containing the list of job's nodes (one line per core). It also uses oarsh as remote shell connector. It is a wrapper around the ssh command that handles the configuration of the SSH environment. You can connect to the reserved nodes using oarsh from the submission frontal of the cluster or from any node. As Open MPI defaults to using ssh for remote startup of processes, you need to add the option --mca orte_rsh_agent "oarsh" to your mpirun command line. Open MPI will then use oarsh in place of ssh.

Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp

You can also set an environment variable (usually in your .bashrc):

Terminal.png bashrc:
export OMPI_MCA_orte_rsh_agent=oarsh
Terminal.png node:
mpirun -machinefile $OAR_NODEFILE ~/mpi/tp

Open MPI also provides a configuration file for --mca parameters. In your home directory, create a file as ~/.openmpi/mca-params.conf

orte_rsh_agent=oarsh
filem_rsh_agent=oarcp

You should have something like:

helios-52       - 4 - 12 - 42
helios-51       - 0 - 12 - 42
helios-52       - 5 - 12 - 42
helios-51       - 2 - 12 - 42
helios-52       - 6 - 12 - 42
helios-51       - 1 - 12 - 42
helios-51       - 3 - 12 - 42
helios-52       - 7 - 12 - 42
helios-53       - 8 - 12 - 42
helios-53       - 9 - 12 - 42
helios-53       - 10 - 12 - 42
helios-53       - 11 - 12 - 42

You may have (lot's of) warning messages if Open MPI cannot take advantage of any high performance hardware. At this point of the tutorial, this is not important as we will learn how to select clusters with high performance interconnect in greater details below. Error messages might look like this:

[1637577186.512697] [taurus-10:2765 :0]      rdmacm_cm.c:638  UCX  ERROR rdma_create_event_channel failed: No such device
[1637577186.512752] [taurus-10:2765 :0]     ucp_worker.c:1432 UCX  ERROR failed to open CM on component rdmacm with status Input/output error
[taurus-10.lyon.grid5000.fr:02765] ../../../../../../ompi/mca/pml/ucx/pml_ucx.c:273  Error: Failed to create UCP worker

To tell OpenMPI not try to use high performance hardware and avoid those warnings message, use the following options:

Terminal.png node:
mpirun --mca orte_rsh_agent "oarsh" --mca pml ^ucx -machinefile $OAR_NODEFILE $HOME/mpi_programm


Setting up and starting Open MPI on a default environment using allow_classic_ssh

If you prefer using ssh as a connector instead of oarsh, submit a job with the allow_classic_ssh type:

Terminal.png frontend:
oarsub -I -t allow_classic_ssh -l nodes=3

Launch your parallel job:

Terminal.png node:
mpirun -machinefile $OAR_NODEFILE ~/mpi/tp
Note.png Note

Using allow_classic_ssh option avoids OAR resources confinement mechanism with cpuset to restrict the jobs on assigned resources. Therefore, allow_classic_ssh cannot be used with jobs sharing nodes between users (i.e. for reservations at the core level).

Setting up and starting Open MPI to use high performance interconnect

Open MPI provides several alternative compononents to use High Performance Interconnect hardware (such as Infiniband or Omni-Path).

MCA parameters (--mca) can be used to select the component that are used at run-time by OpenMPI. To learn more about the MCA parameters, see also:

Open MPI packaged in Grid'5000 debian11 includes ucx, ofi and openib components to make use of Infiniband network, and psm2 and ofi to make use of Omnipath network.

If you want some of these components, you can for example use --mca pml ^ucx --mca mtl ^psm2,ofi --mca btl ^ofi,openib. This disables all high performance components mentionned above and will force Open MPI to use its TCP backend.


Infiniband network

By default, OpenMPI tries to use Infiniband high performance interconnect using the UCX component. When Infiniband network is available, it will provide best results in most of cases (UCX even uses multiple interfaces at a time when available).

Omni-Path network

For Open MPI to work with Omni-Path network hardware, PSM2 component must be used. It is necessary to explicitely disable other components:

Terminal.png node:
mpirun -machinefile $OAR_NODEFILE -mca mtl psm2 -mca pml ^ucx,ofi -mca btl ^ofi,openib ~/mpi/tp

IP over Infiniband or Omni-Path

Nodes with Infiniband or Omni-Path network interfaces also provide an IP over Infiniband interface (these interfaces are named ibX). The TCP backend of Open MPI will try to use them by default.

You can explicitely select interfaces used by the TCP backend using for instance --mca btl_tcp_if_exclude ib0,lo (to avoid using IP over Infiniband and local interfaces) or --mca btl_tcp_if_include eno2 (to force using the 'regular' Ethernet interface eno2).


Benckmarking

We will be using OSU micro benchmark to check the performances of high performance interconnects.

To download, extract and compile our benchmark, do:

Terminal.png frontend:
cd ~/mpi
Terminal.png frontend:
tar xf osu-micro-benchmarks-5.8.tgz
Terminal.png frontend:
cd osu-micro-benchmarks-5.8/
Terminal.png frontend:
./configure CC=$(which mpicc) CXX=$(which mpicxx)
Terminal.png frontend:
make

As we will benchmark two MPI processes, reserve only one core in two distinct nodes.

Terminal.png frontend:
oarsub -I -l nodes=2

To start the network benchmark, use:

Terminal.png node:
mpirunmpirun --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE -npernode 1 ~/mpi/osu-micro-benchmarks-5.8/mpi/pt2pt/osu_latency

The option ``-npernode 1`` tells to only spawn one process on each node, as the benchmark requires only two processes to communicate.

You can then try to compare performance of various network hardware available on Grid'5000. See for instance Network section of Hardware#Network_interface_models.

OAR can select nodes according to properties related to network performance. For example:

  • To reserve one core of two distinct nodes with a 56Gbps InfiniBand interconnect:
Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "ib_rate=56"
  • To reserve one core of two distinct nodes with a 100Gbps Omni-Path interconnect:
Terminal.png frontend:
oarsub -I -l /nodes=2/core=1 -p "opa_rate=100"


Use a newer Open MPI version using modules

If you need latest Open MPI version you should use the module command.

Terminal.png frontend:
module av openmpi
[...]
openmpi/4.1.1_gcc-8.3.0 (D)
[...]
Terminal.png frontend:
module load openmpi
Terminal.png frontend:
mpirun --version

You must recompile simple MPI example on the frontend with this new version.

Terminal.png frontend:
mpicc ~/mpi/tp.c -o ~/mpi/tp

From your job, you must ensure the same Open MPI version is used on every nodes:

Terminal.png frontend:
oarsub -I -l nodes=3
Terminal.png node:
module load openmpi
Terminal.png node:
$(which mpirun) --mca orte_rsh_agent "oarsh" -machinefile $OAR_NODEFILE ~/mpi/tp

Note that $(which mpirun) command is used in this last step to ensure mpirun from the module environment is used.


More advanced use cases

Running MPI on several sites at once

In this section, we are going to execute a MPI process over several Grid'5000 sites. In this example we will use the following sites: Rennes, Sophia and Grenoble, using oargrid for making the reservation (see the Advanced OAR tutorial for more information).

Warning.png Warning

Open MPI tries to figure out the best network interface to use at run time. However, selected networks are not always "production" Grid'5000 network which is routed between sites. In addition, only TCP implementation will work between sites, as high performance networks are only available from the inside of a site. To ensure correct network is selected, add the option --mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo --mca btl self,sm,tcp to mpirun

Warning.png Warning

By default, Open MPI may use only the short name of the nodes specified into the nodesfile; but to join grid5000 nodes that are located on different sites, we must use the FQDN names. For Open Mpi to correctly use FQDN names of the nodes, you must add the following option to mpirun: --mca orte_keep_fqdn_hostnames t

The MPI program must be available on each site you want to use. From the frontend of one site, copy the mpi/ directory to the two other sites. You can do that with rsync. Suppose that you are connected in Sophia and that you want to copy Sophia's mpi/ directoy to Grenoble and Rennes.

Terminal.png fsophia:
rsync -avz ~/mpi/ nancy.grid5000.fr:mpi/
Terminal.png fsophia:
rsync -avz ~/mpi/ grenoble.grid5000.fr:mpi/

(you can also add the --delete option to remove extraneous files from the mpi directory of Nancy and Grenoble).

Reserve nodes in each site from any frontend with oargridsub (you can also add options to reserve nodes from specific clusters if you want to):

Terminal.png frontend:
oargridsub -w 02:00:00 nancy:rdef="nodes=2",grenoble:rdef="nodes=2",sophia:rdef="nodes=2" > oargrid.out

Get the oargrid Id and Job key from the output of oargridsub:

Terminal.png frontend:
export OAR_JOB_KEY_FILE=$(grep "SSH KEY" oargrid.out | cut -f2 -d: | tr -d " ")
Terminal.png frontend:
export OARGRID_JOB_ID=$(grep "Grid reservation id" oargrid.out | cut -f2 -d=)

Get the node list using oargridstat and copy the list to the first node:

Terminal.png frontend:
oargridstat -w -l $OARGRID_JOB_ID | grep -v ^$ > ~/gridnodes
Terminal.png frontend:
oarcp ~/gridnodes $(head -1 ~/gridnodes):

Connect to the first node:

Terminal.png frontend:
oarsh $(head -1 ~/gridnodes)

And run your MPI application:

Terminal.png node:
cd ~/mpi/
Terminal.png node:
mpirun -machinefile ~/gridnodes --mca orte_rsh_agent "oarsh" --mca opal_net_private_ipv4 "192.168.0.0/16" --mca btl_tcp_if_exclude ib0,lo --mca btl self,sm,tcp --mca orte_keep_fqdn_hostnames t tp

FAQ

Passing environment variables to nodes

While some batch schedulers (e.g. Slurm) transparently pass environment variables from the head node shell to all execution nodes given to mpirun. OAR does not (OAR provides no more than what OpenSSH does, be it used directly when oarsub is called with -t all_classic_ssh or through the oarsh wrapper). Thus OAR leaves this responsibility of environment variables passing to mpirun.

Therefore, in order to have more than the default environment variables (OMPI_* variables) passed/set on execution nodes, one has different options:

use the -x VAR option of mpirun, possibly for each variable to pass (WARNING, -x is depreacted)
Example:
mpirun -machinefile $OAR_NODE_FILE --mca orte_rsh_agent "oarsh" -x MY_ENV1 -x MY_ENV2 -x MY_ENV3="value3" ~/bin/mpi_test
use the --mca mca_base_env_list "ENV[;...]" option of mpirun
Example:
mpirun -machinefile $OAR_NODE_FILE --mca orte_rsh_agent "oarsh" --mca mca_base_env_list "MY_ENV1;MY_ENV2;MY_ENV3=value3" ~/bin/mpi_test
set the mca_base_env_list "ENV[;...]" option in the ~/.openmpi/mca-params.conf file
This way passing variable become transparent to the mpirun command line, which becomes:
mpirun -machinefile $OAR_NODE_FILE --mca orte_rsh_agent "oarsh" ~/bin/mpi_test

Rq:

  • orte_rsh_agent="oarsh" can be set in the ~/.openmpi/mca-params.conf configuration file as well (but only if using oarsh as the connector)
  • -x and --mca mca_base_env_list cannot coexist.

This could especially be useful to pass OpenMP variables, such as OMP_NUM_THREADS.

More info in OpenMPI manual pages.