Grid computation-OAR2
From Grid5000
| Note | |
|---|---|
The source code for the experiments is available at: https://www.grid5000.fr/documents/TPCalcul.tgz. Please copy and unzip it in your working area: | |
Contents |
Public
- all G5K community that wants to learn how to use MPI in a grid environment
- MPI knowledge is not required, but experiments will fastly evolve from a simple "hello world" to the deployment and communication in a grid environment
Contents
Reviewing MPI Principles
Introduction
- Message passing principles
- processes roles and ranks (SPMD programming)
Basic Skeleton of an MPI Application
A minimal MPI application should include the MPI library and start/finish the MPI environment. Indeed, any MPI operation outside the MPI_Init and MPI_Finalize will cause a compilation/execution error.
#include <stdio.h>
#include "mpi.h"
int main (int argc, char* argv[])
{
MPI_Init(&argc,&argv);
// your code
MPI_Finalize();
}
When the source code does not make distinction between processes roles, it will execute n identical instances of your application.
Fundamental MPI instructions
- The total number of MPI processes in an execution can be obtained through the use of the
MPI_Comm_sizefunction. It generally receives as parameters variable that identifies the communicator (MPI_COMM_WORLDif we want to count all processes) and the reference to the integer variablesizewhere the number of nodes will be stored:
MPI_Comm_size (MPI_COMM_WORLD, &size);
- Similarly, the actual rank of a process (related to communicator in use) can be obtained through the call to
MPI_Comm_rank:
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
- Finally, the elapsed time of an operation can be obtained through the use of successive calls to
MPI_Wtime(), which returns a floating-point number of seconds, representing elapsed wall-clock time since some time in the past. Its precision varies according to the MPI distribution and architecture, but in most cases is precise enough at the order of milliseconds.
float start = MPI_Wtime(); ... float end = MPI_Wtime(); float elapsed = end-start;
MPI basic datatypes
In order to transmit data between different processes, MPI defines "standard" datatypes that can be used independently of the programming language (C, C++, Fortran, ...) or computer architecture (big-endian versus little-endian, etc.). Therefore, the following table presents some of these basic datatypes and their equivalence to the C language.
| MPI Datatype | C type |
| MPI_CHAR | signed char |
| MPI_SHORT | signed short int |
| MPI_INT | signed int |
| MPI_LONG | signed long int |
| MPI_UNSIGNED_CHAR | unsigned char |
| MPI_UNSIGNED_SHORT | unsigned short int |
| MPI_UNSIGNED | unsigned int |
| MPI_UNSIGNED_LONG | unsigned long int |
| MPI_FLOAT | float |
| MPI_DOUBLE | double |
| MPI_LONG_DOUBLE | long double |
| MPI_BYTE | (none) |
| MPI_PACKED | (none) |
Nevertheless, these types cover only the basic datatypes that can be sent between processes. When specific data structures are needed (such as C structs or C++ objects), we must create derived datatypes before sending them. If time allows, we will see how to declare and use derived datatypes.
Communication among processes
Point-to-Point Communication
- Blocking Calls - Represented by
MPI_SendandMPI_Recvfunctions. A blocking call will only return after the communication buffer can be safely modified (message really sent or copied to intermediate buffers from the MPI communication layer).
MPI_Send (void *buffer, int sendcount, MPI_Datatype dtype, int dest, int tag, MPI_Comm communicator); MPI_Recv (void *buffer, int recvcount, MPI_Datatype dtype, int source, int tag, MPI_Comm communicator, MPI_Status *stat);
PS : MPI_Recv accepts the special words MPI_ANY_SOURCE and MPI_ANY_TAG to accept messages coming from any source or with any identifying tag.
- Non-blocking Calls - Represented by
MPI_IsendandMPI_Irecvfunctions. Non-blocking calls return to the execution almost immediately, it is up to the developer to verify if the message was really sent (allowing the buffer to be reused). We can wait until transmission is done (using wait functions such asMPI_Wait), or simply test if a message was transmitted(using theMPI_Testfunction). The last option is interesting if we have some computation that can be done in parallel, but it requires a careful evaluation of the message status to avoid inconsistencies.
MPI_Isend (void *buffer, int sendcount, MPI_Datatype dtype, int dest, int tag, MPI_Comm communicator,
MPI_Request *req);
MPI_Irecv (void *buffer, int recvcount, MPI_Datatype dtype, int source, int tag, MPI_Comm communicator,
MPI_Request *req);
MPI_Wait (MPI_Request *req, MPI_Status *stat);
MPI_Test(MPI_Request *req, int *flag, MPI_Status *stat);
Collective Communications
Sometimes we need to transmit data to several processes, or to collect data from different peers. Hence, MPI provides several collective operations that represent almost all data exchange patterns between 1->n, n->1 and n->n processes. The following table lists the most common collective operations:
| MPI_Bcast() | Broadcast the same message to all processes |
| MPI_Scatter() | Sends personalized messages to the processes |
| MPI_Gather() | Collects messages from different processes |
| MPI_Reduce() | Collects messages from different processes, while execution arithmetic/logical operations on the data |
| MPI_Alltoall() | Each process sends personalized messages to the other processes |
| MPI_Barrier() | Synchronizes different processes |
MPI collective communication operations do not depend on MPI_Send and MPI_Recv calls. Therefore, all involved processes must call the same collective operation, as in the following example. Indeed, the parameters of the call will determine which process is the originator or the destination of the messages, if needed.
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main (int argc, char* argv[])
{
int parameter;
int rank;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0)
parameter = atoi(argv[1]);
MPI_Bcast (¶meter, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
}
Launching MPI
MPI Distributions
- MPICH - Argonne National Laboratory
- LAM MPI - University of Indiana
- OpenMPI - new distribution - very promising
Compiling an MPI application
mpicc myprogram.c -o myprogram
Deploying MPI processes over GRID5000
Nodes reservation
Use oarsub or oargridsub (according to starter-level practices: Cluster experiment-OAR2 and Grid experiment-OAR2).
Deploying MPI processes
To launch an MPI application over several machines we need a file with the machine list. While the OAR batch scheduler provides an environment variable $OAR_FILE_NODES that can be used, it is sometimes interesting to adapt this list to our need. The syntax depends on the MPI distribution we use, therefore MPICH considers the following syntax:
machinename[:nprocs] machinename[:nprocs] ...
While LAM-MPI uses a slightly different syntax:
machinename[cpu=nprocs] machinename[cpu=nprocs] ...
The machine list is quite simple, and we use the :nprocs part if we want to indicate a machine with more than one CPU (therefore, the deployment will fill all existent resources in a machine). This syntax is equivalent to listing the same machine name several times.
According to the MPI distribution in use, the machine list will be passed directly as a parameter of the mpirun call, or through a boot application:
MPICH angelo@node-10:/~$mpirun -np 10 -machinefile <file> toto Hello world 0! Hello world 1! ... Hello world 9!
LAM angelo@node-10:/~$lamboot <file> LAM 7.2b1svn03012006/ROMIO - Indiana University angelo@node-10:/~$mpirun -np 10 toto Hello world 0! Hello world 1! ... Hello world 9! angelo@node-10:/~$lamhalt
heterogeneity problems
Sometimes heterogeneity may cause some problems when working in a computational grid. Most problems are due to the different architectures in use at Grid5000 (Xeon, Opteron, Itanium, ...), which forces applications to be compiled at each different cluster (hopefully we don't have a fully shared filesystem).
The second type of problem comes from the software available at each site, and the compatibility between different versions of MPI. For instance, LAM-MPI does not tolerate different versions when running an application. At the other hand, MPICH does not offers any restriction, and an application compiled with version 1.2.5 can "talk" with another compiled with MPICH 1.2.7. Nevertheless, it is not wise to mix different versions, as some evolutions (such as new collective communication algorithms) may turn versions partially incompatible.
Exercises
TP #1 - Warming up
To start our exercises session, we'll begin with some basic MPI operations. These operations are indeed fundamental to most MPI programs, as they are used to initialize the environment and "locate" a process.
Hello World
Write a "Hello World" application from the skeleton presented before. Try to use the following operations:
-
MPI_Comm_size()- to print the total number of running processes -
MPI_Comm_rank()- each process can obtain its rank among all processes - Optionally, call
MPI_Get_processor_name()to print the machine name
Logical Ring
Write an MPI application that simulates the passing of a token in a logical ring among the processes. Try to use blocking communication operations MPI_Send and MPI_Recv.
- Advanced exercise: Send useful data between processes. Therefore, try to calculate Pi using the ζ(2) algorithm from Euler to solve the Basel problem[1].
Collective Communications
Write an algorithm to broadcast a message (for example, one value chosen by the root process). All processes will modify the value (for example, summing it with their ranks) and further will retransmit these values to the root process.
- Initially try to use only
MPI_SendandMPI_Recv. Which difficulties have you found? - Rewrite the application using this time collective communication operations. Please use
MPI_Bcast()to broadcast a message andMPI_Gather()to collect the results.
TP #2 - Network performance
As we are working with a computational grid, it is important to be aware of the difference of performance levels according to the location of processes.
| SMP Machine <-> Cluster <-> Computational Grid |
Further, communication performance depends on the application and transport layers, which sometimes may give us modest results than expected.
To evaluate the communication performance, we will measure the latency and the throughput of a link using the approach from NWS[2].
- Latency - obtained from the round-trip time of an empty message (sndcount = 0)
Latency = RTT(0)/2
- Throughput - obtained from the time needed to send a relatively large message (64KB for NWS) and the acknowledgement of that transmission
TPut = msgsize/(RTT(m)-2*Latency)
Try initially to use blocking calls (MPI_Send and MPI_Recv), but if time allows, compare with the performance of non-blocking calls.
Try to compose a reference table with the average latency and throughput between SMP processes, cluster processes and grid-away processes.
TP #3 - Communication cost and granularity
In this experiment we will try to observe the impact of communications on the cost of parallel algorithms. We will work with an algorithm to multiply two square matrix A and B. The experiment consists on identifying the best trade-off between the number of parallel processes and the data granularity.
Conclusions
TBD

