# Grid computation-OAR1

Note The source code for the experiments is available at: https://www.grid5000.fr/documents/TPCalcul.tgz. Please copy and unzip it in your working area: tar -xvzf TPCalcul.tgz
Warning This practice was designed for the old OAR1 infrastructure of Grid'5000. Now Grid'5000 uses OAR2, please refer to the corresponding practice.

# Public

• all G5K community that wants to learn how to use MPI in a grid environment
• MPI knowledge is not required, but experiments will fastly evolve from a simple "hello world" to the deployment and communication in a grid environment

# Contents

## Reviewing MPI Principles

### Introduction

• Message passing principles
• processes roles and ranks (SPMD programming)

### Basic Skeleton of an MPI Application

A minimal MPI application should include the MPI library and start/finish the MPI environment. Indeed, any MPI operation outside the MPI_Init and MPI_Finalize will cause a compilation/execution error.

#include <stdio.h>
#include "mpi.h"

int main (int argc, char* argv[])
{
MPI_Init(&argc,&argv);

MPI_Finalize();
}


When the source code does not make distinction between processes roles, it will execute n identical instances of your application.

### Fundamental MPI instructions

• The total number of MPI processes in an execution can be obtained through the use of the MPI_Comm_size function. It generally receives as parameters variable that identifies the communicator (MPI_COMM_WORLD if we want to count all processes) and the reference to the integer variable size where the number of nodes will be stored:
MPI_Comm_size (MPI_COMM_WORLD, &size);

• Similarly, the actual rank of a process (related to communicator in use) can be obtained through the call to MPI_Comm_rank:
MPI_Comm_rank (MPI_COMM_WORLD, &rank);

• Finally, the elapsed time of an operation can be obtained through the use of successive calls to MPI_Wtime(), which returns a floating-point number of seconds, representing elapsed wall-clock time since some time in the past. Its precision varies according to the MPI distribution and architecture, but in most cases is precise enough at the order of milliseconds.
float start = MPI_Wtime();
...
float end = MPI_Wtime();
float elapsed = end-start;


### MPI basic datatypes

In order to transmit data between different processes, MPI defines "standard" datatypes that can be used independently of the programming language (C, C++, Fortran, ...) or computer architecture (big-endian versus little-endian, etc.). Therefore, the following table presents some of these basic datatypes and their equivalence to the C language.

 MPI Datatype C type MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE (none) MPI_PACKED (none)

Nevertheless, these types cover only the basic datatypes that can be sent between processes. When specific data structures are needed (such as C structs or C++ objects), we must create derived datatypes before sending them. If time allows, we will see how to declare and use derived datatypes.

### Communication among processes

#### Point-to-Point Communication

• Blocking Calls - Represented by MPI_Send and MPI_Recv functions. A blocking call will only return after the communication buffer can be safely modified (message really sent or copied to intermediate buffers from the MPI communication layer).
MPI_Send (void *buffer, int sendcount, MPI_Datatype dtype, int dest, int tag, MPI_Comm communicator);

MPI_Recv (void *buffer, int recvcount, MPI_Datatype dtype, int source, int tag, MPI_Comm communicator,
MPI_Status *stat);


PS : MPI_Recv accepts the special words MPI_ANY_SOURCE and MPI_ANY_TAG to accept messages coming from any source or with any identifying tag.

• Non-blocking Calls - Represented by MPI_Isend and MPI_Irecv functions. Non-blocking calls return to the execution almost immediately, it is up to the developer to verify if the message was really sent (allowing the buffer to be reused). We can wait until transmission is done (using wait functions such as MPI_Wait), or simply test if a message was transmitted(using the MPI_Test function). The last option is interesting if we have some computation that can be done in parallel, but it requires a careful evaluation of the message status to avoid inconsistencies.
MPI_Isend (void *buffer, int sendcount, MPI_Datatype dtype, int dest, int tag, MPI_Comm communicator,
MPI_Request *req);

MPI_Irecv (void *buffer, int recvcount, MPI_Datatype dtype, int source, int tag, MPI_Comm communicator,
MPI_Request *req);

MPI_Wait (MPI_Request *req, MPI_Status *stat);

MPI_Test(MPI_Request *req, int *flag, MPI_Status *stat);


#### Collective Communications

Sometimes we need to transmit data to several processes, or to collect data from different peers. Hence, MPI provides several collective operations that represent almost all data exchange patterns between 1->n, n->1 and n->n processes. The following table lists the most common collective operations:

 MPI_Bcast() Broadcast the same message to all processes MPI_Scatter() Sends personalized messages to the processes MPI_Gather() Collects messages from different processes MPI_Reduce() Collects messages from different processes, while execution arithmetic/logical operations on the data MPI_Alltoall() Each process sends personalized messages to the other processes MPI_Barrier() Synchronizes different processes

MPI collective communication operations do not depend on MPI_Send and MPI_Recv calls. Therefore, all involved processes must call the same collective operation, as in the following example. Indeed, the parameters of the call will determine which process is the originator or the destination of the messages, if needed.

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"

int main (int argc, char* argv[])
{
int parameter;
int rank;

MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0)
parameter = atoi(argv[1]);

MPI_Bcast (&parameter, 1, MPI_INT, 0, MPI_COMM_WORLD);

MPI_Finalize();
}


# Launching MPI

## MPI Distributions

• MPICH - Argonne National Laboratory
• LAM MPI - University of Indiana
• OpenMPI - new distribution - very promising
Todo Characteristics of each distribution

## Compiling an MPI application

mpicc myprogram.c -o myprogram


## Deploying MPI processes over GRID5000

### Nodes reservation

Use oarsub or oargridsub (according to starter-level practices: Cluster experiment and Grid experiment).

### Deploying MPI processes

To launch an MPI application over several machines we need a file with the machine list. While the OAR batch scheduler provides an environment variable $OAR_FILE_NODES that can be used, it is sometimes interesting to adapt this list to our need. The syntax depends on the MPI distribution we use, therefore MPICH considers the following syntax: machinename[:nprocs] machinename[:nprocs] ...  While LAM-MPI uses a slightly different syntax: machinename[cpu=nprocs] machinename[cpu=nprocs] ...  The machine list is quite simple, and we use the :nprocs part if we want to indicate a machine with more than one CPU (therefore, the deployment will fill all existent resources in a machine). This syntax is equivalent to listing the same machine name several times. According to the MPI distribution in use, the machine list will be passed directly as a parameter of the mpirun call, or through a boot application: MPICH angelo@node-10:/~$mpirun -np 10 -machinefile <file> toto
Hello world 0!
Hello world 1!
...
Hello world 9!

Note Sometimes (at Sophia, for example) we need to use the option -nolocal with MPICH.
LAM
angelo@node-10:/~$lamboot <file> LAM 7.2b1svn03012006/ROMIO - Indiana University angelo@node-10:/~$mpirun -np 10 toto
Hello world 0!
Hello world 1!
...
Hello world 9!
angelo@node-10:/~\$lamhalt


### heterogeneity problems

Sometimes heterogeneity may cause some problems when working in a computational grid. Most problems are due to the different architectures in use at Grid5000 (Xeon, Opteron, Itanium, ...), which forces applications to be compiled at each different cluster (hopefully we don't have a fully shared filesystem).

The second type of problem comes from the software available at each site, and the compatibility between different versions of MPI. For instance, LAM-MPI does not tolerate different versions when running an application. At the other hand, MPICH does not offers any restriction, and an application compiled with version 1.2.5 can "talk" with another compiled with MPICH 1.2.7. Nevertheless, it is not wise to mix different versions, as some evolutions (such as new collective communication algorithms) may turn versions partially incompatible.

# Exercises

## TP #1 - Warming up

To start our exercises session, we'll begin with some basic MPI operations. These operations are indeed fundamental to most MPI programs, as they are used to initialize the environment and "locate" a process.

### Hello World

Write a "Hello World" application from the skeleton presented before. Try to use the following operations:

• MPI_Comm_size() - to print the total number of running processes
• MPI_Comm_rank() - each process can obtain its rank among all processes
• Optionally, call MPI_Get_processor_name() to print the machine name

### Logical Ring

Write an MPI application that simulates the passing of a token in a logical ring among the processes. Try to use blocking communication operations MPI_Send and MPI_Recv.

• Advanced exercise: Send useful data between processes. Therefore, try to calculate Pi using the $\zeta(2)$ algorithm from Euler to solve the Basel problem[1].

### Collective Communications

Write an algorithm to broadcast a message (for example, one value chosen by the root process). All processes will modify the value (for example, summing it with their ranks) and further will retransmit these values to the root process.

• Initially try to use only MPI_Send and MPI_Recv. Which difficulties have you found?
• Rewrite the application using this time collective communication operations. Please use MPI_Bcast() to broadcast a message and MPI_Gather() to collect the results.

## TP #2 - Network performance

As we are working with a computational grid, it is important to be aware of the difference of performance levels according to the location of processes.

 SMP Machine <-> Cluster <-> Computational Grid

Further, communication performance depends on the application and transport layers, which sometimes may give us modest results than expected.

To evaluate the communication performance, we will measure the latency and the throughput of a link using the approach from NWS[2].

• Latency - obtained from the round-trip time of an empty message (sndcount = 0)
Latency = RTT(0)/2

• Throughput - obtained from the time needed to send a relatively large message (64KB for NWS) and the acknowledgement of that transmission
TPut = msgsize/(RTT(m)-2*Latency)


Try initially to use blocking calls (MPI_Send and MPI_Recv), but if time allows, compare with the performance of non-blocking calls.

Try to compose a reference table with the average latency and throughput between SMP processes, cluster processes and grid-away processes.

## TP #3 - Communication cost and granularity

In this experiment we will try to observe the impact of communications on the cost of parallel algorithms. We will work with an algorithm to multiply two square matrix A and B. The experiment consists on identifying the best trade-off between the number of parallel processes and the data granularity.

TBD