The Message Passing Interface is a standard protocol of communication between a group of processes, working on the same problem. It allows that the computational time is distributed over multiple processors or/and computers and thus resulting in an overall faster solution of the problem to be computed. Two implementations of MPI are installed on the Beowulf cluster: lam-mpi and mpich, but it is strongly recommended to use mpich, because it is the only one which is compatible with the sun gridengine. For historic reason that lam-mpi was installed first, mpich is not located in the default directory /usr/bin, /usr/include and /usr/lib. It is required to update the search paths (see below). More information (e.g. troubleshooting) on lam-mpi can be found here.
The MPICH implementation is stricter than the lam-mpi and it requires the call of MPI::Init in C and C++ program with the arguments argc and argv. Note that before the call to MPI:::Init the string array argv holds information on the MPI process and NOT the supplied command line arguments of the program. Only after the call you can access them in the standard way. This feature does not applied to Fortran programs.
The MPICH implementation is located at /usr/local/mpich with the available compilers in the subdirectory bin:
mpicc | C - Compiler |
mpicxx | C++ - Compiler |
mpif77 | Fortran77 - Compiler |
mpif90 | Fortran90 - Compiler |
Note that similar executables of the lam-mpi installation are located in /usr/bin, which is part of the default search path. To ensure that the mpich compilers are invoked the full path should be used in the compilation, e.g.
/usr/local/mpich/bin/mpicxx
for the C++ compiler. The compilers include automatically the correct include path of the MPICH implementation (located at /usr/local/mpich/include) so that it is not necessary to define them explicitly in the compilation with the -I option. The MPICH does not need any specific library to be linked against. The compiler options are the same as the underlying compiler, which is for the Beowulf Cluster the GNU compilers gcc, g++ and g77.
Parallel jobs are started with the mpirun command, located at /usr/local/mpich/bin/mpirun and typically requires only the number of processes. This is done with the -np ## option. Note that the requested number or jobs is not limited to the number of available hosts. If it is larger than the number of hosts/processors mulitple processes are spawned on the nodes, which then on the other hand compete for CPU time on the node.
As a default mpirun uses all nodes except harbinge, as it is defined in the file /usr/local/mpich/util/machines/machines.LINUX. However the default set of nodes, requested for the calculation, can be overwritten with the -machinefile=<file with list of hosts> option of mpirun. Alternative the option -exclude <list of colon delimited hosts> will exclude hosts from the defaul list. To request all available nodes/cpus the -np ## option can be replaces by the -allcpus option.
As example to start the MPI version of the program mpich_spur the required command (one line!) is:
/usr/local/mpich/bin/mpirun -allcpus /home/reiche/reiche/bin/mpich_spur test.xml
The file test.xml is an input file for mpich_spur and should be located in the current directory.
Sun Grid Engine supports parallel computation using a parallel environment. In case of MPICH the parallel environment is enables with the option -pe mpich of the qsub command or the corresponding line in the shell script. The option requires an additional argument which is the number of requested nodes. It can be a single number (e.g. 8) or a range (e.g. 2-8). Grid Engine tries to maximize the number. In the case that only 6 nodes are available for a job the method -pe mpich 8 will put the job in the queue as long as 8 nodes become available while -pe mpich 2-6 will start the job with 6 nodes. The range -8 correspond to 0-8 and 8- to 8-infinity. However 'infinity' is only a request and limited to the number of slots, defined in the MPICH environment (currently 14).
Prior to starting the mpi job Grid Engine analyse the requested number of nodes. The results is stored in the environment variable $NSLOTS. It also creates a file with the list of the hosts, located at $TMPDIR/machines, where $TMPDIR is a path to a temporary directory, created by the Grid Engine for the specific job.
A typical scripts looks like
#!/bin/bash
#$ -S /bin/sh
#
#$ -pe mpich 8
#
echo "Got $NSLOTS slots"
/usr/local/mpich/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines /home/reiche/bin/mpich_spur test.xml
echo "Job Done"
Sample scripts can be found at /opt/sge/mpi.
Note that the Grid Engine will spawn one additional job, which corresponds to the actual call of mpirun. A look with qstat, while the parallel job is running, shows that it has no job-id assigned and does not consume any CPU time.