MPI Exercise

  1. Login to the workshop machine

    Workshops differ in how this is done. The instructor will go over this beforehand

  2. Copy the example files

    1. In your home directory, create a subdirectory for the MPI test codes and cd to it.
      mkdir ~/mpi
      cd  ~/mpi

    2. Copy either the Fortran or the C version of the parallel MPI exercise files to your mpi subdirectory:

      C:
      cp  /usr/global/docs/training/blaise/mpi/C/*   ~/mpi
      Fortran:
      cp  /usr/global/docs/training/blaise/mpi/Fortran/*   ~/mpi
      

    3. Some of the example codes have serial versions for comparison. If you are interested in comparing/running the serial versions of the exercise codes, use the appropriate command below to copy those files to your mpi subdirectory also.

      C:
      cp  /usr/global/docs/training/blaise/mpi/Serial/C/*   ~/mpi
      
      Fortran:
      cp  /usr/global/docs/training/blaise/mpi/Serial/Fortran/*   ~/mpi 
      

  3. List the contents of your MPI subdirectory

    You should notice quite a few files. The parallel MPI versions have names which begin with or include mpi_. The serial versions have names which begin with or include ser_. Makefiles are also included.

    Note: These are example files, and as such, are intended to demonstrate the basics of how to parallelize a code. Most execute in a second or two. The serial codes will be faster because the problem sizes are so small and there is none of the overhead associated with parallel setup and execution.

    C Files Fortran Files Description
    mpi_array.c
    ser_array.c
    mpi_array.f
    ser_array.f
    Array Decomposition
    mpi_mm.c
    ser_mm.c
    mpi_mm.f
    ser_mm.f
    Matrix Multiply
    mpi_pi_send.c
    dboard.c
    ser_pi_calc.c
    mpi_pi_send.f
    dboard.f
    ser_pi_calc.f
    pi Calculation - point-to-point communications
    mpi_pi_reduce.c
    dboard.c
    ser_pi_calc.c
    mpi_pi_reduce.f
    dboard.f
    ser_pi_calc.f
    pi Calculation - collective communications
    mpi_wave.c
    draw_wave.c
    ser_wave.c
    mpi_wave.f
    mpi_wave.h
    draw_wave.c
    ser_wave.f
    Concurrent Wave Equation
    mpi_heat2D.c
    draw_heat.c
    ser_heat2D.c
    mpi_heat2D.f
    mpi_heat2D.h
    draw_heat.c
    ser_heat2D.f
    2D Heat Equation
    mpi_latency.c
    mpi_latency.f
    Round Trip Latency Timing Test
    mpi_bandwidth.c
    mpi_bandwidth.f
    Bandwidth Timing Test
    mpi_prime.c
    ser_prime.c
    mpi_prime.f
    ser_prime.f
    Prime Number Generation
    mpi_2dfft.c
    mpi_2dfft.h
    ser_2dfft.c
    mpi_2dfft.f
    timing_fgettod.c
    ser_2dfft.f
    2D FFT

    mpi_ping.c
    mpi_ringtopo.c
    mpi_scatter.c
    mpi_contig.c
    mpi_vector.c
    mpi_indexed.c
    mpi_struct.c
    mpi_group.c
    mpi_cartesian.c

    mpi_ping.f
    mpi_ringtopo.f
    mpi_scatter.f
    mpi_contig.f
    mpi_vector.f
    mpi_indexed.f
    mpi_struct.f
    mpi_group.f
    mpi_cartesian.f
    From the tutorial...
    Blocking send-receive
    Non-blocking send-receive
    Collective communications
    Continguous derived datatype
    Vector derived datatype
    Indexed derived datatype
    Structure derived datatype
    Groups/Communicators
    Cartesian Virtual Topology
    Makefile.MPI.c
    Makefile.Ser.c
    Makefile.MPI.f
    Makefile.Ser.f
    Makefiles
    mpi_bug1.c
    mpi_bug2.c
    mpi_bug3.c
    mpi_bug4.c
    mpi_bug5.c
    mpi_bug6.c
    mpi_bug1.f
    mpi_bug2.f
    mpi_bug3.f
    mpi_bug4.f
    mpi_bug5.f
    mpi_bug6.f
    Programs with bugs

  4. Review the array decomposition example code

    Depending upon your preference, take a look at either mpi_array.c or mpi_array.f. The comments explain how MPI is used to implement a parallel data decomposition on an array. You may also wish to compare this parallel version with its corresponding serial version, either ser_array.c or ser_array.f.

  5. Compile the array decomposition example code

    Invoke the appropriate IBM compiler command:

    C:
    mpxlc -blpdata -q64 -O2 mpi_array.c  -o mpi_array
    
    Fortran:
    mpxlf -blpdata -q64 -O2 mpi_array.f -o mpi_array 

  6. Setup your execution environment

    In this step you'll set a few POE environment variables. Specifically, those which answer the three questions:

    Set the following environment variables as shown:

    Environment Variable Description
    setenv MP_PROCS 4
    Request 4 MPI tasks
    setenv MP_NODES 1
    Specify the number of nodes to use
    setenv MP_RMPOOL pclass
    Selects the interactive node pool

  7. Run the executable

    Now that your execution environment has been setup, run the array decomposition executable:

    mpi_array

  8. Compare other serial codes to their parallel version

    If we had more time, you might even be able to start with a serial code or two and create your own parallel version. Feel free to try if you'd like.

  9. Try any/all of the other example codes

    The included Makefiles can be used to compile any or all of the exercise codes. For example, to compile all of the parallel MPI codes:

    C:
    make -f Makefile.MPI.c
    make -f Makefile.Ser.c
    Fortran:
    make -f Makefile.MPI.f 
    make -f Makefile.Ser.f 

    You can also compile selected example codes individually - see the Makefile for details. For example, to compile just the matrix multiply example code:

    C:
    make -f Makefile.MPI.c  mpi_mm
    
    Fortran:
    make -f Makefile.MPI.f  mpi_mm 

    In either case, be sure to examine the makefile to understand the actual compile command used.

    Most of the executables require only 4 MPI tasks or less. Exceptions are noted below.

    mpi_array Requires that MP_PROCS be evenly divisible by 4
    mpi_group
    mpi_cartesian
    mpi_group requires 8 MPI tasks
    mpi_cartesian requires 16 MPI tasks
    You can accomplish this with a combination of MP_PROCS, MP_NODES and MP_TASKS_PER_NODE environment variables.
    mpi_wave
    mpi_heat2D
    These examples attempt to generate an X windows display. You will need to make sure that your Xwindows environment and software is setup correctly. Ask the instructor if you have any questions.
    mpi_latency
    mpi_bandwidth
    The mpi_latency example requires only 2 MPI tasks, and the mpi_bandwidth example requires an even number of tasks. Setting MP_USE_BULK_XFER to "yes" will demonstrate the difference in performance between RDMA and non-RDMA communications. Try comparing communications bandwidth when both tasks are on the same node versus different nodes too.

  10. When things go wrong...

    There are many things that can go wrong when developing MPI programs. The mpi_bugX.X series of programs demonstrate just a few. See if you can figure out what the problem is with each case and then fix it.

    Use mpxlc or mpxlf to compile each code as appropriate.

    The buggy behavior will differ for each example. Some hints are provided below.

    Code Behavior Hints/Notes
    mpi_bug1 Hangs
    mpi_bug2 Seg fault/coredump/abnormal termination
    mpi_bug3 Error message
    mpi_bug4 Hangs and gives wrong answer. Compare to mpi_array Number of MPI tasks must be divisible by 4.
    mpi_bug5 Dies or hangs - depends upon AIX and PE versions
    mpi_bug6 Terminates (under AIX) Requires 4 MPI tasks.


This completes the exercise.

Evaluation Form       Please complete the online evaluation form if you have not already done so for this tutorial.

Where would you like to go now?