IA32 Linux Clusters Overview Exercise

Preparation:

  1. Login to the workshop machine

    Workshops differ in how this is done. The instructor will go over this beforehand.

    For this workshop, we will be using a small, non-production IA32 Linux cluster called "pengra". Pengra's configuration varies from day to day depending upon what it is being used for. During the workshop, we will the 12 compute nodes divided into two partitions: pdebug and pbatch.

    Pengra is equipped with a Quadrics switch, so the exercises reflect usage as it would be on most of LC's production IA32 Linux clusters. Things are done a little differently on clusters without a switch (ACE and ILX).

  2. Review the login banner. Specifically notice the various sections:

  3. Copy the example files

    First create a linux_clusters subdirectory, then copy the files, and then cd into the linux_clusters subdirectory:

    mkdir linux_clusters
    cp -R /usr/global/docs/training/blaise/ia32linux_clusters/*   ~/linux_clusters
    cd linux_clusters

  4. Verify your exercise files

    Issue the ls -ld * command. Your output should show something like below:

    drwx------    2 class01  class01      4096 Jun 23 12:45 benchmarks
    drwx------    5 class01  class01      4096 Jun 23 12:44 bugs
    drwx------    4 class01  class01      4096 Jun 23 12:44 mpi
    drwx------    4 class01  class01      4096 Jun 23 12:44 openMP
    drwx------    2 class01  class01      4096 Jun 23 12:44 pthreads
    drwx------    4 class01  class01      4096 Jun 23 12:44 serial
    

Job Usage and Configuration Information:

  1. Before we attempt to actually compile and run anything, let's get familiar with some basic usage and configuration commands. For the most part, these commands can be used on any parallel LC system with a high-speed interconnect.

  2. Try each of the following commands, comparing and contrasting them to each other. Most have a man page if you need more information.

    NOTE: most of these commands won't show a whole lot on the workshop machine where there isn't much going on. The story is very different of course on one of the production Linux cluster machines.

    Command Description
    ju
    Basic job usage information for each partition. This is an LC developed command.
    spjstat
    More job usage information - more detail per job. Only running jobs. This is an LC developed command ported from the IBM SPs.
    spj
    Like the previous comand, but shows non-running (queued) jobs also. This is an LC developed command ported from the IBM SPs.
    squeue
    Another job usage display. This is a SLURM command.
    sinfo
    Basic partition configuration information. This is a SLURM command.
    pstat -m pengra
    An LCRM command to show all LCRM jobs on the specified machine. You probably won't see anything running on pengra right now.
    pstat -A 
    An LCRM command to show all LCRM jobs in the system
    pstat -f jobid
    Display the full details for a specified job. For the jobid parameter, select any jobid from the output of the pstat -A command. Jobids are in the first column. Note that because the workshop is being conducted on a non-production LC machine, you may not see much in the way of jobs to choose from. If you do see some jobs, it might be more interesting to pick a job with a RUN status. This is an LCRM command.

Building and Running Serial Applications:

  1. Go to either the C or Fortran versions of the serial applications:
    cd serial/c
       or 
    cd serial/fortran
    Review the Makefile, noting the compiler being used and it's options. See the compiler man page for an explanation of the options.

    When you are ready, issue the make command to build the examples.

  2. Run any of the example codes by typing the executable name at the command prompt. For example: ser_array

  3. Time the untuned code - make a note of its timing:
    time untuned
  4. Edit the Makefile. Comment out the compiler option assignment line that begins with 'COPT=' or 'FOPT=', depending on whether you are using C or Fortran. Do this by inserting a '#' in front of the line. Then, uncomment the line beneath it that begins with '#COPT=' or '#FOPT='.

  5. Build the untuned example with the new compiler options:
    make clean
    make untuned
    Examine the compiler output produced by the new options. What do you notice?

  6. Now time the new untuned executable:
    time untuned
    How does it perform compared to the previous run? What changes in compiler options account for these?

    Note: if you try both C and Fortran, the result differences are due to loop index variables - C starts at 0 and Fortran at 1.

MPI Runs:

Resolving Unresolved Externals:

  1. Deliberately create a situation with missing externals by re-linking the any of the above MPI codes using icc or ifort instead of mpiicc or mpiifort. For instance:
    icc -o mpi_ping mpi_ping.o
    ifort -o mpi_ping mpi_ping.o
  2. The linker will indicate a number of unresolved externals that prevent you from linking successfully. Select one of these symbol names and use it as an argument to findentry. For example, if you are using the C version, try:
    findentry MPI_Recv

  3. Notice the output of the findentry utility, such as the list of library directories it searches, and any possible matches to what you are looking for.

  4. With a real application, you could now attempt to link to a relevant library path and library to resolve the undefined reference. No need to do so here though...

Bad Floating-point Stack Pointer:

The etime_call and badfpstack codes demonstrate two examples of how the x87 floating-point stack pointer can get out of sync. The etime_call example is Fortran and the badfpstack example is C. Try either or both.

Pthreads:

  1. cd to your ~/linux_clusters/pthreads subdirectory. You will see several C files written with pthreads. There are no Fortran files because a standardized Fortran API for pthreads never happened.

  2. If you are already familar with pthreads, you can review the files to see what is intended. If you are not familiar with pthreads, this part of the exercise will probably not be of interest.

  3. Compiling with pthreads is easy: just add the -pthread option to your compile command. For example:
    icc -pthread hello.c -o hello
    Compile any/all of the example codes.

  4. To run, just enter the name of the executable.

OpenMP:

  1. Depending upon your preference, cd to your ~/linux_clusters/openMP/c/   or   ~/linux_clusters/openMP/fortran/ subdirectory. You will see several OpenMP codes.

  2. If you are already familar with OpenMP, you can review the files to see what is intended. If you are not familiar with OpenMP, this part of the exercise will probably not be of interest.

  3. Compiling with OpenMP is easy: just add the -openmp option to your compile command. For example:
    icc -openmp omp_hello.c -o hello
    ifort -openmp omp_reduction.f -o reduction
    Compile any/all of the example codes.

  4. To run, just enter the name of the executable.

  5. Note: by default, the number of OpenMP threads created will be equal to the number of cpus on a node. You can override this by setting the OMP_NUM_THREADS environment variable to a value of your choice.

Run a Few Benchmarks/Tests:

  1. Run the STREAM memory bandwidth benchmark

    1. cd ~/linux_clusters/benchmarks

    2. Depending on whether you like C or Fortran, compile the code. You'll see some informational messages about OpenMP parallelization.

      C
      icc -O3 -tpp7 -xW -openmp stream.c -o stream 
      Fortran
      icc -O3 -tpp7 -xW -DUNDERSCORE -c mysecond.c
      ifort -O3 -tpp7 -xW -openmp stream.f mysecond.o  -o stream

    3. Then run the code on a single node:
      srun -n1 -ppdebug stream
    4. Note the timings when it completes. Compare to the theoretical peak memory-to-cpu bandwidth for the E7500 chipset of 3.2 GB/s. Note that we are running this as a 2-way OpenMP threaded job, using both CPUs on the node.

    5. For more information on this benchmark, see http://www.streambench.org

  2. Run an MPI message passing test, which shows the bandwidth depending upon number of nodes used and type of MPI routine used. This isn't an official benchmark - just a local test.

    1. Assuming you are still in your ~/linux_clusters/benchmarks subdirectory, compile the code (sorry, only a C version at this time):
      mpiicc -O3 -tpp7 -xW mpi_multibandwidth.c -o mpitest

    2. Run it using both CPUs on 2 different nodes. Also be sure to specify where to send output instead of stdout:
      srun -N2 -n4 -ppdebug mpitest > mpitest.output4

    3. After the test runs, check the output file for the results. Notice in particular how bandwidth improves with message size and how much variation there can be between the 4 tasks at any given message size.

    4. To find the best OVERALL average do something like this:
      grep OVERALL mpitest.output4 | sort
      You can then search within your output file for the case that had the best performance.

    5. Now repeat the run, but instead, only use 1 task on each of 2 nodes and send the output to a new file:
      srun -N2 -ppdebug mpitest > mpitest.output2

    6. Find the best OVERALL average again for this run:
      grep OVERALL mpitest.output2 | sort

    7. Which case (using both CPUs per node or just one) performs better? Why?

Online Machine Status Information:

  1. Go to the main LC web page by clicking on the link below. It will open a new window so you can follow along with the rest of the instructions.
    www.llnl.gov/computing

  2. Under the "High Performance Computing" section, you'll notice the little green/red arrows for "OCF Machine Status". Click there.

  3. When prompted for your user name and password, use your class## userid and the PIN + OTP token for your password. Ask the instructor if you're not sure what this means.

  4. You will then be taken to the "LC OCF Machines Status" web matrix. Find one of the Linux cluster machines and note what info is displayed.

  5. Now actually click on the hyperlinked name of that machine and you will be taken to lots of additional information about it, including links to yet more information, which you can follow if you like.


This completes the exercise.

Evaluation Form       Please complete the online evaluation form if you have not already done so for this tutorial.

Where would you like to go now?