IA32 Linux Clusters Overview Exercise
|
|
Preparation:
- Login to the workshop machine
Workshops differ in how this is done. The instructor will go over this
beforehand.
For this workshop, we will be using a small, non-production IA32 Linux
cluster called "pengra". Pengra's configuration varies from day to day
depending upon what it is being used for. During the workshop, we will
the 12 compute nodes divided into two partitions: pdebug and pbatch.
Pengra is equipped with a Quadrics switch, so the exercises reflect
usage as it would be on most of LC's production IA32 Linux clusters.
Things are done a little differently on clusters without a switch
(ACE and ILX).
- Review the login banner. Specifically notice the various sections:
- Welcome section - where to get more info, help
- Announcements - All LC Machines and Pengra specific ones
- Any unread news items
- Copy the example files
First create a linux_clusters subdirectory, then copy the files, and
then cd into the linux_clusters subdirectory:
mkdir linux_clusters
cp -R /usr/global/docs/training/blaise/ia32linux_clusters/* ~/linux_clusters
cd linux_clusters
- Verify your exercise files
Issue the ls -ld * command. Your output
should show something like below:
drwx------ 2 class01 class01 4096 Jun 23 12:45 benchmarks
drwx------ 5 class01 class01 4096 Jun 23 12:44 bugs
drwx------ 4 class01 class01 4096 Jun 23 12:44 mpi
drwx------ 4 class01 class01 4096 Jun 23 12:44 openMP
drwx------ 2 class01 class01 4096 Jun 23 12:44 pthreads
drwx------ 4 class01 class01 4096 Jun 23 12:44 serial
|
Job Usage and Configuration Information:
- Before we attempt to actually compile and run anything, let's get
familiar with some basic usage and configuration commands.
For the most part, these commands can be used on any parallel LC
system with a high-speed interconnect.
- Try each of the following commands, comparing and contrasting them to
each other. Most have a man page if you need more information.
NOTE: most of these commands won't show a whole lot on the workshop
machine where there isn't much going on. The story is very different
of course on one of the production Linux cluster machines.
Command |
Description |
ju |
Basic job usage information for each partition. This
is an LC developed command. |
spjstat |
More job usage information - more detail per job. Only running jobs. This
is an LC developed command ported from the IBM SPs. |
spj |
Like the previous comand, but shows non-running (queued) jobs also. This
is an LC developed command ported from the IBM SPs. |
squeue |
Another job usage display. This is a SLURM command. |
sinfo |
Basic partition configuration information. This is a SLURM command. |
pstat -m pengra |
An LCRM command to show all LCRM jobs on the specified machine. You
probably won't see anything running on pengra right now. |
pstat -A |
An LCRM command to show all LCRM jobs in the system |
pstat -f jobid |
Display the full details for a specified job. For the jobid
parameter, select any jobid from the output of the pstat -A command.
Jobids are in the first column. Note that because the workshop is being
conducted on a non-production LC machine, you may not see much in the way
of jobs to choose from. If you do see some jobs, it might be more
interesting to pick a job with a RUN status. This is an LCRM command. |
Building and Running Serial Applications:
- Go to either the C or Fortran versions of the serial applications:
cd serial/c
or
cd serial/fortran
Review the Makefile, noting the compiler being used and it's options.
See the compiler man page for an explanation of the options.
When you are ready, issue the make command to build the examples.
- Run any of the example codes by typing the executable name at the
command prompt. For example: ser_array
- Time the untuned code - make a note of its timing:
time untuned
- Edit the Makefile. Comment out the compiler option assignment
line that begins with 'COPT=' or 'FOPT=', depending on whether you are
using C or Fortran. Do this by inserting a '#' in front of the line.
Then, uncomment the line beneath it that begins with '#COPT=' or '#FOPT='.
- Build the untuned example with the new compiler options:
make clean
make untuned
Examine the compiler output produced by the new options. What do you
notice?
- Now time the new untuned executable:
time untuned
How does it perform compared to the previous run?
What changes in compiler options account for these?
Note: if you try both C and Fortran, the result differences are due
to loop index variables - C starts at 0 and Fortran at 1.
MPI Runs:
- Interactive Runs:
- Go to either the C or Fortran versions of the MPI applications:
cd ~/linux_clusters/mpi/c
or
cd ~/linux_clusters/mpi/fortran
As before, review the Makefile noting the compiler used and it's options.
When you are ready, type make to build the example codes.
- Run the codes directly using srun on the pdebug partition. For example:
srun -n4 -ppdebug code-of-choice
- Try running the codes with different srun options, for example:
srun -n4 -ppdebug -vvv -m cyclic code-of-choice
srun -n4 -ppdebug -l code-of-choice
- Batch Runs:
- From the same directory that you ran your mpi codes interactively, open
the psub_script file in a UNIX editor, such as vi
(aliased to vim) or emacs.
- Review this very simple LCRM script. LCRM details are covered in the
LCRM tutorial. Some important notes:
- The executable that will be run is mpi_latency. Make sure that
you have created this - if in doubt, just run make again.
- Make sure that you edit the path specification line in the
script to reflect the directory where your mpi_latency executable
is located - it will differ between C and Fortran.
- Submit the script to the batch partition, which is the default. So you
should not specify -ppdebug as you did in the interactive runs. For
example:
psub psub_script
- Monitor the job's status by using any/all of the commands: ju,
spjstat, pstat, sinfo. The sleep command in the script
should allow you enough time to do so.
- After you are convinced that your job has completed, review the batch
log file. It should be named something like mpi_latency.oNNNNN.
Resolving Unresolved Externals:
- Deliberately create a situation with missing externals by re-linking
the any of the above MPI codes using icc or ifort instead of
mpiicc or mpiifort. For instance:
icc -o mpi_ping mpi_ping.o
ifort -o mpi_ping mpi_ping.o
- The linker will indicate a number of unresolved externals that prevent you
from linking successfully. Select one of these symbol names and use it
as an argument to findentry. For example, if you are using the C version,
try:
findentry MPI_Recv
- Notice the output of the findentry utility, such as the list of library
directories it searches, and any possible matches to what you are
looking for.
- With a real application, you could now attempt to link to a relevant
library path and library to resolve the undefined reference. No need
to do so here though...
Bad Floating-point Stack Pointer:
The etime_call and badfpstack codes
demonstrate two examples of how the x87 floating-point stack
pointer can get out of sync. The etime_call example is Fortran
and the badfpstack example is C. Try either or both.
- etime_call
- cd to your ~/linux_clusters/bugs/etime_call subdirectory.
- Review the Makefile and note the options being used. Then review the
comments in the code - they help explain the issue.
- When you are ready, type make to build the executable.
- First run this serial code interactively: etime_call. It
should appear to run normally.
- Now run the program under gdb to diagnose the problem:
- Load the program: gdb etime_call
- List the source code until you find the statement that does the
call to etime. This is done by repeatedly issuing
the l (lower case letter L) command.
- Set a breakpoint on the line that makes the etime call:
b 52
- Run the program to the breakpoint: r
- Now examine the floating-point registers: info float.
Your registers should look something like below - note where the
arrow (floating-point stack pointer) is located. This is the "top" of
the floating-point stack and should be R0.
R7: Empty 0x4003b3334d0000000000
R6: Empty 0x3ffce560628f5c28f71b
R5: Empty 0x00000000000000000000
R4: Empty 0x00000000000000000000
R3: Empty 0x00000000000000000000
R2: Empty 0x00000000000000000000
R1: Empty 0x00000000000000000000
=>R0: Empty 0x00000000000000000000
- Execute the next statement (the call to the etime routine):
n
- Check the floating-point registers again: info float.
They should look something
like below. Note where the "top" of the stack is this time.
=>R7: Valid 0x3ffea147ae2666666800 +0.6300000041723251565
R6: Empty 0x4005c800000000000000
R5: Empty 0x00000000000000000000
R4: Empty 0x00000000000000000000
R3: Empty 0x00000000000000000000
R2: Empty 0x00000000000000000000
R1: Empty 0x00000000000000000000
R0: Empty 0x00000000000000000000
The top of the stack is R7, however it should be R0 when returning from
a routine call. This indicates that the floating-point stack pointer
has gotten out of sync.
- Having located the problem, quit the debugger: q
- To fix the problem, edit the source file. You will need to do three
things:
- Uncomment the interface declaration lines (lines 15-19)
- Comment-out the erroneous etime call on line 52
- Uncomment the correct etime call on line 55
- Remake the code: make clean; make
Then try running it as before under
gdb and prove that the floating-point stack pointer behaves
correctly.
- badfpstack
- cd to your ~/linux_clusters/bugs/badfpstack subdirectory.
- Review the Makefile and both code files. In particular, determine
what value should be returned from the double_val function.
- When you are ready, type make to build the executable.
Notice the compiler warning message.
- First run this serial code interactively: badfpstack.
Is the value of x correct?
- To help diagnose the problem, we'll use the gdb debugger.
- Load the program: gdb badfpstack
- List the source code: l (lower case letter L)
- Set a breakpoint on the line that makes the double_val call:
b 10
- Run the program to the breakpoint: r
- Now examine the floating-point registers: info float.
Your registers should look something like below - note where the
arrow (floating-point stack pointer) is located. This is the "top" of
the floating-point stack and should be R0.
R7: Empty 0x00000000000000000000
R6: Empty 0x00000000000000000000
R5: Empty 0x00000000000000000000
R4: Empty 0x00000000000000000000
R3: Empty 0x00000000000000000000
R2: Empty 0x00000000000000000000
R1: Empty 0x00000000000000000000
=>R0: Empty 0x00000000000000000000
- Execute the next statement (the call to the double_val
routine): n
- Check the floating-point registers again. They should look something
like below. Note where the "top" of the stack is this time.
=>R7: Valid 0x3ffe8000000000000000 +0.5
R6: Empty 0x3fff8000000000000000
R5: Empty 0x00000000000000000000
R4: Empty 0x00000000000000000000
R3: Empty 0x00000000000000000000
R2: Empty 0x00000000000000000000
R1: Empty 0x00000000000000000000
R0: Empty 0x00000000000000000000
The top of the stack is R7, however it should be R0 when returning from
a routine call. This indicates that the floating-point stack pointer
has gotten out of sync.
- Having located the problem, quit the debugger: q
- It should be obvious how to fix the code in badfpstack.c (uncomment the
external routine declaration). Try
fixing the code and re-making it. Then try running it as before under
gdb and prove that the floating-point stack pointer behaves
correctly. Also, compare the result. Is it correct?
Pthreads:
- cd to your ~/linux_clusters/pthreads subdirectory. You
will see several C files written with pthreads. There are no Fortran
files because a standardized Fortran API for pthreads never happened.
- If you are already familar with pthreads, you can review the files
to see what is intended. If you are not familiar with pthreads, this
part of the exercise will probably not be of interest.
- Compiling with pthreads is easy: just add the -pthread
option to your compile command. For example:
icc -pthread hello.c -o hello
Compile any/all of the example codes.
- To run, just enter the name of the executable.
OpenMP:
- Depending upon your preference, cd to your
~/linux_clusters/openMP/c/ or
~/linux_clusters/openMP/fortran/
subdirectory. You
will see several OpenMP codes.
- If you are already familar with OpenMP, you can review the files
to see what is intended. If you are not familiar with OpenMP, this
part of the exercise will probably not be of interest.
- Compiling with OpenMP is easy: just add the -openmp
option to your compile command. For example:
icc -openmp omp_hello.c -o hello
ifort -openmp omp_reduction.f -o reduction
Compile any/all of the example codes.
- To run, just enter the name of the executable.
- Note: by default, the number of OpenMP threads created will be equal
to the number of cpus on a node. You can override this by setting
the OMP_NUM_THREADS environment variable to a value of your choice.
Run a Few Benchmarks/Tests:
- Run the STREAM memory bandwidth benchmark
- cd ~/linux_clusters/benchmarks
- Depending on whether you like C or Fortran, compile the code. You'll
see some informational messages about OpenMP parallelization.
C |
icc -O3 -tpp7 -xW -openmp stream.c -o stream |
Fortran |
icc -O3 -tpp7 -xW -DUNDERSCORE -c mysecond.c
ifort -O3 -tpp7 -xW -openmp stream.f mysecond.o -o stream |
- Then run the code on a single node:
srun -n1 -ppdebug stream
- Note the timings when it completes. Compare to the theoretical peak
memory-to-cpu bandwidth for the E7500 chipset of 3.2 GB/s. Note that
we are running this as a 2-way OpenMP threaded job, using both
CPUs on the node.
- For more information on this benchmark, see
http://www.streambench.org
- Run an MPI message passing test, which shows the bandwidth depending upon
number of nodes used and type of MPI routine used. This isn't an official
benchmark - just a local test.
- Assuming you are still in your ~/linux_clusters/benchmarks
subdirectory, compile the code (sorry, only a C version at this time):
mpiicc -O3 -tpp7 -xW mpi_multibandwidth.c -o mpitest
- Run it using both CPUs on 2 different nodes. Also
be sure to specify where to send output instead of stdout:
srun -N2 -n4 -ppdebug mpitest > mpitest.output4
- After the test runs, check the output file for the results. Notice
in particular how bandwidth improves with message size and how
much variation there can be between the 4 tasks at any given message
size.
- To find the best OVERALL average do something like this:
grep OVERALL mpitest.output4 | sort
You can then search within your output file for the case that had
the best performance.
- Now repeat the run, but instead, only use 1 task on each of 2 nodes
and send the output to a new file:
srun -N2 -ppdebug mpitest > mpitest.output2
- Find the best OVERALL average again for this run:
grep OVERALL mpitest.output2 | sort
- Which case (using both CPUs per node or just one) performs better?
Why?
Online Machine Status Information:
Finally, there are a number of places to get additional information on
the IA32 Linux cluster (and other) machines. Probably one of the most useful
sources, from a user's perspective, are the OCF Status web pages
(LLNL internal) linked to from the main LC web page. Try the following:
- Go to the main LC web page by clicking on the link below. It will open a
new window so you can follow along with the rest of the instructions.
www.llnl.gov/computing
- Under the "High Performance Computing" section, you'll notice the little
green/red arrows for "OCF Machine Status". Click there.
- When prompted for your user name and password, use your
class## userid and the PIN + OTP token for your password.
Ask the instructor if you're not sure what this means.
- You will then be taken to the "LC OCF Machines Status" web matrix. Find one
of the Linux cluster machines and note what info is displayed.
- Now actually click on the hyperlinked name of that machine and you will
be taken to lots of additional information about it, including links to
yet more information, which you can follow if you like.
This completes the exercise.
|
Please complete the online evaluation form if you have not already
done so for this tutorial. |
Where would you like to go now?