Preparation:
- Login to the workshop machine
The instructor will demonstrate how to login to thunder. Once logged in,
we will be using a special partition on the actual
thunder machine. This partition is for exclusive use by this workshop
and will go away after the workshop is over.
- Review the login banner. Specifically notice the various sections:
- Welcome section - where to get more info, help
- Announcements - All LC Machines and Thunder specific ones
- Any unread news items
- Copy the example files
First create a thunder subdirectory, then copy the files, and then cd
into the thunder subdirectory:
mkdir thunder
cp -R /usr/global/docs/training/blaise/thunder/* ~/thunder
cd thunder
- Verify your exercise files
Issue the ls -l command. Your output
should show something like below:
drwx------ 2 class10 class10 4096 Apr 16 10:51 benchmarks
drwx------ 2 class10 class10 4096 Apr 16 10:51 bugs
drwx------ 4 class10 class10 4096 Apr 16 10:51 mpi
drwx------ 4 class10 class10 4096 Apr 16 10:51 openMP
drwx------ 2 class10 class10 4096 Apr 16 10:51 pthreads
drwx------ 4 class10 class10 4096 Apr 16 10:51 serial
|
Job Usage and Configuration Information:
- Before we attempt to actually compile and run anything, let's get
familiar with some basic usage and configuration information for thunder.
For the most part, these commands can be used on any parallel LC
system with a high-speed interconnect.
- Try each of the following commands, comparing and contrasting them to
each other. Most have a man page if you need more information.
Command |
Description |
ju |
Basic job usage information for each partition. This
is an LC developed command. |
spjstat |
More job usage information - more detail per job. Only running jobs. This
is an LC developed command ported from the IBM SPs. |
spj |
Like the previous comand, but shows non-running (queued) jobs also. This
is an LC developed command ported from the IBM SPs. |
squeue |
Another job usage display. This is a SLURM command. |
sinfo |
Basic partition configuration information. This is a SLURM command. |
pstat -m thunder |
An LCRM command to show all LCRM jobs on thunder |
pstat -f jobid |
Display the full details for a specified job. For the jobid
parameter, select any jobid from the output of the previous command.
Jobids are in the first column. Might be more interesting to pick a
job with a RUN status. This is an LCRM command. |
news job.lim.thunder |
Shows job limits on thunder |
Building and Running Serial Applications:
- Go to either the C or Fortran versions of the serial applications:
cd serial/c
or
cd serial/fortran
Review the Makefile, noting the compiler being used and it's options.
See the compiler man page for an explanation of the options.
When you are ready, issue the make command to build the examples.
- Run any of the example codes by typing the executable name at the
command prompt. For example: ser_array
- Time the untuned code - make a note of its timing:
time untuned
- Edit the Makefile. Comment out the compiler option assignment
line that begins with 'COPT=' or 'FOPT=', depending on whether you are
using C or Fortran. Do this by inserting a '#' in front of the line.
Then, uncomment the line beneath it that begins with '#COPT=' or '#FOPT='.
- Build the untuned example with the new compiler options:
make clean
make untuned
- Now time the new untuned executable:
time untuned
How does it perform compared to the previous run?
What changes in compiler options account for these?
Note: if you try both C and Fortran, the result differences are due
to loop index variables - C starts at 0 and Fortran at 1.
A Bit More About Optimization:
- As you undoubtedly noticed from the unoptimized vs. optimized version
of the untuned program, the Intel compilers do a good job at
optimizing code. Additionally (and optionally), they provide highly
detailed reports that specify which optimizations were/were not performed
and why. This information can be helpful for further tuning efforts on
critical portions of code.
- To see an example of this, compile the untuned code with a
new option that specifies an optimization (non parallel) report.
Depending upon if you are using C or Fortran:
icc -O2 -tpp2 -opt_report_file optreport untuned.c
or
ifort -O2 -tpp2 -opt_report_file optreport untuned.f
- After the compilation review the optreport file. Granted, a lot
of the information implies understanding of optimization techniques, but
for those who wish to dig deeper, know that you can. The icc and ifort
documentation provide information on other optimization reporting options.
MPI Runs:
- Interactive Runs:
- Go to either the C or Fortran versions of the MPI applications:
cd ~/thunder/mpi/c
or
cd ~/thunder/mpi/fortran
As before, review the Makefile noting the compiler used and it's options.
When you are ready, type make to build the example codes.
NOTE: You may see the following warning as each file is compiled.
Please ignore it.
/usr/local/intel/compiler81/lib/libimf.so.6: warning: log2l is not implemented and will always fail
|
- Run the codes directly using srun in the workshop partition. For example:
srun -n4 -ppclass code-of-choice
- Try running the codes with different srun options, for example:
srun -n4 -ppclass -vvv -m cyclic code-of-choice
srun -n4 -ppclass -l code-of-choice
- Batch Runs:
- From the same directory that you ran your mpi codes interactively, open
the psub_script file in a UNIX editor, such as vi
(aliased to vim) or emacs.
- Review this very simple LCRM script. LCRM details are covered in the
LCRM tutorial. Some important notes:
- The executable that will be run is mpi_latency. Make sure that
you have created this - if in doubt, just run make again.
- Make sure that you edit the path specification line in the
script to reflect the directory where your mpi_latency executable
is located - it will differ between C and Fortran.
- Submit the script to the batch partition, which is the default. So you
should not specify -ppclass as you did in the interactive runs. For
example:
psub psub_script
- Monitor the job's status by using the commands covered
previously, such as: ju, spjstat, spj, pstat, squeue.
The sleep command in the script should allow you enough time
to do so.
- After you are convinced that your job has completed, review the batch
log file. It should be named something like mpi_latency.oNNNNN.
At the very end of the file are the results from this latency test, which
describes the message passing latency between MPI tasks on different nodes
across the Quadrics switch.
Resolving Unresolved Externals:
- Deliberately create a situation with missing externals by re-linking
the any of the above MPI codes using icc or ifort instead of
mpiicc or mpiifort. For instance:
icc -o mpi_ping mpi_ping.o
ifort -o mpi_ping mpi_ping.o
- The linker will indicate a number of unresolved externals that prevent you
from linking successfully. Select one of these symbol names and use it
as an argument to findentry. For example, if you are using the C version,
try:
findentry MPI_Recv
- Notice the output of the findentry utility, such as the list of library
directories it searches, and any possible matches to what you are
looking for.
- With a real application, you could now attempt to link to a relevant
library path and library to resolve the undefined reference. No need
to do so here though...
Pthreads:
- cd to your ~/thunder/pthreads subdirectory. You
will see several C files written with pthreads. There are no Fortran
files because a standardized Fortran API for pthreads never happened.
- If you are already familar with pthreads, you can review the files
to see what is intended. If you are not familiar with pthreads, this
part of the exercise will probably not be of interest.
- Compiling with pthreads is easy: just add the -pthread
option to your compile command. For example:
icc -pthread hello.c -o hello
Compile any/all of the example codes.
- To run, just enter the name of the executable.
OpenMP:
- Depending upon your preference, cd to your
~/thunder/openMP/c/ or
~/thunder/openMP/fortran/
subdirectory. You
will see several OpenMP codes.
- If you are already familar with OpenMP, you can review the files
to see what is intended. If you are not familiar with OpenMP, this
part of the exercise will probably not be of interest.
- Compiling with OpenMP is easy: just add the -openmp
option to your compile command. For example:
icc -openmp omp_hello.c -o hello
ifort -openmp omp_reduction.f -o reduction
Compile any/all of the example codes.
- To run, just enter the name of the executable.
- Note: by default, the number of OpenMP threads created will be equal
to the number of cpus on a node. You can override this by setting
the OMP_NUM_THREADS environment variable to a value of your choice.
Run a Few Benchmarks/Tests:
- Run the STREAM memory bandwidth benchmark
- cd ~/thunder/benchmarks
- Depending on whether you like C or Fortran, compile the code. You'll
see some informational messages about OpenMP parallelization.
C |
icc -O3 -tpp2 -openmp stream.c -o stream |
Fortran |
icc -O3 -tpp2 -DUNDERSCORE -c mysecond.c
ifort -O3 -tpp2 -openmp stream.f mysecond.o -o stream |
- Then run the code on a single node:
srun -n1 -ppclass stream
- Note the timings when it completes. Compare to the theoretical peak
memory-to-cpu bandwidth for the E8870 chipset of 6.4 GB/s. Note that
we are running this as a 4-way OpenMP threaded job, using all 4 CPUs
on the node.
- For more information on this benchmark, see
http://www.streambench.org
- Run an MPI message passing test, which shows the bandwidth depending upon
number of nodes used and type of MPI routine used. This isn't an official
benchmark - just a local test.
- Assuming you are still in your ~/thunder/benchmarks
subdirectory, compile the code (sorry, only a C version at this time):
mpiicc -O3 -tpp2 mpi_multibandwidth.c -o mpitest
- Run it using all 4 CPUs on 2 different nodes. Also
be sure to specify where to send output instead of stdout:
srun -N2 -n8 -ppclass mpitest > mpitest.output8
- After the test runs, check the output file for the results. Notice
in particular how bandwidth improves with message size and how
much variation there can be between the 8 tasks at any given message
size.
- To find the best OVERALL average do something like this:
grep OVERALL mpitest.output8 | sort
You can then search within your output file for the case that had
the best performance.
- Now repeat the run, but instead, only use 1 task on each of 2 nodes
and send the output to a new file:
srun -N2 -ppclass mpitest > mpitest.output2
- Find the best OVERALL average again for this run:
grep OVERALL mpitest.output2 | sort
- Notice the large difference in performance? Why? If you're curious,
ask the instructor.
A Few Bugs:
There are many things that can go wrong when porting and running codes on
any architecture, Thunder included. A few example codes are provided to
demonstrate several problems related to common encountered issues.
cd ~/thunder/bugs
- bug1
- Look at the bug1 files. You'll notice bug1.c
and bug1.32bit.output. Review bug1.32bit.output to
see what this example does on a IA32 machine.
- Compile and run the program, and notice the incorrect output:
icc bug1.c -o bug1
bug1
- If you have time and interest, see if you can find the source of the
problem and fix it. Otherwise, see/compile/run the solution file,
bug1.fix.c, which documents the problem.
- bug2
- Look at the bug2 files. You'll notice bug2.c
and bug2.32bit.output. Review bug2.32bit.output to
see what this example does on a IA32 machine.
- Compile and run the program.
icc bug2.c -o bug2
bug2
- Notice what happens. Why? This is a very simple program. See if you
can figure out why and how to fix it.
- As with the bug1 problem, there is a "fix" file which explains
the problem and provides a workaround solution.
- bug3
- Review the bug3.c or bug3.f, depending
upon whether you like C or Fortran. These are very simple OpenMP
programs.
- Compile and run the program.
C |
icc -openmp bug3.c -o bug3
bug3 |
Fortran |
ifort -openmp bug3.f -o bug3
bug3 |
- It should seg fault. The lecture notes discuss the reason why under
the memory constraints section, but it could take the average
new programmer quite awhile to figure this out and what to do about it.
- For the solution, read the bug3.fix file and
"source" it:
source bug3.fix
- Now try executing your bug3 program this time. It
should work fine.
Online Thunder Status Information:
Finally, there are a number of places to get additional information on
Thunder. Probably one of the most useful sources, from a user's perspective,
are the OCF Status web pages (LLNL internal) linked to from the main LC
web page. Try the following:
- Go to the main LC web page by clicking on the link below. It will open a
new window so you can follow along with the rest of the instructions.
www.llnl.gov/computing
- Under the "High Performance Computing" section, you'll notice the little
green/red arrows for "OCF Machine Status". Click there.
- When prompted for your user name and password, use your
class## userid and the PIN + OTP token for your password.
Ask the instructor if you're not sure what this means.
- You will then be taken to the "LC OCF Machines Status" web matrix. Find the
line for Thunder and note what info is displayed.
- Now actually click on the hyperlinked word "Thunder" and you will be taken
to lots of additional information about Thunder, including links to yet
more information, which you can follow if you like.
This completes the exercise.
|
Please complete the online evaluation form if you haven't done so
for this tutorial already. |
Where would you like to go now?