uP is an actual LC production system, which normally has a 99 node pbatch
pool and a 2 node pdebug pool. For this workshop, a special pool has been
configured to prevent competition with real users.
- Login to uP
Workshops differ in how this is done. The instructor will go over this
beforehand.
- Review the login banner information
Notice any announcements and news items. Try reading a news announcement,
such as news large_pages or news dat_up.
- Check uP's configuration and job information
Try any/all of the following commands. See the respective man pages if you
have questions.
spjstat
squeue
sinfo
ju
pstat -m up
news job.lim.up
- Copy the lab exercise files
- In your home directory, create a subdirectory for the lab exercise codes
and then cd to it.
mkdir purple
cd purple
- Then copy the exercise files to your purple subdirectory:
cp /usr/global/docs/training/blaise/purple/* ~/purple
- List the contents of your purple subdirectory
You should have the following files:
- Determine which pool you will be using for the workshop
Use either the spjstat or ju command as
done previously to display the available pools. Which pool looks like it
should be used for the class? Remember the name of this pool for later.
- Compile the hello program
Depending upon your language preference, use one of the IBM parallel
compilers to compile the hello program. Notice that we're using a very simple
compilation and explicitly using large pages (-blpdata), 64-bit (-q64) and
level 2 optimization (-O2).
C: |
mpxlc -blpdata -q64 -O2 -o hello hello.c |
Fortran: |
mpxlf -blpdata -q64 -O2 -o hello hello.f |
- Setup your POE environment
In this step you'll set a few POE environment variables. Specifically,
those which answer the three questions:
- How many nodes do I need?
- How many tasks do I need?
- Which pool should I use?
Set the following environment variables as shown. We'll accept the default
POE settings for everything else.
Environment Variable Setting |
Description |
setenv MP_NODES 2 |
Request 2 nodes |
setenv MP_PROCS 8 |
Request 8 MPI tasks (processes) |
setenv MP_RMPOOL pclass |
This is the workshop pool which you determined from the spjstat or ju
command previously. |
- Run your hello executable
- This is the simple part. Just issue the command:
hello
- Provided that everything is working and setup correctly, you should
receive output that looks something like below (your node names may
vary, of course).
0:Total number of tasks = 8
0:Hello! From task 0 on host up037
4:Hello! From task 4 on host up040
1:Hello! From task 1 on host up037
5:Hello! From task 5 on host up040
2:Hello! From task 2 on host up037
6:Hello! From task 6 on host up040
3:Hello! From task 3 on host up037
7:Hello! From task 7 on host up040
- Maximize your use of all 8 cpus on a node
The previous step only used 4 cpus on each of 2 nodes. To make better use of
the SMP nodes, try the following:
- Run 8 hello tasks on each of 2 nodes. Three different
ways to do this are shown below, all of which use
command line flags. The corresponding environment variables could be
used instead. See the POE man page for details.
Method 1: Specify POE flags for number of nodes and number of tasks:
hello -nodes 2 -procs 16
Method 2: Specify POE flags for number of tasks per node and
and number of tasks:
hello -tasks_per_node 8 -procs 16
Method 3: Specify POE flags for number of nodes and and number of
tasks per node:
unsetenv MP_PROCS
hello -nodes 2 -tasks_per_node 8
- Try the bandwidth exercise code
- Depending upon your language preference, compile the bandwidth
source file as shown:
C: |
mpxlc -blpdata -q64 -O2 -o bandwidth bandwidth.c |
Fortran: |
mpxlf -blpdata -q64 -O2 -o bandwidth bandwidth.f |
- This example only uses two tasks, but we want them to be on different
nodes to test interprocessor bandwidth. So:
setenv MP_PROCS 2
setenv MP_NODES 2
- Run the executable:
bandwidth
Note: It is very possible that when
you try this step, you will get an error message that looks something like
the one below. This is because there are others in the workshop using
nodes in the small workshop pool at the same time as you.
If you get this error message, just try running again in a few moments
when the nodes are free.
SLURMERROR: slurm_allocate_resources: Requested nodes are busy
ERROR: 0031-362 Unexpected return code -5 from ll_request
|
As the program runs, it will display the effective communications bandwidth
between two nodes over the HPS switch fabric. The output should look
something like that below:
Sample output from bandwidth example (C version)
|
0:
0:****** MPI/POE Bandwidth Test ******
0:Message start size= 100000 bytes
0:Message finish size= 2000000 bytes
0:Incremented by 100000 bytes per iteration
0:Roundtrips per iteration= 1000
0:Task 0 running on: up037
0:Task 1 running on: up040
0:
0:Message Size Bandwidth (bytes/sec)
0: 100000 1.284070e+09
0: 200000 1.502403e+09
0: 300000 1.573937e+09
0: 400000 1.628769e+09
0: 500000 1.665738e+09
0: 600000 1.673687e+09
0: 700000 1.687317e+09
0: 800000 1.701772e+09
0: 900000 1.717159e+09
0: 1000000 1.726645e+09
0: 1100000 1.734018e+09
0: 1200000 1.734382e+09
0: 1300000 1.744496e+09
0: 1400000 1.744378e+09
0: 1500000 1.749541e+09
0: 1600000 1.756001e+09
0: 1700000 1.757546e+09
0: 1800000 1.758167e+09
0: 1900000 1.758724e+09
0: 2000000 1.764081e+09
|
- Try the bandwidth code with RDMA
- Now, try running the executable again, but this time explicitly
specify use of RDMA communications.
setenv MP_USE_BULK_XFER yes
bandwidth
- Notice the output. You should see a significant increase in bandwidth.
Sample output from bandwidth example with RDMA (C version)
|
0:
0:****** MPI/POE Bandwidth Test ******
0:Message start size= 100000 bytes
0:Message finish size= 2000000 bytes
0:Incremented by 100000 bytes per iteration
0:Roundtrips per iteration= 1000
0:Task 0 running on: up037
0:Task 1 running on: up040
0:
0:Message Size Bandwidth (bytes/sec)
0: 100000 1.216868e+09
0: 200000 1.640740e+09
0: 300000 2.356462e+09
0: 400000 2.525593e+09
0: 500000 2.638849e+09
0: 600000 2.708294e+09
0: 700000 2.765728e+09
0: 800000 2.809319e+09
0: 900000 2.846926e+09
0: 1000000 2.871639e+09
0: 1100000 2.893212e+09
0: 1200000 2.913451e+09
0: 1300000 2.923798e+09
0: 1400000 2.937274e+09
0: 1500000 2.948080e+09
0: 1600000 2.955056e+09
0: 1700000 2.960913e+09
0: 1800000 2.963961e+09
0: 1900000 2.971846e+09
0: 2000000 2.976140e+09
|
- Determine per-task communication bandwidth behavior
In this exercise, pairs of tasks, located on two different nodes,
will communicate with each other.
- Compile the code:
C: |
mpxlc -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.c
|
Fortran: |
mpxlf -blpdata -q64 -O2 -o smp_bandwidth smp_bandwidth.f
|
- Then use the smp_bandwidth code to determine per-task bandwidth
characteristics on an smp node:
smp_bandwidth -nodes 2 -procs 2
smp_bandwidth -nodes 2 -procs 4
smp_bandwidth -nodes 2 -procs 8
smp_bandwidth -nodes 2 -procs 16
What happens to the average per-task bandwidth as the number of tasks
increase? How about the aggregate bandwidth per node?
- Optimize intranode communication bandwidth
When all of the task communications occur "on-node", it is
possible to optimize the effective per-task bandwidth by utilizing
shared memory instead of the network.
- First use shared memory and note the per-task bandwidth:
setenv MP_SHARED_MEMORY yes
smp_bandwidth -nodes 1 -procs 8
- Now try it without shared memory (using the switch network):
setenv MP_SHARED_MEMORY no
smp_bandwidth -nodes 1 -procs 8
What differences do you notice?
- Generate diagnostic/statistical information for your run.
- POE provides several environment variables / command flags that collect
diagnostic and statistical information about a job's run. Three of the more
useful ones are shown below. Try running a job after setting these as shown.
Direct stdout to a file so that you can easily read the output after the job
runs.
setenv MP_SAVEHOSTFILE myhosts
setenv MP_PRINTENV yes
setenv MP_STATISTICS print
bandwidth -nodes 2 -procs 2 > myoutput
- After the job completes, examine both the myhosts file and
myoutput file. The MP_PRINTENV environment variable can be
particularly useful for troubleshooting since it tells you all of the POE
environment variable settings. See the POE man page if you have any
questions.
- Be sure to unset these variables when you're done to prevent cluttering
your screen with their output for the remaining exercises.
unsetenv MP_SAVEHOSTFILE
unsetenv MP_PRINTENV
unsetenv MP_STATISTICS
- Compile and run a job using parallel I/O. Then copy its output to
HPSS storage
- First, you will need to edit par_io.c and change the line
that reads:
static char filename[] = "/p/gup1/class01/par_io.output";
Instead of using class01, use your workshop userid, which appears on your
OTP token.
- Compile the file and then run it:
mpxlc -blpdata -q64 -O2 -o par_io par_io.c
par_io -nodes 1 -procs 8
- After it finishes check your GPFS parallel file system directory for
the output file you edited above. You should have a 32MB file.
Transfer your output file to storage and then delete your GPFS file.
A sample session to accomplish this is shown below. The commands you
type are highlighted.
up041{class01}61: cd /p/gup1/class01
/p/gup1/class01
up041{class01}62: ls -l
total 62720
-rw------- 1 class01 class01 32000000 Jun 30 15:09 par_io.output
up041{class01}63: ftp storage
Connected to toofast43.llnl.gov.
220-NOTICE TO USERS
220-
[ blah blah blah removed ]
220-
220 toofast43 FTP server (HPSS 6.2 PFTPD V1.1.37 Thu Jun 15 10:09:51 PDT 2006) ready.
Name (toofast43.llnl.gov:class01): just hit return
230 User class01 logged in as class01
Remote system type is UNIX.
Using binary mode to transfer files.
ftp> put par_io.output
200 Command Complete (32000000, "par_io.output", 0, 1, 8388608, 0).
200 Command Complete.
150 Transfer starting.
226 Transfer Complete.(moved = 32000000).
32000000 bytes sent in 0.836 seconds (38.298 mbytes/s)
200 Command Complete.
ftp> dir
200 PORT command successful.
150 Opening ASCII mode data connection for directory list.
-rw-r----- 1 class01 class01 32000000 Jun 30 15:47 par_io.output
226 Transfer complete.
69 bytes received in 0.019 seconds (3.585 kbytes/s)
ftp> quit
221 Goodbye.
up041{class01}64: rm par_io.output
up041{class01}65:
|
- Finally, cd back to you purple subdirectory to continue with the exercises:
cd ~/purple
- Try using POE's Multiple Program Multiple Data (MPMD) mode
POE allows you to load and run different executables on different nodes.
This is controlled by the MP_PGMMODEL environment variable.
- First, set some environment variables:
Environment Variable Setting |
Description |
setenv MP_PGMMODEL mpmd |
Specify MPMD mode |
setenv MP_PROCS 4 |
Use 4 tasks again |
setenv MP_NODES 1 |
Use one node for all four tasks |
setenv MP_STDOUTMODE ordered |
Sort the output by task |
- Then, simply issue the poe command.
- After a moment, you will be prompted to enter your executables one
at a time. Notice that the machine name where the executable will
run is displayed as part of the prompt. In any order you choose,
enter these four program names, one per prompt. For example:
up041% poe
0031-503 Enter program name and flags for each node
0:up040> prog1
1:up040> prog2
2:up040> prog3
3:up040> prog4
- After the last program name is entered, POE will run all four executables.
Observe their different outputs. Note: these four programs are just
simple shell scripts used to demonstrate how to use the MPMD programming
model.
- Create an LCRM job control script
- Using your favorite text editor, create a file (name it whatever you like)
that will be used to run a batch job. Your job control script should
specify the following:
- executes on the host "up"
- runs within the workshop pool
- uses two nodes
- uses two tasks
- has a time limit of 5 minutes
- combines stdout and stderr
- gives the job a name chosen by you
- lists the hosts used
- reports POE communication statistics
- lists all of your POE environment variables
- runs the executable bandwidth (which you created earlier)
- See the LCRM tutorial to assist with most of the above, in particular the
Building a Job Control
Script section. If you need more assistance, see the
jobscript.example file provided with your other
exercise files.
- Run your batch job
- Use the LCRM psub command to submit your job. For example:
psub myjobscript
Note the job id
- Check the status of your job as it queues, waits and eventually runs. Use
the pstat command (several times) for this. See the pstat
man page if you have questions about its output.
- Try the pstat -f jobid command for more info on your
job.
- Try the pstat -m up command to view other jobs on the
system. See the job detail on any job by using the
pstat -f jobid command.
- After your job completes, check its output file. Does it show communication
statistics and POE environment variables? Note the bandwidth report - how
does it compare to the bandwidth numbers you did interactively earlier?
That is, does it match the RDMA performance or not?
- Debug a job command script
- Submit the exercise code batchbugs.
- Monitor its progress (or lack thereof) with the pstat command.
Also use the spjstat or ju commands to
verify that adequate nodes are available.
- Figure out why it won't run and fix it. There are three problems with
the script. The output file (when you get that far) should help diagnose
two of them.
- Compare your solution to
batchbugs.fix.
- Debug a batch job
The primary purpose of this trivial exercise is to demonstrate the fact that
you can login to a batch node when your job is running there, and then start
a debugging session. This is just one way to debug in batch.
- Submit the job batchhang. You may want to review the
script also to make sure you understand what it is doing.
- Use the pstat command to monitor your job. When it starts
to RUN, proceed to the next step.
- Find the node where your job is running. The squeue
command can be used for this. For example:
up041% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
8565 pbatch cdp_if-l eiur6 R 4:31:46 8 up[061-068]
9616 pbatch checkHag 38kdcz R 4:08:25 1 up047
9035 pbatch cdp_if-l 88dj6 R 3:59:44 8 up[042,069-075]
9330 pbatch Rep200_N 3kdh R 3:56:10 8 up[033-036,105-108]
10395 pbatch batchhan thedude R 26:23 1 up096
|
- ssh to the node where your job is running and then login. After you login,
use the following command to verify that your job is running there:
ps -A | grep hangme
- cd into your purple subdirectory. This is needed so that
TotalView can find the source code for the hung program (hangme.c).
- Start TotalView with the totalview command. Two new
TotalView windows will then appear.
- In the larger "New Program" window, select "Attach to an existing process"
and then click on the "hangme" process. Then click OK. See the image
below.
- After a few moments, a large window will open showing you that the
"Thread is running". Click the Halt button.
You can then see the hung program source code.
Image here.
- In the real world, you could now begin debugging your hung program.
However, this isn't a debugger workshop, so just click the
Kill button to terminate the hung process.
- Quit TotalView: File --> Exit
- Finally, familiarize yourself with the LC website
- Go to
http://www.llnl.gov/computing/.
- Notice the High Performance Computing section and the links found there.
- In particular, try the following links:
- Important Notices and News - look for any news items regarding uP
in the "Latest LC IBM AIX News" section.
- OCF Machine Status - enter your workshop userid and PIN + 6 digit OTP
when prompted for userid/password. Then find uP in the list of machines.
Review uP's status information, and then click on the uP link for
additional detailed information.
- Computing Resources - find uP's hardware configuration information
- Code Development - find out which compilers are installed on uP
- Running Jobs - find the current job limits for uP
- Training - find the "Using ASC Purple" tutorial. Notice what else is
available.
- Search (upper left corner) - look up "ASC Purple"
This completes the exercise.
|
Please complete the online evaluation form if you have not already done
so for this tutorial. |
Return to the tutorial