presented by
Blaise Barney
Livermore Computing
Abstract |
This tutorial is intended to be an introduction to using LC's IA64 Thunder Linux
cluster. It begins by providing a brief historical background of Linux clusters
at LC, noting their success and adoption as a production, high performance
computing platform. The primary hardware components of Thunder are then
presented, including a summary of Thunder's overall configuration,
Intel's IA64 Itanium 2 processor, the E8870 Chipset and the Quadrics
interconnect switch.
After covering the hardware related topics, a brief discussion on how to obtain an account and access Thunder follows. Software topics are then discussed, including the LC development environment, compiling with the Intel compilers, Quadrics MPI and how to run both batch and interactive parallel jobs. Special attention is paid to IA64 issues in each of these areas as relevant. Available debuggers and performance related tools/topics are briefly discussed, however detailed usage is beyond the scope of this presentation. The tutorial concludes with a brief listing of known issues and problems and where to go for more information. A lab exercise using the IA64 Thunder Linux cluster follows the presentation.
Level/Prerequisites: Intended for those who are new to developing
parallel programs in LC's Intel IA64 cluster environment. A basic
understanding of parallel programming in C or Fortran is assumed.
The material covered by EC3501 - Introduction to Livermore Computing Resources would
also be useful.
Background of Linux Clusters at LLNL |
The Linux Project:
Alpha Linux Clusters:
PCR Clusters:
MCR Cluster...and More:
ALC | OCF | 960 nodes |
ILX | OCF | 67 nodes |
PVC | OCF | 64 nodes |
LILAC | SCF | 768 nodes |
ACE | SCF | 160 nodes |
GVIZ | SCF | 64 nodes |
Which Led To...
|
![]() |
Hardware Overview |
Summary:
Additional Details:
thunder0: cat /proc/cpuinfo processor : 0 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 1 revision : 5 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 1396.196998 itc MHz : 1396.196998 BogoMIPS : 1071.64 ... ... processor : 3 vendor : GenuineIntel arch : IA-64 family : Itanium 2 model : 1 revision : 5 archrev : 0 features : branchlong cpu number : 0 cpu regs : 4 cpu MHz : 1396.196998 itc MHz : 1396.196998 BogoMIPS : 2084.56 thunder0: cat /proc/pal/cpu0/cache_info Data Cache level 1: Size : 16384 bytes Attributes : WriteThrough Associativity : 4 Line size : 64 bytes Stride : 128 bytes Store latency : 3 cycle(s) Load latency : 1 cycle(s) Store hints : Load hints : [Temporal, level 1] Alias boundary : 4096 byte(s) Tag LSB : 12 Tag MSB : 49 Instruction Cache level 1: Size : 16384 bytes Attributes : Associativity : 4 ... ... Data/Instruction Cache level 3: Size : 4194304 bytes Attributes : Unified WriteBack Associativity : 16 Line size : 128 bytes Stride : 128 bytes Store latency : 7 cycle(s) Load latency : 14 cycle(s) Store hints : [Reserved] Load hints : [Non-temporal, level 1] Alias boundary : 4096 byte(s) Tag LSB : 18 Tag MSB : 49 |
Hardware Overview |
Itanium 2 Block Diagram:
Description:
L1 Data Cache L1 Instruction Cache |
L2 Cache | L3 Cache |
---|---|---|
|
|
|
Hardware Overview |
Block Diagram:
Components:
Hardware Overview |
Primary components:
Topology:
![]() |
![]() |
Features:
Performance:
Hardware Overview |
Accounts and Access |
Accounts:
Program | Allocation |
---|---|
D&NT | 10.9% |
P&AT | 5.7% |
C&MS | 8.2% |
BBRP | 0.9% |
E&E | 2.3% |
Engineering | 3.1% |
Computations | 2.3% |
Q Div-M&NT | 0.9% |
Access:
ssh -p922 thunder.llnl.gov
Software and Development Environment |
Note: Like the IA32 Linux clusters, Thunder's software and development
environment is very similar to that described in the
Introduction to LC Resources tutorial. Only highlights and items specific
to Thunder are discussed below.
CHAOS Operating System:
Batch System:
File Systems:
Compilers:
MKL - Intel Math Kernel Library
Debuggers and Performance Analysis Tools:
Man Pages:
Intel Compilers |
Optimizing Compilers:
Compiler Invocation Commands:
icc | serial/OpenMP C |
icpc | serial/OpenMP C++ |
ifort | serial/OpenMP Fortran 77 and 90 |
mpiicc | script for C with Quadrics MPI |
mpiicpc | script for C++ with Quadrics MPI |
mpiifort | script for Fortran with Quadrics MPI |
Versions:
Compiler | Shell | Command |
---|---|---|
C/C++ | bsh/ksh | . /usr/local/intel/compiler80/bin/iccvars.sh |
csh/tcsh | source /usr/local/intel/compiler80/bin/iccvars.csh | |
Fortran | bsh/ksh | . /usr/local/intel/compiler80/bin/ifortvars.sh |
csh/tcsh | source /usr/local/intel/compiler80/bin/ifortvars.csh |
Common / Useful Options:
Option | Description | C/C++ | Fortran |
---|---|---|---|
-align keyword | Align data as specified by keyword. See man page for details. | ![]() |
|
-ansi_alias[-] | Can help performance. Directs the compiler to assume the
following:
C/C++ Default = -ansi_alias- (off) Fortran Default = -ansi_alias (on) |
![]() |
![]() |
-assume keyword
-assume buffered_io |
Specifies assumptions made by the compiler. One option that may improve I/O performance is buffered_io, which causes sequential file I/O to be buffered rather than being written to disk immediately. See the ifort man page for details. | ![]() |
|
-auto
-automatic -nosave -save
|
Places variables, except those declared as SAVE, on the run-time stack.
The default is -auto_scalar (local scalar of types INTEGER, REAL,
COMPLEX, or LOGICAL are automatic). However, if you specify -recursive
or -openmp, the default is -auto.
Places variables, except those declared as AUTOMATIC, in static memory. However, if you specify -recursive or -openmp, the default is -auto. |
![]() |
|
-autodouble | Defines real variables to be REAL(KIND=8). Same as specifying -r8. | ![]() |
|
-c | Stop the compilation after an object file has been produced - creates a *.o file and does not link. | ![]() |
![]() |
-check keyword | Enable runtime error checking actions according to keyword. | ![]() |
|
-convert keyword | Specifies the format for unformatted files, such as big endian, little endian, IBM 370, Cray, etc. | ![]() |
|
-Dname[=value] | Defines a macro name and associates it with a specified value. Equivalent to a #define preprocessor directive. | ![]() |
![]() |
-fast | Shorthand for several combined optimization options: -O3, -ipo -static | ![]() |
![]() |
-fpp
-cpp |
Invoke Fortran preprocessor. -fpp and -cpp are equivalent. | ![]() |
|
-fpe[n] | Specifies the run-time floating-point exception handling behavior:
|
![]() |
|
-ftz | Flush denormal values to zero. May improve performance in some codes. By default, this option is turned off, except with -O3 which turns it on. To turn it off with -O3, use -ftz-. | ![]() |
![]() |
-g | Build with debugging symbols. Note that -g does not imply -O0 in the Intel compilers; -O0 must be specified explicitly to turn all optimizations off. | ![]() |
![]() |
-help | Print compiler options summary | ![]() |
![]() |
-Idirectory | Add directory to include file search path | ![]() |
![]() |
-ip | Enable single-file interprocedural optimizations. | ![]() |
![]() |
-ipo | Enable multi-file interprocedural optimizations. | ![]() |
![]() |
-Ldirectory | Add directory to library search path | ![]() |
![]() |
-mcpu=itanium2 | Optimize for Itanium 2 processor (default) | ![]() |
|
-module directory | Specifies the directory where module (.mod) files should be placed when created and where they should be searched for in USE statements. | ![]() |
|
-mp | 'Maintain precision' - favor conformance to IEEE 754 standards for floating-point arithmetic. | ![]() |
![]() |
-mp1 | Improve floating-point precision - less speed impact than -mp. | ![]() |
![]() |
-o name | Create an object file called name. | ![]() |
![]() |
-O0 | Turn off optimizer - recommended if using -g for debugging. | ![]() |
![]() |
-O, -O1, -O2, -O3 | Optimization levels. (O,O1,O2 are essentially equivalent). -O3 is the most aggressive optimization level. | ![]() |
![]() |
-openmp | Turns on OpenMP. Supports OpenMP 2.0. | ![]() |
![]() |
-opt_report
-opt_report_file filename -opt_report_level [min|med|max] -openmp_report[0|1|2] -par_report[0|1|2|3] |
Various reporting options on optimization, OpenMP, or auto-parallelization. See man pages for more information. | ![]() |
![]() |
-p | Enables function profiling with the gprof tool. Same as -qp | .![]() |
![]() |
-parallel | Enable auto-parallelizer to generate multi-threaded code for eligible loops. | ![]() |
![]() |
-prof_gen
-prof_file -prof_use |
Used for profile guided optimization. | ![]() |
![]() |
-pthread, -lpthread | Link with Pthreads library | ![]() |
|
-r8
-r16 -real_size 64 -real_size 128 |
Different ways to specify the default size of real and/or double-precision numbers. | ![]() |
|
-recursive | Compiles all functions for possible recursion. | ![]() |
|
-reentrancy keyword | Specifies how to compile for multithreaded code. | ![]() |
|
-shared | Create a shared object (.a, .so) | ![]() |
![]() |
-static | Enables linking to shared libraries (.so) statically. | ![]() |
![]() |
-tpp2 | Optimize for Itanium 2 (default) | ![]() |
![]() |
-unroll[n] | Set maximum number of times to unroll loops. Omit n to use default heuristics. Use n=0 to disable loop unroller. | ![]() |
![]() |
-V | Display compiler version information | ![]() |
![]() |
-w
Disable all warning messages |
![]() ![]() | ||
-w[0|1|2]
Increasing levels of warning message reporting. Default=1 |
![]() ![]() | ||
-Wall (C/C++)
-warn (Fortran) |
Enable all warning messages | ![]() |
![]() |
-Wp64 | Print diagnostic messages for 64-bit porting. | ![]() |
|
-Zp[n] | Align structures at n (1,2,4,8,16) byte boundaries. | ![]() |
![]() |
GNU Compatibility:
Caveats:
#include <stdio.h> int main() { int i = 2; i /= 0; printf("i = %d\n",i); } |
#include <stdio.h> int main() { float i = 2; i /= 0; printf("i = %f\n",i); } |
Quadrics MPI |
Quadrics MPI:
MPI Build Scripts:
Script Name | Underlying Compiler |
---|---|
mpicc | gcc |
mpiCC | g++ |
mpiicc | icc |
mpiicpc | icpc |
mpif77 | f77->g77 |
mpif90 | f77->g77 |
mpiifort | ifort |
Static Linking:
/usr/lib/mpi/mpi_intel/lib (for most codes) /usr/lib/mpi/mpi_intel/lib_i8r8 (for fortran codes compiled with 8-byte integers and 8-byte reals) /usr/lib/mpi/mpi_intel/lib_i8 (pointer to lib_i8r8) /usr/lib/mpi/mpi_intel/lib_r8 (pointer to lib) /usr/lib/mpi/mpi_gnu/lib
/usr/lib/mpi/mpi_intel/include (for most codes) /usr/lib/mpi/mpi_intel/include_i8r8 (for fortran codes compiled with 8-byte integers and 8-byte reals) /usr/lib/mpi/mpi_intel/include_i8 (for fortran codes compiled with 8-byte integers) /usr/lib/mpi/mpi_intel/include_r8 (for fortran codes compiled with 8-byte reals) /usr/lib/mpi/mpi_gnu/lib
-lmpifarg (fortran only) -lmpi
-lmpifarg (fortran only) -lmpi -lelan -lelan4 -lrmscall -lelf
Libelan Environment Variables:
Performance:
Known Problems:
Running on Thunder |
A Few General Notes First:
To apply for DAT time on Thunder, see: https://www.llnl.gov/lcforms/thunder.html
Interactive Jobs:
Batch Jobs:
psub myjobscript
# Sample LCRM script to be submitted with psub #PSUB -c thunder,pbatch # explicitly say where to run #PSUB -r t2d22 # sets job name #PSUB -tM 1:00 # sets maximum total CPU time #PSUB -b micphys # sets bank account #PSUB -ln 2 # uses 2 nodes #PSUB -x # export current env var settings #PSUB -o /g/g0/db/t2d22.log # sets output log name #PSUB -e /g/g0/db/t2d22.err # sets error log name #PSUB -nr # do NOT rerun job after system reboot #PSUB -ro # write stdout immediately (no spooling) #PSUB -re # write stderr immediately (no spooling) #PSUB -mb # send email at execution start #PSUB -me # send email at execution finish #PSUB # no more psub commands # job commands start here # Display job information for possible diagnostic use set echo hostname echo LCRM job id = $PSUB_JOBID sinfo squeue # Run job cd /p/gt1/db/t2d22 srun -n 4 ./my_mpiprog echo 'ALL DONE' |
Quick Summary of Common Batch Commands:
Command | Description |
---|---|
psub | Submits a job to LCRM |
pstat | LCRM job status command |
prm | Remove a running or queued job |
phold | Place a queued job on hold |
prel | Release a held job |
palter | Modify job attributes (limited subset) |
lrmmgr | Show host configuration information |
pshare | Queries the LCRM database for bank share allocations, usage statistics, and priorities. |
defbank | Set default bank for interactive sessions |
newbank | Change interactive session bank |
Running on Thunder |
srun [option list] [executable] [args]Note that srun options must preceed your executable.
srun -n4 -ppdebug my_app | 4 process job run interactively in pdebug partition |
srun -n2 -c2 my_threaded_app | 2 process job with 2 threads per process. Assumes pbatch partition. |
srun -N8 my_app | Request that 8 nodes be used for job (total of 32 CPUs. Assumes pbatch partition.) |
srun -n4 -o my_app.out my_app | 4 process job that redirects stdout to file my_app.ou. Assumes pbatch partition.t |
srun -n4 -ppdebug -i all my_app | 4 process interactive job; each process accepts input from stdin |
Option | Description |
---|---|
-c [#cpus/task] | The number of CPUs used by each process. Use this option if each process in your code spawns multiple POSIX or OpenMP threads. |
--core=light | Specifies creation of lightweight core files. May be useful for very large process jobs which are crashing and filling disk space with core files. Note double dashes before "core" in this option. The default is --core=normal, which may actually be limited by your shell corefilesize setting. |
-d | Specify a debug level - integer value between 0 and 5 |
-i [file]
-o [file] |
Redirect input/output to file specified |
-I | Allocate CPUs immediately or fail. By default, srun blocks until resources become available. |
-J | Specify a name for the job |
-l | Label - prepend task number to lines of stdout/err |
-m block|cyclic | Specifies whether to use block (the default) or cyclic distribution of processes over nodes |
-n [#processes] | Number of processes that the job requires |
-N [#nodes] | Number of nodes on which to run job |
-O | Overcommit - srun will refuse to allocate more than one process per CPU unless this option is also specified |
-p [partition] | Specify a partition on which to run job |
-s | Print usage stats as job exits |
-v -vv -vvv | Increasing levels of verbosity |
-V | Display version information |
Terminating Jobs:
thunder0: squeue | grep test1 24688 pbatch test110 qmtang R 1:49:09 24 thunder[79-102] 68865 pdebug test1 blaise R 0:11 1 thunder1008 thunder0% scancel 68865 thunder0% srun: error: thunder1008: task[0-3]: Killed [1] Exit 137 srun -n4 -ppdebug test1 thunder0% |
thunder0% pstat 25156 t1.cmd blaise 000000 cs RUN thunder N 25157 t1.cmd blaise 000000 cs STAGING thunder N thunder0% prm 25156 remove running job 25156 (blaise, 000000, cs)? [y/n] y thunder0% |
Running on Thunder |
thunder0: ju Partition total down used avail cap Jobs pbatch 986 2 961 23 98% m33ng-24, mmwth5-192, ssope-256 ... pdebug 16 0 1 15 6% pnbrown-1 |
thunder0: spjstat Scheduling pool data: -------------------------------------------------------- Pool Memory Cpus Nodes Usable Free Other traits -------------------------------------------------------- pbatch 7500Mb 4 986 984 23 pdebug 7500Mb 4 16 16 15 Running job data: ------------------------------------------------------- Job ID User Name Nodes Pool Status ------------------------------------------------------- 18680 jwen 16 pbatch Running 18789 eess 24 pbatch Running 18688 jwen 16 pbatch Running 18670 gcero 32 pbatch Running 17916 ssope 256 pbatch Running 16316 jkbita 8 pbatch Running 18224 jkbita 256 pbatch Running 19041 mtang 24 pbatch Running 16321 jkbita 8 pbatch Running 15547 mmwth5 192 pbatch Running 19166 eess 24 pbatch Running 19148 cong 8 pbatch Running 19152 cong 8 pbatch Running 19169 eess 24 pbatch Running 19243 mmrin 65 pbatch Running 68245 pnrown 1 pdebug Running |
thunder0: sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST pbatch* up infinite 1 down* thunder57 pbatch* up infinite 969 alloc thunder[22-56,58-206,208-494 ...] pdebug up 30:00 10 alloc thunder[1008-1017] pbatch* up infinite 15 idle thunder[207,619,691,754-755 ...] pbatch* up infinite 1 down thunder495 pdebug up 30:00 6 idle thunder[1018-1023] |
thunder0: squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST 22684 pbatch Expandin yowen R 7:17:49 64 thunder[175-188,197-209 ...] 23014 pbatch 1BBL_swo sope3 R 6:21:46 256 thunder[314-391,497-534 ...] 24640 pbatch test.thu mrrin R 3:56:35 65 thunder[220-284] 23594 pbatch bbatch2H eess R 2:39:31 24 thunder[464-472,646-654 ...] 24637 pbatch test110 mmang R 2:12:23 24 thunder[79-102] 24689 pbatch psub_thu mmhee R 2:11:28 4 thunder[52-55] 24691 pbatch psub_thu mmhee R 2:09:59 4 thunder[720-723] 24696 pbatch s52-1 gwecero R 2:01:20 64 thunder[71-78,423-452 ...] 24725 pbatch sisop11 gwecero R 1:33:54 16 thunder[805-820] 24750 pbatch batch.sh b4nlii R 46:27 128 thunder[58-70,103-108 ...] 24783 pbatch s32-1 gwecero R 20:52 32 thunder[22-29,643-645 ...] 24839 pbatch rh315_10 wgitsu R 16:55 32 thunder[416-422,473-482 ...] |
thunder0: pstat -m thunder JID NAME USER ACCOUNT BANK STATUS EXEHOST CL 8942 mo_108.0 3ood 000000 squeeze *WCPU thunder N 8949 u2.psub uyang 477530 micphys *WCPU thunder N 16346 do800 kers 000000 micphys *WCPU thunder N 17873 b4f lkwggner 000000 micphys *WCPU thunder N 17874 b4f lkwggner 000000 micphys *DEPEND thunder N 22678 valduc3d01 kbbbta 000000 cms *MULTIPLE thunder N 22684 ExpandingTube-3 jwen 529004 axcode RUN thunder N 22685 ExpandingTube-3 jwen 529004 axcode *DEPEND thunder N 22879 mo4.psub uyang 477530 squeeze *WCPU thunder N 22991 vlcc_8.10.16 m55rath5 000000 fph2o *DEPEND thunder N 24640 test.thunder lirin 530001 clchange RUN thunder N 24653 do90 kers 000000 micphys RUN thunder N 24655 do70 kers 000000 micphys RUN thunder N ... ... ... 24656 amr100gp kgitsu 000000 chemd RUN thunder N 24839 rh315_100gp kgitsu 000000 micphys RUN thunder N 24840 origi htang 530001 lines *WCPU thunder N 24841 rh315_100gp kgitsu 000000 micphys *TOOLONG thunder N 24842 dpd ggee 000000 cms RUN thunder N 24873 methanol bmundy 000000 fph2o RUN thunder N 24879 new-ExpandingTu jwen 529004 axcode RUN thunder N 24880 amr100gp kgitsu 000000 chemd *TOOLONG thunder N 43344 sspex_test2.ksh qitera1 000000 folding *DEPEND thunder N 43345 sspex_test2.ksh qitera1 000000 folding *DEPEND thunder N 43346 sspex_test2.ksh qitera1 000000 folding *DEPEND thunder N |
Running on Thunder |
Task Node CPU ------ ----- ----- task 0 node0 cpu0 task 1 node0 cpu1 task 2 node0 cpu2 task 3 node0 cpu3 task 4 node1 cpu0 task 5 node1 cpu1 task 6 node1 cpu2 task 7 node1 cpu3This may or may not be what you want.
# | Example | |
---|---|---|
Interactive | Batch | |
1 | You have a 16-task MPI job. You want each MPI task to have 1 CPU and you don't want to waste CPUs. This would be typical for non-threaded MPI tasks which don't require more than 1/4 of the node's total memory. | |
srun -n16 -ppdebug a.out |
#PSUB -ln 4 srun -n16 a.out |
|
2 | You want to run 4 simultaneous instances of a non-MPI process that uses POSIX or OpenMP threads. Typically, you would allocate an equal number of processes and nodes for your job and also specify the number of CPUs to use per node. | |
srun -n4 -c4 -ppdebug a.out or srun -N4 -ppdebug a.out |
#PSUB -ln 4 srun -n4 -c4 a.out or srun -N4 a.out |
|
3 | You have an 8-task MPI job that uses POSIX or OpenMP threads. Typically, you would run 1 MPI task per node, which would then spawn 4 threads. Don't forget that Quadrics MPI is not thread-safe, so the master thread should perform all MPI calls. | |
srun -n8 -c4 -ppdebug a.out or srun -N8 -ppdebug a.out |
#PSUB -ln 8 srun -n8 -c4 a.out or srun -N8 a.out |
|
4 | You have a 32-task, non-threaded MPI job. Each MPI task requires approx. 3 GB of memory. Thunder nodes have 8 GB memory, so putting 4 tasks on a node would exhaust memory and cause paging. It would be better to use 2 CPUs per task in this case, even though one of them would be "wasted". | |
srun -n32 -c2 -ppdebug a.out or srun -N16 -n32 -ppdebug a.out |
#PSUB -ln 16 srun -n32 -c2 a.out or srun -N16 -n32 a.out |
Task Block Cyclic ------ ----- ------ task 0 node0 node0 task 1 node0 node1 task 2 node0 node0 task 3 node0 node1 task 4 node1 node0 task 5 node1 node1 task 6 node1 node0 task 7 node1 node1
#PSUB -ln 20 srun -N10 -n40 myjob srun -N11 -n44 myjob srun -N12 -n48 myjob .... srun -N20 -n80 myjob
Running on Thunder |
12:41:01 up 8 days, 3:10, 1 user, load average: 0.05, 0.22, 0.56 257 processes: 256 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 0.4% 0.0% 0.4% 0.0% 0.0% 0.0% 398.4% cpu00 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu01 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 100.0% cpu02 0.1% 0.0% 0.2% 0.0% 0.0% 0.0% 99.5% cpu03 0.3% 0.0% 0.2% 0.0% 0.0% 0.0% 99.2% Mem: 8246928k av, 1216608k used, 7030320k free, 0k shrd, 108960k buff |
Process Stack vs. Heap Memory:
pdebug vs. pbatch | csh/tcsh | bsh/ksh |
---|---|---|
pdebug | % limit cputime unlimited filesize unlimited datasize unlimited stacksize unlimited coredumpsize 16 kbytes memoryuse unlimited vmemoryuse unlimited descriptors 1024 memorylocked unlimited maxproc 1024 |
$ ulimit -a time(cpu-seconds) unlimited file(blocks) unlimited coredump(blocks) 32 data(kbytes) unlimited stack(kbytes) unlimited lockedmem(kbytes) unlimited memory(kbytes) unlimited nofiles(descriptors) 1024 processes 1024 |
pbatch | $ limit cputime unlimited filesize unlimited datasize unlimited stacksize unlimited coredumpsize unlimited memoryuse unlimited vmemoryuse unlimited descriptors 1024 memorylocked unlimited maxproc 16339 |
$ ulimit -a time(cpu-seconds) unlimited file(blocks) unlimited coredump(blocks) unlimited data(kbytes) unlimited stack(kbytes) unlimited lockedmem(kbytes) unlimited memory(kbytes) unlimited nofiles(descriptors) 1024 processes 16339 |
Pass | Fail (seg fault) |
---|---|
#include <stdio.h> #define N 16000 void dowork() { double A[N][N],sum; int i,j; for (i=0; i<N; i++) for (j=0; j<N; j++) A[i][j] = 1.0+i*j; printf("Sample result = %e\n",A[N-1][N-1]); } int main(int argc, char *argv[]) { dowork(); } |
#include <stdio.h> #define N 17000 void dowork() { double A[N][N],sum; int i,j; for (i=0; i<N; i++) for (j=0; j<N; j++) A[i][j] = 1.0+i*j; printf("Sample result = %e\n",A[N-1][N-1]); } int main(int argc, char *argv[]) { dowork(); } |
#include <stdio.h> #define N 17000 int main(int argc, char *argv[]) { int i, j; static double A[N][N]; for (i=0; i<N; i++) for (j=0; j<N; j++) A[i][j] = 1.0+i*j; printf("Sample result = %e\n",A[N-1][N-1]); } |
#include <stdio.h> #define N 17000 int main(int argc, char *argv[]) { int i, j; double A[N][N]; for (i=0; i<N; i++) for (j=0; j<N; j++) A[i][j] = 1.0+i*j; printf("Sample result = %e\n",A[N-1][N-1]); } |
#include <stdio.h> #include <stdlib.h> #define N 300000000 int main(int argc, char *argv[]) { long int i, *A; if ((A = (long int *) malloc(N*sizeof(long int))) == NULL) { printf("Malloc failed!. Exiting.\n"); exit(0); } for (i=0; i<N; i++) A[i] = i; printf("Sample result = %li\n",A[N-1]); } |
#include <stdio.h> #include <stdlib.h> #define N 300000000 int main(int argc, char *argv[]) { long int i, A[N]; for (i=0; i<N; i++) A[i] = i; printf("Sample result = %li\n",A[N-1]); } |
Pthreads Stack Limits:
System | Architecture | #CPUs | Memory (GB) | Interactive Default Stack Size | Batch Default Stack Size |
---|---|---|---|---|---|
THUNDER | Intel IA64 | 4 | 8 | 33554432 | 33554432 |
MCR(OCF) / LILAC(SCF) | Intel IA32 | 2 | 4 | 2097152 / 67108864 | 2097152 |
UM,UV | IBM Power4 | 8 | 16 | 196608 | 196608 |
FROST,WHITE | IBM Power3 | 16 | 16 | 98304 | 98304 |
pthread_attr_getstacksize (&attr, &stacksize) pthread_attr_setstacksize (&attr, stacksize) |
System | Architecture | #CPUs | Memory (GB) | Batch System Maximum Thread Stack Size (MB) by #Threads/Node | ||||
---|---|---|---|---|---|---|---|---|
2 | 4 | 8 | 16 | 32 | ||||
THUNDER | Intel IA64 | 4 | 8 | 25000 | 25000 | 25000 | 25000 | 25000 |
MCR,ALC,LILAC | Intel IA32 | 2 | 4 | 1072 | 712 | 352 | 182 | 92 |
UM,UV | IBM Power4 | 8 | 16 | 260 | 260 | 260 | 260 | 260 |
FROST,WHITE | IBM Power3 | 16 | 16 | 260 | 260 | 260 | 260 | 260 |
For example, if your thread uses 20MB of local data, then you should create it with 40+ MB of stack size. How much greater? Start with an extra megabyte and see what happens.
This matter has been reported to Redhat and it doesn't look like it's going to change. It's a design "feature".
OpenMP Stack Limits:
setenv KMP_STACKSIZE 12000000
In Conclusion:
Running on Thunder |
Environment Variable | Description |
---|---|
MALLOC_MMAP_MAX_ | Default value of 0. Forces malloc to use sbrk() rather than mmap() to allocate memory. Improves performance of MPI collectives because it prevents aggressive reclaiming of pages mapped on the Elan card DMA memory. |
MALLOC_TRIM_THRESHOLD_ | Default value of -1. Used in conjunction with MALLOC_MMAP_MAX_ (see above) |
Compiler Hints:
Local MPI Test Results on Thunder:
Web Documentation:
Debugging |
Available Debuggers:
Debugger | More Info |
---|---|
TotalView |
|
DDT |
|
GDB |
|
DDD |
|
IDB |
|
TotalView:
totalview srun -a -n processes -ppdebug prog [prog args]
DDT:
ddt prog
srun -ppdebug -N4 ddt prog
GDB:
gdb a.out
gdb a.out core.1234
gdb a.out 12345
Command | Action |
---|---|
b,break N | Set breakpoint at line N |
b,break funcname|line | Set breakpoint at function named funcname or at specified line |
bt | Print a stack backtrace |
c,cont | Continue after breakpoint |
h,help | Print list of help topics |
i,info registers | Show registers |
i,info float | Show floating-point registers |
l,list N | List N lines of code (default is 10) |
n,next | Execute next program line; step over function calls |
q,quit | Quit |
r,run | Run program |
s,step | Execute next program line; step into function |
DDD:
ddd a.out
ddd a.out core.1234
ddd a.out 12345
IDB:
idb -gdb a.out
idb a.out core.1234
idb -gdb a.out -pid 12345
set path = ($path /usr/local/intel/idb_80/bin)
setenv IDB_HOME /usr/local/intel/idb_80/bin
Debugging in Batch: batchxterm:
batchxterm display machine #nodes #minutesWhere:
cd ~/projects totalview srun -a -n8 myprog
A Few Additional Useful Debugging Hints:
srun -N12 -x "thunder1008 thunder1009" -ppdebug myjob
csh/tcsh | limit coredumpsize 64 |
---|---|
ksh/bsh | ulimit -c 64 |
Tools |
We Need a Book!
setenv GMON_OUT_PREFIX 'gmon.out.'`/bin/uname -n`
Known Problems/Issues |
Just Getting Started...
What We Have So Far:
IA32 Data Sizes | IA64 Data Sizes |
---|---|
int= 4 bytes unsigned int= 4 bytes long= 4 bytes unsigned long= 4 bytes *int= 4 bytes float= 4 bytes *float= 4 bytes double= 8 bytes *double= 4 bytes |
int= 4 bytes unsigned int= 4 bytes long= 8 bytes unsigned long= 8 bytes *int= 8 bytes float= 4 bytes *float= 8 bytes double= 8 bytes *double= 8 bytes |
References and More Information |
This completes the tutorial.
![]() |
Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise. |
Where would you like to go now?