presented by
Blaise Barney
Livermore Computing
Table of Contents
- Abstract
- ASC Purple Background
- Hardware
- Configuration
- POWER5 Processor
- p5 575 Node and Frame
- High Performance Switch (HPS) Network
- GPFS Parallel File System
- Accounts
- Access
- User Environment Topics
- Software and Development Environment
- Parallel Operating Environment (POE) Overview
- Compilers
- MPI
- Running on Purple Systems
- Important Differences
- Understanding Your System Configuration
- Setting POE Environment Variables
- Invoking the Executable
- Monitoring Job Status
- Interactive Job Specifics
- Batch Job Specifics
- More On SLURM
- Optimizing CPU Usage
- Large Pages
- RDMA
- Debugging With TotalView
- Misc - Recommendations, Known Problems, Etc.
- References and More Information
- Exercise
This tutorial provides an introduction to using Livermore Computing's (LC)
ASC Purple systems. The intended audience is primarily those who are new to
using the IBM POWER architecture and computing in LC's HPC environment.
Those who are already knowledgeable with computing in LC's HPC environment,
especially users of LC's POWER based systems (such as ASC White), will already
be familiar with a substantial portion of these materials.
The tutorial begins by providing a brief background of ASC Purple and the
configuration of LC's Purple systems. The primary hardware components
of Purple are then presented, including IBM's POWER5 processor, p5 575 node
and frame, HPS switch, and GPFS parallel I/O architecture.
After covering the hardware related topics, a brief discussion on how to
obtain an account and access the Purple systems follows. Software topics are
then discussed, including the LC development environment, IBM's Parallel
Operating Environment (POE), compilers, MPI implementations, and how to run
both batch and interactive parallel jobs.
Debugging and performance related tools/topics are briefly discussed,
however detailed usage of these tools is beyond the scope of this presentation
and is covered in other tutorials and LC documentation. The tutorial
concludes with several LC specific and miscellaneous topics. A lab exercise
using LC's unclassified Purple system follows the presentation.
Level/Prerequisites: Intended for those who are new to developing
parallel programs in LC's IBM POWER environment. A basic understanding of
parallel programming in C or Fortran is assumed. The material covered by
EC3501 - Introduction to
Livermore Computing Resources would also be useful.
The history of ASC Purple is really the history of two separate, but
interrelated timelines: the evolution of IBM's POWER architecture, and the
NNSA's ASC program.
IBM's POWER Architectures:
NNSA's ASC Program:
- 1992: President Bush signed into law the FY1993 Energy and Water
Authorization Bill that established a moratorium on U.S. nuclear testing.
- 1993: President Clinton extended the moratorium.
- 1995: President Clinton announced the United States' intention to pursue
a Comprehensive Test Ban Treaty for nuclear weapons: "we can meet the
challenge of maintaining our nuclear deterrent under a [comprehensive test
ban] through a science-based stockpile stewardship program without nuclear
testing."
- 1996: Department of Energy (DOE) established the science-based Stockpile
Stewardship Program to ensure the safety, reliability and performance of
the U.S. stockpile in an era of no nuclear weapons testing/development.
- 1996: The Accelerated Strategic Computing Initiative (ASCI) was created
as an essential part of the Stockpile Stewardship Program. ASC's mission
is to help ensure the performance, reliability and safety of the U.S.
nuclear stockpile through leading-edge computational modeling and
simulation. Aggressive investment in leading-edge HPC platforms and
related hardware/software technologies provides a critical resource
for this mission.
- 2000: National Nuclear Security Administration (NNSA) established to carry
out the national nuclear security responsibilities of the DOE. The
Stockpile Stewardship Program and ASCI fall under this umbrella.
- 2002: ASCI is renamed to Advanced Simulation and Computing (ASC).
- 2005: ASC enters its second decade.
- Since its inception, the ASCI/ASC program has funded the following
world-class HPC systems. Highlighted systems have obtained the status of
the world's most powerful computer:
- ASCI Red - Sandia
- ASCI Blue Mountain - Los Alamos
- ASCI Blue Pacific - Livermore
|
- ASCI White - Livermore
- ASCI Q - Los Alamos
- Lightning - Los Alamos
|
- ASC Purple - Livermore
- BlueGene/L - Livermore
- Red Storm - Sandia
|
ASC Purple Timeline:
- One of the ASC Program's key goals was to implement a 100 Teraflop
system by 2005. ASC Purple is the realization of this goal.
- 2/22/02: Purple RFP issued
- 11/19/02: DOE awards IBM a $290 million dollar contract to build ASC
Purple and BlueGene/L. Press release
available here.
- Late CY03 thru CY04: implementation of early delivery technology
vehicles (EDTV) - UM and UV POWER4 systems.
- Late CY04 thru CY05: phased implementation of classified Purple
systems. Unclassified Purple system (UP) became generally available (GA)
on 7/19/05. Intermediate systems PU and Purpura available on the SCF.
- 2Q06: Limited availability of classified Purple system. General
availability targeted for early 3Q06.
- ASC Purple is ranked as #3 in the Top 500 supercomputers list in June
2006.
Configuration
Primary Components:
- Like LC's other IBM POWER systems, ASC Purple systems are
comprised of five primary components, described briefly below and in more
detail later:
- Nodes
- Frames
- Switch Network
- Parallel File Systems
- Hardware Management Console
- Nodes: Comprise the heart of a system. Nodes are rack mounted
in a frame and directly connected to the switch network. The
majority of nodes are dedicated as compute nodes to run user jobs.
Other nodes serve as file servers and login machines.
- Frames: The containment units that physically house nodes, switch
hardware, and other control/supporting hardware.
- Switch Network: The internal network fabric that enables high-speed
communication between nodes. Also called the High Performance Switch (HPS).
- Parallel File Systems: Each Purple system mounts one or more GPFS
parallel file system(s).
- Hardware Management Console: A stand-alone workstation that possesses
the hardware and software required to monitor and control the frames, nodes
and switches of an entire system by one person from a single point. With
Purple, the Hardware Management Console function is actually distributed
over a cluster of 68 PCs running linux with a single management console.
- Additionally, Purple systems are connected to external networks and NFS file
systems.
Topology:
- The schematic below shows the general topology used for both Purple systems.
The key characteristics are:
- The majority of nodes are dedicated as parallel batch/interactive
compute nodes.
- Some nodes are dedicated as GPFS servers for the parallel file
systems.
- A small number of nodes are dedicated as login nodes.
- Large, global parallel GPFS file systems are present.
- Nodes are connected to the HPS switch network
- The entire system is connected by GigE to HPSS archival storage,
visualization systems, and other LC systems.
LC's Purple Systems:
- The two primary, production Purple systems are uP and Purple.
They share the following characteristics:
- Processor type: IBM POWER5 @1.9 GHz
- Node type: p5 575
- 8 processors per node
- 64-bit architecture and address space
- High Performance Switch (HPS) interconnect
- Parallel GPFS file system(s)
- Purple:
- Classified
- 1532 nodes
- Login nodes differ from compute nodes: consist of two 32-way
POWER5 machines partitioned to look like four 16-way machines.
- Theoretical peak performance: 93 TFLOP
- Memory:
- Login nodes: 64 GB
- All other nodes: 32 GB
- 2000 TB parallel GPFS file system(s)
- Configuration:
- Login nodes: 4
- Batch nodes: 1336
- Visualization nodes: 64
- Server nodes: 128
- uP:
- "unclassified Purple"
- 108 nodes
- Theoretical peak performance: 6.6 TFLOP
- Memory: 32 GB per node
- 140 TB parallel GPFS file system
- Configuration:
- Login nodes: 1
- Debug nodes: 2
- Batch nodes: 99
- Server nodes: 6
POWER5 Processor
POWER5 Basics:
- The heart of LC's ASC Purple systems is the POWER5 processor. This same
processor is used by IBM for a variety of machines in its p5 family of
products. Models differ widely in the number of processors, memory, disk,
I/O configuration, etc.
- Architecturally, the POWER5 processor is very similar to its predecessor,
the POWER4 processor.
- POWER5 primary features:
- Dual-core chip (2 cpus per chip)
- 64-bit architecture
- Clock speeds of 1.65 to 1.9 GHz
- Superscalar, out of order execution with multiple functional units -
including two fixed point and two floating point units
- L1 data cache: 32KB per processor, 128 byte line, 4-way associative
- L1 instruction cache: 64KB per processor, 128 byte line,
2-way associative
- L2 cache: 1.9MB per chip (shared between dual processors), 128 byte
line, 10-way associative
- L3 cache: 36MB per chip (shared), extension of L2 cache, 256 byte line,
12-way associative, 30.4 GB/sec bandwidth to L2
- On-chip memory controller and L3 cache directory
- 1 GB - 256 GB memory
- 12.4 GB/sec cpu-memory bandwidth
- Can be combined with other dual-core chips to make up to 64-way SMPs
- Simultaneous multi-threading support - makes each cpu appear as two
- Virtualization/logical partitioning - each cpu can support 10 partitions
(operating systems) simultaneously
- Dynamic power management - adjusts core power according to demand
Chip Modules:
- Dual-core POWER5 chips are combined with other components to form modules.
IBM produces the following types of POWER5 modules:
- Dual-chip Module (DCM): includes one dual-core POWER5 processor chip
and one L3 cache chip (2-way SMP).
- Quad-core Module (QCM): includes two dual-core POWER5 processor chips
and two L3 cache chips (4-way SMP).
- Multi-chip Module (MCM): includes four dual-core POWER5 processor chips
and four L3 cache chips (8-way SMP).
- Several diagrams and pictures of POWER5 modules are shown below.
Multiple Modules:
- Modules can be combined to form larger SMPs. For example, a 16-way SMP
can be constructed from two MCMs, and is called a “book" building block.
Four books can be used to make a 64-way SMP. Diagrams demonstrating both
of these are shown below.
ASC Purple Chips and Modules:
- ASC Purple compute nodes are p5 575 nodes, which differ from standard p5
nodes in having only one active core in a dual-processor chip.
- With only one active cpu in a chip, the entire L2 and L3 cache is dedicated.
This design benefits scientific HPC applications by providing better
cpu-memory bandwidth.
- ASC Purple nodes are built from Dual-chip Modules (DCMs). Each node has a
total of eight DCMs. A photo showing these appears in the next section
below.
|
 |
p5 575 Node and Frame
p5 575 Node Characteristics:
- As mentioned previously, the p5 575 node is different from most other p5
nodes in that only one cpu is active in the dual-core processor.
- "Innovative" 2U packaging - a novel design to minimize space requirements
and achieve "ultra dense" CPU distribution. Up to 192 CPUs per frame
(12 nodes * 16 CPUs/node).
- Eight Dual-chip Modules (DCMs) with associated memory
- Comprised of 4 "field swappable" component modules:
- I/O subsystem
- DC power converter/lid
- processor and memory planar
- cooling system
- I/O: standard configuration of two hot-swappable SCSI disk drives.
Expansion via an I/O drawer to 16 additional SCSI bays with a maximum
of 1.17 TB of disk storage.
- Adapters: standard configuration of four 10/100/1000 Mb/s ethernet
ports and two HMC ports. Expansion up to 4 internal PCI-X slots and
20 external PCI-X slots via the I/O drawer.
- Support for the High Performance Switch and InfiniBand
- Supported operating systems:
- AIX 5L
- SUSE Linux
- Redhat Linux
- i5/OS
- Many built-in RAS features
ASC Purple Frames:
- Like LC's other POWER systems, ASC Purple nodes and switch hardware are
housed in frames. An example frame used for the p5 575 compute nodes is
shown at right.
- Frame characteristics:
- Redundant frame power supply
- Air cooling
- Concurrent (hot swappable) node maintenance
- Monitoring and control from a single point via the Hardware Management
Console (HMC)
- ASC Purple frames can hold up to twelve p5 575 nodes.
- Some frames are used solely for switch hardware - stages 2 and 3.
- Managed via the Hardware Management Console/cluster
|
 |
High Performance Switch (HPS) Network
Quick Intro:
- The HPS network provides the internal, high performance, communication
fabric that connects individual nodes together to form an entire system.
Any node can communicate with any other node via multiple pathways in
the network.
- The switch has evolved along with the POWER architecture, and has been
called by various names along the way:
- High Performance Switch (HiPS)
- SP Switch
- Colony / SP Switch2
- Federation / HPS (current)
- For the interested, a history (and much more) of the SP switch is
presented in the IBM Redbook called, "An Introduction to the New IBM
eserver pSeries High Performance Switch".
Currently, this publication is available in PDF format at:
www.redbooks.ibm.com/redbooks/pdfs/sg246978.pdf.
- The discussion here is limited (more or less) to a user's view of the
switch network and is highly simplified. In reality, the switch network
is quite complicated. The same IBM redbook mentioned above covers in much
greater detail the "real" switch network, for the curious.
Topology:
- Technically, the HPS network can be classified as a bidirectional,
Omega-based variety of Multistage Interconnect Network (MIN).
- Bidirectional: Each point-to-point
connection between nodes is comprised of two channels
(full duplex) that can carry data in opposite directions simultaneously.
- Multistage: Additional intermediate switches are required
to scale the system upwards.
For example, with ASC Purple, there are 3 levels of switches
required in order for every node to communicate with every other
node.
|
 |
Switch Network Characteristics:
- Packet-switched network (versus circuit-switched). Messages are broken
into discrete packets and sent to their final destination, possibly
following different routes and out of sequence. All invisible to the
user.
- Low latency, high bandwidth
- Support for multi-user environment - multiple jobs may run
simultaneously over the switch (one user does not monopolize switch)
- Path redundancy - multiple routings between any two points. Permits
routes to be generated even when there are faulty components in
the system.
- Built-in error detection
- Hardware redundancy for reliability - the switch board (discussed below)
actually uses twice as many hardware components as it minimally requires,
for RAS purposes.
- Architected for expansion to 1000s of ports. ASC Purple is the
first real system to prove this.
- Hardware components: in reality, the HPS switch network is a very
sophisticated system with many complex components. From a user's
perspective however, there are only a few hardware components worth
mentioning:
- Switch drawers: house the switch boards and other support hardware.
Mounted in a system frame.
- Switch boards: the heart of the switch network
- Switch Network Interface (SNI): an adapter which plugs into a node
- Cables: to connect nodes to switch boards (copper), and
switchboards to other switchboards (copper or fiber).
Switch Drawer:
- The HPS switch drawer (4U by 24") fits into a slot in a frame. For
frames with
nodes, this is usually the bottom slot of a frame. For systems requiring
intermediate switches, there are frames dedicated to housing only switch
drawers.
- The switch drawer contains most of the components that comprise the
HPS network, including but not limited to:
- Switchboard with switch chips
- Power supply
- Fans for cooling
- Switch port connector cards (riser cards)
Switch Board:
- The switch board is really the heart of the HPS network. The main features
of the switch board are listed below.
- There are 8 logical Switch Chips, each of which is connected to 4 other
Switch Chips to form an internal 4x4 crossbar switch.
- A total of 32 ports controlled by Link Driver Chips on riser cards, are
used to connect to nodes and/or other switch boards.
- Depending upon how the Switch Board is used, it will be called a
Node Switch Board (NSB) or Intermediate Switch Board (ISB):
- NSB: 16 ports are configured for node connections. The other 16 ports
are configured for connections to switch boards in other frames.
- ISB: all ports are used to cascade to other switch boards.
- Practically speaking, the distinction between an NSB and ISB is
only one of topology. An ISB is just located higher up in the
network hierarchy.
- Switch-node connections are by copper cable. Switch-switch connections can
be either copper or optical fiber cable.
- Minimal hardware latency: approximately 59 nanoseconds to cross each Switch
Chip.
- Some simple example configurations using both NSB and ISB switch boards
are shown below. The number “4" refers to the number of ports connecting
each ISB to each NSB.
|
 |
Switch Network Interface (SNI):
- The Switch Network Interface (SNI) is an adapter card that plugs into a
node's GX bus slot, allowing it to use the switch to communicate with other
nodes in the system. Every node that is connected to the switch must have
at least one switch adapter.
- There are different types of SNI cards for p5 nodes, and some types of p5
nodes support more than one SNI per node. p5 575 nodes use a single,
2-link adapter card.
- One of the key features of the SNI is providing the ability for a process to
communicate via Direct Memory Access (DMA). Using DMA for communications
eliminates additional copies of data to system buffers; a process can
directly read/write to another process's memory.
- The node's adapter is directly cabled via a rather bulky copper
cable into a corresponding port on the switch board.
- There is much more to say about SNIs, but we'll leave that to the curious
to pursue in the previously mentioned (and other) IBM documentation.
Switch Application Performance:
- An application's communication performance over the switch is dependent
upon a complex mix of at least several factors:
- Node type
- Switch and switch adapter type
- Communications protocol used
- On-node vs. off-node communications
- Application communication patterns/characteristics
- Network tuning parameters
- Competing network traffic
- Hardware latency: in practical terms, the switch hardware latency is almost
negligible (about 59 nanoseconds per chip crossed)
when compared to the software latency involved in sending data.
Between any two nodes, hardware latency is in the range of hundreds of
nanoseconds.
- Software latency: comprises most of the delay in sending a message between
processes. To send MPI messages through the software stack over the switch
incurs a latency of ~5 microseconds.
- Theoretical peak bi-directional performance is 4 GB/sec per link.
GPFS Parallel File System
Overview:
- GPFS is IBM's General Parallel File System product.
- As with other LC production IBM systems, ASC Purple systems have at least
one parallel GPFS file system.
- "Looks and feels" like any other UNIX file system from a user's
perspective.
- Architecture:
- Most nodes in a system are application/compute nodes
where programs actually run. A subset of the system's nodes are
dedicated to serve as storage nodes for conducting
I/O activities between the compute nodes and physical disk. Storage
nodes are the interface to disk resources.
- For performance reasons, data transfer between the application nodes
and storage nodes typically occurs over the internal switch network.
- Individual files are stored as a series of "blocks" that are striped
across the disks of different storage nodes. This permits concurrent
access by a multi-task application when tasks read/write to different
segments of a common file.
- Internally, GPFS's file striping is set to a specific block
size that is configurable. At LC, the most efficient use of GPFS is
with large files. The use of many small files in a GPFS file system
is not advised if performance is important.
- IBM's implementation of MPI-IO routines depends upon an underlying GPFS
system to accomplish parallel I/O within MPI programs.
- GPFS Parallelism:
- Simultaneous reads/writes to non-overlapping regions of the same file
by multiple tasks
- Concurrent reads and writes to different files by multiple tasks
- I/O will be serial if tasks attempt to use the same stripe of a file
simultaneously.
- Additional information:
http://www-1.ibm.com/servers/eserver/pseries/library/sp_books
LC Configuration Details:
- Naming scheme: /p/gM#/username where:
- M = one or two character abbreviation for the machine. For example,
"up".
- # = one digit number (1 or 2)
- username = your user name on that machine. Established automatically.
- Symbolic links allow for standardized naming in scripts, etc.:
/p/glocal1 links to /p/gM1
and /p/glocal2 links to /p/gM2
- Configurations:
- Sizes of GPFS file systems vary between systems. And they change.
- A machine may have more than one GPFS file system.
- Sizes change from time to time. du -k will tell you
what the current configuration.
- GPFS file systems are not global; they are local to a specific system.
- At LC, GPFS file systems are configured optimally for use with large
data files.
- Temporary location:
- No backup
- Purge policies are in effect, since a full file system reduces
performance.
- Not reliable for long term storage
Note: This section represents a subset of the information available on LC's HPC
accounts web pages located at www.llnl.gov/computing/hpc/accounts. Please consult those
pages for forms and additional details.
How to Obtain an Account on uP and/or Purple:
- Getting an account on LC's ASC Purple systems is similar to obtaining
accounts on other LC ASC systems:
- The same procedures, authorizations, policies and considerations apply
- Accounts are handled by the LC Hotline
- Forms are available on the LC accounts web pages referenced above
- The flow chart below summarizes the general process for obtaining
unclassified (OCF) and classified (SCF) accounts.
- What happens after you submit your account application?
- After your account request has been processed, the LC Hotline will
notify you by email or phone on what to do next.
- For OCF accounts, you will also receive via US mail, a One-time
Password (OTP) token. Instructions on how to activate and use this
token are included with your account notification email.
- For SCF accounts, you will be asked to visit the LC Hotline in
person to setup your initial password. If you are not physically at
LLNL, as would be the case for SecureNet users, you will receive your
password and instructions via US mail.
- Other types of accounts may be required. For example:
- Non-Tri-lab offsite users will need to insure that they
have an Internet Access Service Account setup.
- Lab employees who wish to do unclassified work from home will need to
establish an Internet Access Service Account also.
- A SecureNet account is needed for non-Tri-lab classified use
- Questions? Contact the LC Hotline
Capability Computing:
- ASC Purple is designated as a Capability System for ASC stockpile
stewardship work. This means that resource allocation and scheduling
will promote large jobs performing ASC related work.
- Currently, a Tri-lab team is formulating a "Governance Model" that will
define how Purple resources will be allocated and scheduled. Although
this model is still under development, expected implementation is
in 3Q06.
- Stay tuned...
Note: This section represents a subset of the information available on LC's HPC
access web pages located at www.llnl.gov/computing/hpc/access. Please consult those
pages for additional details.
Summary:
- NOTE: Tri-lab (Sandia, Los Alamos) access to Purple machines is treated
differently than described here. See the discussion below on "Tri-lab
Login Exceptions".
- Accessing LC's Purple systems is similar to accessing other LC systems.
- You must have a valid account setup by the LC Hotline for the machine you
wish to access.
- Unclassified access to uP requires SSH (version 2) to
up.llnl.gov and One Time Password (OTP)
authentication generated by an LC OTP token. Access is via the LLNL
unclassified network or remotely over the Internet.
- LLNL classified access to Purple requires SSH (version 2) to
purple.llnl.gov and either OTP or DCE password
authentication. Access is via the LLNL classified network or
remotely over SecureNet.
- An Internet Access Service account may also be required for remote access
(external to the LLNL network) to uP and other unclassified resources
(discussed later).
One Time Passwords (OTP):
- Are single-use passwords that are mandatory on all OCF machines including
unclassified Purple. They are currently optional on classified Purple
and all other SCF machines.
- Based upon a "two factor" authentication:
- 4-8 character alphanumeric, static DCE PIN for every user
- 6-digit random number generated by an RSA SecureID token
device (similar to a cryptocard, but smaller and more durable).
- OTP is also in effect for other OCF services besides logins:
- Access to internal web pages
- Internet Access Services such as VPN
- Under certain circumstances, the OPT server and your token may get out
of sync. In such cases it is necessary to enter two consecutive token
codes so the server can resynchronize itself. This can be done via
the "Token Diagnostics" link on the OTP home page listed below.
- OTP home page:
https://access.llnl.gov/otp
DCE Passwords:
- One static password for all SCF machines. No longer
available on unclassified systems and will eventually be phased out on
the SCF.
- DCE passwords must be compliant with DOE Guide
205.3 which requires that passwords do not have common names or dictionary
words of 4 or more characters in them, spelled forward or backwards.
- Expires every six months. Notifications of expiring DCE passwords appear
in the login message and are sent via SCF email
- Passwords can be changed on the classified web:
https://lc.llnl.gov/bin/passwd
- Lockouts can occur when the password is entered incorrectly too many
times. The lock is released after 15 minutes. If multiple lockouts occur,
then your account may be permanently locked.
- Must obtain new password from LC Hotline (walk-in or certified mail)
once your password has expired or you become permanently locked out.
Login Nodes:
Tri-lab Login Exceptions:
SSH:
- Secure Shell (SSH) is required for access between all LC systems, whether
you are internal to LC or external, whether you are on the OCF or the SCF.
- SSH version 1 is no longer supported at LC. SSH clients attempting to access
LC systems should be version 2 compatible.
- ASC Purple systems and all other LC HPC systems use OpenSSH software, which
is compatible with most other common SSH software.
- OpenSSH supports both RSA and DSA key authentication for extra security via
a passphrase, or the convenience of passwordless access.
Internet Access Services:
- LLNL's Open LabNet offers several different remote access services for
unclassified internal/restricted resources. Using one of these services is
usually required if you are coming from a non-LLNL site, and requires
setting up an account first.
- The types of services available depend upon whether or not you are an LLNL
employee, or coming from Sandia or LANL.
- LLNL Employee:
- VPN: Virtual Private Network (preferred)
- OTS: Open Terminal Server (dial up)
- WPS: Web Proxy Service (web pages only)
- ISDN: Integrated Services Digital Network
- IPA: IP Port Allow Service (currently limited/restricted)
- Non-LLNL Employee (Alliances, LLNL Collaborators, Others):
- VPN-C: Virtual Private Network for collaborators (preferred)
- WPS-C: Web Proxy Service for collaborators (web pages only)
- IPA: IP Port Allow Service (currently limited/restricted)
- VIP: Vouch IP Access (uncommon)
- Sandia or Los Alamos User:
- Internet Access Service account not required if coming from a
Sandia/LANL internal, restricted (yellow) network.
- Other types of access will require a non-LLNL employee account
- For details about any/all of these services, including setting up an
account, see
access.llnl.gov.
Web Page Access:
- The majority of LC's user oriented web pages are publicly available
without restriction over the Internet. Accessing these pages does not
require any special account or password authentication. These pages are
on LLNL's unrestricted ("green") network.
- Web pages on LLNL's unrestricted network have been approved for public
access after passing through a Review and Release process and receiving
a UCRL number.
- However, some user web pages are considered "internal" and may only be
viewed by those who have the necessary authentication. These pages are
on the restricted (yellow) network.
- Pages on the restricted network may have vendor confidential information,
site confidential information, or just simply have no general interest to
non-LLNL people, and have not gone through the Review and Release process.
- Accessing restricted web pages requires being on-site at LLNL, or having
an appropriate Internet Service Account such as VPN, VPN-C or OTS.
- When attempting to access an internal web page, you will typically see
a rerouting message and password dialog box, such as shown at right.
|
 |
SecureNet:
- SecureNet is the network that provides access between classified
systems at DOE national laboratories and facilities:
- LLNL
- LANL
- Sandia (New Mexico)
- Sandia (California)
- Honeywell Kansas City Plant
- Pantex Plant
- Westinghouse Savannah River Site
- Y-12 National Security Complex
- Purple and all other LC classified systems must be accessed over
SecureNet from non-LLNL systems.
- Non-Tri-Lab users who wish to access LLNL classified
resources require a SecureNet account in addition to an SCF account.
- For more information on SecureNet, including an account application form,
see www.llnl.gov/computing/securenet_info.html
This section briefly covers a number of topics that will be of interest to users
who are new to LC's HPC environment and Purple systems in particular. Existing
LC users will already be familiar with most of these topics. Additional details
can be found by searching LC's computing web pages at
www.llnl.gov/computing
and also by consulting the
LC Resources tutorial.
Topics covered include:
- Login Files
- Home Directories
- Temporary File Systems
- Archival Storage
- File Transfer and Sharing
- File Interchange System (FIS)
- Mail
Login Files:
Home Directories:
- Scheme: /g/g#/user_name
- Global - one home directory file system is shared by all OCF hosts.
Another one is shared by all SCF hosts.
- Best user file space:
- Backed up regularly
- Safe from purge
- Automatic on-line backups
- NFS mounted -access is slower than local or parallel file systems
- Quota in effect - currently 16 GB per user. Use the quota -v
command to check.
- Recommended use: executables and source code.
- Not recommended for parallel I/O - can easily hurt access to home
directories by other users. Can also cause the NFS server to crash.
- Automatically backed up twice a day into your hidden
.snapshot subdirectory:
- It is not listed by the ls command but you can cd to .snapshot
- Contains multiple subdirectories called hourly.#
- Each contains a full backup from the past 48 hours
- hourly.0 is most recent, hourly.1 second most, etc.
- Can use cp to copy any .snapshot file to another directory
- .snapshot is read-only directory
Temporary File Systems:
- Temporary file systems are NOT backed up and are subject to purging. Not
recommended for long term storage. Data can also be lost due to hardware
problems.
- Purge policies:
- Vary by machine and are subject to change
- See news PurgePolicy.ibm for details
- Only GPFS file systems should be used for parallel I/O.
- Several temporary file systems are available. These are described below.
- /nfs/tmpN
- N = 0, 1, 2, ...
- Global temporary file systems on both the OCF and SCF
- Shared by all users
- Size varies but is large - in the multi-TB range
- NFS mounted - do NOT use for parallel I/0 as the NFS server may hang
- Quotas:
If tmpN is < 40 TB the quota is 100 GB per user
If tmpN is >= 40 TB the quota is 400 GB per user
- /var/tmp and /usr/tmp
- Same file system - /usr/tmp is a link to /var/tmp
- Temporary file system local to each individual node.
- Faster than NFS
- Size is only in the GB range - much smaller than /nfs/tmpN
- /tmp
- Local to each individual node
- Very small and meant for system usage only
- GPFS parallel file systems
- GPFS is IBM's General Parallel File System product
- Purple systems mount at least one GPFS parallel file system
- "Look and feel" like any other UNIX file system from a user's
perspective
- Naming scheme: /p/gM#/username where:
- M = one or two character abbreviation for the machine, such as
- # = one digit number (1 or 2)
- username = your user name on that machine. Established automatically
- Examples: /p/gup1/smith on uP. /p/gp1/jones on Purple.
- Only file systems recommended for parallel I/O
Archival Storage:
- High Performance Storage System (HPSS) is available on both the OCF and
SCF. Provides "virtually unlimited" tape archive storage with GigE
connectivity to all productions clusters. Also able to be accessed from
Tri-lab and other remote sites.
- Primary components:
- Server machines
- RAID disk cache
- Magnetic tape libraries
- Jumbo frame GigE network
- Storage account for each user. Most easily accessed by simply issuing the
command ftp storage from another LLNL machine.
- Virtually unlimited storage = petabyte range. Both capacity and
performance are continually increasing to keep up with increased usage
over the years. Performance increase of 285x between 1999 and 2005:
9 MB/s vs. 2,573 MB/s aggregate throughput.
- No back up, no purge
- There are several different ways to access storage (discussed below):
- hopper GUI - replaces xftp and xdir
- pftp/ftp command - parallelized at LC
- pftp2 scripts for Tri-lab use
- htar command
- nft utility - persistent file transfer
- xftp (old)
- xdir (old)
- Additional information: see the
Data Storage Group (LLNL internal) web page.
File Transfer:
File Sharing:
File Interchange System (FIS):
- Use of LC's File Interchange System (FIS) is required to move files
between the OCF and the SCF
- Requires that an FIS account be setup first - contact the LC Hotline
- To move file(s) from the OCF to the SCF:
- ftp fis from any OCF host
- login using OCF OTP password
- cd TO then use ftp-type commands to put file(s)
- To retrieve file on SCF side:
- It will take 30-60 minutes for file to be read onto SCF side
- ftp fis from any SCF host
- login using SCF DCE password
- cd FROM then use ftp-type command to get file(s)
- To move file(s) from SCF to OCF:
- Requires review by Authorized Derivative Classifier (ADC) from your
department/program (see
https://www-r.llnl.gov/class_off to find yours).
- Notes: Be sure to complete your transfer on the FROM side, as files are
periodically purged from the TO and FROM directories. 2 GB file size
limit.
- More information:
http://www.llnl.gov/LCdocs/fis/.
Mail:
- Local mail options should be utilized - please don't use LC production
hosts for your e-mail.
- Use $HOME/.forward file to direct your e-mail to your local machine:
- Example: .forward contains: jsmith@mymachine.llnl.gov
- May miss system & job e-mail if you don't
- SCF e-mail:
- SCF users have an automatic POP email account which is accessible
from all SCF machines as username@pop.llnl.gov.
- SCF POP server e-mail can be read with fetchmail, mailx, eudora or
netscape.
- Note that you can not forward mail directly between SCF systems.
Help and Documentation:
- LC Hotline
- Walk-in, phone and email assistance weekdays 7:30am - 4:45pm
- Location: Building 453, Room 1103. (Q-clearance area)
- Phone: 925-422-4531 (Main number)
- OCF email: lc-hotline@llnl.gov
- SCF email: lc-hotline@pop.scf.cln
- LC Home Page
- Portal to most of LC's user information and documentation
- OCF: www.llnl.gov/computing
- SCF: https://lc.llnl.gov/computing
- Time Critical Information
- Login messages
- News items
- Technical bulletins
- Machine status email lists
- LC Home Page “Important Notices and News" section
- Miscellaneous
- /usr/local/doc - archive of files covering a wide range of topics.
Note that some files may be out of date.
- /gadmin/docs - another archive of files covering a wide range of
topics; postscript versions of LC manuals
- LC User Meetings - Agenda and viewgraphs available in
"Important Notices and News" section of the LC Home Page.
Software and Development Environment
|
The software and development environment for ASC Purple systems is very similar
to that shared by other LC systems. Topics relevant to Purple are discussed
below. For more information about topics shared by all LC systems, see the
Introduction to LC Resources tutorial
and search the LC Home
Page.
AIX Operating System:
- As with all other LC IBM POWER systems, Purple systems
run IBM's AIX operating system. AIX is IBM's proprietary version of UNIX.
- Every node runs under a single copy of the AIX OS, which is
threaded for all CPUs.
- Beginning with POWER5 and AIX 5.3, simultaneous multi-threading is
supported.
Micro-partitioning (multiple operating systems on a single processor)
is also supported.
- AIX product information and documentation are available from IBM at
www-03.ibm.com/servers/aix
Parallel Environment:
- IBM's Parallel Environment is a collection of software tools and
libraries designed for developing, executing, debugging and profiling
parallel C, C++ and Fortran applications on POWER systems running AIX.
- The Parallel Environment consists of:
- Parallel Operating Environment (POE) software for submitting and
managing jobs
- IBM's MPI library
- A parallel debugger (pdbx) for debugging parallel programs
- Parallel utilities for simplified file manipulation
- PE Benchmarker performance analysis toolset
- Parallel Environment documentation can be found in
IBM's Parallel Environment manuals.
Parallel Environment topics are also discussed in the
POE section below.
Compilers:
IBM Math Libraries:
- ESSL - IBM's Engineering Scientific Subroutine Library.
See IBM's online
ESSL manuals for information.
- PESSL - IBM's Parallel Engineering Scientific Subroutine Library.
A subset of ESSL that has been parallelized. Documentation is located with
ESSL documentation mentioned above.
- MASS - Math Acceleration Subsystem. High performance versions of
most math intrinsic functions. Scalar versions and vector versions. See
/usr/local/lpp/mass or search IBM's web pages for more information.
Batch System:
- LCRM - Livermore Computing Resource Management system - LC's
cross-platform batch system. Covered in depth in the
LCRM Tutorial.
- LoadLeveler - IBM's native batch system. LoadLeveler is used as the
native scheduling system on LC's non-Purple IBM systems.
- SLURM - LC's Simple Linux Utility for Resource Management.
Originally developed for LC's Linux systems, but has now been ported to
AIX for use on the Purple systems. Replaces the function of LoadLeveler.
More information available at: www.llnl.gov/linux/slurm.
Software Tools:
- In addition to compilers, LC's Development Environment Group (DEG)
supports a wide variety of software tools including:
- Debuggers
- Memory tools
- Profilers
- Tracing and instrumentation tools
- Correctness tools
- Performance analysis tools
- Various utilities
- Most of these tools are simply listed below. For detailed information see:
- Debugging/Memory Tools:
- TotalView
- dbx
- pdbx
- gdb
- decor
- Tracing, Profiling, Performance Analysis and Other Tools:
- prof
- gprof
- PE Benchmarker
- IBM HPC Toolkit
- TAU
- VampirGuideView (VGV)
|
- Paraver
- mpiP
- Xprofiler
- mpi_trace
- PAPI
- PMAPI
|
- Jumpshot
- Dimemas
- Assure
- Umpire
- DPCL
|
Video and Graphics Services:
- LC's Information Management and Graphics Group (IMGG) provides a range of
visualization hardware, software and services including:
- Parallel visualization clusters
- PowerWalls
- Video production
- Consulting for scientific visualization issues
- Installation and support of visualization and graphics software
- Support for LLNL Access Grid nodes
- Contacts and more information:
Parallel Operating Environment (POE) Overview
|
Most of what you'll do on any parallel IBM AIX POWER system will be under IBM's Parallel Operating Environment (POE) software. This section provides a quick
overview. Other sections provide the details for actually using POE.
PE vs POE:
- IBM's Parallel Environment (PE) software product encompasses a
collection of software tools designed to provide a complete
environment for developing, executing, debugging and profiling
parallel C, C++ and Fortran programs.
- As previously mentioned, PE's primary components include:
- Parallel compiler scripts
- Facilities to manage your parallel execution environment (environment
variables and command line flags)
- Message Passing Interface (MPI) library
- Low-level API (LAPI) communication library
- Parallel file management utilities
- Authentication utilities
- pdbx parallel debugger
- PE Benchmarker performance analysis toolset
- Technically, the Parallel Operating Environment (POE) is a subset
of PE that actually contains the majority of the PE product.
- However, to the user this distinction is not really necessary
and probably serves more to confuse than enlighten. Consequently,
this tutorial will consider PE and POE synonymous.
Types of Parallelism Supported:
- POE is primarily designed for process level (MPI) parallelism, but fully
supports threaded and hybrid (MPI + threads) parallel programs also.
- Process level MPI parallelism is directly managed by POE from
compilation through execution.
- Thread level parallelism is "handed off" to the compiler, threads
library and OS.
- For hybrid programs, POE manages the MPI tasks, and
lets the compiler, threads library and OS manage the threads.
- POE fully supports Single Process Multiple Data (SPMD) and
Mutltiple Process Multiple Data (MPMD) models for parallel programming.
- For more information about parallel programming, MPI, OpenMP and
POSIX threads, see the tutorials listed on the
LC Training web page.
Interactive and Batch:
- POE can be used both interactively and within a batch scheduler
system to compile, load and run parallel jobs.
- There are many similarities between interactive and batch POE usage.
There are also important differences. These will
be pointed out later as appropriate.
Typical Usage Progression:
- The typical progression of steps for POE usage is outlined below,
and discussed in more detail in following sections.
- Understand your system's configuration (always changing?)
- Establish POE authorization on all nodes that you will use (one-time
event for some. Not even required at LC.)
- Compile and link the program using one of the POE parallel
compiler scripts. Best to do this on the actual platform you want
to run on.
- Set up your execution environment by setting the necessary POE
environment variables. Of course, depending upon your application,
and whether you are running interactively or batch, you may need
to do a lot more than this. But we're only talking about POE here...
- Invoke the executable (with or w/o POE options)
and watch it hum! (hopefully)
A Few Miscellaneous Words About POE:
- POE is unique to the IBM AIX environment. It runs only on the IBM POWER
platforms under AIX.
- Much of what POE does is designed to be transparent to the user.
Some of these tasks include:
- Linking to the necessary parallel libraries during compilation (via
parallel compiler scripts)
- Finding and acquiring requested machine resources for your parallel job
- Loading and starting parallel tasks
- Handling all stdin, stderr and stdout for each parallel task
- Signal handling for parallel jobs
- Providing parallel communications support
- Managing the use of processor and network resources
- Retrieving system and job status information
- Error detection and reporting
- Providing support for run-time profiling and analysis tools
- POE can also be used to run serial jobs and shell commands concurrently
across a network of machines. For example, issuing the command
poe hostname
will cause each machine in your partition
to tell you its name. Run just about any other shell command or
serial job under poe and it will work the same way.
- POE limts (number of tasks, message sizes, etc.) can be found in the
MPI Programming Guide Parallel Environment
manual (see the chapter on Limits).
Some POE Terminology:
Before learning how to use POE, understanding some basic definitions may
be useful. Note that some of these terms are common to parallel programming
in general while others are unique or tailored to POE.
- Node
- Within POE, a node usually refers to single machine, running
its own copy of the AIX operating system. A node has a unique network
name/address. All current model IBM nodes are SMPs (next).
- SMP
- Symmetric Multi-Processor. A computer (single machine/node) with
multiple CPUs that share a common memory. Different types of SMP nodes may
vary in the number of CPUs they possess and the manner in which the
shared memory is accessed.
- Process / Task
- Under POE, an executable (a.out) that may be scheduled to run
by AIX on any available physical processor as a UNIX process is
considered a task. Task and process are synonymous. For MPI applications,
each MPI process is referred to as a "task" with a unique identifier
starting at zero up to the number of processes minus one.
- Job
- A job refers to the entire parallel application and typically consists
of multiple processes/tasks.
- Interprocess
- Between different processes/tasks. For example, interprocess
communications can refer to the exchange of data between different
MPI tasks executing on different physical processors. The processors
can be on the same node (SMP) or on different nodes, but with POE, are
always part of the same job.
- Pool
- A pool is an arbitrary collection of nodes assigned by system managers.
Pools are typically used to
separate nodes into disjoint groups, each of which is used for specific
purposes. For example, on a given system, some nodes may be designated
as "login" nodes, while others are reserved for "batch" or "testing"
use only.
- Partition
- The group of nodes used to run a parallel job is
called a partition. Across a system, there is one discreet
partition for each user's job. Typically, the nodes in a
partition are used exclusively by a single user for the
duration of a job. (Technically though, POE allows multiple users to
share a partition, but in practice, this is not common, for obvious
reasons.) After a job completes, the nodes may be
allocated for other users' partitions.
- Partition Manager
- The Partition Manager, also known as the poe daemon,
is a process that is automatically started for each parallel job.
The Partition Manager is responsible for overseeing the parallel
execution of the job by communicating with daemon processes on each node
in the partition and with the system scheduler. It operates transparently
to the user and terminates after the job completes.
- Home Node / Remote Node
- The home node is the node where the parallel job is initiated
and where the Partition Manager process lives.
The home node may or may not be considered part of your partition
depending upon how the system is configured, interactive vs. batch, etc.
A Remote Node is any other node in your partition.
Note: Because of LC's use of the SLURM scheduler, there are some differences
from what is shown above. Specifically, SLURM daemons are part of the process.
|
 |
Compilers and Compiler Scripts:
- In IBM's Parallel Environment, there are a number of compiler invocation
commands, depending upon what you want to do. However, underlying all
of these commands are the same AIX C/C++ and Fortran compilers.
- The POE parallel compiler commands are actually scripts that
automatically link to the necessary Parallel Environment libraries,
include files, etc. and then call the appropriate native AIX compiler.
- For the most part, the native IBM compilers and their parallel
compiler scripts support a common command line syntax.
- See the References and More Information section
for links to IBM compiler documentation. Versions change frequently and
downloading the relevant documentation from IBM is probably the best
source of information for the version of compiler you are using.
Compiler Syntax:
[compiler] [options] [source_files] |
For example:
mpxlf -g -O3 -qlistopt -o myprog mprog.f
Common Compiler Invocation Commands:
- Note that all of the IBM compiler invocation commands are not shown.
Other compiler commands are available to select IBM compiler extensions
and features. Consult the appropriate IBM compiler man page and compiler
manuals for details. Man pages are linked below for convenience.
- Also note, that since POE version 4, all parallel compiler
commands actually use the
_r (thread-safe) version of the command. In other words,
even though you compile with mpxlc, you will really get the
mpxlc_r thread-safe version of the compiler.
IBM Compiler Invocation Commands |
Serial |
xlc cc |
ANSI C compiler Extended C compiler (not strict ANSI) |
xlC |
C++ compiler |
xlf f77 |
Extended Fortran; Fortran 77 compatible xlf alias |
xlf90 f90 |
Full Fortran 90 with IBM extensions xlf90 alias |
xlf95 f95 |
Full Fortran 95 with IBM extensions xlf95 alias |
Threads
(OpenMP, Pthreads, IBM threads) |
xlc_r / cc_r |
xlc / cc for use with threaded programs |
xlC_r |
xlC for use with threaded programs |
xlf_r |
xlf for use with threaded programs |
xlf90_r |
xlf90 for use with threaded programs |
xlf95_r |
xlf95 for use with threaded programs |
MPI |
mpxlc / mpcc |
Parallel xlc / cc compiler scripts |
mpCC |
Parallel xlC compiler script |
mpxlf |
Parallel xlf compiler script |
mpxlf90 |
Parallel xlf90 compiler script |
mpxlf95 |
Parallel xlf95 compiler script |
MPI with Threads
(OpenMP, Pthreads, IBM threads) |
mpxlc_r / mpcc_r |
Parallel xlc / cc compiler scripts for hybrid MPI/threads programs |
mpCC_r |
Parallel xlC compiler script for hybrid MPI/threads programs |
mpxlf_r |
Parallel xlf compiler script for hybrid MPI/threads programs |
mpxlf90_r |
Parallel xlf90 compiler script for hybrid MPI/threads programs |
mpxlf95_r |
Parallel xlf95 compiler script for hybrid MPI/threads programs |
Compiler Options:
- IBM compilers include many options - too
numerous to be covered here. For a full discussion, consult the IBM
compiler documentation. An abbreviated summary of some common/useful
options are listed in the table below.
Option |
Description |
C/C++ |
Fortran |
-blpdata |
Enable executable for large pages. Linker option. |
 |
 |
-c |
Compile only, producing a ".o" file. Does not link object
files. |
 |
 |
-g |
Produce information required by debuggers and some
profiler tools |
 |
 |
-I |
Names directories for additional include files. |
 |
 |
-L |
Specifies pathname where additional libraries reside
directories will be searched in the order of their occurrence on
the command line. |
 |
 |
-l |
Names additional libraries to be searched. |
 |
 |
-O -O2 -O3 -O4 -O5 |
Various levels of optimization. See discussion below. |
 |
 |
-o |
Specifies the name of the executable (a.out by default) |
 |
 |
-p -pg |
Generate profiling support code. -p is required for use with
the prof utility and -pg is required for use with the
gprof utility. |
 |
 |
-q32, -q64 |
Specifies generation of 32-bit or 64-bit objects. See discussion
below. |
 |
 |
-qflttrap=enable:options |
Generates instructions to detect and trap run-time floating-point
exceptions. The available options (separated by a colon) are: overflow,
underflow, zerodivide, invalid, inexact, enable, imprecise, nanq and
nonanq. Note that the keyword "enable" is required in addition to the
options selected. Default is off for all exception trapping. |
 |
 |
-qhot |
Determines whether or not to perform high-order
transformations on loops and array language during
optimization, and whether or not to pad array dimensions
and data objects to avoid cache misses. |
 |
 |
-qipa |
Specifies interprocedural analysis optimizations |
 |
 |
-qarch=arch
-qtune=arch |
Permits maximum optimization for the SP processor
architecture being used. Can improve performance at the
expense of portability. It's probably best to use auto
and let the compiler optimize for the platform where you actually compile.
See the man page for other options. |
 |
 |
-qautodbl=setting |
Automatic conversion of single precision to double
precision, or double precision to extended precision. See the man page
for correct setting options. |
|
 |
-qreport |
Displays information about loop transformations if -qhot or
-qsmp are used. |
 |
 |
-qsmp=omp |
Specifies OpenMP compilation |
 |
 |
-qstrict |
Turns off aggressive optimizations which have the potential to alter the
semantics of a user's program. |
 |
 |
-qlist
-qlistopt
-qsource
-qxref |
Compiler listing/reporting options. -qlistopt may be of use
if you want to know the setting of ALL options. |
 |
 |
-qwarn64 |
Aids in porting code from a 32-bit environment to a 64-bit environment
by detecting the truncation of an 8 byte integer to 4 bytes. Statements
which may cause problems will be identified through informational messages.
|
 |
 |
-v -V |
Display verbose information about the compilation |
 |
 |
-w |
Suppress informational, language-level, and warning messages. |
 |
 |
-bmaxdata:bytes |
Historical. This is actually a loader (ld) flag required for
use on 32-bit objects that exceed the default data segment
size, which is only 256 MB, regardless of the machine's
actual memory. At LC, this option would not normally be used because all
of its SP systems are now 64-bit since the retirement of the ASC Blue
systems. Codes that link to old libraries compiled in 32-bit mode
may still need this option, however. |
 |
 |
32-bit versus 64-bit:
- In the past, LC operated the 32-bit ASC Blue machines (now
retired). When the 64-bit ASC White machines arrived, this created two
different environments for IBM executables.
- Because 32-bit and 64-bit executables are incompatible, users needed to
be aware of the compilation mode for all files in their application. An
executable had to be entirely 32-bit or entirely 64-bit.
- At LC, POWER3 compilers continue to default to 32-bit, but the POWER4 and
POWER5 machines default to 64-bit. Therefore, it is still possible for users
to mistakenly try to mix 32-bit and 64-bit executables. This usually isn't
much more than an inconvenience because the compiler/linker will complain.
- It is possible for users to build libraries that contain both 32-bit and
64-bit objects - to aid development on different platforms. The compiler
will choose the appropriate mode file at build time. Note that an
application must still be entirely 32-bit or 64-bit - not mixed.
See /usr/local/docs/AIX_32bit_64bit_libs for details.
- Recommendation: explicitly specify your compilations with either -q32 or
-q64 to avoid any problems encountered by accepting the defaults.
Optimization:
- Default is no optimization
- Without the correct -O option specified, the defaults for
-qarch and -qtune are not optimal!
Only -O4 and -O5 automatically select the best
architecture related optimizations.
- -O4 and -O5 can perform optimizations specific to L1
and L2 caches on a given platform. Use the -qlistopt flag
with either of these and then look at the listing file for this
information.
- Any level of optimization above -O2 can be aggressive and change
the semantics of your program, possibly reduce performance or cause
wrong results. You can use -qstrict flag with the
higher levels of optimization to restrict semantic changing optimizations.
- The compiler uses a default amount of memory to perform optimizations.
If it thinks it needs more memory to do a better job, you may get a
warning message about setting MAXMEM to a higher value. If you specify
-qmaxmem=-1 the compiler is free to use as much memory as it needs
for its optimization efforts.
- Optimizations may cause the compiler to relax conformance to the IEEE
Floating-Point Standard.
- Attempting to debug optimized code is generally not a good idea, since
optimization "rewrites" your code and the object file will not match
the source code.
Miscellaneous:
- Conformance to IEEE Standard Floating-Point Arithmetic: the IBM C/C++
and Fortran compilers "mostly" follow the standard, however, the
exceptions and discussions are too involved to cover here.
- According to the IBM product information, the latest C/C++ and Fortran
compilers support the OpenMP API version 2.5.
- All of the IBM compiler commands have default options, which can
be configured by a site's system administrators. It may be useful to
review the files /etc/*cfg* to learn exactly what the
defaults are for the system you're using.
- Static Linking and POE: POE executables that use MPI are
dynamically linked with the appropriate communications library at
run time. Beginning with POE version 4, there is no support for
building statically bound executables.
- The IBM C/C++ compilers automatically support POSIX threads - no
special compile flag(s) are needed. Additionally, the IBM Fortran
compiler provides an API and support for pthreads even though there
is no POSIX API standard for Fortran.
See the IBM Documentation - Really!
IBM's MPI Library:
- The only supported implementation of MPI on Purple machines is IBM's.
- MPICH is not supported because it is not thread-safe.
- IBM's MPI library conforms to the MPI-2 Standard, with the exception of the chapter on “Process Creation and Management", which is not implemented.
- The MPI library includes support for both 32-bit and 64-bit applications.
- Message passing is supported on the main thread and on user-created threads.
- The MPI library uses hidden AIX kernel threads as well as the users' threads to move data into and out of message buffers.
- IBM's MPI library is built upon IBM's LAPI, a low-level communication
protocol.
Usage Notes:
- IBM MPI compiler invocation commands are listed below.
Because only thread-safe libraries are used, the two commands on each
line are equivalent.
- mpcc, mpcc_r
- mpxlc, mpxlc_r
- mpCC, mpCC_r
- mpxlC, mpxlC_r
- mpxlf, mpxlf_r
- mpxlf90, mpxlf90_r
- mpxlf95, mpxlf95_r
- All MPI compiler commands are actually "scripts" that automatically link
in the necessary MPI libraries, include files, etc. and then call the
appropriate native AIX compiler.
- IBM's MPI behavior is significantly determined by POE environment variable
settings. The defaults, according to IBM documentation, optimize for
one task per processor with User Space protocol and blocking communications.
Some applications which use non-blocking routines may benefit by changing
the default settings.
- LLNL sets several MPI related POE environment variables in /etc/environment.
- Documentation for the IBM implementation is available
from IBM.
- LC's MPI tutorial describes
how to create MPI programs.
Programming and Performance Considerations:
- IBM's "MPI Programming Guide" manual lists a number of programming
considerations. These are only listed here to advise the reader.
- POE user limits
- Exit status
- POE job step function
- POE additions to the user executable
- Signal handlers
- Do not hard-code file descriptors
- Termination of a parallel job
- Do not run your program as root
- AIX function limitations
- Shell execution
|
- Do not rewind STDIN, STDOUT, or STDERR
- Do not match blocking and non-blocking collectives
- Passing string arguments to your program correctly
- POE argument limits
- Network tuning considerations
- Standard I/O requires special attention
- Reserved environment variables
- AIX message catalog considerations
- Language bindings
- Available virtual memory segments
|
- The "MPI Programming Guide"
also discusses several performance considerations.
Again these are listed here to advise the reader.
- Message transport mechanisms
- MPI point-to-point communications
- Polling and single thread considerations
- LAPI send side copy
- Striping
- Remote Direct Memory Access (RDMA) considerations
Running on Purple Systems
|
Important Differences
For those who are familiar with LC's other IBM systems, note that there are
a few very important differences between those systems and Purple systems.
These differences are briefly discussed here for visibility, and covered
in more detail later as needed.
Large Pages:
- Purple systems are configured for use with large pages.
- Large pages refer to 16 MB page size versus the standard AIX 4 KB page
size. They are intended to improve application memory-to-cpu
performance.
- Approximately 85% of a Purple compute node's memory is dedicated to
large pages. The remainder is available for standard pages.
- Applications need to be specifically enabled for Large Page use. It is
not automatic.
- If an application is NOT enabled for large pages, then it is very
possible that it will perform poorly and may even hang. The reason is
that it will only have the smaller amount of memory configured for
standard AIX pages, and will end up paging excessively.
- A full discussion is presented in the Large Pages
section of this tutorial.
SLURM:
- Unlike other LC IBM systems, Purple systems use LC's SLURM scheduler
instead of IBM's LoadLeveler.
The SLURM scheduler is also used on all of LC's production Linux clusters.
- SLURM significantly alters the way POE behaves, especially in batch.
- There are differences between the SLURM implementation on Purple systems
and the SLURM implementation on LC's Linux systems.
- SLURM on Purple systems is discussed in more detail in the
More On Slurm section of this tutorial.
RDMA
- RDMA = Remote Direct Memory Access. A data transfer protocol that
allows one process to directly access the memory of another process
over the network without the involvement of a CPU.
- Purple systems are enabled for RDMA. Actually using RDMA is up to the
user however.
- RDMA on Purple machines is covered in the RDMA section
of this tutorial.
Simultaneous Multi-Threading (SMT)
- SMT is a combination of POWER5 hardware and AIX software that creates
and manages two independent instruction streams (threads) on the same
physical CPU.
- SMT makes one processor appear as two processors.
- The primary intent of SMT is to improve performance by overlapping
instructions so that idle execution units are used.
- Under AIX, both instruction streams share resources equally.
- Purple systems have been enabled for SMT.
- Users don't need to do anything specific to take advantage of SMT,
though there are a few considerations:
- Use no more than 1 MPI task per CPU. Regard the second virtual CPU
for use by auxiliary threads or system daemons.
- Some applications will note a marked improvement in performance.
For example, the LLNL benchmark codes sPPM and UMT2K both realized
a 20-22% performance gain (according to IBM).
- Some applications may experience a performance degradation.
- IBM documentation states that performance can range from -20% to +60%
POE Co-Scheduler
- Under normal execution, every process will be interrupted periodically
in order to allow system daemons and other processes to use the CPU.
- On multi-CPU nodes, CPU interruptions are not synchronized. Furthermore,
CPU interruptions across the nodes of a system are not synchronized.
- For MPI programs, the non-synchronized CPU interruptions can significantly
affect performance, particularly for collective operations. Some tasks will
be executing while other tasks will be waiting for a CPU being used by
system processes.
- IBM has enabled POE to elevate user task priority and to force system
tasks into a common time slice. This is accomplished by the POE
co-scheduler daemon.
- Configuring and enabling the POE co-scheduler is performed by Purple
system administrators. Two primary components:
- /etc/poe.priority file specifies priority "class" configurations
- /etc/environment sets the MP_PRIORITY environment variable to the
desired priority class.
- Note that if you unset or change the setting of MP_PRIORITY you may
defeat the co-scheduler's purpose.
Running on Purple Systems
|
Understanding Your System Configuration
First Things First:
- Before building and running your parallel application, it is important
to know a few details regarding the system you intend to use.
This is especially important if you use multiple systems, as they will
be configured differently.
- Also, things at LC are in a continual state of flux. Machines change,
software changes, etc.
- Several information sources and simple configuration commands
are available for understanding LC's IBM systems.
System Configuration/Status Information:
- LC's Home Page - the High Performance Computing section:
- OCF: www.llnl.gov/computing
- SCF: lc.llnl.gov
-
Important Notices and News - timely information for every system
-
OCF Machine Status (LLNL internal) - shows if the machine is up or
down and then links to the following information for each machine:
- Message of the Day
- Announcements
- Load Information
- Machine Configuration (and other) information
- Job Limits
- Purge Policy
-
Computing Resources - detailed machine configuration information
for all machines.
- When you login, be sure to check the login banner & news items
- Machine status email lists.
- Each machine has a status list which provides the most timely
status information and plans for upcoming system maintenance or
changes. For example:
up-status@llnl.gov
purple-status@llnl.gov
uv-status@llnl.gov
um-status@llnl.gov
...
- LC support initially adds people to the list, but just in case you
find you aren't on a particular list (or want to get off), just use
the usual majordomo commands in an email sent to
Majordomo@lists.llnl.gov.
LC Configuration Commands:
-
spjstat: Displays a summary of pool information followed
by a listing of all running jobs, one job per line.
Sample output, partially truncated, shown below.
up041% spjstat
Scheduling pool data:
--------------------------------------------------------
Pool Memory Cpus Nodes Usable Free Other traits
--------------------------------------------------------
pbatch 31616Mb 8 99 99 8
pdebug 31616Mb 8 2 1 1
systest 31616Mb 8 4 4 4
Running job data:
-------------------------------------------------------
Job ID User Name Nodes Pool Status
-------------------------------------------------------
11412 dael 6 pbatch Running
28420 hlbac 8 pbatch Running
28040 rtyg 6 pbatch Running
30243 kubii 16 pbatch Running
...
...
|
-
sinfo: Displays a summary of the node/pool
configuration. Sample output below.
up041% sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
pbatch* up infinite 91 alloc up[001-016,021-036,042-082,089-106]
systest up 2:00:00 4 idle up[085-088]
pdebug up 2:00:00 1 idle up037
pdebug up 2:00:00 1 comp up040
pbatch* up infinite 8 idle up[017-020,083-084,107-108]
|
-
squeue: Shows information about jobs located in the
SLURM scheduling queue.
up041% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
39854 pbatch R6015 ddkekel R 9:31:08 32 up[057-079,095-103]
36267 pbatch RSRM_2nd ew2edler R 9:03:32 30 up[009-016,025-036,080-084,104-108]
40813 pbatch B15S21-t ggady R 7:54:24 15 up[002-008,017-024]
41167 pbatch upFiller ugee R 4:53:24 1 up001
41257 pbatch R2L0S25 theou1 R 2:22:03 1 up054
41277 pbatch R1L0S25 theou1 R 2:21:51 1 up055
41106 pbatch Rch001b iert R 1:05:02 16 up[042-053,089-092]
|
-
ju: Displays a summary of node availability
and usage within each pool. Note that while still available on Purple
machines, this command is a carry-over from previous LoadLeveler
(non-SLURM) IBM systems and may be discontinued in the future.
Sample output, partially truncated, shown below.
uv006% ju
Partition total down used avail cap Jobs
systest 4 0 0 4 0%
pdebug 2 0 1 1 50% degrt-2
pbatch 99 0 92 7 93% halbac-8, fdy-8, fkdd-8, kuba-16, dael-
6
|
IBM Configuration Commands:
- Several standard IBM AIX and Parallel Environment commands that may prove
useful are shown below. See the man pages for usage details.
- lsdev: This command lists all of the available
physical devices (disk, memory, adapters, etc.)
on a single machine. Probably its most useful purpose
after providing a list, is that it tells you the names of devices
you can use with the lsattr command for detailed information.
Sample output
- lsattr: Allows you to obtain detailed information for
a specific device on a single machine. The catch is you need to
know the device name, and
for that, the previously mentioned lsdev command is used.
Sample output
- lslpp -al | grep poe: The lslpp command is used
to check on installed software. If you grep the output for poe
it will show you which version of POE is installed on that machine.
- Note that the serial lsattr, lsdev, and lslpp commands
can be used in parallel simply by calling them via poe. For example,
if the following command was put in your batch script, it would show
the memory configuration for every node in your partition:
poe lsattr -El mem0
Running on Purple Systems
|
Setting POE Environment Variables
In General:
- Application behavior under POE is very much determined by a number of POE
environment variables. They control important factors in how a program
runs.
- POE environment variables fall into several categories:
- Partition Manager control
- Job specification
- I/O control
- Diagnostic information generation
- MPI behavior
- Corefile generation
- Miscellaneous
- There are over 50 POE environment variables. Most of them also
have a corresponding command line flag that temporarily overrides the
variable's setting.
- A complete discussion and list of the POE environment variables and
their corresponding command line flags can be found in the Parallel
Environment Operation and Use Volume 1
manual. They can also be reviewed (in less detail) in the
POE man page.
- Different versions of POE software are not identical in the environment
variables they support. Things change.
 |
At LC, POE does not behave exactly as documented by IBM. This is mostly
due to LC's use of LCRM and SLURM. |
How to Set POE Environment Variables:
POE, LCRM and SLURM:
- LC's use of LCRM and SLURM significantly alters the way POE behaves.
LCRM affects batch jobs and SLURM affects both batch and interactive
jobs.
- Both LCRM and SLURM override or usurp several basic, required POE
environment variables. This is especially true for the LCRM batch
system.
- Because of this, knowing how to use POE at LC requires also knowing how
to use LCRM and SLURM.
- The LCRM batch system is covered in the
LCRM tutorial. SLURM usage for Purple
systems is covered in the More On SLURM section
of this tutorial, and elsewhere as relevant.
Basic Interactive POE Environment Variables:
- Although there are many POE environment variables, you really only need
to be familiar with a few basic ones at LC. This is because LC
automatically sets some of them, and LCRM and SLURM set or replace the
functionality of several.
- The basic environment variables all users need to be familiar with
are listed below.
Their corresponding command line flags are shown in parenthesis.
- Note their usage as shown here is only for interactive jobs at LC.
- MP_PROCS (-procs)
- The total number of MPI processes/tasks for your parallel job.
May be used alone or
in conjunction with MP_NODES and/or MP_TASKS_PER_NODE to specify how many
tasks are loaded onto a physical node. The maximum value for
MP_PROCS is dependent upon the version of POE software installed. For
version 4.2 the limit is 8192 tasks. The default is 1.
- MP_NODES (-nodes)
- Specifies the number of physical nodes on which to run the parallel
tasks. May be used alone or in conjunction with MP_TASKS_PER_NODE
and/or MP_PROCS.
- MP_TASKS_PER_NODE (-tasks_per_node)
- Specifies the number of tasks to be run on each of the physical nodes.
May be used in conjunction with MP_NODES and/or MP_PROCS.
- MP_RMPOOL (-rmpool)
- Specifies the system pool where your job should run. For Purple
systems, the interactive pool is pdebug.
Example Basic Interactive Environment Variable Settings:
- The example below demonstrates how to set the basic POE environment
variables for an interactive job which will:
- Use 16 tasks on 2 nodes
- Allocate nodes automatically (non-specific allocation) from
a pool called "pdebug"
csh / tcsh |
ksh / bsh |
setenv MP_PROCS 16
setenv MP_NODES 2
setenv MP_RMPOOL pdebug
|
export MP_PROCS=16
export MP_NODES=2
export MP_RMPOOL=pdebug
|
Other Common/Useful POE Environment Variables
- A list of some commonly used and potentially useful POE environment
variables appears below.
A complete list of the POE environment variables can be
viewed quickly in the POE
man page. A much fuller discussion is available in the
Parallel Environment Operation and Use Volume 1
manual.
- These POE environment variables may be used both interactively and in
batch.
- For ASC Purple systems using the LCRM and SLURM, some POE variables
(not shown) are ignored, such as MP_RETRY, MP_RETRYCOUNT and MP_EUIDEVICE.
Variable |
Description |
MP_SHARED_MEMORY |
Allows MPI programs with more than one task on a node to use shared
memory instead of the switch for communications. Can significantly
improve on-node communication bandwidth. Valid values are "yes" and
"no". Default is "yes" at LC. |
MP_LABELIO |
Determines whether or not output from the parallel tasks are labeled
by task id. Valid values are yes or no. The default is yes at LC. |
MP_PRINTENV |
Can be used to generate a report of the your job's parallel environment
setup information, which may be useful for diagnostic purposes. Default
value is "no". Set to "yes". The report goes to stdout by default, but
can be directed to be added to the output of a user-specified script
name. |
MP_STATISTICS |
Allows you to obtain certain statistical information about your job's
communications. The default setting is "no". Set to "print" and the
statistics will appear on stdout after your job finishes. Note that
there may be a slight impact on your job's performance if you use
this feature. |
MP_INFOLEVEL |
Determines the level of message reporting. Default is 1. Valid values are:
0 = error
1 = warning and error
2 = informational, warning, and error
3 = informational, warning, and error. Also
reports diagnostic messages for use by the
IBM Support Center.
4,5,6 = Informational, warning, and error. Also
reports high- and low-level diagnostic
messages for use by the IBM Support Center.
|
MP_COREDIR
MP_COREFILE_FORMAT
MP_COREFILE_SIGTERM |
Allow you to control how, when and where core files are created. See
the POE man page and/or IBM documentation for details. Note that
LC is currently setting MP_COREFILE_FORMAT to "core.light" by default,
which may/may not be what you want for debugging purposes. |
MP_STDOUTMODE |
Enables you to manage the STDOUT from your parallel tasks. If set to
"unordered" all tasks write output data to STDOUT asynchronously.
If set to "ordered" output data from each parallel task is written to
its own buffer. Later, all buffers are flushed in task order to
stdout. If a task id is specified, only the task indicated writes
output data to stdout. The default is unordered. Warning: use
"unordered" if your interactive program prompts for input - otherwise
your prompts may not appear. |
MP_SAVEHOSTFILE |
Specifies the file name where POE should record the hosts used by your job.
Can be used to "save" the names of the execution nodes. |
MP_CHILD |
Is an undocumented, "read-only" variable set by POE. Each task will
have this variable set to equal its unique taskid (0 thru MP_PROCS-1).
Can be queried in scripts or batch jobs to determine "who I am". |
MP_PGMMODEL |
Determines the programming model you are using. Valid
values are "spmd" or "mpmd". The default is "spmd". If set to "mpmd"
you will be enabled to load different executables individually on the
nodes of your partition. |
MP_CMDFILE |
Is generally used when MP_PGMMODEL=mpmd, but doesn't have to be.
It specifies the name of a file that lists the commands that are to be
run by your job. Nodes are loaded with these commands in the order they
are listed in the file. If set, POE will read the commands file rather
than try to use STDIN - such as in a batch job. |
LLNL Preset POE Environment Variables:
- LC automatically sets several POE environment variables for all
users. In most cases, these are the "best" settings. For the most
current settings, check the /etc/environment file.
Note that these will vary by machine. An example is shown below.
# POE default environment variables
MP_COREFILE_SIGTERM=NO
MP_CPU_USE=unique
MP_EUILIB=us
MP_HOSTFILE=NULL
MP_LABELIO=yes
MP_RESD=yes
MP_SHARED_MEMORY=yes
....
# Set Poe Environment Variables
MP_COREFILE_FORMAT=core.light
MP_INFOLEVEL=1
MP_PRIORITY=normalprio
MP_PRIORITY_NTP=yes
MP_RMLIB=/admin/llnl/lib/slurm_ll_api.so
# Constrains AIX MPI tasks to physical CPUs
MP_S_POE_AFFINITY=YES
|
Running on Purple Systems
|
Invoking the Executable
Syntax:
Multiple Program Multiple Data (MPMD) Programs:
- By default, POE follows the Single Program Multiple Data parallel
programming model: all parallel tasks execute the same program but may use
different data.
- For some applications, parallel tasks may need to run different programs
as well as use different data. This parallel programming model is
called Multiple Program Multiple Data (MPMD).
- For MPMD programs, the following steps must be performed:
Interactive:
- Set the MP_PGMMODEL environment variable to "mpmd". For example:
setenv MP_PGMMODEL mpmd
export MP_PGMMODEL=mpmd
- Enter poe at the Unix prompt. You will then
be prompted to enter the executable which should be loaded on each
node. The example below
loads a "master" program on the first node and 4 "worker" tasks
on the remaining four nodes.
0:node1> master
1:node2> worker
2:node3> worker
3:node4> worker
4:node5> worker
- Execution starts automatically after the last node has been loaded.
- Note that if you don't want to type each command in line by line,
you can put the commands in a file, one per line, to match the
number of MPI tasks, and then set MP_CMDFILE to the name of that file.
POE will then read that file instead of prompting you to input each
executable.
Batch:
- Create a file which contains a list of the program names,
one per line, that must be loaded onto your nodes. There should be
one command per MPI task that you will be using.
- Set the MP_PGMMODEL environment variable to "mpmd" - usually done in
your batch submission script
- Set the environment variable MP_CMDFILE to the name of the file
you created in step 1 above - usually done in your batch submission
script also.
- When your application is invoked within the batch system, POE will
automatically load the nodes as specified by your file.
- Execution starts automatically after the last node has been loaded.
Using POE with Serial Programs:
POE Error Messages:
Running on Purple Systems
|
Monitoring Job Status
- POE does not provide any basic commands for monitoring your job.
- LC does provide several different commands to accomplish this however.
Some of these commands were previously discussed in the
Understanding Your System Configuration
section, as they serve more than one purpose.
- The most useful LC commands for monitoring your job's status are:
- spjstat / spj - running jobs
- squeue - SLURM command for running jobs
- pstat - LCRM command for both running and queued jobs
- ju - succinct display of running jobs
- Examples are shown below, some with truncated output for readability.
See the man pages for more information.
up041% spjstat
Scheduling pool data:
--------------------------------------------------------
Pool Memory Cpus Nodes Usable Free Other traits
--------------------------------------------------------
pbatch 31616Mb 8 99 99 8
pdebug 31616Mb 8 2 1 1
systest 31616Mb 8 4 4 4
Running job data:
-------------------------------------------------------
Job ID User Name Nodes Pool Status
-------------------------------------------------------
11412 dael 6 pbatch Running
28420 hlbac 8 pbatch Running
28040 rtyg 6 pbatch Running
30243 kubii 16 pbatch Running
34087 dhrrtel 32 pbatch Running
34433 gddy 3 pbatch Running…
...
...
up041% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
28420 pbatch uu4r elbac R 11:25:18 8 up[001-008]
28040 pbatch rr5t16b danl R 8:43:30 6 up[095-100]
30243 pbatch www34r14 jj4ota R 8:38:49 16 up[064-079]
28058 pbatch rr5t16c danl R 8:38:37 6 up[101-106]
33257 pbatch www34r14 jj4ota R 7:49:20 1 up036
33882 pbatch BH44T2-t fdy R 5:13:34 8 up[009-016]
34087 pbatch YY015 nekel R 4:46:09 32 up[029-035,042-047,054-063,080-082,089-094]
34433 pbatch ruunra6 ekay R 3:18:51 3 up[023-025]
34664 pbatch BHJRL01k fdy R 1:34:27 8 up[017-022,026-027]
34640 pbatch RUUE80 edeski R 54:24 2 up[083-084]
up041% pstat
JID NAME USER BANK STATUS EXEHOST CL
28033 33er16a danl a_phys *WPRIO up N
28040 phjje6b danl a_phys RUN up N
28420 rsr22erv10 haslr4 illinois RUN up N
29267 B7eee0ms fddt illinois *DEPEND up N
30243 wol8une409 wwiota a_cms RUN up N
33675 phrr33a qqw3el a_phys *DEPEND up N
34071 inhiyyrrr76 weertler illinois *WCPU up N
34087 RTT15 dqrtkel axicf RUN up N
34433 runwww6 robr a_engr RUN up N
34435 runwww6 robr a_engr *DEPEND up N
34640 RTTT80 lssgh axicf RUN up N
34653 RTT081 lssgh axicf *WPRIO up N
34661 B7eee0ms fddt illinois *DEPEND up N
34749 rsff450v10 haslr4 illinois *DEPEND up N
35221 nm-hhj99.inp cnbvvdy bdivp *WPRIO up N
...
...
up041% ju
Partition total down used avail cap Jobs
systest 4 0 0 4 0%
pdebug 2 0 0 2 0%
pbatch 99 0 90 9 91% hac-8, fdy-8, fdy-8, gky-3,
dhkel-32, kuta-1, kuta-16, lski-2, danl-6, danl-6
|
Running on Purple Systems
|
Interactive Job Specifics
The pdebug Interactive Pool/Partition:
Insufficient Resources:
Killing Interactive Jobs:
- You can use CTRL-C to terminate a running, interactive
POE job that has not been put in the background. POE will propagate the
termination signal to all tasks.
- The scancel command can also be used to kill an interactive
job:
up041% squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
18352 pbatch 4ye_Conv 4enollin R 1:00:37 1 up100
18354 pbatch 4ye_Conv yeioan R 59:47 1 up074
18378 pbatch 4ye_Conv yeioan R 48:14 1 up101
66004 pdebug poe blaise R 0:13 1 up037
up041% scancel 66004
up041% ERROR: 0031-250 task 3: Terminated
ERROR: 0031-250 task 2: Terminated
ERROR: 0031-250 task 0: Terminated
ERROR: 0031-250 task 1: Terminated
SLURMERROR: slurm_complete_job: Job/step already completed
[1] Exit 143 poe hangme
up041% |
- Another way to kill your interactive job is to kill the poe
process on the node where you started the job (usually your login node).
For example:
up041% ps
PID TTY TIME CMD
3223668 pts/68 0:00 ps
3952660 pts/68 0:00 -tcsh
4313092 pts/68 0:00 poe
up041% kill 4313092
up041% ERROR: 0031-250 task 2: Terminated
ERROR: 0031-250 task 0: Terminated
ERROR: 0031-250 task 3: Terminated
ERROR: 0031-250 task 1: Terminated
up041% |
- Yet another alternative is to use the poekill command to kill the
poe process. For example:
up041% poekill poe
Terminating process 4173960 program poe
ERROR: 0031-250 task 0: Terminated
ERROR: 0031-250 task 1: Terminated
ERROR: 0031-250 task 2: Terminated
ERROR: 0031-250 task 3: Terminated
up041% |
Running on Purple Systems
|
Batch Job Specifics
LCRM is used for running batch jobs on all LC production systems. LCRM is
covered in detail in the LCRM
Tutorial. This section only provides a quick summary of LCRM usage.
Submitting Batch Jobs:
- LC production systems allocate the majority of their nodes for batch use.
Batch nodes are configured into the pbatch pool/partition.
This is also the default pool for batch jobs.
- The first step in running a batch job is to create an LCRM job control
script. A sample job control script appears below.
# Sample LCRM script to be submitted with psub
#PSUB -c up # which machine to use
#PSUB -pool pbatch # which pool to use
#PSUB -r myjob # specify job name
#PSUB -tM 1:00 # set maximum total CPU time
#PSUB -b micphys # set bank account
#PSUB -ln 4 # use 4 nodes
#PSUB -g 16 # use 16 tasks
#PSUB -x # export current env var settings
#PSUB -o myjob.log # set output log name
#PSUB -e myjob.err # set error log name
#PSUB -nr # do not rerun job after system reboot
#PSUB -mb # send email at execution start
#PSUB -me # send email at execution finish
# no more psub commands
# job commands start here
set echo
setenv MP_INFOLEVEL 4
setenv MP_SAVEHOSTFILE myhosts
setenv MP_PRINTENV yes
echo LCRM job id = $PSUB_JOBID
cd ~/db/myjob
./my_mpiprog
rm -rf tempfiles
echo 'ALL DONE'
|
- Submit your LCRM job control script using the psub
command. For example, if the above script were named run.cmd,
it would be submitted as:
psub run.cmd
- You may then check your job's progress as discussed in the
Monitoring Job Status section above.
Quick Summary of Common LCRM Batch Commands:
Command |
Description |
psub |
Submits a job to LCRM |
pstat |
LCRM job status command |
prm |
Remove a running or queued job |
phold |
Place a queued job on hold |
prel |
Release a held job |
palter |
Modify job attributes (limited subset) |
lrmmgr |
Show host configuration information |
pshare |
Queries the LCRM database for bank share allocations, usage statistics,
and priorities. |
defbank |
Set default bank for interactive sessions |
newbank |
Change interactive session bank |
Batch Jobs and POE Environment Variables:
- Certain POE environment variables will affect batch jobs just as they
do interactive jobs. For example, MP_INFOLEVEL and
MP_PGMMODEL. These can be placed in your batch job command
script.
- However, other POE environment variables are ignored by the batch
scheduler for obvious reasons. For example, the
following POE variables will have no effect if used in a batch job
control script:
MP_ADAPTER_USE
MP_CPU_USE
MP_EUIDEVICE
MP_EUILIB
MP_HOSTFILE
MP_NODES
MP_PMDSUFFIX
|
MP_PROCS
MP_RESD
MP_RETRY
MP_RETRYCOUNT
MP_RMPOOL
MP_TASKS_PER_NODE
|
- Be aware that POE environment variables in your .login, .cshrc,
.profile, etc. files may also affect your batch job.
Logging Into Batch Nodes:
- Being able to login to a batch node is very useful for debugging
purposes.
- You are permitted to login to a batch node only when you already have a
batch job executing on that node.
- The squeue command can be used to determine which node(s)
your job is running on. For parallel jobs, the first node in the list is
where the master POE process runs.
 |
For serial and other non-MPI jobs, you will need to put a "dummy" POE
command in your batch script if you want to be able to login while it
is running on a batch node. Something as simple as poe true
or poe hostname will work. Put this command as the first
executable command in your job script.
|
Killing Batch Jobs:
- The best way to kill batch jobs is to use the LCRM prm
command. For example:
% pstat
JID NAME USER BANK STATUS EXEHOST CL
42991 batch_run3 joe33 cs RUN up N
42999 batch_run4 joe33 cs *DEPEND up N
% prm 42991
remove running job 42991 (joe33, cs)? [y/n] y
% pstat
JID NAME USER BANK STATUS EXEHOST CL
42999 batch_run4 joe33 cs *DEPEND up N
|
- Note: Currently, users are advised to NOT use the SLURM
scancel command to kill batch jobs.
Running on Purple Systems
|
More On SLURM
SLURM and a few SLURM commands have been briefly discussed in
several places already in this tutorial. This section provides a concise
summary of useful/important SLURM information for Purple systems.
SLURM Architecture:
- LC production systems that provide batch usage all require running a
low-level native scheduler:
- Linux systems use SLURM
- Non-Purple IBM systems use LoadLeveler
- Purple systems use SLURM
- Compaq/HP systems use RMS
- LCRM "talks" to the low-level native scheduler on each system. This
insulates users from many (but not all) of the unique details associated
with the different native schedulers.
- SLURM is implemented with two daemons:
- slurmctld - central management daemon.
Monitors all other SLURM daemons and resources, accepts work
(jobs), and allocates resources to those jobs. Given the
critical functionality of slurmctld, there may be a backup
daemon to assume these functions in the event that the
primary daemon fails.
- slurmd - compute node daemon. Monitors all
tasks running on the compute node, accepts work (tasks),
launches tasks, and kills running tasks upon request.
SLURM Commands:
- SLURM provides six user-level commands. Note that Purple systems do
not support all six commands. Commands are linked to their corresponding
man page.
SLURM Command |
Description |
Supported on Purple? |
scancel |
Cancel or signal a job |
INTERACTIVE ONLY |
scontrol |
Administration tool; configuration |
YES |
sinfo |
Reports general system information |
YES |
smap |
Displays an asci-graphical version of squeue |
YES |
squeue |
Reports job information |
YES |
srun |
Submits/initiates a job |
NO |
SLURM Environment Variables:
- The srun man page describes a
number of SLURM environment variables. However, under AIX, only a few
of these are supported (described below).
SLURM Environment Variable |
Description |
SLURM_JOBID |
Use with "echo" to display the jobid. |
SLURM_NETWORK |
Specifies switch and adapter settings such as communication protocol, RDMA
and number of adapter ports. Replaces the use
of the POE MP_EUILIB and MP_EUIDEVICE environment
variables. In most cases, users should not modify the default settings, but
if needed, they can. For example to run with IP protocol over a single
switch adapter port:
setenv SLURM_NETWORK ip,sn_single
The default setting is to use User Space protocol over both switch adapter
ports and to permit RDMA.
setenv SLURM_NETWORK us,bulk_xfer,sn_all |
SLURM_NNODES |
Specify the number of nodes to use. Currently not working
on Purple systems. |
SLURM_NPROCS |
Specify the number of processes to run. The default is one process
per node. |
Miscellaneous:
- A nice feature of SLURM is that it will permit batch jobs to
run with a number of tasks different than their original specification.
For example, the batch script below specifies 4 tasks with the
#PSUB -g setting, but then runs the same executable
three times with a different number of tasks each time.
#PSUB -c up
#PSUB -pool pbatch
#PSUB -eo
#PSUB -tM 15
#PSUB -ln 4
#PSUB -g 4
cd ~/myjobs
echo 'Running with 4 tasks'
./a.out
setenv SLURM_NPROCS 16
echo 'Running with 16 tasks'
./a.out
setenv SLURM_NPROCS 32
echo 'Running with 32 tasks'
./a.out
|
- It is often useful for debugging purposes to obtain a list of the machines
used to execute your job. This can be easily done by using the following
SLURM command and SLURM environment variable in your batch script:
squeue -j $SLURM_JOBID
Example output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12174 pbatch g1x560rr joeuser R 1:31:17 8 up[016-023]
|
Additional Information:
Running on Purple Systems
|
Optimizing CPU Usage
SMP Nodes:
- ASC Purple compute nodes are shared memory SMPs. Each SMP node has
eight active CPUs and is thus capable of running multiple tasks
simultaneously.
- Optimizing CPU usage on these nodes means using the available CPUs as
fully as possible.
Effectively Using Available CPUs:
When Not to Use All CPUs:
- For MPI codes that use OpenMP or Pthread threads, you probably do not
want to place an MPI task on each CPU, as the threads will need someplace
to run.
- For tasks that use a substantial portion of a node's memory, you may
likewise not want to put a task on every CPU if it will lead to memory
exhaustion.
Running on Purple Systems
|
Large Pages
Large Page Overview:
- IBM AIX page size has typically been 4KB. Beginning with POWER4 and AIX 5L
version 5.1, 16MB large page support was implemented.
- The primary purpose of large pages is to provide performance improvements
for memory intensive HPC applications. The performance improvement
results from:
- Reducing translation look-aside buffer (TLB) misses through mapping
more virtual memory into the TLB. TLB memory coverage for large pages is
16 GB vs. 4 MB for small pages.
- Improved memory prefetching by eliminating the need to restart prefetch
operations on 4KB boundaries. Large pages hold 131,072 cache lines vs.
32 cache lines for 4KB pages.
- AIX treats large pages as pinned memory - an application's data remains
in physical memory until the application completes. AIX does not provide
paging support for large pages.
- According to IBM, memory bandwidth can be increased up to 3x for some
applications when using large pages. In practice this may translate to
an overall application speedup of 5-20%.
- However, some applications may demonstrate a marked decrease in
performance with large pages:
- Short running applications (measured in minutes)
- Applications that perform fork() and exec() operations
- Shell scripts
- Compilers
- Graphics tools
- GUIs for other tools (such as TotalView)
- If large pages are exhausted, enabled applications silently fail over to
use small pages, with possible ramifications to performance. However,
the converse is not true: applications that use only small pages cannot
access large-page memory.
- Large page configuration is controlled by system managers. Using this
configuration is entirely up to the user. It is not automatic.
- More information: AIX Support For
Large Pages whitepaper.
Large Pages and Purple:
- Purple systems are configured to allocate the maximum AIX permitted
amount (85%) of a machine's memory for large pages. This means that there
is a relatively small amount of memory available for regular 4KB pages.
- IMPORTANT: Because LC has allocated most of memory for large pages,
applications which aren't enabled for large pages will default to using
the limited 4KB page pool. It is quite likely in such cases that excessive
paging will occur and the job will have to be terminated to prevent
it from hanging or crashing the system.
How to Enable Large Pages:
- As mentioned, even though a system has large pages configured, making
use of them is up to the user.
- Use any of three ways to enable an application for large page use:
- At build time: link with the -blpdata flag.
Recommended.
- After build: use the ldedit -blpdata executable
command on your executable. Recommended.
- At runtime: set the LDR_CNTRL=LARGE_PAGE_DATA=Y environment
variable. For example:
setenv LDR_CNTRL=LARGE_PAGE_DATA=Y
Note that if you foget to unset this environment variable after your
application runs, this method will affect all other tasks in your
login session. Routine, non-application tasks will probably be
very slow, and it is therefore NOT recommended for interactive jobs
where you are using other tools/utilities.
When NOT to Use Large Pages:
- In most cases, large pages should not be used for non-application tasks
such as editing, compiling, running scripts, debugging, using
GUIs or running non-application tools. Using large pages for these tasks
will cause them to perform poorly in most cases.
- Using the LDR_CNTRL=LARGE_PAGE_DATA=Y environment variable will cause
all tasks to use large pages, not just your executable. For this reason,
it is not recommended.
- Of special note: do not run TotalView under large pages. Discussed
later in the Debugging section of this
tutorial.
- To change a large page executable to not use large pages, use the
command lededit -bnolpdata executable
Miscellaneous Large Page Info:
- To see if the large page bit is set on an executable, use the command
shown below. Note that the example executable's name is "bandwidth"
and that the output will contain LPDATA if the bit is set.
% dump -Xany -ov bandwidth | grep "Flags"
Flags=( EXEC DYNLOAD LPDATA DEP_SYSTEM )
|
- On AIX 5.3 and later, "ps -Z" will show 16M in the DPGSZ column (data
page size) for jobs using large pages. The SPGSZ (stack) and TPGSZ
(text) columns will remain at 4K regardless. For example:
% ps -Z
PID TTY TIME DPGSZ SPGSZ TPGSZ CMD
135182 pts/68 0:00 4K 4K 4K my_small_page_job
177982 pts/68 0:00 16M 4K 4K my_large_page_job
|
- The sysconf(_SC_LARGE_PAGE_SIZE) function call will return the large
page size on systems that have large pages.
- The vmgetinfo() function returns information about large page pools
size and other large page related information.
Running on Purple Systems
|
RDMA
What is RDMA?
- One definition:
A communications protocol that provides transmission of data from the
memory of one computer to the memory of another without involving the
CPU, cache or context switches.
- Since there is no CPU involvement, data transfer can occur in parallel
with other system operations. Overlapping computation with communication
is one of the major benefits of RDMA.
- Zero-copy data transport reduces memory subsystem load.
- Implemented in hardware on the network adapter (NIC). Transmission occurs
over the High Performance Switch.
- Other advantages:
- Offload message fragmentation and reassembly to the adapter
- Reduced packet arrival interrupts
- One-sided shared memory programming model
- IBM also refers to RDMA as "Bulk Transfer".
How to Use RDMA:
- TotalView remains the debugger of choice when working with parallel programs
on LC's IBM AIX machines.
- TotalView is a complex and sophisticated tool which requires much more than
few paragraphs of description before it can be used effectively. This
section serves only as a quick and convenient “getting started" summary.
- Using TotalView is covered in great detail in LC's
Totalview tutorial.
The Very Basics:
- Be sure to compile your program with the -g option
- When starting TotalView, specify the poe process and then
use TotalView's -a option for your program and any other
arguments (including POE arguments). For example:
totalview poe -a myprog -procs 4
- TotalView will then load the poe process and open its Root and Process
windows as usual. Note that the poe process appears in the Process Window.
- Use the Go command in the Process Window to start poe with your executable.
- TotalView will then attempt to acquire your partition and load your job.
When it is ready to run your job, you
will be prompted about stopping your parallel job (below). In most cases,
answering yes is the right thing to do.
- Your executable should then appear in the Process Window. You are now
ready to begin debugging your parallel program.
- For debugging in batch, see
Batch System
Debugging in LC's TotalView tutorial.
A Couple LC Specific Notes:
- For non-MPI jobs on LC's IBMs, you will need to put at least one
poe command in your batch script if you plan to login to
a batch node where your job is running. Something as simple as
poe date or poe hostname will do the trick.
Otherwise you will be prompted for a password, which will never be
recognized.
- For Tri-lab cross-cell authentication users: instead of using
ssh for connecting to a batch node, use
rsh. The above note for non-MPI jobs also applies.
TotalView and Large Pages:
- TotalView will perform poorly if run with
large pages.
- Do not set the LDR_CNTRL=LARGE_PAGE_DATA=Y environment variable when you
are using TotalView. Instead, enable your application for large pages
with the -blpdata flag at build time, or by using the
ledit -blpdata executable_name command before starting
it with TotalView.
- This will allow your application to use large pages, but keep TotalView
using standard AIX 4 KB pages, where it performs best.
- TotalView will warn you if you are trying to run it with Large Pages, as
shown below:
************************************************************************
* WARNING: This TotalView session may run SLOWLY because this *
* machine has a large page pool, and you have set LDR_CNTRL, but *
* it is not set to LARGE_PAGE_DATA=N . *
* *
* TotalView will run at its normal speed if you exit this session, *
* unsetenv LDR_CNTRL, and flag your executable to use large pages *
* by issuing ``ldedit -blpdata <your executable>''. Later, you *
* may unflag it with ``ldedit -bnolpdata <your executable>''. To *
* list the flag, run ``dump -ov <your executable> | grep LPDATA''. *
************************************************************************
Misc - Recommendations, Known Problems, Etc.
|
Performance Related POE Environment Variables:
- POE provides a number of environment variables that can
have a direct effect on an application's performance.
- However, there are few, if any, hard-and-fast rules on their use. Their
effects are highly application dependent and can vary significantly.
- Furthermore, IBM's documentation for these variables, how they interact
with each other, and how to optimally exploit them, is minimal at best.
- Interested users will have to "kick the tires" and experiment with these
environment variables themselves.
- The more relevant performance related POE environment variables are listed
below. Consult the POE man page page
and IBM's Parallel Environment documentation for
details.
MP_EAGER_LIMIT |
MP_BUFFER_MEM |
MP_USE_BULK_XFER |
MP_BULK_MIN_MSG_SIZE |
MP_CSS_INTERRUPT |
MP_RXMIT_BUF_SIZE |
MP_SHARED_MEMORY |
MP_RETRANSMIT_INTERVAL |
MP_RXMIT_BUF_CNT |
MP_POLLING_INTERVAL |
MP_WAIT_MODE |
LAPI_DEBUG_BULK_XFER_SIZE |
MP_SINGLE_THREAD |
MP_TASK_AFFINITY |
MP_PRIORITY |
LAPI_DEBUG_ZC_CONN_RECV |
DAT Times on uP:
- Some weekends on the uP machine will be used for Dedicated
Access Time (DAT). Users can apply for DAT time on the web:
https://www.llnl.gov/lcforms/ASC_dat_form.html.
- DAT time will be charged to a separate set of banks, so use
of DAT time will not affect your regular bank.
- DAT time on Purple is anticipated after Purple becomes GA (generally
available).
Parallel I/O Warnings:
- Use only the GPFS parallel file system(s) on uP and Purple for parallel
I/O. Do not use your home directory or any other NFS mounted directories.
- If MP_COREFILE_FORMAT is not set, then full sized AIX corefiles will be
produced. For parallel jobs, you want to make sure that you launch your
executable from a parallel file system and not your home directory or
other NFS mounted file system. This is because, should your job dump
core, the parallel tasks will all attempt to write a core file to a
non-parallel file system, and possibly hang/crash the server.
- Currently, LC is advising users to verify their GPFS data files (for
example, by using a checksum). See news gpfs.data
after logging into uP or Purple for details.
This completes the tutorial.
|
Please complete the online evaluation form - unless you are doing the exercise,
in which case please complete it at the end of the exercise. |
Where would you like to go now?
References and More Information
|
- Author: Blaise Barney, Livermore
Computing, LLNL.
- IBM Parallel Environment (PE), ESSL/PESSL and GPFS documentation - follow
the links located at:
www-03.ibm.com/servers/eserver/pseries/library/sp_books.
- IBM Compiler Documentation:
- Numerous web pages at IBM's web site:
ibm.com.
- Presentation materials from LC's IBM seminar on Power4/5 Tools and
Technologies. To access these materials, see
www.llnl.gov/computing/mpi/news_events.html
and scroll down to the Thursday, July 8-9, 2004 event.
- Photos/Graphics: Permission to use photos/graphics from IBM sources
has been obtained by the author and is on file. Other photos/graphics
have been created by the author, created by other LLNL employees,
obtained from non-copyrighted sources, or used with the permission of
authors from other presentations and web pages.