Livermore Computing Resource Management System (LCRM)
|
|
Table of Contents
- Abstract
- LCRM Overview
- Resource Allocation & Control System (RAC)
- LCRM Bank Structure
- Bank Shares
- User RAC Utilities
- defbank
- newbank
- pshare
- bac
- brlim
- pquota
- lrmusage
- LCRM Usage GUI
- Production Workload Scheduler (PWS)
- LCRM Job Scheduling
- Batch Job Limits
- Building a Job Control Script
- Optimizing CPU Usage
- Batch Utilities and Commands
- psub: Submitting a Job
- pstat, spjstat, ju: Displaying Job Status
- prm: Cancelling a Job
- phold, prel: Holding and Releasing Jobs
- palter: Changing a Job's Attributes
- pexp: Expediting a Job
- phist: Job Memory Statistics and History
- phstat: Showing a Host's Attributes
- plim: Showing a Machine's Job Limits
- lrmmgr: Obtaining Configuration Information
- Batch Debugging, I/O and Miscellaneous Considerations
- References and More Information
- Exercise
The Livermore Computing Resource Management System (LCRM) is a product of LLNL
Livermore Computing Center (LC). Its primary purpose is to allocate computer
resources, according to resource delivery goals, for LC's production computer
systems. It is the batch system that LC users use to submit, monitor, and
interact with their production computing jobs.
This tutorial begins with a brief overview of LCRM and its two primary
functional components, the Resource Allocation and Control System and
the Production Workload Scheduler. Each of these components is then
further explored, with a practical focus on describing commands and
utilities that are provided for the user's interaction with LCRM. Building job
command scripts, running parallel jobs, and job scheduling policies
are also included. The lecture is followed by a lab exercise.
Note: LCRM was formerly known as the Distributed Production Control System
(DPCS)
Level/Prerequisites: Beginner. The material covered by the following tutorials
would also be useful:
EC3501: Introduction to Livermore
Computing Resources
EC3503: IBM POWER Systems Overview
EC3516: IA32 Linux Clusters Overview
- The Livermore Computing Resource Management System (LCRM) is LC's
production batch system, used on all OCF and SCF systems.
- Short History:
- The LCRM project began in 1991 when LC started to convert all of its
production computer systems to UNIX platforms.
- LCRM was in production in October 1992 and has continued to develop
since then.
- Until 9/03, LCRM was known as the Distributed Production Control
System (DPCS).
- Purpose:
The primary purpose of LCRM is to manage LC's production computing
resources. This is done according to predefined resource delivery
goals.
Resource Delivery Goals:
- Determine who may use what percentage of a given compute resource
- Defined by LC management in coordination with program managers.
- Program managers oversee their group's access to production computer
system resources
- Resource delivery goals are "programmed" into the LCRM system which
then attempts to meet them
- LCRM accomplishes resource management through a complex hierarchy of:
- computer bank-share accounts
- time and usage monitoring tools
- run-control mechanisms
- Delivery of resources to users is based upon their usage rate
according to a
"Fair Share with Half-Life Decay of Usage" algorithm
(discussed later). There are no "service units" per se.
Architecture:
- LCRM is comprised of a hierarchy, consisting of domains, partitions,
and individual machines.
- Domain: Highest level. Comprised of partitions.
- Partition: Comprised of similar machines, which usually map to a
particular system. For example:
up_part pengra_part
white_part thunder_part
mcr_part alc_part
gps_part ice_part
- Machine: Member of one partition. Actual compute resource.
- On the OCF, the primary
production partitions are all part of the same domain. On the SCF,
partitions are split into two domains: one domain includes the IBM
partitions running LoadLeveler, and the other domain includes all
other SCF partitions.
- The primary importance of a domain is that LCRM is only aware of the
machines within a domain. This means you can't submit a job on
ASC White which is supposed to run on Lilac, and vice-versa.
- Management of this system requires control machines and a network of
daemon processes that reside both on the control machines and compute
machines.
- LCRM includes two major subsystems that work together to deliver and account
for the usage of LC computer resources:
- Resource Allocation & Control System (RAC) - allocates,
reports and controls computing resources to organizations and
individuals.
- Production Workload Scheduler (PWS) - schedules
production (batch) computing jobs. Scheduling dependent upon RAC
information.
For the most part, these subsystems are transparent to the user.
- LCRM provides a consistent interface across the different computing
partitions it manages. Both managers and users can interact with both the
RAC and PWS subsystems on any partition through a set of common command
line utilties.
Resource Allocation & Control System (RAC)
|
- The Resource Allocation & Control System (RAC) provides:
- Accounting for resource usage
- Mechanisms for allocating machine resources among diverse users and
groups
- Utilities for users/managers to query/set resource usage
- The RAC system manages delivery of computer resources to user jobs
primarily through:
- Banks - hierarchical allocations of pools/groups
- User allocations within the banks
- User actual resource usage
- Banks and shares:
- A bank is a pool of "shares" that can be
thought of as the bank's percentage of a partition's computing
resources. A bank never "runs out of shares" because they reflect
a percentage, not a decrementable service unit.
- A bank typically includes a group of users and/or sub-banks that
have permission to invoke its shares.
- Users within a bank will have an allocation - their percentage of the
bank's shares.
- Most of the RAC system commands/utilities are primarily for use
by system managers. Several (covered later) are useful for the general
user.
- Banks are organized into a hierarchy, and have parent/child relationships
with other banks. For example:
- You may have access to more than one bank.
- Your default bank is the same on all resource partitions.
- The bank hierarchy can differ across partitions, and is also subject
to change.
- Every bank has a specified number of "shares".
- Your normalized shares represent your percentage of the entire partition.
- The bank hierarchy strongly influences batch job scheduling because
shares are assigned and their effects enforced in layers.
- If you have access to multiple banks, you may find one bank gives
better service due to the hierarchy and share assignments.
The following commands enable you to query/set Resource Allocation & Control
System (RAC) parameters. Only a brief description of each is provided.
Additional detailed information (man page) can be obtained by clicking on
the hyperlinked command names.
defbank
- Sets your default bank for batch or interactive sessions.
Without arguments your valid banks will be displayed with a
prompt for your new default bank.
- Examples:
defbank |
Show and prompt for your new default bank |
defbank -l |
List your available banks |
Sample output
|
% defbank
Valid banks:
cs (current) (default)
guests
Enter the new default bank:
|
newbank
- Sets the bank for a single interactive session (only).
Can be used to override the default bank. Without arguments
your valid banks will be displayed and you will be prompted
for your new current bank.
- To set a bank for batch jobs, use the
psub -b bankname option in your
batch script file (discussed later).
- Examples:
newbank |
Show and then prompt for your new current bank |
newbank -l |
List your current bank |
newbank -d |
Set your default bank as current bank |
Sample output
|
% newbank
Valid banks:
cs (current) (default)
guests
Enter the new bank you want to draw resources from:
|
pshare
- Queries the LCRM database for bank share allocations and usage
statistics. Without options only your bank information is displayed.
- Examples:
pshare |
Show information on your banks |
pshare -T root |
Show entire bank hierarchy and usage for current partition |
pshare -r bank |
Show bank hierarchy from specified bankname up through root |
Sample output
|
% pshare
USERNAME BANKNAME -------SHARES------- -------USAGE-------
ALLOCATED NORMALIZED NORMALIZED
joeuser cs 1.00000 0.0604% 0.002%
joeuser guests 1.00000 0.0000% 0.000%
% pshare -r cs
BANKNAME PARENT -------SHARES------- -------USAGE-------
ALLOCATED NORMALIZED NORMALIZED
cs lc 1.00000 50.0000% 5.941%
lc overhead 1.00000 50.0000% 5.941%
overhead root 5.00000 50.0000% 5.941%
root (NULL) 1.00000 100.0000% 100.000%
|
bac
- Displays your bank access permissions. Without options only
your bank access information is displayed.
- Examples:
bac
| Display your available banks
|
bac -T bank |
Show bank hierarchy for specified bank |
bac -m \* |
Show your available banks across all LCRM partitions |
Sample output
|
% bac
USERNAME BANKNAME DEFAULT ACCESS EXPED EXEMP FXPRI
joeuser cs default U 0 0 0
joeuser guests U 0 0 0
% bac -T overhead
BANKNAME PARENT
overhead root
lc overhead
infosec lc
cm lc
cs lc
problem cs
guests cs
co lc
sa overhead
sa_jr sa
|
- Output columns represent:
- DEFAULT: shows which bank is the default
- ACCESS: "U" for use (most users). C, E and M are administrative.
- EXPED: number of days of special expedite permission remaining.
- EXEMP: number of days of exempt access remaining.
- FXPRI: the number of days a user may set their jobs' priority to a
constant value
brlim
- Displays resource partition specific limits and current usage.
Primarily designed to show the limits and usage for jobs, nodes and
node-time at the user or bank level.
- Examples:
brlim -u user |
Show resource usage and limit information for the specified user.
The default is to just show your limit and usage information. |
brlim -b bank |
Show resource usage and limit information for the
specified bank. |
Sample output
|
% brlim -u joeuser
USERNAME BANKNAME +----JOBS---+---NODES---+----NODE-TIME----+
|LIMIT- USED|LIMIT- USED|LIMIT - USED|
joeuser micphys |none - 0|none - 0|none - 0:00|
joeuser a_phys |none - 0|none - 0|none - 0:00|
joeuser positron |none - 0|none - 0|none - 0:00|
joeuser axcode |none - 3|none - 3|none - 2:37|
% brlim -b alliance
BANKNAME PARENT +----JOBS---+---NODES---+----NODE-TIME----+
|LIMIT- USED|LIMIT- USED|LIMIT - USED|
alliance root |none - 1|none - 453|none -52110:5|
|
pquota
- Displays information about any resource quotas or
usage limits which may be in effect. Without options only
your quota information is shown.
- Note: this feature is not currently being used by LC
- Examples:
pquota |
Show any quotas for your userid |
pquota -t bank |
Show quota information for all users for specified bank |
Sample output
|
% pquota
USERNAME BANKNAME -------REFRESH-------
QUOTA USED PERIOD BASE
joeguy cs unlim N/A N/A N/A
joeguy guests unlim N/A N/A N/A
|
lrmusage / pcsusage
- This utility reports information about how time and
memory are being consumed for particular banks, machines,
and users. The report may be based on a number of criteria
such as bank, machine, user and time period.
- pcsusage is an older name for the utility, which is now aliased to
lrmusage.
- Examples:
lrmusage -bu |
Show the usage for your bank(s) on default machine |
lrmusage -bu -user all -host mcr |
Show usage for users on MCR in your bank(s) |
lrmusage -pad2 |
Show departmental usage on default machine |
Sample output
|
% lrmusage -bu -user all -host mcr
********************************
* LRMUSAGE *
********************************
Report for : mcr
Start Date : Jun 07 2005
End Date : Jun 07 2005
Minimum time : 0.0 minutes
Time units : minutes
********************************
Bank: cs
User Time Used
-------------------- ------------
jsikku 0.00
toxd 0.00
gptut 746.10
yynney 46.15
32nvan 1.05
jgg776 4.10
e3wr 0.00
r2oing 0.05
rsahhrnz 0.00
sanqqa 0.02
spomx 233.48
wotrr 1.00
zztpo 0.58
--------
1032.56
--------
|
LCRM Usage GUI
- LC provides a web based GUI that includes statistics
on a number of LCRM usage parameters, such as:
- lrmusage
- job status
- usage reports
- For more information, see the
LCRM Usage web pages (LLNL internal link).
Production Workload Scheduler (PWS)
|
- The Production Workload Scheduler (PWS):
- Schedules batch (production) jobs on eligible machines
- Gathers statistics about the load on production hosts
- Uses resource accounting information from the RAC system
- Provides utilities for users to manage their batch jobs
- Supports user commands to:
- Submit a job
- Display status information about a job
- Modify a job's attributes
- Place a job on hold
- Release a held job
- Remove a job
- Obtain a limited amount of system configuration information
- "Behind the scenes", the PWS component of LCRM coordinates with other
(native) batch schedulers within the LCRM domain.
For example, on the ASC Purple machines, LCRM actually "hands over"
management of the job's execution to the native SLURM scheduler.
Fair Share with Half-Life Decay of Usage:
- This is the primary mechanism used to determine job scheduling. It is
based upon a dynamically calculated priority for your job that reflects
your share allocation within a bank and your actual usage.
- Use more than your share, your service degrades; use less than your
share, your service improves
- Your priority can become very low, but you never "run out of time" (unless
a quota has been applied)
- Half-Life Decay: Without new usage, your current usage value
decays to half its value in one week.
- Resources are not wasted: Even though your allocation and/or job
priority may be small your job will run if machine resources are
sitting idle.
- LCRM's scheduling is dynamic with job priorities and usage information
being recalculated every minute (according to the LCRM Reference Manual).
- The details of the Fair Share with Half-Life Decay algorithm are
a bit more complex than presented here. See the
LCRM Reference Manual for a full explanation.
Other Considerations:
In General:
- The majority of nodes on LC's production systems are designated
for batch use.
- There are defined limits for batch use, based upon metrics such
as the number of nodes required, execution time, number of
running jobs and memory use.
- No two LC systems have the same batch limits
- Dedicated Application Time (DAT) jobs are very common on some LC
systems, especially during weekends.
- Job limits can and do change!
How Do I Find Out What the Limits Are?
- The most up to date information can be found by logging in and
issuing the command news job.lim.[system].
For example:
news job.lim.thunder news job.lim.white
news job.lim.mcr news job.lim.gps
news job.lim.ilx news job.lim.um
If you're not sure of the actual command to use, try
news job.limits - it usually provides helpful hints.
- Job limits can also be found by consulting the
"LC Job Limits for All OCF Production Machines"
(LLNL internal) web page. Essentially, this is listing of the job.lim
output for all OCF machine on a single page.
- Use the tables below. They cover every LC production machine, with the
caveat that they are not necessarily as up to date as the previous
two methods. The tables below reflect job limits as of 6/05.
ASC IBM Systems:
- Batch nodes comprise a homogenous pool within a system. Users do not
have to be concerned with using a particular node. The batch
pool is called pbatch.
- Interactive nodes if present, are assigned to the pdebug
pool. They can be used interactively or through the batch system.
- Login nodes are limited and are not associated with either of the
pbatch or pdebug pools.
Batch Limits for ASC IBM Systems |
System |
Batch Pool |
Shift |
Max Time |
Max Nodes |
Max Jobs |
PURPLE1
(SCF)
| pbatch |
All shifts |
250 hr |
1336 |
5 |
viz |
All shifts |
TBD |
TBD |
TBD |
pdebug |
Not currently configured |
ICE (SCF)
| pbatch |
Week |
12 hr |
21 |
4 |
Weekend |
24 hr |
21 |
4 |
pdebug |
All shifts |
2 hr |
1 |
1 |
UP (OCF)
| pbatch |
All shifts |
12 hr |
100 |
TBD |
pdebug |
All shifts |
2 hr |
2 |
TBD |
UM (OCF)
| pbatch |
Week |
6 hr |
32 |
3 |
Weekend |
24 hr |
32 |
3 |
pdebug |
All shifts |
2 hr |
2 |
2 |
UV (OCF)
| pbatch |
Week |
12 hr |
32 |
3 |
Weekend |
24 hr |
32 |
3 |
pdebug |
All shifts |
2 hr |
2 |
2 |
TEMPEST (SCF)
| 4-way |
All shifts |
12 hr |
7 |
4 |
16-way |
All shifts |
12 hr |
3 |
4 |
Notes:
1
Purple's configuration is pre-general availability and subject to change
at any time.
Intel Systems:
- The Intel Linux systems represent a heterogeneous mix in regards to batch
limits and configurations.
- Note for ILX, ACE and QUEEN that the nodes which show both interactive
and batch limits are actually intended for batch use.
Batch Limits for Intel Systems |
System |
Nodes |
Interactive Batch |
Max Memory |
Max Time |
Max Jobs per Node |
User |
Total |
ACE (SCF) |
ace1-ace8 |
Interactive |
400 MB |
30 min |
n/a |
n/a |
ace9-ace152 |
Interactive Batch |
400 MB 4 GB |
30 min 200 hr |
2 |
2 |
ace153-ace160 |
Interactive Batch |
400 MB 4 GB |
30 min 72 hr |
2 |
2 |
ace161-ace176 |
ICF Use Only |
QUEEN (SCF) |
queen1-queen4 |
Interactive |
400 MB |
30 min |
n/a |
n/a |
queen5-queen63 |
Interactive Batch |
400 MB 4 GB |
30 min 200 hr |
2 |
2 |
ILX (OCF) |
ilx1-ilx4 |
Interactive |
400 MB |
30 min |
n/a |
n/a |
ilx5-ilx18 |
Interactive Batch |
400 MB 3 GB |
30 min 50 hr |
1 |
2 |
ilx19-ilx61 |
Interactive Batch |
400 MB 3 GB |
30 min 200 hr |
1 |
2 |
ilx62-ilx67 |
Interactive Batch |
400 MB 3 GB |
30 min 12 hr |
1 |
2 |
|
Pool |
|
Max Nodes |
|
Max Jobs per Pool |
ALC (OCF) |
pbatch |
Batch |
TBD |
8 hr (weekday)
24 hr (weekend) |
TBD |
pdebug |
Interactive |
8 |
30 min (weekdays)
2 hr (off-hours) |
TBD |
LILAC (SCF) |
pbatch |
Batch |
64 |
6 hr (weekday) 24 hr or 768 node-hr (weekend) |
4 |
pdebug |
Interactive |
TBD |
30 min |
16 |
MCR (OCF) |
pbatch |
Batch |
513(weekday) 768(weekend) |
12 hr (weekday) Until 8am Monday (weekend) |
n/a |
pdebug |
Interactive |
16 |
30 min (weekday) unlimited (weekend) |
n/a |
PENGRA (OCF) |
pbatch |
Batch |
11 |
2 hr or 8 node-hr (8am-5pm) 12 hr or 108 node-hr (5pm-8am+weekend) |
6
|
pdebug |
Interactive |
4 |
1 hr |
n/a |
THUNDER (OCF) |
pbatch |
Batch |
986 |
12 hr (weekday)
24 hr (weekend) |
n/a
|
pdebug |
Interactive |
16 |
30 min |
n/a |
Compaq Systems:
- The Compaq systems represent a highly heterogeneous environment.
Within a system, there may be multiple architectures with different
configurations. The batch limits reflect this.
- Note for GPS that the nodes which show both interactive
and batch limits are actually intended for batch use.
Batch Limits for Compaq Systems |
System |
Nodes |
Interactive Batch |
Max Memory |
Max Time |
Max Jobs per Node |
User |
Total |
SC (SCF) |
sc1-sc2 |
Interactive |
4 GB |
30 min |
n/a |
n/a |
sc3-sc4 |
Interactive Batch |
4 GB unlimited |
30 min unlimited |
1 |
4 |
sc5-sc6 |
A division privileged use (-c sc_apriv) |
sc7-sc8 |
B division privileged use (-c sc_bpriv) |
GPS (OCF) |
gps1-gps4 |
Interactive Batch |
400 MB 3 GB |
30 min 200 hr |
1 |
4 |
gps5-gps9 |
Interactive Batch |
800 MB 6 GB |
30 min 200 hr |
1 |
4 |
gps10-gps11 |
Interactive Batch |
1.6 GB 12 GB |
30 min 200 hr |
1 |
4 |
gps12-gps14 |
Interactive Batch |
3.2 GB 24 GB |
30 min 200 hr |
1 |
4 |
gps15-gps16 |
Interactive |
3.2 GB |
30 min |
n/a |
n/a |
gps17-gps19 |
Interactive |
400 MB |
30 min |
n/a |
n/a |
gps20-gps32 |
Interactive Batch |
400 MB 1.5 GB |
30 min 200 hr |
1 |
2 |
gps320 |
Interactive Batch |
3.2 GB 16 GB |
30 min 200 hr |
2 |
32 |
Building a Job Control Script
|
- The first step in running a batch job under LCRM is to create a
job control script. A job control script is nothing more than a
shell script that includes LCRM instructions.
- A job control script must be a plain text file which you
create with your favorite text editor. It can reside anywhere you
choose within your UNIX home directory.
- A job control script typically includes the following components:
- Shell commands, which are interpreted at run time
- Shell comment lines
- LCRM statement lines. These begin with #PSUB
and will therefore be interpreted by the shell as comments - but
will be understood by LCRM. They tell LCRM details about your job's
environment and requirements. These are interpreted by LCRM when
you submit your job - they are not executed at run time.
- References to LCRM environment variables within shell commands
- A call to your executable(s)
- An simple LCRM job control script appears below:
##### These lines are for LCRM
#PSUB -ln 4 -g 16
#PSUB -c up,pbatch
#PSUB -tM 30
#PSUB -eo
#PSUB -o /g/g18/me/job.out
##### These are shell commands
echo $PSUB_JOBID
cd /u/jdoe/job1
./a.out
echo 'Done'
date
|
LCRM Job Control Options:
- The psub man
page describes in detail the numerous (>30) options available.
The more common/important ones are described below.
PSUB option |
Description |
-A date-time |
Run job after specified date/time. See man page for acceptable ways to
specify date-time parameter. Preemption does not override this parameter.
|
-b bank |
Bank from which allocated resources is to be drawn.
If you do not use this option, resources used by the job will be drawn
from your default bank |
-c mcr
-c pbatch
-c pdebug
-c alc,pbatch
-c 2300Mb,gps320
|
Represents a "constraint" that can include system, pool, machine, memory,
etc. Constraints differ with each partition. This option permits multiple
constraints - they can be grouped, separated by commas with no spaces. |
-d jobid |
Run this job after the specified jobid has completed |
-e errorfile |
Send stderr to specified file
|
-eo |
Send stderr to the same file as stdout
|
-g numtasks[ip] |
IBMs only. This special option is implemented at LLNL to
take advantage of SMP nodes. Specifies the number of tasks to start and
works with the -ln option. Discussed in detail later.
For IBMs, specify ip if IP communications are desired instead
of the default User Space communications. |
-ln nodes |
Number of nodes to use. If not specified, then
it is a 1 node serial job. If nodes= 1, then it a one node
parallel job. |
-mb -me |
Send mail when job begins (-mb) execution and/or ends (-me)
execution |
-np |
Number of processors to use. Primarily for single-node computing
platforms - Linux and Compaq/HP clusters without a switch. |
-nr |
This job is not rerunnable |
-o outputfile |
Send stdout to specified file |
-pool poolname |
Pool (usually pbatch) in which to run job. Preferred over using the
-c option. |
-r jobname |
Assigns the specified job-name to the job. |
SESSARGS |
LCRM environment variable - passes arguments from the psub command to
your job control script. See psub man page for usage details. |
-standby |
Exempts a job from the user and bank resource limits.
However, the job will be removed (or otherwise stopped)
when a non-standby job is eligible to run on the machine and is
in need of the resources being used by the job. |
-tM HH:MM:SS
-tM minutes |
Specifies the maximum cpu-time-per-task for this job.
Default units are minutes. If not specified, the default is currently
30 minutes.
***See discussion below *** |
-tW HH:MM:SS
-tW minutes |
Specifies the maximum wall clock time for this job. Default
units are minutes.
***See discussion below *** |
-tM versus -tW ?
- The elapsed (wall clock) run-time limit (-tW) and cpu-time-per-task
limit (-tM) behave differently depending on where your job runs.
- Behavior on parallel systems:
- Parallel machines are those that have a switch interconnect and
a pbatch pool configured, such as alc, white, mcr and thunder.
Only one user may run on a node at a time (dedicated).
- Although -tM is called cpu-time-per-task, it behaves as plain old
wall clock time on these machines, regardless of the the real CPU time
or the physical number of CPUs used by a task.
- Even if a task sleeps the entire time and uses almost no real
CPU time, it will only be permitted to run until a wall clock
limit equal to -tM.
- Equally true is the case of a multi-threaded task that uses
multiple CPUs and is very CPU intensive - it likewise
will run until the wall clock equivalent of -tM.
- Behavior on non-parallel systems:
- Non-parallel machines are those that don't have a pbatch pool or
switch interconnect, such as GPS, SC, ACE, QUEEN and ILX. These
machines allow more than one user to run on a node simultaneously
(shared).
- On these machines, -tM really does mean actual CPU time.
- If your job doesn't use much CPU, it can run longer than the
wall clock equivalent of -tM.
- If your job uses multiple CPUs (threads, multi-process) and is CPU
intensive, it will run shorter than the wall clock equivalent of -tM.
- More confusing details:
- The default -tM is 30 minutes. If you forget to specify either
-tW or -tM, then your job will be terminated:
- After 30 minutes of wall clock time on parallel systems
- After 30 minutes of CPU time on non-parallel systems
- You can (but why would you want to?) use both -tM and -tW in the
same job script. This is not advised since the behavior is not
documented, not consistent across LC systems, and may even cause
your job to hang in the wait queue forever.
- Recommendations:
- For parallel machines: Because -tW and -tM behave the same, use
either one of the two and just pretend it's wall clock time.
- For non-parallel machines: Set -tM to reflect your best
estimate of the actual, total CPU time that your job will require,
for all threads/processes.
- Always use one or the other to assist accurate job scheduling by
LCRM and to prevent the default limit from terminating your job early.
- Do not use both -tM and -tW in the same job script.
Other Notes:
- LCRM statement options can be used in either the job control script
or with the psub command (discussed later).
- There are a several environment variables that are set by LCRM, and
which may be queried from within your job control script. See the
psub man page
for details.
- Restrictions:
- Job names can not include the % (percentage) character.
- Output/input file names should not include blank characters.
- Object files/executables can not be submitted to LCRM directly. You
must submit a job command script which then calls the executable.
ASC IBMs:
- The ASC IBMs are shared memory SMP nodes. Each SMP node
has multiple CPUs, and is thus capable of running multiple tasks
simultaneously. However, by default, POE uses only one task per node.
For example, a 4-task MPI job running on ASC Purple nodes having 8 CPUs:
- In order to maximize the use of CPUs on these machines within LCRM, the
special -g (Geometry) option is used.
This option only applies to the IBM systems.
-
The diagram below demonstrates how tasks would be loaded on an ASC Purple
machine when the requested number of nodes is one, and -g
is set to four:
- The syntax for the -g option is:
-g task_count
-g task_countip |
-g @tpn# |
Where task_count is total number of tasks to use.
Usually equated with the number of CPUs to use.
Specify ip if IP communications are desired instead
of the default User Space communications. For example, 32ip. |
Specifies "tasks per node", where # is
the number of tasks per node. Note: there must be no space character between
any of the characters. |
- The -g option can be used with #PSUB in a job command
script or as a psub command line flag.
Examples for both are shown below.
Command
#PSUB as used in your job control script
psub as used from the command line |
Task Distribution
Assuming an 8-cpu node |
#PSUB -ln 3 -g 20
psub -ln 3 -g 20 scriptfile
|
nodeA: 0 1 2 3 4 5 6
nodeB: 7 8 9 10 11 12 13
nodeC: 14 15 16 17 18 19
|
#PSUB -ln 3 -g @tpn3
psub -ln 3 -g @tpn3 scriptfile
|
nodeA: 0 1 2
nodeB: 3 4 5
nodeC: 6 7 8
|
#PSUB -ln 2 -g 24
psub -ln 2 -g 24 scriptfile
|
Job will be rejected because over allocation - more tasks
per node than CPUs is not permitted. |
#PSUB -ln 4 -g 3
psub -ln 4 -g 3 scriptfile
| Job rejected because one node is left unused |
 |
Note that for threaded processes, having "unused" CPUs is actually
the right thing to do, since the threads will need to execute on them. |
Linux clusters with a Quadrics switch:
- These machines differ from the IBMs in that tasks are
"packed" onto a node to fill all available CPUs by default. For
example:
#PSUB -ln 4
srun -n 4 myjob
Will result in 4 processes being packed onto 2 nodes (2 CPUs each)
and 2 unused nodes:
task 0 node 0 cpu 0
task 1 node 0 cpu 1
task 2 node 1 cpu 0
task 3 node 1 cpu 1
node 2 cpu 0
node 2 cpu 1
node 3 cpu 0
node 3 cpu 1
|
- In the above example, if only 4 processes are required, the job should
be specified with: #PSUB -ln 2
- If the processes are threaded, then each process should probably use
both CPUs and the job should be specified as:
#PSUB -ln 4
srun -n 4 -c 2 myjob
The task distribution will then look like:
task 0 thread 0 node 0 cpu 0
task 0 thread 1 node 0 cpu 1
task 1 thread 0 node 1 cpu 0
task 1 thread 1 node 1 cpu 1
task 2 thread 0 node 2 cpu 0
task 2 thread 1 node 2 cpu 1
task 3 thread 0 node 3 cpu 0
task 3 thread 1 node 3 cpu 1
|
Compaqs and Linux clusters without a Quadrics switch:
- Since these machines are primarily for serial and single node parallel
jobs, and because multiple users may be using the same node simultaneously,
optimizing CPU usage is looked at a little differently.
- To assist LCRM with effectively scheduling jobs, users with processes that
require more than one CPU should include the -np (number of processors)
option to specify how many CPUs their job will actually require.
This is particularly true for threaded codes.
- For example, if a single process code on GPS spawns 4 threads, then the job
command script should contain:
#PSUB -np 4
- Users who fail to do this may be penalized by having their jobs
"re-niced" to a lower priority. This is because parallel jobs that
fail to use -np cause LCRM to over-subscribe the CPUs, and other jobs
suffer unfairly.
- These machines may also have memory limits. To insure that LCRM does
not place your job on a machine with inadequate memory, be sure to specify
the actual amount of memory your job requires with the -c
flag. For example:
#PSUB -c 15000Mb
Batch Utilities and Commands
|
LCRM provides the following utilites/commands for managing your batch job. A
brief description of each is provided. Additional detailed information can be
reviewed in each command's man page by clicking on the hyperlinked command
name.
psub
- Used to submit your job control script to LCRM. The psub command
has many options (see man page) which can be used in your job control
script or from the command line.
- Upon successful submission LCRM returns the job's ID and spools it for
execution. For example:
% psub t4.cmd
Job 42976 submitted to batch
psub script33 -ln 4 -r script33.out
Job 3458 submitted to batch
|
- After you submit your job control script, changes to the contents of
the script file will have no effect on your job (LCRM has already
spooled it to system file space).
- Users may submit and queue as many jobs as they would like. However
there are defined limits for number of active jobs, number of nodes
and length of run. See Batch Job Limits
for details.
- If your command line psub option conflicts with the same option
in your script file, the command line option will override what is
specified in the script (in most cases?).
pstat
- Used to display the attributes of selected jobs under
LCRM control. The command line options are used to delimit or qualify
the set of jobs to display. When pstat is invoked with no arguments
the output is restricted to your jobs.
- Example usages and output are shown below. See the man page
for explanations of status codes and other fields.
- Also, the LC Hotline has set up some useful aliases for the pstat command.
See /usr/local/docs/LCRM_pstat_tips for details.
pstat |
All of your running/queued jobs |
pstat -f jobid |
Detailed information about specified job |
pstat -A |
All jobs in the LCRM system |
pstat -m lilac |
Detailed listing of all jobs on lilac |
pstat -T |
Your completed jobs (within past 5 days) |
pstat -M jobid |
Particularly useful to see the details of a job that won't run
because of MULTIPLE status |
pstat -o outspec |
Customized output - see man page for
options |
Sample output
|
% pstat -m mcr
JID NAME USER ACCOUNT BANK STATUS EXEHOST CL
33153 batch user1 000000 adivp RUN mcr N
33169 batch user1 000000 adivp RUN mcr N
38811 tomkj08 user4 000000 a_phys RUN mcr N
39191 psub_sc user3 529055 wndivp *DEPEND mcr N
39422 lc128_u user3 529029 bdivp *WCPU mcr N
40004 z0256 user4 000000 bdivp RUN mcr N
40550 womkj08 user88 000000 a_phys *ELIG mcr N
40617 ueans_s user88 000000 chicago *ELIG mcr N
% up041% pstat -o jid,sid,user,status,prio -s prio -m up
JID SID USER STATUS PRIORITY
16453 up015.139840 icdy RUN 0.630
14743 up023.119122 u8lheou1 RUN 0.629
16392 up012.38006 geey7 RUN 0.626
16393 up003.73894 geey7 RUN 0.626
16639 N/A p8cvic2 *WCPU 0.624
16440 up098.25610 icdy RUN 0.606
16366 up042.103618 icdy RUN 0.606
16454 up051.106498 icdy RUN 0.606
16500 N/A icdy *MULTIPLE 0.606
15695 N/A uiieta *WCPU 0.606
14210 N/A o9snsk *WPRIO 0.422
15734 N/A uiieta *DEPEND N/A
14744 N/A u8lheou1 *DEPEND N/A
13945 N/A ggtral *HELDs 0.000
15731 N/A uiieta *DEPEND N/A
% pstat -f 13544
-------------------------------------------------------------------------------
LCRM BATCH JOB ID 13544 user: joeeda
-------------------------------------------------------------------------------
job name: job_run bank: cms
batch ID: 13544 account: 000000
session ID: mcr1010.28368 dependency: none
executing host: mcr job status: RUN
expedited: no standby: no
priority: 0.389 preempted by: N/A
cpn: 2 min/max nodes: 12
node distribution: 12 geometry: N/A
constraint: mcr
submitted at: 04/01/05 10:20:43 earliest start time: N/A
must stop at: N/A estimated completion: 04/02/05 23:04:19
elapsed run time limit: 12:00 time limit per task: 12:00
elapsed run time: 11:57 tasks: 24
time used: 285:48 time used per task: 11:54
time charged: 285:52
resident memory integral: 14653Mbh physical memory integral: 429434Mbh
largest process size: 682Mb process size limit: unlimited
max resident set size: 101Mb max physical size: 2742Mb
job size: 32894Mb
% pstat
9882 gps.script.abin userl2 000000 micphys *MULTIPLE N
% pstat -M 9882
JID NAME USER ACCOUNT BANK STATUS EXEHOST CL
9882 gps.script.abin userl2 000000 micphys *NOTIME mcr N
9882 gps.script.abin userl2 000000 micphys *TOOLONG mcr N
9882 gps.script.abin userl2 000000 micphys *QTOTLIMU gps320 N
9882 gps.script.abin userl2 000000 micphys *QTOTLIMU gps10 N
9882 gps.script.abin userl2 000000 micphys *WCPU gps10 N
9882 gps.script.abin userl2 000000 micphys *QTOTLIMU gps11 N
9882 gps.script.abin userl2 000000 micphys *WCPU gps11 N
9882 gps.script.abin userl2 000000 micphys *QTOTLIMU gps12 N
9882 gps.script.abin userl2 000000 micphys *QTOTLIMU gps13 N
9882 gps.script.abin userl2 000000 micphys *WCPU gps14 N
|
spjstat
& spj
- Not really LCRM commands.
- spjstat displays statistics for running jobs on the IBMs and Intel
LINUX systems with a Quadrics switch.
- On the IBMs, spjstat does not show the LCRM job id, but instead
displays the job id of the native batch system (LoadLeveler).
- spj is similar to spjstat, but also includes more information and
jobs that aren't running yet. It also shows the LCRM job id on the
IBMs.
Sample output
|
% spjstat
Scheduling pool data:
--------------------------------------------------------
Pool Memory Cpus Nodes Usable Free Other traits
--------------------------------------------------------
pbatch 15360Mb 8 116 116 30 pbatch, RDMA
pdebug 15360Mb 8 8 8 8 pdebug, RDMA
Running job data:
---------------------------------------------------------------
LL Batch ID User Nodes Pool Class Status Master
Name Used Node
---------------------------------------------------------------
uv006.1543.0 user22 20 pbatch normal R uv021
uv006.1542.0 uskra2 16 pbatch normal R uv091
uv006.1541.0 usey22y 20 pbatch normal R uv019
uv006.1540.0 3ser24y 14 pbatch normal R uv001
uv006.1536.0 eser12 16 pbatch normal R uv038
% spj
Scheduling pool data:
--------------------------------------------------------
Pool Memory Cpus Nodes Usable Free Other traits
--------------------------------------------------------
pbatch 3300Mb 2 1048 1044 9
pdebug 3300Mb 2 64 64 64
Job data:
--------------------------------------------------------------------------------
LCRM
JID BATCHID USER NODES STATUS TIMELEFT MASTER POOL CLASS
--------------------------------------------------------------------------------
20441 20441 wwwuion1 430 RUN 22:47 pbatch
17239 17239 iisjris 4 RUN 10:12 pbatch
20503 20503 f44gi 120 RUN 3:10 pbatch
20627 20627 rewrs 128 RUN 11:59 pbatch
35029 N/A u8kuo 144 *WCPU 12:00 N/A mcr:N/A
17607 N/A u8kuo 144 *WCPU 12:00 N/A mcr:N/A
19517 19517 hhimholz 32 RUN 8:18 pbatch
20603 20603 u8kuo 80 RUN 11:42 pbatch
19520 19520 hhumholz 16 RUN 8:03 pbatch
16779 N/A cdi8ero 128 *WCPU 5:00 N/A mcr:N/A
20591 20591 oo33ega 10 RUN 3:32 pbatch
20296 20296 hhopmas 32 RUN 9:36 pbatch
20576 20576 mmkdmas 32 RUN 11:06 pbatch
20534 20534 mmkdmas 32 RUN 11:01 pbatch
|
ju
- Not really an LCRM command.
- Displays a summary of node availability and usage within each pool
on the ASC IBMs and Intel LINUX systems with a Quadrics switch.
- Sample output (some truncation due to length):
% ju
Pool total down used avail cap Jobs
0 pdebug.general 12 0 5 7 41% avrrril-4, xxxn-1
1 pbatch.batch 243 1 241 1 99% eeee-1, ctttillo-1, mbbbtea-36, [...]
|
prm
- Removes a running or queued job from the LCRM system. Without the
proper authority, you can only remove jobs you own.
- Sample output:
% pstat
JID NAME USER ACCOUNT BANK STATUS EXEHOST CL XCT
42991 t5 user24 000000 cs RUN N 0
42993 t9 user24 000000 cs *WCPU N 0
% prm 42991
remove (unknown status) job 42991 (user24, 000000, cs)? [y/n] y
% pstat
JID NAME USER ACCOUNT BANK STATUS EXEHOST CL XCT
42993 t9 user24 000000 cs *WCPU N 0
|
phold
& prel
- These two utilities are used to place your LCRM jobs on user hold and to
release jobs which are already on user HELD status. Without the
proper authority, you can only hold/release jobs you own.
- Sample output:
% pstat
JID NAME USER ACCOUNT BANK STATUS EXEHOST CL XCT
42992 t5 user38 000000 cs *ELIG N 0
% phold 42992
hold 42992 (user38, 000000, cs)? [y/n] y
% pstat
JID NAME USER ACCOUNT BANK STATUS EXEHOST CL XCT
42992 t5 user38 000000 cs *HELDu N 0
% prel 42992
release 42992 (user38, 000000, cs)? [y/n] y
% pstat
JID NAME USER ACCOUNT BANK STATUS EXEHOST CL XCT
42992 t5 user38 000000 cs *ELIG N 0
|
palter
- May be used to change certain attributes of a LCRM job.
The attributes of a job can only be changed by the job owner or
by a coordinator of its bank or by a pcs manager.
- Examples of attributes which can be changed:
- Date and time before which the job is not permitted to run
- Account to charge
- Which bank to use
- Dependency on specified job
- Maximum cpu-time-per-task time (can only be increased for non-running
jobs)
- Expedite and exempting privileges (authorized users only)
- Scheduling priority (authorized users only)
- Examples:
palter -n 14299 -A time |
Set the time after which the specified job is permitted to run |
palter -n 14299 -b bank |
Use specified bank for this job's resource requirements (not
permitted on a running job) |
palter -n 14299 -tM time |
Set the maximum execution time for the specified job |
palter -n 14299 -expedite |
Allows specified job to preempt other jobs. Only authorized users
and administrative staff may expedite jobs. |
|
pexp
- Allows authorized users to "expedite" a batch job so that it competes
favorably against jobs funded from other PCS banks.
- Mostly used by hotline staff and LCRM managers
phist
- Lists job memory size statistics and history for (up to) your last 5 jobs
- Sample output:
% phist
HOST MEAN_SIZE STD_DEV HISTORY
mcr 779Mb 1176Mb 3131Mb 2956Kb 1956Mb 2276Kb 2050Mb
|
phstat
- Can be used to show various dynamic attributes of hosts being
scheduled by LCRM. The hosts shown can be delimited by a constraint
by using the -c option. If the -c option is not used, attributes of
all hosts are shown.
- See the man page, or use the phstat -h command for an
explanation of what the columns mean.
- Sample output:
% phstat -c thunder
HOSTNAME+--POOL--+--ID-+--AVAILABILITY-+-------MEMORY--------+-CPU|NODE-+SCHTIME
thunder | | 100 | N-______:____ | NA/ NA NA | NA/4024 | 59s
thunder |pbatch | 100 | N-______:____ | NA/7500Mb NA | 796/ 986 | 59s
thunder |pdebug | 100 | N-______:____ | NA/7500Mb NA | 1/ 16 | 59s
% phstat -c mcr
HOSTNAME+--POOL--+--ID-+--AVAILABILITY-+-------MEMORY--------+-CPU|NODE-+SCHTIME
mcr | | 40 | N-______:____ | NA/ NA NA | NA/2236 | 49s
mcr |pbatch | 40 | N-______:____ | NA/3300Mb NA |1033/1048 | 49s
mcr |pdebug | 40 | N-______:____ | NA/3300Mb NA | 0/ 64 | 49s
% phstat -c gps320
HOSTNAME+--POOL--+--ID-+--AVAILABILITY-+-------MEMORY--------+-CPU|NODE-+SCHTIME
gps320 | | 82 | C-______:____ | 28Gb/ 31Gb 91.9% | 31/ 32 | 188s
|
plim
- This LCRM utility reports several seldom-changed system default job
limits, including:
- maximum run time for batch jobs
- maximum allowed job size
- default time limit your job gets if you do not specify one.
- All times are unlabeled, but all have the form hh:mm - not mm:ss, so 0:30
is 30 minutes.
- Sample output:
% plim -m mcr
(-ct) Maximum time in run slot: 0:30
(-gt) Grace time before holding or suspending a session: 0:01
(-ln) Max. allowable nodes for running batch jobs: 768
(-nh) Max. allowable node-hours for running batch jobs: unlimited
(-mr) Maximum cpu time for batch jobs: 45:00
(-mR) Maximum run time for batch jobs: 45:00
(-ms) Max. allowable size for running batch jobs: unlimited
(-tM) Default cpu time limit for a batch job: 0:30
(-tW) Default elapsed time limit for a batch job: 45:00
|
lrmmgr
- This LCRM utility is primarily for use by PCS managers and coordinators
to create, update and delete RAC data base records. A few
lrmmgr "show" commands, for displaying host configuration
information, may be useful for general users.
- Note: the commands shown
below are issued after starting the lrmmgr utility with the
lrmmgr command and receiving the lrmmgr>
prompt.
lrmmgr> show host white |
Show configuration for the host "white" |
lrmmgr> show config thunder_config |
Show configuration for the host "thunder" |
lrmmgr> show part asci_part |
Show information about ASC partition |
lrmmgr> show part gps_part |
Show information about the GPS partition |
lrmmgr> show part * |
List all partitions |
lrmmgr> help show |
Get help on other show commands |
Batch Debugging, I/O and Miscellaneous
Considerations
|
Batch Debugging
- Typically, jobs are debugged interactively, using whatever choice
of debugger you prefer, or is available on the system you are
using.
- LC permits you to login to parallel batch nodes ONLY when your job is
actively running. This means you can treat batch jobs like interactive
jobs for the purpose of debugging. In most cases, all you need to do
is login to the node which is running your job and attach to it via
the debugger. For parallel jobs, you'll need to attach to the
"manager/master" task.
- LC's TotalView tutorial includes a section on this
subject. Definitely worth checking out.
- Also, some LC systems (mostly the Linux clusters), offer an LC
utility called batchxterm that can be used to make
batch debugging a bit more convenient. There isn't a man page for
the utility (if it is present), but if it is installed, find out
more by viewing the script:
more `which batchxterm`
I/O Issues
- Most of LC's parallel production machines have very large parallel
I/O file systems (GPFS or Lustre). These should be used for parallel
(multi-task) I/O.
- Never run but the smallest of parallel jobs that do
intensive or concurrent I/O to an NFS mounted file system, such as
your home directory. This degrades performance and can crash the
NFS server making the file system unavailable for yourself and others.
- Most nodes have a reasonable amount of local (non-NFS, non-parallel)
disk space. On the IBMs and Compaqs it is /var/tmp.
On the Linux clusters it is /tmp. Do not use
/tmp on the IBMs - it is a different and very small file system.
- All production systems mount large, NFS mounted /nfs/tmp#
file system(s), which are shared by all users.
- Remember to clean up your files when your job is done!!!
- Be aware of exceeding your disk quota. Here are three reasons for
not writing to a file space which has exceeded its disk quota:
- You will get truncated or zero length files and waste
the run whose output is lost as a result.
- You will lose further data if your home directory remains filled
after the job finishes, if you attempt to copy, write or edit files.
- multiple attempts to write to a directory over quota create a
significant burden on the NFS server, causing interactive delay
and performance problems. In some cases, user jobs create hundreds
of thousands of write attempts, quota checks and refusals to write
within a few minutes.
- To find out more details on the various LC file systems, see the
Introduction to Livermore
Computing Resources tutorial. There are a number of topics regarding
file systems and their usage.
Miscellaneous
This completes the tutorial.
|
Please complete the online evaluation form - unless you are doing the
exercise, in which case please complete it at the end of the
exercise. |
Where would you like to go now?
References and More Information
|