© 2008 by the Rector and Visitors of the University of Virginia.

The information contained on the University of Virginia’s Department of Information Technology and Communication (ITC) website is provided as a public service with the understanding that ITC makes no representations or warranties, either expressed or implied, concerning the accuracy, completeness, reliability or suitability of the information, including warrantees of title, non-infringement of copyright or patent rights of others. These pages are expected to represent the University of Virginia community and the State of Virginia in a professional manner in accordance with the University of Virginia’s Computing Policies.

Getting Started on the Aspen Linux Cluster

Tutorial: Getting Started on the Aspen Linux Cluster

This tutorial is designed for researchers who are new to the Aspen Linux cluster. It covers basic information about the cluster, as well as how to create and submit batch jobs using the PBS resource management software. It also contains sample job command files that can be used as templates for running jobs under PBS.

Table of Contents


The Aspen Linux Cluster at UVA

The Aspen Linux Cluster is a 48-node, distributed memory multi-processor system. Each node of the cluster contains two 1.53 GHz AMD Athlon K7 MP processors with 256KB of cache (per cpu) and 1 GB of RAM (per node). The nodes are interconected with Gigabit Ethernet (60-110 Mbytes/sec bandwidth, 50-200 usecs latency).

The Aspen Linux cluster uses Red Hat Linux version 7.2 as its operating systems and the Portable Batch System (PBS) software to distribute the computational workload across the nodes. PBS is a batch job scheduling application that provides the facility for building, submitting and processing batch jobs on the cluster.

Jobs are submitted to the cluster by creating a PBS job command file that specifies certain attributes of the job, such as how long the job is expected to run and how many nodes of the cluster are needed (e.g. for parallel programs). PBS then schedules when the job is to start running on the cluster (based in part on those attributes), runs and monitors the job at the scheduled time, and returns any output to the user once the job completes.

Logging on to the Cluster

Logins to the Linux cluster can be done through the machine aspen.itc.virginia.edu by slogin or ssh. This places you on the head node of the cluster which acts as the control console for any interactive work such as source code editing, compilation, and submitting jobs through PBS. Using applications such as Matlab or Mathematica interactively should not be done on aspen.itc, but rather on other machines. When you log on to the cluster you should be in your blue.unix home directory.

Important notice for Windows users: do not use a standard Windows editor such as Notepad to edit files that will be used on the Linux or other Unix systems. The two systems use different sequences of control characters to mark the end of line (EOL). If you are using the clusters from a Windows system, there are a number of options:

More information about using Unix systems from Windows machines can be found at www.itc.virginia.edu/research/unix/windows_tools.

Configuring Your Account

Use of the Aspen Cluster assumes familiarity with the Unix/Linux software environment. In order to use PBS for batch job submission, it may be necessary to configure some of your Unix account startup files. General information about the Unix operating system can be found at the URL www.itc.virginia.edu/research/unixbasics.html.

When a job is submited to the cluster through PBS a new login to your account is initiated, and any initialization commands in your startup files (.profile, .variables.ksh, .kshrc etc) are executed. In this case (running in batch mode) it is necessary to disable the interactive commands such as setting tset and stty. If these precautions are not taken then error messages will be written to the batch jobs error file and your program may not run.

The recommended procedure to disable the interactive sections of the startup files is to test the environment variable PBS_ENVIRONMENT, which is set when PBS runs. If the variable has been set, meaning a PBS job has initiated the login, the interactive parts of the startup files are skipped.

Below is an example of a .profile file configured for use with PBS on the Aspen cluster.

# The following command exports variables set here to your user shell.
set -a

# This command runs your ".variables.ksh" file.
. ${HOME}/.variables.ksh

# Exclude interactive commands & umenu if LL_JOB is TRUE (SP batch job)
# or PBS_ENVIRONMENT is set (PBS batch jobs on any architecture)
if [ -z "$PBS_ENVIRONMENT" ] ; then

# Make /home/ intial Linux command prompt directory path
cd /home/$USER

# Interactive lines such control key and terminal settings go here 


# Close exclusion of interactive section (SP and PBS batch job requirement)
fi
The following link shows a complete .profile modified to run PBS jobs using K-shell. If you are using the shell tcsh, the following link shows a .login modified to run PBS jobs. You should also make sure any stty commands are done inside the PBS exclusion test in the .profile or .login.

Note: if you have trouble using the man command on Aspen, in your .variables.ksh file replace the line

PAGER=/usr/bin/more
with
PAGER=more
This should work on all systems since more is normally in the path automatically.

Note that csh (tcsh) users may get the warning "Warning: no access to tty, thus no job control in this shell" as part of their PBS job output. This is documented on page 18 of the PBSPro User's Guide and should not affect the job itself.

To allow access to the PBS commands and manual pages, the appropriate paths have been added to the system PATH and MANPATH environment variables. Users should make sure they are including the system PATH and MANPATH variables as part of their account PATH and MANPATH variables (e.g. in .variables.ksh, PATH=${HOME}/bin:${PATH}:/home/loadl/bin:.).

Users may need to modify their PAGER variable (typically in the .variables.ksh file) to be /bin/more so that the man command will work correctly on the cluster.


Using Modules to Load Software

The Aspen cluster uses modules to manage the setting of paths and other environment variables for particular software packages, such as the compilers and the MPICH environment. In particular, Aspen offers more than one compiler, as well as an MPICH environment. At least one module must be loaded in order to use a compiler or its libraries; for example:

module load pgi

loads the current version of the PGI compiler suite, while the command

module load pgi/4.0

loads the older version of the PGI compilers.

The modules command has a number of options, some of which are similar. For example, module add is synonymous with module load.

A full listing of the available modules can be obtained by typing

module which

Executing module which on Aspen at a particular time yields

icc/7.0              : loads the Intel C++ Compiler Environment
ifc/7.0              : loads the Intel Fortran Compiler Environment
imsl/5.0             : loads the IMSL scientific library
java/1.5             : loads the Sun JDK Environment
mpich-eth-gnu/1.2.4  : loads the mpich environment for Gnu over Ethernet
mpich-eth-intel/1.2.4: loads the mpich environment for Intel over Ethernet
mpich-eth-pgi/1.2.5  : loads the mpich environment for PGI over Ethernet
pgi/4.0              : loads the PGI Compiler Environment
pgi/5.0              : loads the PGI Compiler Environment

Compilers

Programs for which the user has written the source code must first be compiled on aspen.itc to run on the cluster. Currently, three sets of compilers are supported on Aspen: the Portland Group (PGI), the Intel, and the Gnu compilers. PGI and Intel offer C, C++, and Fortran 95 compilers; the Gnu compilers include C, C++, and Fortran 77.

The Portland Group (PGI) Compilers are licensed by ITC to run on Linux platforms at the University. The PGI compilers available on the Aspen Cluster are:

pgcc [options] file.c                     (C)
					
pgCC [options] file.C                     (C++) 
				      
pgf77 [options] file.f                    (Fortran 77)
					
pgf90 [options] file.f                    (Fortran 90) 
					
pghpf [options] file.f                    (High Performance Fortran)
For a complete list of options consult the relevent compiler man page, e.g. man pgf77 from your account on aspen.itc. More detailed information about the PGI compilers can be found in the documentation on the webpage,
www.pgroup.com/doc/index.htm

Information about installing these compilers on your own Linux workstation can be found on the webpage,

wwww.itc.virginia.edu/research/pgi/

The Intel compilers are licensed by ITC to run on Linux platforms at the University. The Intel compilers available on the Aspen Cluster are:

icc [options] file.c                     (C)
					
icc [options] file.C file.cc file.cpp    (C++) 
				      
ifc [options] file.f                     (Fortran 77)
					
ifc [options] file.f90                   (Fortran 90/95) 
					
For a complete list of options consult the relevent compiler man page, e.g. man ifc on Aspen. More detailed information about the Intel compilers can be found in the documentation on the Fortran and C/C++ Web pages.

To compile parallel programs, the open source MPI (Message Passing Interface) libraries MPICH have been provided. A module corresponding to the compiler you wish to use must be loaded in order to set up the correct environment. MPICH is specific to compiler and to networking protocol. For example, to use an MPICH compiled with the Intel compiler over the Ethernet networking protocol, which is the only protocol available on Aspen, the command would be

module load mpich-eth-intel
Once the module is loaded the following commands should be used to compile programs that use MPI code:
mpicc [options] file.c                     (C)
				     
mpiCC [options] file.C                     (C++)
				      
mpif77 [options] file.f                    (Fortran 77)
				     
mpif90 [options] file.f                    (Fortran 90)
					

The following webpage provides information on using the MPI libraries.

www.itc.Virginia.EDU/research/mpi/
Once you have an executable version of a program you want to run, whether it's source code you've compiled yourself or a third party software package such as Matlab or Mathematica, you must use the PBS resource management software to run the code on the cluster.

Portable Batch System (PBS)

The PBS resource management system handles the management and monitoring of the computational workload on the Aspen cluster. Users submit "jobs" to the resource management system where they are queued up until the system is ready to run them. PBS selects which jobs to run, when, and where, according to a predetermined site policy meant to balance competing user needs and to maximize efficient use of the cluster resources.

To use PBS, you create a batch job command file which you submit to the PBS server to run on the Aspen cluster. A batch job file is simply a shell script containing the set of commands you want run on some set of cluster compute nodes. It also contains directives which specify the characteristics (attributes), and resource requirements (e.g. number of compute nodes and maximum runtime) that your job needs. Once you create your PBS job file, you can reuse it if you wish or modify it for subsequent runs.

PBS also provides a special kind of batch job called interactive-batch. An interactive-batch job is treated just like a regular batch job, in that it is placed into the queue and must wait for resources to become available before it can run. Once it is started, however, the user's terminal input and output are connected to the job in what appears to be an rlogin session to one of the compute nodes. Many users find this useful for debugging their applications or for computational steering.

PBS provides two user interfaces for batch job submission: a command line interface (CLI) and a graphical user interface (GUI). Both interfaces provide the same functionality and you can use either one to interact with PBS. The CLI lets you type commands at the system prompt. The GUI is a graphical point-and-click interface.

The PBS graphical interface is invoked with the command xpbs. A screen shot of xpbs is here.
The xpbs interface is composed of three windows: the first is the "Hosts Panel" and displays the the hostnames of the machines running PBS servers to which jobs can be submitted. In the case of the Aspen cluster, the PBS server is running on the front-end login host aspen.itc.virginia.edu and is labeled lc0. The second window is the "Queues Panel" and displays information about the queues managed by the server host selected in the "Hosts Panel". It shows the single queue "workq" on the Aspen cluster. The third window is the "Jobs Panel" and displays information about jobs that are found in the queue(s) selected from the Queues listbox.

Further information about how to configure and use the xpbs interface can be found in Chapter 5 of the PBS Pro User Guide. The remainder of this tutorial will focus on the PBS command line interface. More detailed information bout using PBS can be found in the PBS Pro User Guide.

PBS Job Command Files

To submit a job to run on the Aspen cluster, a PBS job command file must be created. The job command file is a shell script that contains PBS directives; these directives are preceded by #PBS. The following is an example of a PBS command file to run a serial job, which would only require 1 processor on one node. In this example, the executable to be run is named serial_executable.

#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l walltime=12:00:00
#PBS -o output_filename
#PBS -j oe
#PBS -m bea
#PBS -M userid@virginia.edu

cd $PBS_O_WORKDIR
./serial_executable

The first line identifies this file as a shell script. The next several lines are PBS directives that must precede any commands to be executed by the shell (e.g. the last two lines). The PBS directives illustrated are explained in the table below:

PBS Directive                         Function   

#PBS -l nodes=1:ppn=1          Specifies a PBS resource requirement of
                               1 compute node and 1 processor per node.

#PBS -l walltime=12:00:00      Specifies a PBS resource requirement of 
                               12 hours of wall clock time to run the job.

#PBS -o output_filename        Specifies the name of the file where job
                               output is to be saved. May be omitted to
                               generate filename appended with jobid number.

#PBS -j oe                     Specifies that job output and error messages
                               are to be joined in one file.

#PBS -m bea                    Specifies that PBS send email notification
                               when the job begins (b), ends (e), or 
                               aborts (a). 

#PBS -M userid@virginia.edu    Specifies the email address where PBS
                               notification is to be sent.

#PBS -V                        Specifies that all environment variables
                               are to be exported to the batch job.

It is not necessary to use the -j (join) directive; sometimes it is helpful to keep the output and error files separate. If -o or -e directives are not specified, PBS will assign a name to each consisting of the name of the script concatenated with .o for output and .e for error. This makes it possible for several runs to write to their standard output and standard error files without overwriting one another's results.

The following is an example of a PBS email notification to the user at the end of the job:

Date: Mon, 21 Oct 2002 23:06:47 -0400
From: adm 
To: userid@virginia.edu
Subject: PBS JOB 1187.lc0

PBS Job Id: 1187.lc0
Job Name:   script.sh
Execution terminated
Exit_status=0
resources_used.cpupercent=88
resources_used.cput=00:00:52
resources_used.mem=64248kb
resources_used.ncpus=1
resources_used.vmem=81036kb
resources_used.walltime=01:02:14

Note that the walltime-used information in the email should be used to accurately estimate the walltime resource requirement in the PBS job command file for future job submissions so that PBS can more effectively schedule the job. When submitting a particular PBS job for the first time, the walltime requirement should be overestimated to prevent premature job termination. The walltime measurement corresponds closely to the job cpu time since each job is allocated its own processor for execution.

After the PBS directives in the command file, the shell executes a change directory command to $PBS_O_WORKDIR, a PBS variable indicating the directory where the PBS job was submitted. Normally this will also be where the progam executable is located. Other shell commands can be executed as well. In the last line, the executable itself itself is invoked.

If your program was compiled with the PGI compiler or uses any of its libraries, you will probably need to add the lines

source /opt/Modules/default/init/sh
module add pgi
before or after the cd into the working directory.

If the executable is a parallel program using the the Message Passing Interface (MPI), then it will require multiple processors of the cluster to run and this is specified in the PBS nodes resource requirement. The script 'mpiexec' is used to invoke the parallel executable. The following is an example of a PBS command file to run a parallel (MPI) job:

#!/bin/sh
#PBS -l nodes=4:ppn=2
#PBS -l walltime=12:00:00
#PBS -o output_filename
#PBS -j oe
#PBS -m abe
#PBS -M userid@virginia.edu

cd $PBS_O_WORKDIR

mpiexec -comm mpich-p4 executable_parallel

In this case the PBS nodes resource requirement specifies 2 processor per node on 4 nodes for a total of 8 processors. This number of processors is automatically used by mpiexec, by default. The code was compiled with the Intel compiler so the corresponding mpich module is loaded before beginning the run.

Parallel jobs should usually specify a nodes requirement of 2 processors per node to efficiently partition the compute nodes for these jobs.

The PBS job command file can be given any name, although it is usually appended with a .sh extension to indicate that it is a shell script. The link pbs_script.sh is an example PBS job script that runs the High Performance Linpack benchmark across 4 nodes using the input file HPL.dat. You can download these to your cluster account and use them to test PBS job submission described below. Remember to change the userid placeholder in the PBS email directive to your own.

Submitting a Job

The PBS qsub command is used to submit job command files for scheduling and execution. For example, to submit your job with a PBS command file called "pbs_script.sh", the syntax would be
lc0: /home/uconsult $ qsub pbs_script.sh
 
1354.lc0

lc0: /home/uconsult $ 
Notice that upon successful submission of a job, PBS returns a job identifier of the form jobid.lc0, where jobid is an integer number assigned by PBS to that job. You'll need the job identifier for any actions involving the job, such as checking job status, deleting the job, or specifying job dependencies as described below.

There are many options to the qsub command as can be seen by typing man qsub at the Linux command prompt on lc0.itc or looking at PBS Pro User Guide. Three of the more useful ones are the -W option for allowing specification of additional job attributes, the -I option, which declares that the job is to be run "interactively", and the -l option, which allows resource requirements to be listed as part of the qsub command. These are discussed below.

Specifying Job Dependencies

The -W option allows for the specification of additional job attributes. In particular, the "-W depend=dependency_list" option to qsub defines the dependency between multiple jobs, which is useful if the jobs need to execute in a certain order. For example, if pbs_script2.sh should not start executing until pbs_script1.sh successfully completes because it needs a file that pbs_script1.sh creates, then these two jobs should be submitted to PBS in the following manner:

lc0: /home/uconsult $ qsub pbs_script1.sh

543.lc0

lc0: /home/uconsult $ qsub -W depend=afterok:543 pbs_script2.sh

544.lc0
After pbs_script1.sh is submitted, PBS returns the job identifier number which is then used as part of the dependence argument list when pbs_script2.sh is submitted. The "afterok" argument in the dependency list indicates that the job identified as 543 must complete successfully before pbs_script2.sh will start.

Other options for arguments of the dependency list are detailed in Chapter 8 of PBS Pro User Guide as well as the online manual page for qsub by typing man qsub at the Linux command prompt.

Submitting an Interactive Job

The -I option of qsub declares that a job has to be run "interactively". The job will be queued and scheduled as any PBS batch job, but when executed, the standard input, output, and error streams of the job are connected through qsub to the terminal session in which qsub is running. Interactive jobs with PBS should be used only for the purposes of testing/debugging the user's code, e.g. in cases using the PGI or TotalView debuggers.

Once the PBS intereactive job is executed, the terminal session will be logged into one of the compute nodes allocated by PBS. The executable can then be invoked manually from the Linux command prompt.

As will be discussed in the next section, the PBS scheduler is configured to favor jobs with shorter walltime and smaller node resource requirements. To insure that a PBS interactive job is executed quickly, these reduced resource requirements can be listed as arguments of qsub with the -l option.

The following is an example of running the High Performance Linpack Benchmark as an interactive PBS job using 4 nodes and requesting 10 minutes of walltime. Note that the terminal session is actually logged into node compute-0-4.

lc0: /home/uconsult $ qsub -I -l nodes=2:ppn=2 -l walltime=00:10:00

qsub: waiting for job 1352.lc0 to start

qsub: job 1352.lc0 ready

localstorage is in /jobtmp/pbstmp.1352.lc0

compute-0-4: /home/uconsult $ mpiexec -comm mpich-p4 \
 /opt/hpl-eth/bin/xhpl 

============================================================================
HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27, 2000
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
============================================================================

   [further output not shown]

compute-0-4: /home/uconsult $ exit

lc0: /home/uconsult $
An interactive PBS job submission should require no more than 4 processors (2 nodes, 2 processors each) for testing/debugging purposes. In addition, an interactive PBS job will not terminate until the user exits the terminal session. The allocated nodes will remain reserved as long as the terminal session is open, up to the walltime limit, so it is extremely important that users exit their interactive sessions as soon as their debugging is done so that their nodes are returned to the available pool of processors.

Job Submission Policies

Users of the Aspen Cluster may submit as many jobs to PBS as they like. The PBS scheduler will dynamically determine a user's priority based on the the number of jobs of other users and the number of available nodes, in order to maximize cluster usage in an equitable fashion. Any jobs in excess of the allowed upper limit on resources (such as cpus) will wait in the queue until a slot opens when one or more of the user's other jobs finishes.

All PBS jobs submitted by users of the cluster will go to one execution queue called workq; the scheduler will first sort them by giving jobs requiring shorter walltime and smaller node resource requirements higher run priorities. The scheduler further modifies these priorities based on a fair-share algorithm which tries to guarantee that on average, all users will get an equal amount of computing time. Finally, jobs waiting for more the 24 hours to run will be considered "starving" and given higher priority.

PBS is currently configured to limit the maximum amount of walltime a single job can use to 168 hours. When that time limit is reached, the job will be terminated whether it has completed or not. This insures that no one job can monopolize cluster compute nodes indefinitely and underscores the need for users to implement some type of save-restart mechanism in their code so they can restart the job close to where it was stopped and not lose all the work done up to that point. The following URL provides some guidelines for implementing save-restart in your code:
www.scd.ucar.edu/docs/chinook/save.html

PBS also imposes a limit on the number of processors users can require, based on how busy the cluster is. A user can request up to 36 processors for a parallel job, though all must become available for such a job to start; in practice this is unlikely to occur. A maximum of 48 cpus aggregated over all jobs may be in use by a single user.

The PBS configuration and scheduling policies used on the cluster will be periodically reviewed and modified as needed to insure efficient and equitable use of this high performance computing resource.

Researchers with extraordinary needs for the cluster, either in terms of extended compute time or number of nodes, should contact the Research Computing Support Group at res-consult@virginia.edu to discuss making special arrangements to meet those needs.


Displaying Job Status

The qstat -a command is used to obtain status information about jobs submitted to PBS.

lc0: /home/uconsult $ qstat -a

						    Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1363.lc0        uconsult workq    job16x2     19094  16  32    --  00:20 R 00:02
1364.lc0        teh1m    workq    job12x2      7149  12  24    --  00:16 R 00:01
1365.lc0        teh1m    workq    job8x2       4166   8  16    --  00:12 R 00:00
1366.lc0        uconsult workq    job20x2       --   20  40    --  00:28 Q   -- 
1368.lc0        uconsult workq    STDIN       30942   2   4    --  00:10 R 00:02

lc0: /home/uconsult $ 
The first five fields of the display are self-explanatory. Note that job ID 1368 has a jobname of STDIN which is short for standard input, indicating that its an interactive job. The sixth and seventh fields titled NDS and TSK in the above display indicate the total number of nodes and processors respectively required by each job. The ninth field indicates the required walltime (hrs:min.) and the last field shows the elapsed runtime. The tenth filed titled S indicates the state of the job. The job state can have the following values:
State              Definition

E          Job is exiting after having run
H          Job is held
Q          Job is queued, eligible to run or be routed
R          Job is Running
T          Job is in transition (being moved to a new location)
W          Job is waiting for its requested execution time to be reached
S          Job is suspended

The following example shows how to use the qstat -f command to get detailed information on a specific job using its job identification number.

lc0: /home/uconsult $ qstat -f 1363
Job Id: 1363.lc0
Job_Name = job16x2
Job_Owner = uconsult@lc0
resources_used.cpupercent = 82
resources_used.cput = 00:01:59
resources_used.mem = 83384kb
resources_used.ncpus = 32
resources_used.vmem = 124920kb
resources_used.walltime = 00:02:33
job_state = R
queue = workq
server = lc0
Checkpoint = u
ctime = Fri Oct 25 03:00:41 2002
Error_Path = lc0:/h1/u/uc/uconsult/linux_cluster/job16x2.e1363
exec_host = compute-1-0/0+compute-0-15/0+compute-0-14/0+compute-0-13/0+comp
ute-0-12/0+compute-0-11/0+compute-0-10/0+compute-0-9/0+compute-0-8/0+co
mpute-0-7/0+compute-0-6/0+compute-0-5/0+compute-0-4/0+compute-0-3/0+com
pute-0-2/0+compute-0-1/0+compute-1-0/1+compute-0-15/1+compute-0-14/1+co
mpute-0-13/1+compute-0-12/1+compute-0-11/1+compute-0-10/1+compute-0-9/1
+compute-0-8/1+compute-0-7/1+compute-0-6/1+compute-0-5/1+compute-0-4/1+
compute-0-3/1+compute-0-2/1+compute-0-1/1
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = e
mtime = Fri Oct 25 03:00:42 2002
Output_Path = lc0:/h1/u/uc/uconsult/linux_cluster/16x2
Priority = 0
qtime = Fri Oct 25 03:00:41 2002
Rerunable = True
Resource_List.ncpus = 32
Resource_List.neednodes = 16:ppn=2
Resource_List.nodect = 16
Resource_List.nodes = 16:ppn=2
Resource_List.walltime = 20:00:00
session_id = 19094
Variable_List = PBS_O_HOME=/home/uconsult,PBS_O_LANG=en_US,
PBS_O_LOGNAME=uconsult,
PBS_O_PATH=/home/uconsult/bin:/usr/pbs/bin:/usr/share/mpi/bin:/uva/bin
:/usr/pgi/linux86/bin:/bin:/usr/bin:/usr/local/bin:/usr/bin/X11:/usr/X1
1R6/bin:.,PBS_O_MAIL=/var/spool/mail/uconsult,PBS_O_SHELL=/bin/ksh,
PBS_O_HOST=lc0,PBS_O_WORKDIR=/h1/u/uc/uconsult/linux_cluster,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq
comment = Job run at started on Fri Oct 25 at 03:00
etime = Fri Oct 25 03:00:41 2002

For further information about the qstat command, type man qstat on the cluster front-end machine aspen.itc or see the PBS Pro User Guide.



Canceling a Job

PBS provides the qdel command for deleting jobs from the system using the job identification number, as shown below.
lc0: /home/uconsult/linux_cluster $   qstat -a

						    Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1361.lc0        uconsult workq    job16x2     18136  16  32    --  48:00 R 00:01


lc0: /home/uconsult/linux_cluster $ qdel 1361
lc0: /home/uconsult/linux_cluster $ qstat -a Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 1361.lc0 uconsult workq job16x2 18136 16 32 -- 48:00 E 00:01

For further information about the qdel command, type man qdel on the cluster front-end machine lc0.itc or see the PBS Pro User Guide.



Sample PBS Command Scripts

In this section are a number of sample PBS command files for different types of jobs.

Large Scratch/Ouput Files

A perl script called tmpsync has been installed on the Aspen Linux Cluster to allow users to deal with programs that generate large scratch or output files without exceeding their home directory disk space quota. The PBS command file below shows how tmpsync can be used with the scatter/collect options to distribute/collect files associated with a parallel program to/from disk space on the cluster compute nodes. Once the PBS job has completed, all output files from the master compute node will be copied to /bigtmp/$PBS_JOBID on the frontend node (lc0.itc). The variable $PBS_JOBID is assigned when the job begins and contains the ID number, so users should make a note of all their job ID numbers. Files older than 72 hours are removed from /bigtmp, so users should download their output file to their own longer-term storage.

File tranfer to and from the cluster frontend should be done using a secure method such as scp or rsync. The following are examples of transferring files from /bigtmp on the cluster front-end node lc0.itc to a remote host, initiating the transfer either from lc0.itc or from the remote host. These examples use the ksh line continuation character \ immediately followed by a newline.

Tranfer from lc0.itc (local source and remote distination):

/uva/bin/scp /bigtmp/pbstmp.jobid.lc0/* \
userid@remote_host.virginia.edu:/home/userid/pbs_output/.

/uva/bin/rsync -e ssh -a /bigtmp/pbstmp.jobid.lc0/. \
 userid@remote_host.virginia.edu:/home/userid/pbs_output/.    
Tranfer to remote_host (remote source and local distination):
/uva/bin/scp2 userid@lc0.itc.virginia.edu:/bigtmp/pbstmp.jobid.lc0/* \
/home/userid/pbs_output/.

/uva/bin/rsync -e ssh -a  \
userid@lc0.itc.virginia.edu:/bigtmp/pbstmp.jobid.lc0/. \
/home/userid/pbs_output/.

Note that if the program is serial rather than parallel, the scatter and collect operations of tmpsync would not be needed since there would only be one execution node.

#!/bin/sh
#PBS -l nodes=2:ppn=2
#PBS -l walltime=00:02:00
#PBS -j oe
#PBS -m ea
#PBS -M uconsult@virginia.edu

# Define variable for local storage on compute nodes associated with the job
LS="/jobtmp/pbstmp.$PBS_JOBID"

# Copy executable (e.g. xhpl) and data files (e.g. HPL.dat) from your 
# home directory to local storage on the master compute node
cd $LS
/bin/cp $HOME/xhpl .
/bin/cp $HOME/HPL.dat .

# If parallel program, synchronize local storage from master compute node 
# to slave compute nodes
/usr/bin/tmpsync -scatter

# Run parallel program 
mpiexec -comm mpich-p4 ./xhpl > xhpl_out

# If parallel program, synchronize local storage from slave compute nodes 
# to master compute node
/usr/bin/tmpsync -collect

Note: in this script, there should be no spaces around the equals sign
in the line LS="/jobtmp/pbstmp.$PBS_JOBID".

Matlab

This is a PBS job command file to run a Matlab batch job. The Matlab program commands are in the file matlab_script.m (note the .m extension is not included in the command syntax) and the output of the program will go to the file matlab_output1 at the end of the job and to matlab_output2 while the job is running.

#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:02:00
#PBS -o matlab_output1
#PBS -j oe
#PBS -m ea
#PBS -M userid@virginia.edu

cd $PBS_O_WORKDIR
matlab -nojvm -nodesktop -r "matlab_script;exit" -logfile matlab_output2

Mathematica

This is a PBS job command file to run a Mathematica batch job. The Mathematica program commands are in the file math_script and the output of the program will go to the file math_output. These file names are arbitrary and other names could be used.

#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l walltime=00:02:00
#PBS -j oe
#PBS -m ea
#PBS -M userid@virginia.edu

cd $PBS_O_WORKDIR
math < math_script > math_output 

It is sometimes useful to make the first line in the math_script file the command
AppendTo[$Echo, "stdout"] so that the Mathematica input lines will also be included in the output file.

If you have Mathematica commands stored in a notebook that you would like to transfer to your math_script file, you can use one of Mathematica's front end features to help you.

  1. Select the cell or set of cells that contain the commands you wish to be written to the math_script text file.
  2. The Mathematica commands in the selected cells should be converted to Input Form by clicking on Cell -> Convert To -> InputForm in the menu bar.
  3. These cells should also be defined as initialization cells by clicking on Cell -> Cell Properties -> Initialization Cell in the menu bar.
  4. Now Mathematica can generate a text file by clicking on File -> Save As Special... -> Package Format

A dialog box will appear prompting you to give the file a name and location. You can use this Package Format file as the input file for your Mathematica batch job.

Ansys

This is a PBS job command file to run an Ansys batch job. The Ansys program input is in the file ansys.in and the output of the program will go to the file ansys.out. Output from PBS is saved in the file ansys.msg.

#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l walltime=160:00:00
#PBS -o ansys.msg
#PBS -j oe

# Copy Ansys input file to compute node scratch space
LS="/jobtmp/pbstmp.$PBS_JOBID"
cd $LS
/bin/cp /home/mst3k/ansys/ansys.in .

ansys < $LS/ansys.in > $LS/ansys.out

Gaussian 98

This is a PBS job command file to run a Gaussian 98 batch job. The Gaussian 98 program input is in the file gaussian.in and the output of the program will go to the file gaussian.out. Output from PBS is saved in the file gaussian.msg.

#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l walltime=160:00:00
#PBS -o guassian.msg
#PBS -j oe

# Copy Gaussian input file to compute node scratch space
LS="/jobtmp/pbstmp.$PBS_JOBID"
cd $LS
/bin/cp /home/userid/gaussian/gaussian.in .

# Define Gaussian scratch directory as compute node scratch space
export GAUSS_SCRDIR=$LS

# Load PGI module needed by the binary
. /opt/Modules/default/init/sh
module load pgi

g98 < $LS/gaussian.in > $LS/gaussian.out

SAS

This is a PBS job command file to run a SAS batch job. The SAS program commands are in the file myfile.sas and the output of the program will go to the file myfile.out. The log file will be myfile.log.

#!/bin/sh
#PBS -l nodes=1:ppn=1 
#PBS -l walltime=01:00:00 
#PBS -m bea 
#PBS -M userid@virginia.edu 
cd $PBS_O_WORKDIR 
sas myfile.sas

File Transfer to and from the Cluster

Disk space on the home directory is extremely limited, and space on /bigtmp is temporary. Once your jobs have run, you will need to transfer your files to your local system for permanent storage. File tranfer to and from the cluster should be effected using a secure method such as scp or rsync.

If you are transferring to and from a Unix system (this includes Linux), the following are examples of transferring files from a directory mydirectory on the cluster front-end node aspen.itc to a remote host, initiating the transfer either from aspen.itc or from the remote host. These examples use the ksh line continuation character \ immediately followed by a newline.

Transfer from aspen.itc (local source and remote destination):

/uva/bin/scp mydirectory/* \
userid@remote_host.virginia.edu:/home/userid/myoutput/.
Note: userid@ may be omitted if the user's id is the same on both systems. The colon after the hostname is essential, however. Also, if you are using Linux on your local workstation and are running OpenSSH rather than UVa's commercial SSH you should use sftp to transfer from the workstation to the clusters (scp will work in the opposite direction); sftp takes exactly the same form and commands as insecure ftp.
/uva/bin/rsync -e ssh -a mydirectory/. \
 userid@remote_host.virginia.edu:/home/userid/myoutput/.
Tranfer to remote_host (remote source and local destination):
/uva/bin/scp2 userid@birch.itc.virginia.edu:mydirectory/* \
/home/userid/myoutput/.
 
/uva/bin/rsync -e ssh -a  \
userid@birch.itc.virginia.edu:mydirectory/. \
/home/userid/myoutput/.

Mac OSX with Darwin includes scp and rsync, so these commands can be run inside the terminal application exactly as in the Unix examples above.

From a Windows system, use SecureFX, a commercial product available to students, faculty, and staff. The cluster runs ssh2; it does not run an ftp daemon, so sftp is the correct protocol for file transfers to the cluster frontend.