4.2. HPC Services

4.2.1. Login Nodes

Several of the clusters have High Performance Computing (HPC) services installed. Access to them is provided via a Linux Login node for each of the clusters on which these services are installed.

To access the login nodes you need a FG resource account and an SSH public key you have uploaded to FutureGrid (this process is described in the section about Project and Account Management. After you are part of a valid project and have a FutureGrid account, you can log into the FutureGrid resources with ssh. The resources include the following login nodes:

  • alamo.futuregrid.org
  • bravo.futuregrid.org
  • foxtrot.futuregrid.org
  • hotel.futuregrid.org
  • india.futuregrid.org
  • sierra.futuregrid.org
  • xray.futuregrid.org

Todo

what are login nodes for delta, echo

For example, assume your portalname is “portalname”, than you can login to sierra as follows:

$ ssh portalname@sierra.futuregrid.org
Welcome to sierra.futuregrid.org
Last login: Thu Aug 12 19:19:22 2010 from ....

4.2.1.1. SSH Add

Sometimes you may wish to log in repeatedly in other machines while using a cached password. To do that you can use ssh agent and ssh add. First start the agent:

eval `ssh-agent`

Then add your key with:

ssh-add

Follow the instructions on the screen. Thus before you ssh in, you may want to use ssh agent. This way you do not have to repeatedly type in your key password.

4.2.1.2. SSH Config

Also you may want to setup your ~/.ssh/config file to create shortcut for the username and hosts on which you want to log in. Let us assume your username is albert, then add the following lines in the .ssh/config file:

Host india
      Hostname india.futuregrid.org
      User albert

This will allow you to log into the machine just while typing in:

ssh india

4.2.1.3. Modules

The login nodes have the modules package installed. It provides a convenient tool to adapt your environment and enables you to activate different packages and services dependent on your specific needs. The Modules are utilized to let you dynamically control your environment. Modules allows you to load and unload packages and ensure a coherent working environment. This ensures that your $PATH, $LD_LIBRARY_PATH, $LD_PRELOAD, and other environment variables are properly set, and that you can access the programs and libraries you need. For additional information about the Modules package you can consult the manual page.

To display the list of available modules:

$ module avail

To display the list of currently loaded modules:

$ module list

To add and remove packages from your environment you can use the module load and module unload commands:

$ module load <package name>/<optional package version>
$ module unload <package name>/<optional package version>

The available command are listed in the next table:

Module commands
Command Description
module avail List all software packages available on the system.
module avail package List all versions of package available on the system
module list List all packages currently loaded in your environment.
module load package/version Add the specified version of the package to your environment
module unload package Remove the specified package from your environment.
module swap package_A package_B Swap the loaded package (package_A) with another package (package_B).
module show package Shows what changes will be made to your environment (e.g. paths to libraries and executables) by loading the specified package.

Example - List the currently loaded modules on sierra after login:

$ module list

Currently Loaded Modulefiles:
  1) torque/2.4.8   2) moab/5.4.0

Example - list the avialable modules on sierra:

$ module avail

----------------- /opt/Modules/3.2.8/modulefiles/applications ------------------
R/2.11.1(default)      hpcc/1.3.1(default)    velvet/1.0.15
git/1.7.10             ncbi/2.2.23(default)   wgs/6.1
gromacs/4.0.7(default) soapdenovo/1.04

------------------- /opt/Modules/3.2.8/modulefiles/compilers -------------------
cmake/2.8.1(default)       java/1.6.0-i586
intel/10.1                 java/1.6.0-x86_64(default)
intel/11.1(default)

------------------- /opt/Modules/3.2.8/modulefiles/debuggers -------------------
null                       totalview/8.8.0-2(default)

------------------- /opt/Modules/3.2.8/modulefiles/libraries -------------------
intelmpi/4.0.0.028(default)  openmpi/1.4.3-intel
mkl/10.2.5.035(default)      otf/1.7.0(default)
openmpi/1.4.2(default)       unimci/1.0.1(default)
openmpi/1.4.3-gnu            vampirtrace/intel-11.1/5.8.2

--------------------- /opt/Modules/3.2.8/modulefiles/tools ---------------------
cinderclient/1.0.4(default)   moab/5.4.0(default)
cloudmesh/0.8(default)        myhadoop/0.2a
euca2ools/1.2                 novaclient/2.13.0(default)
euca2ools/1.3.1               precip/0.1(default)
euca2ools/2.0.2(default)      python/2.7(default)
genesisII/2.7.0               python/2.7.2
glanceclient/0.9.0(default)   torque/2.4.8(default)
keystoneclient/0.2.3(default) vim/7.2
marmot/2.4.0(default)

Example - load a default module (in this case cloudmesh):

$ module load cloudmesh

Please note that for loading the default you do not have to specify the version number.

4.2.1.4. List of Available Modules on Various Machines

Module hotel india sierra
R   2.11.1 2.11.1
atlas 3.9.35 3.10.1  
cbench 20110407-openmpi    
cinderclient     1.0.4
cloudmesh     0.8
cmake 2.8.4 2.8.1 2.8.1
ctool 2.12    
euca2ools   2.1.2 2.0.2
fftw 3.2.2    
glanceclient     0.9.0
globus 5.0.3    
goto2 1.13    
gromacs 4.5.4 4.0.7 4.0.7
gsl 1.14    
hadoop 0.20.203.0    
hdf5 1.8.7    
hostlists 0.2    
hpcc   1.3.1 1.3.1
intel 11.1 11.1 11.1
intelmpi 4.0.0.028 4.0.0.028 4.0.0.028
java 1.6.0_31-x86_64 1.6.0-x86_64 1.6.0-x86_64
keystoneclient     0.2.3
lapack 3.3.0    
marmot 2.4.0 2.4.0 2.4.0
mkl 10.2.5.035 10.2.5.035 10.2.5.035
moab   5.4.0 5.4.0
myhadoop 0.2a    
ncbi   2.2.23 2.2.23
novaclient     2.13.0
openmpi 1.4.5 1.4.3-gnu 1.4.2
otf 1.7.1 1.7.0 1.7.0
precip   0.1 0.1
python 2.7 2.7 2.7
szip 2.1    
taktuk 3.7.3    
torque   2.5.5 2.4.8
totalview   8.8.0-2 8.8.0-2
unimci 1.0.1 1.0.1 1.0.1
vampirtrace 5.9    
zookeeper 3.3.5    

4.2.1.5. Filesystem Layout

Home directories:
Home directories are accessible through the $HOME shell variable which are located at /N/u/<username>. This is where users are encouraged to keep source files, configuration files and executables. Users should not run code from their $HOME directories. Please note that this is an NFS file system, and may result in slower access for some applications. We also advise the users to provide external backup storage at their home institution or a code repository. For example, we recommend that you use git or svn to make sure you backup your changes to the code. Also make sure you backup your data. As a testbed, we do not guarantee data loss.
Scratch directories:
Scratch directories are located at different locations on the systems. To find out more about the file layout, please see the section Storage Services
System software directories:
System software directories are located at /N/soft. System and community software are typically installed here. Table Storage mountpoints on the Clusters provides a summary of the various mount points.
Storage mountpoints on the Clusters
Clustername (site) Mountpoint Size Type Backups Use Notes

Sierra (UCSD/SDSC)

/N/u/portalname

40.6TB

ZFS (RAID2)

Yes (nightly incremental)

Home dir

By default quotas on home directories are 50 GB and quotas on scratch directories are 100 GB.

Sierra (UCSD/SDSC)

/N/scratch/portalname

5.44TB

ZFS (RAID0)

No

Scratch

Sierra (UCSD/SDSC)

/N/soft

50GB

ZFS (RAID2)

Yes (nightly incremental)

Software installs

Sierra (UCSD/SDSC)

/N/images

6TB

ZFS (RAID2)

Yes (nightly incremental)

VM images

India (IU)

/N/u/portalname

15TB

NFS (RAID5)

Yes (nightly incremental)

Home dir

At the moment we do not have any quota implemented on India and we use the local/tmp (77 GB) as scratch space.

India (IU)

/share/project

14TB

NFS (RAID5)

Yes (nightly incremental)

Shared/group folders

India (IU)

/tmp

77GB

local disk

No

Scratch

Bravo (IU)

/N/u/portalname

15TB

NFS (RAID5)

Yes (nightly incremental)

Home dir

The same NFS shares in India are mounted in Bravo (users do not log in here; jobs are submitted through India). There are two local partitions which are used for HDFS and swift tests.

Bravo (IU)

/share/project

14TB

NFS (RAID5)

Yes (nightly incremental)

Shared/group folders

Delta (IU)

/N/u/portalname

15TB

NFS (RAID5)

Yes (nightly incremental)

Home dir

Same as Bravo. The NFS shares are mounted for user and group share (users do not log in directly here; jobs are submitted through India).

Delta (IU)

/share/project

14TB

NFS (RAID5)

Yes (nightly incremental)

Shared/group folders

Hotel (UC)

/gpfs/home

15TB

GPFS (RAID6)

No

Home dir

By default quotas on home directories are 10 GB.

Hotel (UC)

/gpfs/scratch

57TB

GPFS (RAID6)

No

Scratch

Hotel (UC)

/gpfs/software

7.1GB

GPFS (RAID6)

No

Software installs

Hotel (UC)

/gpfs/images

7.1TB

GPFS (RAID6)

No

VM images

Hotel (UC)

/scratch/local

862GB

ext3 (local disk)

No

Local scratch

Foxtrot (UFL)

/N/u/portalname

16TiB

NFS (RAID5)

No

Home dir

At the moment we do not have any quota implemented on Foxtrot.

4.2.2. Message Passing Interface (MPI)

The Message Passing Interface Standard (MPI) is the de facto standard communication library for almost many HPC systems, and is available in a variety of implementations. It has been created through consensus of the MPI Forum, which has dozens of participating organizations, including vendors, researchers, software library developers, and users. The goal of the Message Passing Interface is to provide a portable, efficient, and flexible standard for programs using message passing. For more information about MPI, please visit:

4.2.2.1. MPI Libraries

Several FutureGrid systems support MPI as part of their HPC services. An up to date status about it can be retrieved via our Inca status pages.

Todo

this table is outdated.

MPI versions installed on FutureGrid HPC services
System MPI version Compiler Infiniband Support Module  
Alamo OpenMPI 1.4.5 Intel 11.1 yes openmpi  
Bravo OpenMPI 1.4.2 Intel 11.1 no openmpi  
  OpenMPI 1.4.3 gcc 4.4.6 no openmpi/1.4.3-gnu  
  OpenMPI 1.4.3 Intel 11.1 no openmpi/1.4.3-intel  
  OpenMPI 1.5.4 gcc 4.4.6 no openmpi/1.5.4-[gnu intel]
Hotel OpenMPI 1.4.3 gcc 4.1.2 yes openmpi  
India OpenMPI 1.4.2 Intel 11.1 yes openmpi  
Sierra OpenMPI 1.4.2 Intel 11.1 no openmpi  
Xray     N/A    

Loading the OpenMPI module adds the MPI compilers to your $PATH environment variable and the OpenMPI shared library directory to your $LD_LIBRARY_PATH. This is an important step to ensure that MPI applications will compile and run successfully. In cases where the OpenMPI is compiled with the Intel compilers loading the OpenMPI module will automatically load the Intel compilers as a dependency. To load the default OpenMPI module and associated compilers, just use:

$ module load openmpi

4.2.2.2. Compiling MPI Applications

To compile MPI applications, users can simply use the available mpi compile commands:

mpicc:
To compile C programs with the CC/icc/gcc compilers
mpicxx:
To compile c++ programs with CXX/icpc/g++ compilers
mpif90:
To compile programs with F90/F77/FC/ifort/gfortran

To see in detail what these commands do you can add a -show as an option. Thus the following commands:

$ mpicc -show
$ mpicxx -show
$ mpif90 -show

will show you the detail of each of them. The resulting output can be used as a template to adapt compile flags in case the default settings are not suitable for you.

Assuming you have loaded the OpenMPI module into your environment, you can compile a simple MPI application easily by executing:

$ mpicc -o ring ring.c

Users MUST NOT run jobs on the login or headnodes. These nodes are reserved for editing and compiling programs. Furthermore running your commands on such nodes will not provide any useful information as you actually do not use the standard cluster node.

4.2.2.3. Batch Jobs

Once your MPI application is compiled, you run it on the compute nodes of a cluster via a batch processing. With the help of a batch processing services a job is run on the cluster without the users intervention via a job queue. The user does not have to worry much about the internal details of the job queue, but must provide the scheduler with some guidance about the job so it can be efficiently scheduled on the system.

To run jobs on resources with the HPC services, users must first activate their environment to use the job scheduler:

$ module load torque

A complete manual for the torque scheduler can be found in the Torque manual .

Next we need to create a script so that we can run the program on the cluster. We will be using our simple ring example to illustrate some of the parameters you need to adjust. Please save the following content to a file called ring.pbs.:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#! /bin/bash

# OPTIONS FOR THE SCRIPT
#PBS -M username@example.com
#PBS -N ring_test
#PBS -o ring_$PBS_JOBID.out
#PBS -e ring_$PBS_JOBID.err
#PBS -q batch
#PBS -l nodes=4:ppn=8
#PBS -l walltime=00:20:00


# make sure MPI is in the environment
module load openmpi

# launch the parallel application with the correct number of process
# Typical usage: mpirun -np <number of processes> <executable> <arguments>
mpirun -np 32 ring -t 1000

echo "Nodes allocated to this job: "
cat $PBS_NODEFILE

This file can be used to submit a job to the queueing system by calling the command:

qsub ring.pbs

In the job script, lines that begin with #PBS are directives to the job scheduler. You can disable any of these lines by adding an extra # character at the beginning of the line, as ## is interpreted to be a comment. Common options include:

  • -M: specify a mail address that is notified upon completion
  • -N: To specify a job name
  • -o: The name of the file to write stdout to
  • -e: The name of the file to write stderr to
  • -q: The queue to submit the job to
  • -l: Resources specifications to execute the job

The first parameters are rather obvious, so let us focus on the -q option. Each batch service is configured with a number of queues that are targeting different classes of jobs to schedule them more efficiently. These queues can be switch on or off, be modified or new queues can be added to the system. It is useful to get a list of available queues on the system of where you would like to submit your jobs. You can also inspect which would be the most suitable queue to use for your purpose with the qstat command on the appropriate login node:

$ qstat -q

Currently we have the following queues:

HPC Job Queue Information:
Resource Queue name Default Wallclock Limit Max Wallclock Limit NOTES
india batch 4 hours 24 hours  
  long 8 hours 168 hours  
  scalemp 8 hours 168 hours restricted access
  b534 none none restricted access
  ajyounge none none restricted access
sierra batch 4 hours 24 hours  
  long 8 hours 168 hours  
hotel extended none none  
alamo shortq none 24 hours  
  longq none 24 hours  
foxtrot batch 1 hour none not for general use

Todo

remove the queue ajyounge from the system, can this be done by preserving the logs?

Todo

remove the queue b534 from the system, can this be done while preserving the logs?

Next we focus on the -l option that specifies the resources. The term:

nodes=4

means that we specify 4 servers on which we execute the job. The term:

ppn=8

means that we use 8 virtual processors per node, where a virtual processor is typically executed on a core of the server. Thus it is advisable not to exceed the number of cores per server. For some programs choosing the best performing number of servers and cores may be dependent on factors such as memory needs, IO access and other resource bounded properties. You may have to experiment with the parameters. To identify the number of servers and cores available please see Tables Overview of the Clusters and Selected Details of the Clusters. For example, Alamo, Hotel, India, and Sierra have 8 cores per node, thus 4 servers would provide you access to 32 processing units.

Often you may just want to have the stdout and stderr in one file, then you simply can replace the line with -e in it with:

#PBS -j oe

which simply means that you join stdout and stderr. Here j stands for join, o for stdout and e for stderr. In case you would like to have an e-mail sent to you based on the status of the job, you can add:

#PBS -m ae

to your script. It will send you a mail when the job aborts (indicated by a), or when the job ends (indicated by e).

4.2.3. Job Management

A list of all available scheduler commands is available from the Torque manual page. We describe next the use of some typical interactions to manage your jobs in the batch queue.

4.2.3.1. Job Submission

Once you have created a submission script, you can then use the qsub command to submit this job to be executed on the compute nodes:

$ qsub ring.pbs
20311.i136

The qsub command outputs either a job identifier or an error message describing why the scheduler would not accept your job. Alternatively, you can also use the msub command, which is very similar to the qsub command. For differences we ask you to consult the manual pages.

4.2.3.2. Job Deletion

Sometimes you may want to delete a job from the queue, which can be easily done with the qdel command, followed by the id:

$ qdel 20311

4.2.3.3. Job Monitoring

If your job is submitted successfully, you can track its execution using the qstat or showq commands. Both commands will show you the state of the jobs submitted to the scheduler. The difference is mostly in their output format.

showq:

Divides the output into three sections: active jobs, eligible jobs, and blocked jobs:

$ showq
active jobs
------------------------
JOBID    USERNAME       STATE PROCS    REMAINING            STARTTIME
20311   yourusername       Running     16        3:59:59 Tue Aug 17 09:02:40
1 active job 16 of 264 processors in use by local jobs (6.06%)
                  2 of 33 nodes active (6.06%) eligible jobs
----------------------
JOBID    USERNAME       STATE PROCS    REMAINING            STARTTIME
0 eligible jobs blocked jobs
-----------------------
JOBID    USERNAME       STATE PROCS    REMAINING            STARTTIME
0 blocked jobs
Total job: 1
Legend:
Active jobs:
are jobs that are currently running on resources.
Eligible jobs:
are jobs that are waiting for nodes to become available before they can run. As a general rule, jobs are listed in the order that they will be scheduled, but scheduling algorithms may change the order over time.
Blocked jobs:
are jobs that the scheduler cannot run for some reason. Usually a job becomes blocked because it is requesting something that is impossible, such as more nodes than those which currently exist, or more processors per node than are installed.
qstat:

provides a single table view, where the status of each job is added via a status column called S:

$ qstat
Job id                             Name               User          Time Use S Queue
------------------------- --------------------- ------------------- -------- - -----
1981.i136                       sub19327.sub      inca               00:00:00 C batch
20311.i136                      testjob           yourusername              0 R batch
Legend:
Job id:
is the identifier assigned to your job.
Name:
is the name that you assigned to your job.
User:
is the username of the person who submitted the job.
Time:
is the amount of time the job has been running.
S:
shows the job state. Common job states are R for a running job, Q for a job that is queued and waiting to run, C for a job that has completed, and H for a job that is being held.
Queue:
is the name of the job queue where your job will run.

If you are interested in only your job use grep:

$ qstat | grep 20311

4.2.3.4. Job Output

If you gave your job a name with the #PBS -N <jobname> directive in your job script or by specifying the job name on the command line, your job output will be available in a file named jobname.o######, where the ###### is the job number assigned by the job manager. You can type ls jobname.o* to see all output files from the same job name.

If you explicitly name an output file with the #PBS -o <outfile> directive in your job script or by specifying the output file on the command line, your output will be in the file you specified. If you run the job again, the output file will be overwritten.

If you don’t specify any output file, your job output will have the same name as your job script, and will be numbered in the same manner as if you had specified a job name (jobname.o######).

4.2.4. Xray HPC Services

To log into the login node of xray please use the command:

ssh portalname@xray.futuregrid.org

Extensive documentation about the user environment of the Cray can be found at

For MPI jobs, use cc (pgcc). For best performance, add the xtpe-barcelona module:

% module add xtpe-module

Currently there is only one queue (batch) available to users on the Cray, and all jobs are automatically routed to that queue. You can use the same commands as introduced in the previous sections. Thus, to list the queues please use:

qstat -Q

To obtain details of running jobs and available processors, use the showq command:

/opt/moab/default/bin/showq

4.2.4.1. Submitting a Job on xray

To execute an MPI program on xray we use a special program called aprun in the submit script. Additionally we have some special resource specifications that we can pass along, such as mppwidth and mppnppn. An example is the following program that will use 16 processors on 2 nodes:

$ cat job.pbs
#! /bin/sh

#PBS -l mppwidth=16
#PBS -l mppnppn=8
#PBS -N hpcc-16
#PBS -j oe
#PBS -l walltime=7:00:00

#cd to directory where job was submitted from
cd $PBS_O_WORKDIR
export MPICH_FAST_MEMCPY=1
export MPICH_PTL_MATCH_OFF=1
aprun -n 16 -N 8 -ss -cc cpu hpcc

$ qsub job.pbs

The XT5m is a 2D mesh of nodes. Each node has two sockets, and each socket has four cores. The batch scheduler interfaces with a Cray resource scheduler called APLS. When you submit a job, the batch scheduler talks to ALPS to find out what resources are available, and ALPS then makes the reservation.

Currently ALPS is a “gang scheduler” and only allows one “job” per node. If a user submits a job in the format aprun -n 1 a.out , ALPS will put that job on one core of one node and leave the other seven cores empty. When the next job comes in, either from the same user or a different one, it will schedule that job to the next node.

If the user submits a job with aprun -n 10 a.out , then the scheduler will put the first eight tasks on the first node and the next two tasks on the second node, again leaving six empty cores on the second node. The user can modify the placement with -N , -S , and -cc .

A user might also run a single job with multiple treads, as with OpenMPI. If a user runs this job aprun -n 1 -d 8 a.out , the job will be scheduled to one node and have eight threads running, one on each core.

You can run multiple, different binaries at the same time on the same node, but only from one submission. Submitting a script like this will not work:

OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 0 ./my-binary
OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 1 ./my-binary
OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 2 ./my-binary
OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 3 ./my-binary
OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 4 ./my-binary
OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 5 ./my-binary
OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 6 ./my-binary
OMP_NUM_THREADS=1 aprun -n 1 -d 1 -cc 7 ./my-binary

This will run a job on each core, but not at the same time. To run all jobs at the same time, you need to first add all the binaries within one aprun command:

$ cat run-all.pbs
./my-binary1
./my-binary2
./my-binary3
./my-binary4
./my-binary5
./my-binary6
./my-binary7
./my-binary8
$ aprun -n 1 run.pbs

Alternatively, use the command aprun -n 1 -d 8 run.pbs. To run multiple serial jobs, you must build a batch script to divide the number of jobs into groups of eight, and the