energy – OS X server

Summary

Server name: energy.ccny.cuny.edu

Off-campus access: No

Configuration: 25 node Linux cluster, 4 TB Storage

Function: Primarily for MPI parallel jobs

How to use

Use ssh (requires VPN)

ssh username@energy.ccny.cuny.edu

Job scheduler

In order to efficiently and fairly use the Energy Institute Xserve cluster, slave nodes are being added into Sun Grid Engine (SGE) pool. New jobs are expected to be submitted through SGE. If you have any questions regarding the user policy and SGE, please contact the system administrator. The full documentation of the SGE can be found at http://docs.sun.com/app/docs/doc/820-0699

Submit a simple job

qsub
submits batch jobs to the Grid Engine queuing system.

SGE accepts jobs as shell scripts such as bash or csh scripts. An example of a job script looks like:

#!/bin/sh

# request Bourne shell as shell for job
#$ -S /bin/sh
# assume current working directory as paths
#$ -cwd

#
# print date and time
date
# Sleep for 20 seconds
sleep 20
# print date and time again
date

If the above script is called “simple.sh”, run this command to submit the job:

qsub simple.sh

When the job is finished, SGE will write two files in the working directory:

simple.sh.ejobid
simple.sh.ojobid

Where jobid is the job ID assigned by SGE. They are standard error and standard output of the script “simple.sh”.

To submit a serial job that runs a user compiled program, simply wrap it in a script.

 

Submit a parallel job

Submitting a parallel job is the same as submitting a serial job except one has to define a parallel environment and request CPUs in line “#$ -pe mpich cpunumber”. A parallel environment is a pre-built set of resource and instruction configurations that makes a particular type of parallel computing possible. In addition, keep the -np $NSLOTS -machinefile $TMPDIR/machines as shown in the next example because the number of CPUs and the name of the nodes are decided by SGE.

### Begin embedded Grid Engine arguments
#   (name the job)
#$ -N test
# request 16 CPUs and pe named mpich
#$ -pe mpich 16
#   Force a shell
#$ -S /bin/sh
#   (assume current working directory for paths)
#$ -cwd
### End embedded Grid Engine commands

echo "I have $NSLOTS slots to run on!"

/common/mpich-mx10g/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines /Users/jmao/cpi/cpi_mpich

IMPORTANT:

Due to the parallel nature of the program, a Parallel job spawns child processes on the slave nodes and the termination of the parent job doesn’t gaurentee the removal of the child processes. Accumulation of strayed child processes will use up all the endpoints and makes the slave nodes not accepting future MPI jobs.

A system script is implemented to clean the strayed user jobs on slave nodes every 10 minutes.

Control SGE in the script file

Special comment line that starts with “#$” defines qsub options.

Here are some commonly used options:

  • -N [name] — The name of the job.
  • -pe [type] [num] — Request [num] amount of [type] CPUs. Only mpich is supported at the time being.
  • -cwd — Place the output files in current working directory. The default place is users’ home directory.
  • -o [path] — Place the standard out in the specified path.
  • -e [path] — Place the standard error in the specified path.
  • -S [shell path] — Specify the shell to use when running the job script.
  • -V — preserve environment variables.

Monitor and control the job status

  • qstat — show your own jobs at waiting or running state.
  • qstat -u “*” — show all users’ jobs at waiting or running state.
  • qstat -j jobid — show the details of the job including why the job is in that state.
  • qstat -s z — show recently finished jobs.
  • qhost — show current nodes in the SGE pool.
  • qdel jobid — delete a job with job ID.
  • qdel -u userid — delete jobs with user ID.