Server name: energy.ccny.cuny.edu
Off-campus access: No
Configuration: 25 node Linux cluster, 4 TB Storage
Function: Primarily for MPI parallel jobs
How to use
Use ssh (requires VPN)
In order to efficiently and fairly use the Energy Institute Xserve cluster, slave nodes are being added into Sun Grid Engine (SGE) pool. New jobs are expected to be submitted through SGE. If you have any questions regarding the user policy and SGE, please contact the system administrator. The full documentation of the SGE can be found at http://docs.sun.com/app/docs/doc/820-0699
Submit a simple job
submits batch jobs to the Grid Engine queuing system.
SGE accepts jobs as shell scripts such as bash or csh scripts. An example of a job script looks like:
#!/bin/sh # request Bourne shell as shell for job #$ -S /bin/sh # assume current working directory as paths #$ -cwd # # print date and time date # Sleep for 20 seconds sleep 20 # print date and time again date
If the above script is called “simple.sh”, run this command to submit the job:
When the job is finished, SGE will write two files in the working directory:
Where jobid is the job ID assigned by SGE. They are standard error and standard output of the script “simple.sh”.
To submit a serial job that runs a user compiled program, simply wrap it in a script.
Submit a parallel job
Submitting a parallel job is the same as submitting a serial job except one has to define a parallel environment and request CPUs in line “#$ -pe mpich cpunumber”. A parallel environment is a pre-built set of resource and instruction configurations that makes a particular type of parallel computing possible. In addition, keep the
-np $NSLOTS -machinefile $TMPDIR/machines as shown in the next example because the number of CPUs and the name of the nodes are decided by SGE.
### Begin embedded Grid Engine arguments # (name the job) #$ -N test # request 16 CPUs and pe named mpich #$ -pe mpich 16 # Force a shell #$ -S /bin/sh # (assume current working directory for paths) #$ -cwd ### End embedded Grid Engine commands echo "I have $NSLOTS slots to run on!" /common/mpich-mx10g/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines /Users/jmao/cpi/cpi_mpich
Due to the parallel nature of the program, a Parallel job spawns child processes on the slave nodes and the termination of the parent job doesn’t gaurentee the removal of the child processes. Accumulation of strayed child processes will use up all the endpoints and makes the slave nodes not accepting future MPI jobs.
A system script is implemented to clean the strayed user jobs on slave nodes every 10 minutes.
Control SGE in the script file
Special comment line that starts with “#$” defines qsub options.
Here are some commonly used options:
- -N [name] — The name of the job.
- -pe [type] [num] — Request [num] amount of [type] CPUs. Only mpich is supported at the time being.
- -cwd — Place the output files in current working directory. The default place is users’ home directory.
- -o [path] — Place the standard out in the specified path.
- -e [path] — Place the standard error in the specified path.
- -S [shell path] — Specify the shell to use when running the job script.
- -V — preserve environment variables.
Monitor and control the job status
- qstat — show your own jobs at waiting or running state.
- qstat -u “*” — show all users’ jobs at waiting or running state.
- qstat -j jobid — show the details of the job including why the job is in that state.
- qstat -s z — show recently finished jobs.
- qhost — show current nodes in the SGE pool.
- qdel jobid — delete a job with job ID.
- qdel -u userid — delete jobs with user ID.