ACG's Sun Grid Engine (SGE) Cluster

Our new 128-processor cluster consists of 32 machines each with two sockets and two cores per socket. The processors are 64-bit 2.66 Ghz Core 2 Duo Xeons each with four MBs of second-level cache..

The machines are named: acggrid01.seas.upenn.edu through acggrid32.seas.upenn.edu. We can log into any of the machines directly, if need (they just look like standard CETS-supported Linux boxes).

The cluster uses the open-source Sun Grid Engine (SGE) system for cluster job scheduling. Codex-l.cis.upenn.edu is currently running the SGE job scheduler.

Contents

File Storage and the 'acg' Unix Group
Quickstart
Additional Information
Grid Restart
Configuring the Grid
Troubleshooting

File Storage and the 'acg' Unix Group

Before talking about SGE, some information about files and file permissions.

Each of you should have a directory for your files in the shared /mnt/eclipse/acg file space. For example, my directory is /mnt/eclipse/acg/users/milom/. All research-related files should be kept in /mnt/eclipse/acg. All private files should be kept in your account's home directory.

On the ACG grid, SGE runs all jobs as the user "acgsge" which is a member of the "acg" unix group. This means that all directories to want your job to read or write must have the correct permissions. This is probably the number one source of problems you'll encounter when first using our SGE setup.

To keep files and directories read/write-able by the acg group, you'll want to do a few things.

Set the "set group ID" on directories. You can do this by:
```
chgrp acg directoryname 
chmod g+rwXs directoryname 
```
This will make sure that all files (and subdirectories) created in this sub-directory inherent the acg group. Note: the acg directory you start out with should already have this set, so you shouldn't need to change anything.
Set your "umask". I set mine using "umask 7". This will make all files you create read/write-able by you and the group (but not other). Works for me. I set this in my .cshrc or bash equivalent.
Use "cp" instead of "mv". If you "cp" to copy files, it will inherent the permissions correctly. If you use a "mv" it won't. Thus, avoid using "mv" in favor of "cp".
Finally, you might want to make setting the file permissions as part of your submit script. This is what I did when I was in graduate school, and it prevented lots of mistakes on my part.

Acgsge user quota problems: As the scheduling software runs at the acgsge user, all files it creates are owned by that user. If your progams write files into any CETS home directory, those files will count towards the quota of the acgsge user. This can cause quota problems. To avoid this problem, always write files to shared space such as /mnt/eclipse/acg/. Quota for such space is handled differently, and thus the problem is avoided.

Quickstart

Environment Setup

To get started, first log into codex-l.cis.upenn.edu using SSH:

ssh codex-l.cis.upenn.edu

Next, if you are a csh/tcsh user, type the following command (or add it to your .cshrc):

source /home1/a/acgsge/sge/default/common/settings.csh

If you are a sh/ksh user:

# . /home1/a/acgsge/sge/default/common/settings.sh

This will set or expand the following environment variables:

- $SGE_ROOT         (always necessary)
- $SGE_CELL         (if you are using a cell other than >default<)
- $SGE_QMASTER_PORT (if you haven't added the service >sge_qmaster<)
- $SGE_EXECD_PORT   (if you haven't added the service >sge_execd<)
- $PATH/$path       (to find the Grid Engine binaries)
- $MANPATH          (to access the manual pages)

Some Common Command-Line Commands

To see the jobs in the queue:

qstat -f

To see the machines in the cluster:

qhost

To see which machines are running which jobs:

qhost -j

To see who has been using the cluster:

qacct -o

For more information an various accounting, see the man page for "accounting"

Graphical Tool

To use the x-windows based GUI monitoring and configuration tool:

qmon &

Submitting a Test Job

To submit a test job, first change to a directory writeable by anyone in the acg unix group. Then, run the following command:

qsub -cwd ~acgsge/sge/examples/jobs/sleeper.sh

It should say something like:

Your job 17 ("Sleeper") has been submitted

You can then check the queue:

qstat -f

After a minute or so, you will have some output files in the current directory (one each for standard output and error), owned by the user acgsge:

-rw-r--r-- 1 acgsge acg  0 2006-11-01 09:41 Sleeper.e17
-rw-r--r-- 1 acgsge acg 95 2006-11-01 09:42 Sleeper.o17

If you want to test out submitting a bunch of jobs, just run qsub multiple times and then watch the jobs queue up and execute.

Going Further

The directory ~acgsge/sge/examples/jobs/ has several example submission scripts. Looking at the various man pages and -help flags are useful. There are also lots of pages on the web you might find helpful.

Additional Information

Links

Sun Grid Engine documentation: http://gridengine.sunsource.net/documentation.html (you want the docs for "N1 Grid Engine 6").

Some other helpful links:

http://gridengine.sunsource.net/howto/howto.html

http://gridengine.sunsource.net/howto/commonproblems.html

http://gridengine.info/

http://issg.cs.duke.edu/lab/gridware.html

http://www.rzg.mpg.de/docs/linux/sge.html

Requesting Resources

It seems you can also tell SGE that you want a specific set of resources before you start your job. To see the requestable resources:

qconf -sc
qstat -t -F

It is also helpful to look at some various man pages:

man complex
man host_conf

Some other related links:

http://docs.sun.com/app/docs/doc/817-6117/6mlhdaprs?a=view#i999081

http://docs.sun.com/app/docs/doc/817-5677/chp8-1501?a=view

Checkpointing

Sun Grid engine doesn't directly support checkpointing, but it does have hooks to let you use a automated checkpointing library or application-level checkpointing. One reasonable option is Condor's checkpointing library. Setting this up requires mucking with the SGE config, but doesn't look impossible.

http://gridengine.sunsource.net/howto/condorckpt.html

https://lists.cs.wisc.edu/archive/condor-users/2005-June/msg00041.shtml

Queue Scheduling Notes

GridEngine schedule shares and such ("share tree" == average over time; "functional" is an instantaneous priority). At least one web page advocates the functional schedule, as it is easiest to explain (if many users have jobs pending, the cluster will be divided up proportionally to each user). They said that in their experience, "share tree" sounds good in theory, but the practice the history it includes make it harder for end users to reason about:

http://bioteam.net/dag/sge6-funct-share-dept.html

http://bioteam.net/dag/sge6-funcshare-1.jpg

http://bioteam.net/dag/gridengine-6-features.html

http://gridengine.sunsource.net/howto/geee.html

http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/source/daemons/schedd/schedd.html

http://gridengine.info/articles/2005/09/30/pretty-pictures-explain-functional-vs-sharetree-scheduling

http://gridengine.info/articles/2005/09/19/resource-allocation-overview

Other docs:

http://www.rzg.mpg.de/docs/linux/sge.html

Grid Restart

To restart the daemons, log in as the user "acgsge". To start qmaster and scheduler, on codex-l type:

/home1/a/acgsge/sge/default/common/sgemaster

On each of the rest of the cluster nodes launch execd with the following command:

/home1/a/acgsge/sge/default/common/sgeexecd

After you run this, it should list the "execd" process running as user acgsge.

Configuring the Grid

To sets the number slots on a machine (for example, acggrid20) for queue all.q to zero:

qconf -rattr queue slots 0 all.q@acggrid20

Troubleshooting

Some tips: http://gridengine.sunsource.net/howto/troubleshooting.html
Problem: a user is able to submit jobs, but they stay in the "pending" state for indefinite amount of time.

The actual error you will get when you submit your job is like this:
```
Unable to run job: warning: your_username your job is not allowed to run in any queue
Your job your_jobid ("your_jobname") has been submitted.
Exiting.
```
(for each job you submit)

Solution: the submitting user may need to be added to the group of acg users using the "qmon" tool.
Problem: a job on the queue with status Eqw, which means that the job's location directory was not given the correct group permissions. Solution: Just fixing the permissions will not solve the problem. You must kill the jobs, fix the permissions (chgrp acg your_dir; chmod g+rwx your_dir), then start them again. This time they should work. See the section on group permissions for how to avoid this.

Problem: jobs don't start but instead spit out a seemingly infinite error stream that says: tset: standard error: Inappropriate ioctl for device. Solution: check your .login file for terminal setting problems. For example, the following .login code:

loop:
  ## If modem dialup or vt100, use vt100
  if ($TERM == network || $TERM =~ *[vV][tT]*100) eval `tset -QIs vt100`

  ## If don't know, ask (default to vt100).  Otherwise, use it.
  if ($TERM == '' || $TERM == unknown) then     # don't know?
    eval `tset -QIs \?vt100`            # then ask (default vt100)
    else eval `tset -QIs $TERM`         # know?  use it then
  endif

if ($TERM == unknown || $TERM == '') goto loop

Could cause problems.

Solution: The solution is to wrap this code:

 if ( { [ -t ] } ) then
  (do the interactive-only stuff here)
endif

Unlike Condor's standard universe, SGE does not take a snapshot of the executable you specify when submitting jobs. This means that if half your jobs start running and the other half are queued up and you then change your executable, the jobs that have not yet started will execute the updated executable, not the original one.