Our new 128-processor cluster consists of 32 machines each with two sockets and two cores per socket. The processors are 64-bit 2.66 Ghz Core 2 Duo Xeons each with four MBs of second-level cache..
The machines are named: acggrid01.seas.upenn.edu through acggrid32.seas.upenn.edu. We can log into any of the machines directly, if need (they just look like standard CETS-supported Linux boxes).
The cluster uses the open-source Sun Grid Engine (SGE) system for cluster job scheduling. Codex-l.cis.upenn.edu is currently running the SGE job scheduler.
Before talking about SGE, some information about files and file permissions.
Each of you should have a directory for your files in the shared /mnt/eclipse/acg file space. For example, my directory is /mnt/eclipse/acg/users/milom/. All research-related files should be kept in /mnt/eclipse/acg. All private files should be kept in your account's home directory.
On the ACG grid, SGE runs all jobs as the user "acgsge" which is a member of the "acg" unix group. This means that all directories to want your job to read or write must have the correct permissions. This is probably the number one source of problems you'll encounter when first using our SGE setup.
To keep files and directories read/write-able by the acg group, you'll want to do a few things.
Set the "set group ID" on directories. You can do this by:
chgrp acg directoryname chmod g+rwXs directoryname
This will make sure that all files (and subdirectories) created in this sub-directory inherent the acg group. Note: the acg directory you start out with should already have this set, so you shouldn't need to change anything.
Set your "umask". I set mine using "umask 7". This will make all files you create read/write-able by you and the group (but not other). Works for me. I set this in my .cshrc or bash equivalent.
Use "cp" instead of "mv". If you "cp" to copy files, it will inherent the permissions correctly. If you use a "mv" it won't. Thus, avoid using "mv" in favor of "cp".
Finally, you might want to make setting the file permissions as part of your submit script. This is what I did when I was in graduate school, and it prevented lots of mistakes on my part.
Acgsge user quota problems: As the scheduling software runs at the acgsge user, all files it creates are owned by that user. If your progams write files into any CETS home directory, those files will count towards the quota of the acgsge user. This can cause quota problems. To avoid this problem, always write files to shared space such as /mnt/eclipse/acg/. Quota for such space is handled differently, and thus the problem is avoided.
To get started, first log into codex-l.cis.upenn.edu using SSH:
ssh codex-l.cis.upenn.edu
Next, if you are a csh/tcsh user, type the following command (or add it to your .cshrc):
source /home1/a/acgsge/sge/default/common/settings.csh
If you are a sh/ksh user:
# . /home1/a/acgsge/sge/default/common/settings.sh
This will set or expand the following environment variables:
- $SGE_ROOT (always necessary) - $SGE_CELL (if you are using a cell other than >default<) - $SGE_QMASTER_PORT (if you haven't added the service >sge_qmaster<) - $SGE_EXECD_PORT (if you haven't added the service >sge_execd<) - $PATH/$path (to find the Grid Engine binaries) - $MANPATH (to access the manual pages)
To see the jobs in the queue:
qstat -f
To see the machines in the cluster:
qhost
To see which machines are running which jobs:
qhost -j
To see who has been using the cluster:
qacct -o
For more information an various accounting, see the man page for "accounting"
To submit a test job, first change to a directory writeable by anyone in the acg unix group. Then, run the following command:
qsub -cwd ~acgsge/sge/examples/jobs/sleeper.sh
It should say something like:
Your job 17 ("Sleeper") has been submitted
You can then check the queue:
qstat -f
After a minute or so, you will have some output files in the current directory (one each for standard output and error), owned by the user acgsge:
-rw-r--r-- 1 acgsge acg 0 2006-11-01 09:41 Sleeper.e17 -rw-r--r-- 1 acgsge acg 95 2006-11-01 09:42 Sleeper.o17
If you want to test out submitting a bunch of jobs, just run qsub multiple times and then watch the jobs queue up and execute.
The directory ~acgsge/sge/examples/jobs/ has several example submission scripts. Looking at the various man pages and -help flags are useful. There are also lots of pages on the web you might find helpful.
Sun Grid Engine documentation: http://gridengine.sunsource.net/documentation.html (you want the docs for "N1 Grid Engine 6").
Some other helpful links:
It seems you can also tell SGE that you want a specific set of resources before you start your job. To see the requestable resources:
qconf -sc qstat -t -F
It is also helpful to look at some various man pages:
man complex man host_conf
Some other related links:
Sun Grid engine doesn't directly support checkpointing, but it does have hooks to let you use a automated checkpointing library or application-level checkpointing. One reasonable option is Condor's checkpointing library. Setting this up requires mucking with the SGE config, but doesn't look impossible.
GridEngine schedule shares and such ("share tree" == average over time; "functional" is an instantaneous priority). At least one web page advocates the functional schedule, as it is easiest to explain (if many users have jobs pending, the cluster will be divided up proportionally to each user). They said that in their experience, "share tree" sounds good in theory, but the practice the history it includes make it harder for end users to reason about:
- http://bioteam.net/dag/sge6-funct-share-dept.html
- http://bioteam.net/dag/sge6-funcshare-1.jpg
- http://bioteam.net/dag/gridengine-6-features.html
- http://gridengine.sunsource.net/howto/geee.html
- http://gridengine.sunsource.net/nonav/source/browse/~checkout~/gridengine/source/daemons/schedd/schedd.html
- http://gridengine.info/articles/2005/09/30/pretty-pictures-explain-functional-vs-sharetree-scheduling
- http://gridengine.info/articles/2005/09/19/resource-allocation-overview
Other docs:
To restart the daemons, log in as the user "acgsge". To start qmaster and scheduler, on codex-l type:
/home1/a/acgsge/sge/default/common/sgemaster
On each of the rest of the cluster nodes launch execd with the following command:
/home1/a/acgsge/sge/default/common/sgeexecd
After you run this, it should list the "execd" process running as user acgsge.
To sets the number slots on a machine (for example, acggrid20) for queue all.q to zero:
qconf -rattr queue slots 0 all.q@acggrid20
Some tips: http://gridengine.sunsource.net/howto/troubleshooting.html
Problem: a user is able to submit jobs, but they stay in the "pending" state for indefinite amount of time.
The actual error you will get when you submit your job is like this:
Unable to run job: warning: your_username your job is not allowed to run in any queue Your job your_jobid ("your_jobname") has been submitted. Exiting.
(for each job you submit)
Solution: the submitting user may need to be added to the group of acg users using the "qmon" tool.
Problem: a job on the queue with status Eqw, which means that the job's location directory was not given the correct group permissions. Solution: Just fixing the permissions will not solve the problem. You must kill the jobs, fix the permissions (chgrp acg your_dir; chmod g+rwx your_dir), then start them again. This time they should work. See the section on group permissions for how to avoid this.
Problem: jobs don't start but instead spit out a seemingly infinite error stream that says: tset: standard error: Inappropriate ioctl for device. Solution: check your .login file for terminal setting problems. For example, the following .login code:
loop: ## If modem dialup or vt100, use vt100 if ($TERM == network || $TERM =~ *[vV][tT]*100) eval `tset -QIs vt100` ## If don't know, ask (default to vt100). Otherwise, use it. if ($TERM == '' || $TERM == unknown) then # don't know? eval `tset -QIs \?vt100` # then ask (default vt100) else eval `tset -QIs $TERM` # know? use it then endif if ($TERM == unknown || $TERM == '') goto loop
Could cause problems.
Solution: The solution is to wrap this code:
if ( { [ -t ] } ) then (do the interactive-only stuff here) endif
Unlike Condor's standard universe, SGE does not take a snapshot of the executable you specify when submitting jobs. This means that if half your jobs start running and the other half are queued up and you then change your executable, the jobs that have not yet started will execute the updated executable, not the original one.