Using Slurm: Difference between revisions
Line 46: | Line 46: | ||
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so: | You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so: | ||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | |||
3396 ABGC BOV-WUR- megen002 R 27:26 1 node004 | |||
3397 ABGC BOV-WUR- megen002 R 27:26 1 node005 | |||
3398 ABGC BOV-WUR- megen002 R 27:26 1 node006 | |||
3399 ABGC BOV-WUR- megen002 R 27:26 1 node007 | |||
3400 ABGC BOV-WUR- megen002 R 27:26 1 node008 | |||
3401 ABGC BOV-WUR- megen002 R 27:26 1 node009 | |||
3385 research BOV-WUR- megen002 R 44:38 1 node049 | |||
3386 research BOV-WUR- megen002 R 44:38 1 node050 | |||
3387 research BOV-WUR- megen002 R 44:38 1 node051 | |||
3388 research BOV-WUR- megen002 R 44:38 1 node052 | |||
3389 research BOV-WUR- megen002 R 44:38 1 node053 | |||
3390 research BOV-WUR- megen002 R 44:38 1 node054 | |||
3391 research BOV-WUR- megen002 R 44:38 3 node[049-051] | |||
3392 research BOV-WUR- megen002 R 44:38 3 node[052-054] | |||
3393 research BOV-WUR- megen002 R 44:38 1 node001 | |||
3394 research BOV-WUR- megen002 R 44:38 1 node002 | |||
3395 research BOV-WUR- megen002 R 44:38 1 node003 | |||
== removing jobs from a list: scancel == | == removing jobs from a list: scancel == |
Revision as of 21:08, 27 November 2013
The resource allocation / scheduling software on the B4F Cluster is SLURM: Simple Linux Utility for Resource Management.
Submitting jobs: sbatch
Example
Consider this simple python3 script that should calculate Pi to 1 million digits: <source lang='python'> from decimal import * D=Decimal getcontext().prec=10000000 p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411)) print(str(p)[:10000002]) </source>
Loading modules
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3
Batch script
The following shell/slurm script can then be used to schedule the job using the sbatch command: <source lang='bash'>
- !/bin/bash
- #SBATCH --time=100
- SBATCH --ntasks=1
- SBATCH --output=output_%j.txt
- SBATCH --error=error_output_%j.txt
- SBATCH --job-name=calc_pi.py
- SBATCH --partition=research
time python3 calc_pi.py </source>
Submitting
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'> sbatch run_calc_pi.sh </source>
monitoring submitted jobs: squeue
Once a job is submitted, the status can be monitored using the 'squeue' command:
squeue
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3396 ABGC BOV-WUR- megen002 R 27:26 1 node004 3397 ABGC BOV-WUR- megen002 R 27:26 1 node005 3398 ABGC BOV-WUR- megen002 R 27:26 1 node006 3399 ABGC BOV-WUR- megen002 R 27:26 1 node007 3400 ABGC BOV-WUR- megen002 R 27:26 1 node008 3401 ABGC BOV-WUR- megen002 R 27:26 1 node009 3385 research BOV-WUR- megen002 R 44:38 1 node049 3386 research BOV-WUR- megen002 R 44:38 1 node050 3387 research BOV-WUR- megen002 R 44:38 1 node051 3388 research BOV-WUR- megen002 R 44:38 1 node052 3389 research BOV-WUR- megen002 R 44:38 1 node053 3390 research BOV-WUR- megen002 R 44:38 1 node054 3391 research BOV-WUR- megen002 R 44:38 3 node[049-051] 3392 research BOV-WUR- megen002 R 44:38 3 node[052-054] 3393 research BOV-WUR- megen002 R 44:38 1 node001 3394 research BOV-WUR- megen002 R 44:38 1 node002 3395 research BOV-WUR- megen002 R 44:38 1 node003
removing jobs from a list: scancel
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. The For the example above, this would be done using the following code: <source lang='bash'> scancel 3347 </source>
allocating resources interactively: sallocate
running MPI jobs on B4F cluster
Understanding which resources are available to you: sinfo
By using the 'sinfo' command you can retrieve information on which 'Partitions' are available to you. A 'Partition' using SLURM is similar to the 'queue' when submitting using the Sun Grid Engine ('qsub'). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have 'student', 'research', and 'ABGC' partitions available. The higher the level of resource allocation, though, the higher the cost per compute-hour. The default Partition is the 'student' partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.
<source lang='bash'> sinfo </source>
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST student* up infinite 12 down* node[043-048,055-060] student* up infinite 50 idle fat[001-002],node[001-042,049-054] research up infinite 12 down* node[043-048,055-060] research up infinite 50 idle fat[001-002],node[001-042,049-054] ABGC up infinite 12 down* node[043-048,055-060] ABGC up infinite 50 idle fat[001-002],node[001-042,049-054]