Using Slurm: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
No edit summary
 
(28 intermediate revisions by 6 users not shown)
Line 1: Line 1:
The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.
The resource allocation / scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.




== Queues and defaults ==
== Queues and defaults ==
=== Quality of Service ===
When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with:
<pre>
#SBATCH --qos=std
</pre>
By default, jobs will use std, the standard quality.
Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs.
The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive.
The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them.
Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring.


=== Queues ===
=== Queues ===
Every organization has 3 queues (in slurm called partitions) : a high, a standard and a low priority queue.<br>
The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'.
The High queue provides the highest priority to jobs (20) then the standard queue (10). In the low priority queue (0)<br>
 
jobs will be resubmitted if a job with higer priority needs cluster resources and those resoruces are occupied by a Low queue jobs.
You can see the partitions available with `sinfo`:
To find out which queues your account has been authorized for, type sinfo:
<source lang='bash'>
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
ABGC_High      up  infinite    12  down* node[043-048,055-060]
ABGC_High      up  infinite      6    mix fat[001-002],node[002-005]
ABGC_High      up  infinite    44  idle node[001,006-042,049-054]
ABGC_Std      up  infinite    12  down* node[043-048,055-060]
ABGC_Std      up  infinite      6    mix fat[001-002],node[002-005]
ABGC_Std      up  infinite    44  idle node[001,006-042,049-054]
ABGC_Low      up  infinite    12  down* node[043-048,055-060]
ABGC_Low      up  infinite      6    mix fat[001-002],node[002-005]
ABGC_Low      up  infinite    44  idle node[001,006-042,049-054]
</source>


=== Defaults ===
=== Defaults ===
There is no default queue, so you need to specify which queue to use when submitting a job.<br>
The default partition is 'main'. This will work for most jobs.
'''The default run time for a job is 1 hour!''' <br>
 
'''Default memory limit is 100MB per node!'''
The default qos is 'std'.
 
The default cpu count is 1.
 
The default run time for a job is '''1 hour'''.
 
The default memory limit is '''100MB per node'''.
 


== Submitting jobs: sbatch ==
== Submitting jobs: sbatch ==
Line 31: Line 42:
=== Example ===
=== Example ===
Consider this simple python3 script that should calculate Pi to 1 million digits:
Consider this simple python3 script that should calculate Pi to 1 million digits:
<source lang='python'>
<pre>
from decimal import *
from decimal import *
D=Decimal
D=Decimal
Line 37: Line 48:
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
print(str(p)[:10000002])
</source>  
</pre>  


=== Loading modules ===
=== Loading modules ===
Line 47: Line 58:


=== Batch script ===
=== Batch script ===
[[Creating_sbatch_script | Main Article: Creating a sbatch script]]
The following shell/slurm script can then be used to schedule the job using the sbatch command:
The following shell/slurm script can then be used to schedule the job using the sbatch command:
<source lang='bash'>
<pre>
#!/bin/bash
#!/bin/bash
#SBATCH --account=773320000
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --mem=2048
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --output=output_%j.txt
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --job-name=calc_pi.py
#SBATCH --partition=ABGC_Std
#SBATCH --mail-type=ALL
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl
#SBATCH --mail-user=email@org.nl
Line 63: Line 75:


time python3 calc_pi.py
time python3 calc_pi.py
</source>
</pre>
Explanation of used SBATCH parameters:
 
<source lang='bash'>
=== Submitting ===
#SBATCH --account=773320000
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
</source>
<pre>
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.
sbatch run_calc_pi.sh
<source lang='bash'>
</pre>
#SBATCH --time=1200
 
</source>
=== Submitting multiple jobs (simple) ===
A time limit of zero requests that no time limit be imposed. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". So in this example the job will run for a maximum of 1200 minutes.
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<pre>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</pre>
 
=== Submitting multiple jobs (complex) ===
Lets's say you have three job scripts that depend on each other:
 
<pre>job_1.sh #A simple initialisation script</pre>
<pre>job_2.sh #An array task</pre>
<pre>job_3.sh #Some finishing script, single run, after everything previous has finished</pre>
 
You can create a script to simultaneously submit each job with a dependency on each other:
 
<pre>#!/bin/bash
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element
 
if ! [ "z$JOB1" == "z" ] ; then
  echo "First job submitted as jobid $JOB1"
  JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev)
 
  if ! [ "z$JOB2" == "z" ] ; then
  echo "Second job submitted as jobid $JOB2, following $JOB1"
  JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev)
 
  if ! [ "z$JOB3" == "z" ] ; then
  echo "Third job submitted as jobid $JOB3, following after every element of $JOB2"
 
  fi
fi
fi
</pre>
 
This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).
 
Please see [https://slurm.schedmd.com/sbatch.html#OPT_dependency the sbatch documentation] for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.
 
=== Submitting array jobs ===
<pre>
#SBATCH --array=0-10%4
</pre>
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found [[Array_jobs|here]].


----
=== Using /tmp ===
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.


<source lang='bash'>
In order to be sure that you're able to use space in /tmp, you can add
#SBATCH --mem=2048
<pre>
</source>
#SBATCH --tmp=<required size>
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 100 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission:
</pre>
<source lang='bash'>
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.
#SBATCH --mem X
</source>


where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:
=== Using GPU ===
<source lang='bash'>
There are six GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add
$ sacct -o MaxRSS -j JOBID
</source>
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.


----
<pre>
#SBATCH --gres=gpu:<num gpus>
#SBATCH --partition=gpu
</pre>
To your sbatch script. Without this parameter, your job won't run on one of these nodes.
Be sure to add the gres line, otherwise your job will either fail, or it will run on the CPU instead of on the GPU.


<source lang='bash'>
As we have different flavours of GPU's, you might want to select a type/manufacturer.
#SBATCH --ntasks=1
</source>
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.


When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the <code>-N</code> or <code>--node</code> flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:
If you don't, you will get one that is available.
<source lang='bash'>
#SBATCH --nodes=1
</source>
This should force your job to be scheduled to a single node.


Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the <code>-C</code> or <code>--constraints</code> flag.
To see which types are available, run this :
<source lang='bash'>
<pre>
#SBATCH --constraint=normalmem
scontrol show -o node | grep -o -e "NodeName=\w*" -e "ActiveFeatures=[[:alnum:][:punct:]]*" | paste - - | column -t | grep gpu
</source>
</pre>
The example above will result in jobs being scheduled to the regular compute nodes. By using <code>largemem</code> as option the job will specifically be scheduled to one of the fat nodes.
To select a certain type, use the flag:
<pre>
#SBATCH --constraint
</pre>
Example:
<pre>
# This will limit this job to the A100 GPUs
#SBATCH --constraint='nvidia&A100'
</pre>


<source lang='bash'>
A rough estimate is that the A100/80G cards are about twice as fast as the A6000/48G or the V100/16G. But this all depends on whether your analyses actually needs the RAM and can completely fill the GPU.
#SBATCH --output=output_%j.txt
We have set up the scheduler in such a way that the A100s are chosen first, and then the A6000s, and lastly the V100s
</source>
The pricing for all is the same.
Instruct SLURM to connect the batch script's standard output directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.
<source lang='bash'>
#SBATCH --error=error_output_%j.txt
</source>
Instruct SLURM to connect the batch script's standard error directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.
<source lang='bash'>
#SBATCH --job-name=calc_pi.py
</source>
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just "sbatch" if the script is read on sbatch's standard input.
<source lang='bash'>
#SBATCH --partition=ABGC_Std
</source>
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.
<source lang='bash'>
#SBATCH --mail-type=ALL
</source>
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.
<source lang='bash'>
#SBATCH --mail-user=email@org.nl
</source>
Email address to use.


=== Submitting ===
Please use the nvidia constraint if your jobs are limited to those, as we will put the AMD GPUs online in the future, which will then probably break your analyses.
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'>
sbatch run_calc_pi.sh
</source>


=== Submitting multiple jobs ===
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</source>


== Monitoring submitted jobs ==
== Monitoring submitted jobs ==
Line 149: Line 174:


=== Generic monitoring of all running jobs ===
=== Generic monitoring of all running jobs ===
<source lang='bash'>
<pre>
   squeue
   squeue
</source>
</pre>


You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
Line 175: Line 200:
=== Monitoring time limit set for a specific job ===
=== Monitoring time limit set for a specific job ===
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
<source lang='bash'>
<pre>
squeue -l -j 3532
squeue -l -j 3532
</source>
</pre>
Information similar to the following should appear:
Information similar to the following should appear:
   Fri Nov 29 15:41:00 2013
   Fri Nov 29 15:41:00 2013
Line 185: Line 210:
=== Query a specific active job: scontrol ===
=== Query a specific active job: scontrol ===
Show all the details of a currently active job, so not a completed job.
Show all the details of a currently active job, so not a completed job.
<source lang='bash'>
<pre>
nfs01 ~]$ scontrol show jobid 4241
login ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
JobId=4241 Name=WB20F06
   UserId=megen002(16795409) GroupId=domain users(16777729)
   UserId=megen002(16795409) GroupId=domain users(16777729)
Line 196: Line 221:
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=research AllocNode:Sid=nfs01:21799
   Partition=research AllocNode:Sid=login0:21799
   ReqNodeList=(null) ExcNodeList=(null)
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node023
   NodeList=node023
Line 206: Line 231:
   Command=/lustre/scratch/WUR/ABGC/...
   Command=/lustre/scratch/WUR/ABGC/...
   WorkDir=/lustre/scratch/WUR/ABGC/...
   WorkDir=/lustre/scratch/WUR/ABGC/...
</source>
</pre>


=== Check on a pending job ===
=== Check on a pending job ===
A submitted job could result in a pending state when there are not enough resources available to this job.
A submitted job could result in a pending state when there are not enough resources available to this job.
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
<source lang='bash'>
<pre>
[@nfs01 jobs]$ sbatch hpl_student.job
[@login jobs]$ sbatch hpl_student.job
  Submitted batch job 740338
  Submitted batch job 740338


[@nfs01 jobs]$ squeue -l -j 740338
[@login jobs]$ squeue -l -j 740338
  Fri Feb 21 15:32:31 2014
  Fri Feb 21 15:32:31 2014
   JOBID PARTITION    NAME    USER    STATE      TIME TIMELIMIT  NODES NODELIST(REASON)
   JOBID PARTITION    NAME    USER    STATE      TIME TIMELIMIT  NODES NODELIST(REASON)
  740338 ABGC_Stud HPLstude bohme999  PENDING      0:00 1-00:00:00      1 (ReqNodeNotAvail)
  740338 ABGC_Stud HPLstude bohme999  PENDING      0:00 1-00:00:00      1 (ReqNodeNotAvail)


[@nfs01 jobs]$ squeue --start -j 740338
[@login jobs]$ squeue --start -j 740338
   JOBID PARTITION    NAME    USER  ST          START_TIME  NODES NODELIST(REASON)
   JOBID PARTITION    NAME    USER  ST          START_TIME  NODES NODELIST(REASON)
  740338 ABGC_Stud HPLstude bohme999  PD  2014-02-22T15:31:48      1 (ReqNodeNotAvail)
  740338 ABGC_Stud HPLstude bohme999  PD  2014-02-22T15:31:48      1 (ReqNodeNotAvail)
</source>
</pre>
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.


== Removing jobs from a list: scancel ==
== Removing jobs from a list: scancel ==
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
<source lang='bash'>
<pre>
scancel 3401
scancel 3401
</source>
</pre>


== Allocating resources interactively: salloc ==
== Allocating resources interactively: sinteractive ==
It's possible to set up an interactive session using salloc. Run salloc as follows:
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:
<source lang='bash'>
<pre>
salloc -p <partition, say, ABGC_Low>
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>
</source>
</pre>
And because of the magic of SallocDefaultCommand, you will immediately be transported to a new prompt.
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.


Here, run 'hostname' to see which node your shell has been transported to.
Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.


=== sinteractive source ===
<pre>
#!/bin/bash
srun "$@" -I60 -N 1 -n 1 --pty bash -i
</pre>
=== interactive Slurm - using salloc ===
If you don't want your shell to be transported but want a new remote shell, do:
If you don't want your shell to be transported but want a new remote shell, do:
<source lang='bash'>
<pre>
salloc -p ABGC_Low $SHELL
salloc -p ABGC_Low $SHELL
</source>
</pre>
Now your shell will stay on nfs01, but you can do:
Now your shell will stay on the login node, but you can do:
<source lang='bash'>
<pre>
srun <command> &
srun <command> &
</source>
</pre>
To submit tasks to this new shell!
To submit tasks to this new shell!


Line 255: Line 287:
== Get overview of past and current jobs: sacct ==
== Get overview of past and current jobs: sacct ==
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
<source lang='bash'>
<pre>
sacct
sacct
</source>
</pre>
This should provide information similar to the following:
This should provide information similar to the following:


Line 270: Line 302:


Or in more detail for a specific job:
Or in more detail for a specific job:
<source lang='bash'>
<pre>
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
</source>
</pre>
This should provide information about job id 4220:
This should provide information about job id 4220:


       JobID    JobName    Account  Partition  NTasks  AllocCPUS    Elapsed      State ExitCode  
       JobID    JobName    Comment  Partition  NTasks  AllocCPUS    Elapsed      State ExitCode  
   ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------  
   ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------  
   4220        PreProces+              research                  3  00:30:52  COMPLETED      0:0  
   4220        PreProces+              research                  3  00:30:52  COMPLETED      0:0  
Line 311: Line 343:
|}
|}


== Running MPI jobs on B4F cluster ==
== Running MPI jobs on Anunna ==
 
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]
< text here >
 
== Understanding which resources are available to you: sinfo ==
By using the 'sinfo' command you can retrieve information on which 'Partitions' are available to you. A 'Partition' using SLURM is similar to the 'queue' when submitting using the Sun Grid Engine ('qsub'). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have 'student', 'research', and 'ABGC' partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the 'student' partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.
 
<source lang='bash'>
sinfo
</source>


  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
[[MPI_on_B4F_cluster | Main article: MPI on Anunna]]
  student*    up  infinite    12  down* node[043-048,055-060]
  student*    up  infinite    50  idle fat[001-002],node[001-042,049-054]
  research    up  infinite    12  down* node[043-048,055-060]
  research    up  infinite    50  idle fat[001-002],node[001-042,049-054]
  ABGC        up  infinite    12  down* node[043-048,055-060]
  ABGC        up  infinite    50  idle fat[001-002],node[001-042,049-054]


== See also ==
== See also ==
* [[B4F_cluster | B4F Cluster]]
* [[Tariffs | Costs associated with resource usage]]
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]
* [[B4F_cluster | Anunna]]
* [[BCM_on_B4F_cluster | BCM on Anunna]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

Latest revision as of 10:55, 16 April 2024

The resource allocation / scheduling software on Anunna is SLURM: Simple Linux Utility for Resource Management.


Queues and defaults

Quality of Service

When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with:

#SBATCH --qos=std

By default, jobs will use std, the standard quality.

Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs.

The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive.

The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them.

Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring.

Queues

The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'.

You can see the partitions available with `sinfo`:

Defaults

The default partition is 'main'. This will work for most jobs.

The default qos is 'std'.

The default cpu count is 1.

The default run time for a job is 1 hour.

The default memory limit is 100MB per node.


Submitting jobs: sbatch

Example

Consider this simple python3 script that should calculate Pi to 1 million digits:

from decimal import *
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])

Loading modules

In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:

 module avail

In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:

 module load python/3.3.3

Batch script

Main Article: Creating a sbatch script

The following shell/slurm script can then be used to schedule the job using the sbatch command:

#!/bin/bash
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --cpus-per-task=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl


time python3 calc_pi.py

Submitting

The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:

sbatch run_calc_pi.sh

Submitting multiple jobs (simple)

Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:

for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done

Submitting multiple jobs (complex)

Lets's say you have three job scripts that depend on each other:

job_1.sh #A simple initialisation script
job_2.sh #An array task
job_3.sh #Some finishing script, single run, after everything previous has finished

You can create a script to simultaneously submit each job with a dependency on each other:

#!/bin/bash
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element

if ! [ "z$JOB1" == "z" ] ; then
  echo "First job submitted as jobid $JOB1"
  JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev)

  if ! [ "z$JOB2" == "z" ] ; then
  echo "Second job submitted as jobid $JOB2, following $JOB1"
  JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev)

  if ! [ "z$JOB3" == "z" ] ; then
  echo "Third job submitted as jobid $JOB3, following after every element of $JOB2"

  fi
 fi
fi

This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).

Please see the sbatch documentation for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.

Submitting array jobs

#SBATCH --array=0-10%4

SLURM allows you to submit multiple jobs using the same template. Further information about this can be found here.

Using /tmp

There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.

In order to be sure that you're able to use space in /tmp, you can add

#SBATCH --tmp=<required size>

To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.

Using GPU

There are six GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add

#SBATCH --gres=gpu:<num gpus>
#SBATCH --partition=gpu

To your sbatch script. Without this parameter, your job won't run on one of these nodes. Be sure to add the gres line, otherwise your job will either fail, or it will run on the CPU instead of on the GPU.

As we have different flavours of GPU's, you might want to select a type/manufacturer.

If you don't, you will get one that is available.

To see which types are available, run this :

scontrol show -o node | grep -o -e "NodeName=\w*" -e "ActiveFeatures=[[:alnum:][:punct:]]*" | paste - - | column -t | grep gpu

To select a certain type, use the flag:

#SBATCH --constraint

Example:

# This will limit this job to the A100 GPUs
#SBATCH --constraint='nvidia&A100'

A rough estimate is that the A100/80G cards are about twice as fast as the A6000/48G or the V100/16G. But this all depends on whether your analyses actually needs the RAM and can completely fill the GPU. We have set up the scheduler in such a way that the A100s are chosen first, and then the A6000s, and lastly the V100s The pricing for all is the same.

Please use the nvidia constraint if your jobs are limited to those, as we will put the AMD GPUs online in the future, which will then probably break your analyses.


Monitoring submitted jobs

Once a job is submitted, the status can be monitored using the squeue command. The squeue command has a number of parameters for monitoring specific properties of the jobs such as time limit.

Generic monitoring of all running jobs

  squeue

You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:

   JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
  3396      ABGC BOV-WUR- megen002   R      27:26      1 node004
  3397      ABGC BOV-WUR- megen002   R      27:26      1 node005
  3398      ABGC BOV-WUR- megen002   R      27:26      1 node006
  3399      ABGC BOV-WUR- megen002   R      27:26      1 node007
  3400      ABGC BOV-WUR- megen002   R      27:26      1 node008
  3401      ABGC BOV-WUR- megen002   R      27:26      1 node009
  3385  research BOV-WUR- megen002   R      44:38      1 node049
  3386  research BOV-WUR- megen002   R      44:38      1 node050
  3387  research BOV-WUR- megen002   R      44:38      1 node051
  3388  research BOV-WUR- megen002   R      44:38      1 node052
  3389  research BOV-WUR- megen002   R      44:38      1 node053
  3390  research BOV-WUR- megen002   R      44:38      1 node054
  3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]
  3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]
  3393  research BOV-WUR- megen002   R      44:38      1 node001
  3394  research BOV-WUR- megen002   R      44:38      1 node002
  3395  research BOV-WUR- megen002   R      44:38      1 node003

Monitoring time limit set for a specific job

The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the squeue command.

squeue -l -j 3532

Information similar to the following should appear:

 Fri Nov 29 15:41:00 2013
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
  3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054

Query a specific active job: scontrol

Show all the details of a currently active job, so not a completed job.

login ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
   UserId=megen002(16795409) GroupId=domain users(16777729)
   Priority=1 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=research AllocNode:Sid=login0:21799
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node023
   BatchHost=node023
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/lustre/scratch/WUR/ABGC/...
   WorkDir=/lustre/scratch/WUR/ABGC/...

Check on a pending job

A submitted job could result in a pending state when there are not enough resources available to this job. In this example I sumbit a job, check the status and after finding out is it pending I'll check when is probably will start.

[@login jobs]$ sbatch hpl_student.job
 Submitted batch job 740338

[@login jobs]$ squeue -l -j 740338
 Fri Feb 21 15:32:31 2014
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)
 740338 ABGC_Stud HPLstude bohme999  PENDING       0:00 1-00:00:00      1 (ReqNodeNotAvail)

[@login jobs]$ squeue --start -j 740338
  JOBID PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)
 740338 ABGC_Stud HPLstude bohme999  PD  2014-02-22T15:31:48      1 (ReqNodeNotAvail)

So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.

Removing jobs from a list: scancel

If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:

scancel 3401

Allocating resources interactively: sinteractive

sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:

sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>

You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.

Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.

sinteractive source

#!/bin/bash
srun "$@" -I60 -N 1 -n 1 --pty bash -i

interactive Slurm - using salloc

If you don't want your shell to be transported but want a new remote shell, do:

salloc -p ABGC_Low $SHELL

Now your shell will stay on the login node, but you can do:

srun <command> &

To submit tasks to this new shell!

Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html

Get overview of past and current jobs: sacct

To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:

sacct

This should provide information similar to the following:

        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
 ------------ ---------- ---------- ---------- ---------- ---------- -------- 
 3385         BOV-WUR-58   research                    12  COMPLETED      0:0 
 3385.batch        batch                                1  COMPLETED      0:0 
 3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 
 3386.batch        batch                                1  CANCELLED     0:15 
 3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 
 3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0

Or in more detail for a specific job:

sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220

This should provide information about job id 4220:

      JobID    JobName    Comment   Partition   NTasks  AllocCPUS    Elapsed      State ExitCode 
 ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- 
 4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 
 4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0

Job Status Codes

Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.

Code State Description
CA CANCELLED Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED Job has terminated all processes on all nodes.
CF CONFIGURING Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
CG COMPLETING Job is in the process of completing. Some processes on some nodes may still be active.
F FAILED Job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
PD PENDING Job is awaiting resource allocation.
R RUNNING Job currently has an allocation.
S SUSPENDED Job has an allocation, but execution has been suspended.
TO TIMEOUT Job terminated upon reaching its time limit.

Running MPI jobs on Anunna

Main article: MPI on Anunna

See also

External links