Scheduler Overview (Slurm): Difference between revisions

From HPCwiki
Jump to navigation Jump to search
Bohme001 (talk | contribs)
No edit summary
Phase 1 § 4 P1.4.1: trim to overview only — content split into Partitions / Queues, Choosing a node (constraints), Batch Jobs, Interactive Jobs, Cancelling Jobs, Monitoring Jobs (separate pages). This page is now the entry point with QoS overview and topic index. (via update-page on MediaWiki MCP Server)
 
(57 intermediate revisions by 8 users not shown)
Line 1: Line 1:
The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.
The resource allocation and scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement. This page is the entry point — most topics have their own page; below is a short summary plus links.


== What's on which page ==


== Queues and defaults ==
* [[Partitions / Queues]] — list of partitions (<code>main</code>, <code>gpu</code>, <code>gpu_amd</code>) and how to choose one.
* [[Choosing a node (constraints)]] — defaults, hardware constraints, GPU selection.
* [[Batch Jobs]] — writing sbatch scripts and submitting them, including multi-job submissions and dependencies.
* [[Interactive Jobs]] — <code>sinteractive</code> and <code>salloc</code> for live shell sessions on a compute node.
* [[Array jobs]] — running the same script many times with a varying parameter.
* [[Monitoring Jobs]] — <code>squeue</code>, <code>scontrol</code>, <code>sstat</code>, <code>sacct</code>, <code>node_usage_graph</code>.
* [[Cancelling Jobs]] — <code>scancel</code>.
* [[Reservations]] — booking nodes in advance for events.


=== Queues ===
== Quality of Service ==
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).
To find out which queues your account has been authorized for, type sinfo:
<source lang='bash'>
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
ABGC_Production    up  infinite    12  down* node[043-048,055-060]
ABGC_Production    up  infinite      6    mix fat[001-002],node[002-005]
ABGC_Production    up  infinite    44  idle node[001,006-042,049-054]
ABGC_Research      up  infinite    12  down* node[043-048,055-060]
ABGC_Research      up  infinite      6    mix fat[001-002],node[002-005]
ABGC_Research      up  infinite    44  idle node[001,006-042,049-054]
ABGC_Student      up  infinite    12  down* node[043-048,055-060]
ABGC_Student      up  infinite      6    mix fat[001-002],node[002-005]
ABGC_Student      up  infinite    44  idle node[001,006-042,049-054]
</source>
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.


=== Defaults ===
When submitting a job, you may optionally assign a different Quality of Service (QoS) to it:
There is no default queue, so you need to specify which queue to use when submitting a job.
The default run time for a job is 1 hour!


<syntaxhighlight lang="bash">
#SBATCH --qos=std
</syntaxhighlight>


== Submitting jobs: sbatch ==
The QoS values configured on Anunna:


=== Example ===
* '''std''' (priority 10) — the default. Use this unless you have a specific reason to pick another.
Consider this simple python3 script that should calculate Pi to 1 million digits:
* '''low''' (priority 1) — reduced priority, but limited to 8 hours per job so a flood of low-priority jobs cannot lock up the cluster.
<source lang='python'>
* '''high''' (priority 20) — higher priority than <code>std</code>. More expensive — see [[Tariffs]].
from decimal import *
* '''interactive''' (priority 100) — the highest priority, exclusively for immediate-running interactive jobs. You may not submit many or large jobs at this QoS.
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
</source>


=== Loading modules ===
Jobs can in principle be restarted and rescheduled if a higher-priority job needs cluster resources, but at the time of writing this preemption is not actually configured.
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
  module avail


In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
== Running MPI jobs ==
  module load python/3.3.3


=== Batch script ===
For multi-node MPI workloads see [[MPI on B4F cluster | MPI on Anunna]].
The following shell/slurm script can then be used to schedule the job using the sbatch command:
<source lang='bash'>
#!/bin/bash
#SBATCH --account=773320000
#SBATCH --time=1200
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --partition=ABGC
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl


== See also ==


time python3 calc_pi.py
* [[Partitions / Queues]]
</source>
* [[Choosing a node (constraints)]]
Explanation of used SBATCH parameters:
* [[Batch Jobs]]
<source lang='bash'>
* [[Interactive Jobs]]
#SBATCH --account=773320000
* [[Array jobs]]
</source>
* [[Monitoring Jobs]]
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.
* [[Cancelling Jobs]]
<source lang='bash'>
* [[Reservations]]
#SBATCH --time=1200
* [[Tariffs | Costs associated with resource usage]]
</source>
A time limit of zero requests that no time limit be imposed. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". So in this example the job will run for a maximum of 1200 minutes.
<source lang='bash'>
#SBATCH --ntasks=1
</source>
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.
<source lang='bash'>
#SBATCH --output=output_%j.txt
</source>
Instruct SLURM to connect the batch script's standard output directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.
<source lang='bash'>
#SBATCH --error=error_output_%j.txt
</source>
Instruct SLURM to connect the batch script's standard error directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.
<source lang='bash'>
#SBATCH --job-name=calc_pi.py
</source>
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just "sbatch" if the script is read on sbatch's standard input.
<source lang='bash'>
#SBATCH --partition=research
</source>
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.
<source lang='bash'>
#SBATCH --mail-type=ALL
</source>
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.
<source lang='bash'>
#SBATCH --mail-user=email@org.nl
</source>
Email address to use.


=== Submitting ===
== External links ==
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'>
sbatch run_calc_pi.sh
</source>
 
=== Submitting multiple jobs ===
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</source>
 
== monitoring submitted jobs: squeue ==
Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit.
 
=== Generic monitoring of all running jobs ===
<source lang='bash'>
  squeue
</source>
 
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
    JOBID PARTITION    NAME    USER  ST      TIME  NODES NODELIST(REASON)
  3396      ABGC BOV-WUR- megen002  R      27:26      1 node004
  3397      ABGC BOV-WUR- megen002  R      27:26      1 node005
  3398      ABGC BOV-WUR- megen002  R      27:26      1 node006
  3399      ABGC BOV-WUR- megen002  R      27:26      1 node007
  3400      ABGC BOV-WUR- megen002  R      27:26      1 node008
  3401      ABGC BOV-WUR- megen002  R      27:26      1 node009
  3385  research BOV-WUR- megen002  R      44:38      1 node049
  3386  research BOV-WUR- megen002  R      44:38      1 node050
  3387  research BOV-WUR- megen002  R      44:38      1 node051
  3388  research BOV-WUR- megen002  R      44:38      1 node052
  3389  research BOV-WUR- megen002  R      44:38      1 node053
  3390  research BOV-WUR- megen002  R      44:38      1 node054
  3391  research BOV-WUR- megen002  R      44:38      3 node[049-051]
  3392  research BOV-WUR- megen002  R      44:38      3 node[052-054]
  3393  research BOV-WUR- megen002  R      44:38      1 node001
  3394  research BOV-WUR- megen002  R      44:38      1 node002
  3395  research BOV-WUR- megen002  R      44:38      1 node003
 
=== Monitoring time limit set for a specific job ===
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
<source lang='bash'>
squeue -l -j 3532
</source>
Information similar to the following should appear:
  Fri Nov 29 15:41:00 2013
  JOBID PARTITION    NAME    USER    STATE      TIME TIMELIMIT  NODES NODELIST(REASON)
  3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054
 
== Query a specific active job: scontrol ==
Show all the details of a currently active job, so not a completed job.
<source lang='bash'>
nfs01 ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
  UserId=megen002(16795409) GroupId=domain users(16777729)
  Priority=1 Account=(null) QOS=normal
  JobState=RUNNING Reason=None Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
  RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
  SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
  StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
  PreemptTime=None SuspendTime=None SecsPreSuspend=0
  Partition=research AllocNode:Sid=nfs01:21799
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList=node023
  BatchHost=node023
  NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
  MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
  Features=(null) Gres=(null) Reservation=(null)
  Shared=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=/lustre/scratch/WUR/ABGC/...
  WorkDir=/lustre/scratch/WUR/ABGC/...
</source>
 
== Removing jobs from a list: scancel ==
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
<source lang='bash'>
scancel 3401
</source>
 
== Allocating resources interactively: sallocate ==
< text here>
 
== Get overview of past and current jobs: sacct ==
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
<source lang='bash'>
sacct
</source>
This should provide information similar to the following:
 
        JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
  ------------ ---------- ---------- ---------- ---------- ---------- --------
  3385        BOV-WUR-58  research                    12  COMPLETED      0:0
  3385.batch        batch                                1  COMPLETED      0:0
  3386        BOV-WUR-59  research                    12 CANCELLED+      0:0
  3386.batch        batch                                1  CANCELLED    0:15
  3528        BOV-WUR-59      ABGC                    16    RUNNING      0:0
  3529        BOV-WUR-60      ABGC                    16    RUNNING      0:0
 
Or in more detail for a specific job:
<source lang='bash'>
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
</source>
This should provide information about job id 4220:
 
      JobID    JobName    Account  Partition  NTasks  AllocCPUS    Elapsed      State ExitCode
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------
  4220        PreProces+              research                  3  00:30:52  COMPLETED      0:0
  4220.batch        batch                              1          1  00:30:52  COMPLETED      0:0
 
== Running MPI jobs on B4F cluster ==


[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]
< text here >
== Understanding which resources are available to you: sinfo ==
By using the 'sinfo' command you can retrieve information on which 'Partitions' are available to you. A 'Partition' using SLURM is similar to the 'queue' when submitting using the Sun Grid Engine ('qsub'). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have 'student', 'research', and 'ABGC' partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the 'student' partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.
<source lang='bash'>
sinfo
</source>
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
  student*    up  infinite    12  down* node[043-048,055-060]
  student*    up  infinite    50  idle fat[001-002],node[001-042,049-054]
  research    up  infinite    12  down* node[043-048,055-060]
  research    up  infinite    50  idle fat[001-002],node[001-042,049-054]
  ABGC        up  infinite    12  down* node[043-048,055-060]
  ABGC        up  infinite    50  idle fat[001-002],node[001-042,049-054]
== See also ==
* [[B4F_cluster | B4F Cluster]]
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]
== External links ==
* [http://slurm.schedmd.com Slurm official documentation]
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]

Latest revision as of 09:48, 16 June 2026

The resource allocation and scheduling software on Anunna is SLURM: Simple Linux Utility for Resource Management. This page is the entry point — most topics have their own page; below is a short summary plus links.

What's on which page

Quality of Service

When submitting a job, you may optionally assign a different Quality of Service (QoS) to it:

#SBATCH --qos=std

The QoS values configured on Anunna:

  • std (priority 10) — the default. Use this unless you have a specific reason to pick another.
  • low (priority 1) — reduced priority, but limited to 8 hours per job so a flood of low-priority jobs cannot lock up the cluster.
  • high (priority 20) — higher priority than std. More expensive — see Tariffs.
  • interactive (priority 100) — the highest priority, exclusively for immediate-running interactive jobs. You may not submit many or large jobs at this QoS.

Jobs can in principle be restarted and rescheduled if a higher-priority job needs cluster resources, but at the time of writing this preemption is not actually configured.

Running MPI jobs

For multi-node MPI workloads see MPI on Anunna.

See also