Scheduler Overview (Slurm): Difference between revisions

From HPCwiki
Jump to navigation Jump to search
No edit summary
Phase 1 § 4 P1.4.1: trim to overview only — content split into Partitions / Queues, Choosing a node (constraints), Batch Jobs, Interactive Jobs, Cancelling Jobs, Monitoring Jobs (separate pages). This page is now the entry point with QoS overview and topic index. (via update-page on MediaWiki MCP Server)
 
(116 intermediate revisions by 9 users not shown)
Line 1: Line 1:
== submitting jobs: sbatch ==
The resource allocation and scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement. This page is the entry point — most topics have their own page; below is a short summary plus links.


Consider this simple python3 script that should calculate Pi up to 1 million digits:
== What's on which page ==
<source lang='python'>
from decimal import *
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
</source>


In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
* [[Partitions / Queues]] — list of partitions (<code>main</code>, <code>gpu</code>, <code>gpu_amd</code>) and how to choose one.
  module avail
* [[Choosing a node (constraints)]] — defaults, hardware constraints, GPU selection.
* [[Batch Jobs]] — writing sbatch scripts and submitting them, including multi-job submissions and dependencies.
* [[Interactive Jobs]] — <code>sinteractive</code> and <code>salloc</code> for live shell sessions on a compute node.
* [[Array jobs]] — running the same script many times with a varying parameter.
* [[Monitoring Jobs]] — <code>squeue</code>, <code>scontrol</code>, <code>sstat</code>, <code>sacct</code>, <code>node_usage_graph</code>.
* [[Cancelling Jobs]] — <code>scancel</code>.
* [[Reservations]] — booking nodes in advance for events.


In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
== Quality of Service ==
  module load python/3.3.3


When submitting a job, you may optionally assign a different Quality of Service (QoS) to it:


<source lang='bash'>
<syntaxhighlight lang="bash">
#!/bin/bash
#SBATCH --qos=std
# #SBATCH --time=100
</syntaxhighlight>
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --partition=research


time python3 calc_pi.py
The QoS values configured on Anunna:
</source>


  JOBID PARTITION    NAME    USER  ST      TIME  NODES NODELIST(REASON)
* '''std''' (priority 10) — the default. Use this unless you have a specific reason to pick another.
  3347  research calc_pi. megen002  R      0:03      1 node049
* '''low''' (priority 1) — reduced priority, but limited to 8 hours per job so a flood of low-priority jobs cannot lock up the cluster.
* '''high''' (priority 20) — higher priority than <code>std</code>. More expensive — see [[Tariffs]].
* '''interactive''' (priority 100) — the highest priority, exclusively for immediate-running interactive jobs. You may not submit many or large jobs at this QoS.


== allocating resources interactively: sallocate ==
Jobs can in principle be restarted and rescheduled if a higher-priority job needs cluster resources, but at the time of writing this preemption is not actually configured.


== running MPI jobs on B4F cluster ==
== Running MPI jobs ==


== monitoring submitted jobs: squeue ==
For multi-node MPI workloads see [[MPI on B4F cluster | MPI on Anunna]].


== removing jobs from a list: scancel ==
== See also ==


== other ==
* [[Partitions / Queues]]
* [[Choosing a node (constraints)]]
* [[Batch Jobs]]
* [[Interactive Jobs]]
* [[Array jobs]]
* [[Monitoring Jobs]]
* [[Cancelling Jobs]]
* [[Reservations]]
* [[Tariffs | Costs associated with resource usage]]


== external links ==
== External links ==
 
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]

Latest revision as of 09:48, 16 June 2026

The resource allocation and scheduling software on Anunna is SLURM: Simple Linux Utility for Resource Management. This page is the entry point — most topics have their own page; below is a short summary plus links.

What's on which page

Quality of Service

When submitting a job, you may optionally assign a different Quality of Service (QoS) to it:

#SBATCH --qos=std

The QoS values configured on Anunna:

  • std (priority 10) — the default. Use this unless you have a specific reason to pick another.
  • low (priority 1) — reduced priority, but limited to 8 hours per job so a flood of low-priority jobs cannot lock up the cluster.
  • high (priority 20) — higher priority than std. More expensive — see Tariffs.
  • interactive (priority 100) — the highest priority, exclusively for immediate-running interactive jobs. You may not submit many or large jobs at this QoS.

Jobs can in principle be restarted and rescheduled if a higher-priority job needs cluster resources, but at the time of writing this preemption is not actually configured.

Running MPI jobs

For multi-node MPI workloads see MPI on Anunna.

See also