Scheduler Overview (Slurm): Difference between revisions

From HPCwiki
Jump to navigation Jump to search
No edit summary
Phase 1 § 4 P1.4.1: trim to overview only — content split into Partitions / Queues, Choosing a node (constraints), Batch Jobs, Interactive Jobs, Cancelling Jobs, Monitoring Jobs (separate pages). This page is now the entry point with QoS overview and topic index. (via update-page on MediaWiki MCP Server)
 
(117 intermediate revisions by 9 users not shown)
Line 1: Line 1:
== submitting jobs: sbatch ==
The resource allocation and scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement. This page is the entry point — most topics have their own page; below is a short summary plus links.


from decimal import *
== What's on which page ==
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])


<source lang='python'>
* [[Partitions / Queues]] — list of partitions (<code>main</code>, <code>gpu</code>, <code>gpu_amd</code>) and how to choose one.
from decimal import *
* [[Choosing a node (constraints)]] — defaults, hardware constraints, GPU selection.
D=Decimal
* [[Batch Jobs]] — writing sbatch scripts and submitting them, including multi-job submissions and dependencies.
getcontext().prec=10000000
* [[Interactive Jobs]] — <code>sinteractive</code> and <code>salloc</code> for live shell sessions on a compute node.
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
* [[Array jobs]] — running the same script many times with a varying parameter.
print(str(p)[:10000002])
* [[Monitoring Jobs]] — <code>squeue</code>, <code>scontrol</code>, <code>sstat</code>, <code>sacct</code>, <code>node_usage_graph</code>.
</source>  
* [[Cancelling Jobs]] — <code>scancel</code>.
* [[Reservations]] — booking nodes in advance for events.


<source lang='bash'>
== Quality of Service ==
#!/bin/bash
# #SBATCH --time=100
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --partition=research


time python3 calc_pi.py
When submitting a job, you may optionally assign a different Quality of Service (QoS) to it:
</source>


  JOBID PARTITION    NAME    USER  ST      TIME  NODES NODELIST(REASON)
<syntaxhighlight lang="bash">
  3347  research calc_pi. megen002  R      0:03      1 node049
#SBATCH --qos=std
</syntaxhighlight>


== allocating resources interactively: sallocate ==
The QoS values configured on Anunna:


== running MPI jobs on B4F cluster ==
* '''std''' (priority 10) — the default. Use this unless you have a specific reason to pick another.
* '''low''' (priority 1) — reduced priority, but limited to 8 hours per job so a flood of low-priority jobs cannot lock up the cluster.
* '''high''' (priority 20) — higher priority than <code>std</code>. More expensive — see [[Tariffs]].
* '''interactive''' (priority 100) — the highest priority, exclusively for immediate-running interactive jobs. You may not submit many or large jobs at this QoS.


== monitoring submitted jobs: squeue ==
Jobs can in principle be restarted and rescheduled if a higher-priority job needs cluster resources, but at the time of writing this preemption is not actually configured.


== removing jobs from a list: scancel ==
== Running MPI jobs ==


== other ==
For multi-node MPI workloads see [[MPI on B4F cluster | MPI on Anunna]].


== external links ==
== See also ==
 
* [[Partitions / Queues]]
* [[Choosing a node (constraints)]]
* [[Batch Jobs]]
* [[Interactive Jobs]]
* [[Array jobs]]
* [[Monitoring Jobs]]
* [[Cancelling Jobs]]
* [[Reservations]]
* [[Tariffs | Costs associated with resource usage]]
 
== External links ==
 
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]

Latest revision as of 09:48, 16 June 2026

The resource allocation and scheduling software on Anunna is SLURM: Simple Linux Utility for Resource Management. This page is the entry point — most topics have their own page; below is a short summary plus links.

What's on which page

Quality of Service

When submitting a job, you may optionally assign a different Quality of Service (QoS) to it:

#SBATCH --qos=std

The QoS values configured on Anunna:

  • std (priority 10) — the default. Use this unless you have a specific reason to pick another.
  • low (priority 1) — reduced priority, but limited to 8 hours per job so a flood of low-priority jobs cannot lock up the cluster.
  • high (priority 20) — higher priority than std. More expensive — see Tariffs.
  • interactive (priority 100) — the highest priority, exclusively for immediate-running interactive jobs. You may not submit many or large jobs at this QoS.

Jobs can in principle be restarted and rescheduled if a higher-priority job needs cluster resources, but at the time of writing this preemption is not actually configured.

Running MPI jobs

For multi-node MPI workloads see MPI on Anunna.

See also