Batch Jobs

From HPCwiki
Revision as of 09:28, 16 June 2026 by Haars0011 (talk | contribs) (Phase 1 § 4 P1.4.4: merge Creating sbatch script + Using Slurm § Submitting into Batch Jobs (via create-page on MediaWiki MCP Server))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

A batch job is the standard way to run unattended work on Anunna. You write a shell script that declares the resources you need (CPUs, memory, wall time) and the commands you want to run, then hand the script to sbatch. SLURM queues the job, runs it on a compute node when resources are free, and writes the output to a file you specified.

This page covers the mechanics: how a batch script is laid out, what each #SBATCH directive does, and how to submit one or many jobs at once. For node selection (partition, GPU, hardware constraints) see Choosing a node (constraints); for tracking submitted jobs see Monitoring jobs.

A worked example

Suppose you have a Python script calc_pi.py that calculates π to one million digits:

from decimal import *
D = Decimal
getcontext().prec = 10000000
p = sum(D(1)/16**k * (D(4)/(8*k+1) - D(2)/(8*k+4) - D(1)/(8*k+5) - D(1)/(8*k+6)) for k in range(411))
print(str(p)[:10000002])

To run it on the cluster you need to (1) load the right Python module and (2) wrap the script in a batch submission file.

Loading modules

Check which Python versions are available:

module avail Python

Pick one and load it from the appropriate year bucket — see Modules for the bucket-and-version convention.

Batch script

Save the following as run_calc_pi.sh:

#!/bin/bash
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --cpus-per-task=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl

module load 2023
module load Python/3.11.3

time python3 calc_pi.py

Submitting

Submit the job:

sbatch run_calc_pi.sh

SLURM responds with a job ID. Track progress with squeue and sacct, or cancel with scancel.

Skeleton script

A reasonable starting template:

#!/bin/bash

#-----------------------------Mail address-----------------------------
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#-----------------------------Output files-----------------------------
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#-----------------------------Other information------------------------
#SBATCH --comment=
#SBATCH --qos=
#-----------------------------Required resources-----------------------
#SBATCH --time=0-0:0:0
#SBATCH --ntasks=
#SBATCH --cpus-per-task=
#SBATCH --mem-per-cpu=

#-----------------------------Environment, Operations and Job steps----
# load modules

# export variables

# your job

Common #SBATCH directives

For partition, hardware constraints, and GPU requests see Choosing a node (constraints). The directives below are the ones you use to size the job and control its bookkeeping.

Time limit

#SBATCH --time=1200

Acceptable time formats: minutes, minutes:seconds, hours:minutes:seconds, days-hours, days-hours:minutes, days-hours:minutes:seconds. A time limit of zero disables the limit. The example above caps the job at 1200 minutes.

Memory

#SBATCH --mem=2048

SLURM enforces a memory limit per job. The default is deliberately small — 100 MB per node — so most real work needs an explicit value. The argument is in MB (or use a suffix: K, M, G, T).

Pick a value slightly larger than what your job actually needs. Too low and the job is killed; too high and it is harder for SLURM to find a place to run it. A starting heuristic is 4000 MB per core; tune from there. To see what a finished job actually used:

sacct -o MaxRSS -j <jobid>

The value is in KB; divide by 1024 to get MB. If the job finished a while ago, broaden the search with -S YYYY-MM-DD.

For multi-node jobs the value is the maximum used on any one node; uneven task distribution can make this fluctuate between runs.

Number of tasks and CPUs

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

sbatch does not launch tasks itself — it reserves resources and runs the batch script. --ntasks tells SLURM the upper bound on job steps that may run inside the allocation; --cpus-per-task reserves that many CPUs for each task. The default is one task per node, but setting --cpus-per-task changes the default task count.

To pin the job to a single node:

#SBATCH --nodes=1

If only one number is given for --nodes, SLURM treats it as both minimum and maximum.

Local disk on the compute node

#SBATCH --tmp=<required size>

Each compute node has a local disk of around 300 GB at /tmp that you can use for staging. The disk is shared with other jobs on the node, so request the amount you need so SLURM places your job on a node where it is actually available. Clean up after yourself when the job ends.

Output and error files

#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt

Redirect the batch script's standard output and standard error to files. The default for both is slurm-%j.out. The pattern %j is replaced with the job ID.

Job name

#SBATCH --job-name=calc_pi.py

A short label that appears next to the job ID in squeue. Defaults to the script name, or sbatch if the script came from standard input.

Email notifications

#SBATCH --mail-type=ALL
#SBATCH --mail-user=yourname001@wur.nl

Valid --mail-type values are BEGIN, END, FAIL, REQUEUE, and ALL. Combine with --mail-user to choose the recipient.

Comment / project accounting

#SBATCH --comment=773320000

The comment is an arbitrary string that ends up in accounting records. For WUR users this is the place to put a project number or KTP number. The comment can be changed after submission with scontrol.

Submitting many jobs at once

Simple loop

To submit ten independent scripts runscript_1.sh through runscript_10.sh:

for i in $(seq 1 10); do
  echo "$i"
  sbatch "runscript_${i}.sh"
done

Job dependencies

To make one job wait for another to finish, capture the job ID returned by sbatch and pass it to the next submission via --dependency:

#!/bin/bash
JOB1=$(sbatch --parsable job_1.sh)
echo "Submitted $JOB1"

JOB2=$(sbatch --parsable --dependency=afterany:$JOB1 job_2.sh)
echo "Submitted $JOB2 (waits for $JOB1)"

JOB3=$(sbatch --parsable --dependency=afterany:$JOB2 job_3.sh)
echo "Submitted $JOB3 (waits for $JOB2)"

The --parsable flag makes sbatch print just the job ID, which is convenient for capturing into a shell variable. afterany triggers the next job whether the previous one succeeded or failed; afterok only triggers on success. For array job dependencies, aftercorr matches array elements one-to-one.

See the sbatch documentation for the full list of dependency types.

Array jobs

If you need to run the same script many times with a varying parameter, an array job is far more efficient than a submission loop. Add:

#SBATCH --array=0-10%4

This submits 11 jobs (indices 0–10) with at most 4 running at once. See Array jobs for details.

See also