Monitoring Jobs
Once you have submitted a job, you usually want to keep an eye on it: is it queued or running, on which node, using how much memory, producing the output you expected. SLURM exposes this through a small collection of commands; each one answers a different question, and together they cover everything from "is it running" to "what did it cost me last month".
This page is a reference for the four common needs:
- Is the job in the queue, and where in the queue? —
squeueandscontrol. - Is the running job doing what I want? — the output streams from your script, plus
sstatortopfor live resource usage. - What did finished jobs actually use? —
sacct. - How busy is the cluster overall? —
node_usage_graph.
For submitting jobs see Batch Jobs and Interactive Jobs; to stop a job, see Cancelling Jobs.
Listing your jobs
squeue shows every job currently known to the scheduler. The most common form is just:
squeue
To restrict it to your own jobs:
squeue -u $USER
The -l flag adds the time limit set on each job (useful for spotting a misconfigured time request):
squeue -l -j <jobid>
Detailed information for one job
scontrol show jobid dumps everything SLURM knows about a job — start time, allocated nodes, command line, partition, account, exit code:
scontrol show jobid <jobid>
This works for jobs that are currently in the system (running, pending, completing). Once a job has cleared the queue, use sacct (below) to get its accounting record.
When will a pending job start?
If squeue shows your job as PD (pending) with a reason like ReqNodeNotAvail, SLURM has an estimate for when it can start:
squeue --start -j <jobid>
The START_TIME column is an estimate — it can shift forward or backward as other jobs finish or new jobs arrive — but it is the best signal SLURM has.
Tracking output streams
The output and error files declared in your batch script (see Batch Jobs) are written in real time. The most useful tool to follow them is tail:
tail -f output_<jobid>.txt
This keeps the file open and prints new lines as your job writes them. Press Ctrl-C to stop following. To see only the last N lines instead of following:
tail -n <N> output_<jobid>.txt
For a long output file you want to page through, use less:
less output_<jobid>.txt
Press q to quit.
Live resource usage
The output streams tell you what your code is logging; they don't tell you what it is actually using. For CPU load and memory while a job is running, you have two options.
sstat
sstat reports instantaneous resource usage. To use it your batch script needs to launch the work step under srun (so that SLURM tracks it as a job step). Example final line of the script:
srun python3 calc_pi.py
Then, with the job ID from the submission, query a few useful fields:
sstat --format=AveCPU,AveRSS,MaxRSS -P -j <jobid>
The full list of available fields is in the sstat manual.
Logging top inside the job
If you want a continuous record of what the OS sees while the job runs, run top alongside your work and let its output land in the job's stdout. Add this near the start of the script (replace the username):
srun --overcommit --ntasks=1 top -b -u user001 &
python3 calc_pi.py
The & backgrounds top so the real workload starts immediately. top -b runs in batch mode (no terminal control) and writes one snapshot every three seconds. The result in your output file looks like:
top - 18:09:12 up 53 days, 22:53, 0 users, load average: 27,04, 27,63, 26,53
Tasks: 1068 total, 4 running, 1064 sleeping, 0 stopped, 0 zombie
%Cpu(s): 41,9 us, 0,2 sy, 0,0 ni, 57,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
29260 user001 20 0 15,6g 15,2g 17392 R 1777 1,5 4014:16 R
29447 user001 20 0 178604 12900 1676 R 1,6 0,0 6:33.17 top
Historical accounting with sacct
sacct reports on jobs that have left the queue. It reads from the accounting database, so it works for both finished and still-running jobs, including ones from months back.
The simplest form lists today's jobs:
sacct -a
For a real report — for instance an entire month, with the fields you actually care about, written out as CSV — pass --format, a date range, and the parsable flags:
sacct -P -X --delimiter=',' \
-S 2026-01-01 -E 2026-02-01 \
--format=comment%15,User,Partition%20,JobID,JobName,ncpus,nnodes,NodeList,Start,alloccpus,cputime%12,cputimeraw,state \
> usage_report_$(date -I).csv
The flags break down as:
-Pwrites parsable output (no column alignment, just delimited fields).-Xhides the per-step rows and keeps just one row per job.-S/-Eset the start and end of the window. Without thesesacctonly looks at today.--formatselects the columns; the%NNsuffix sets a width.cputimerawis the field you usually want for cost analysis — total CPU-seconds.
The resulting CSV opens directly in Excel or any spreadsheet. The full field list is in the sacct manual.
Cluster-wide view: node_usage_graph
To see how busy the cluster as a whole is — which nodes are crunching, which are drained, which are reserved — use node_usage_graph. It is a small wrapper around sacct that renders a per-node bar chart in your terminal.
module load legacy
module load anunna
node_usage_graph
The output is one row per node. Each row contains up to two strips of characters: the top strip shows CPU activity, the bottom shows memory:
node: |0% 100%|
fat002: CCCCCCCCC
MMMMMmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
node003:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
MM
node010:
node040:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
node042:RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
Letters mean:
C— CPU reserved and in usec— CPU reserved but idleM— Memory reserved and in usem— Memory reserved but unusedD— Drained node (unavailable for new jobs)R— Reserved node (held for a specific user or event — see Reservations)P— Powered off (energy saving)
node_usage_graph shows current allocation but not the queue depth. For "how busy is the queue?" stick with squeue.
Job status codes
squeue shows a two-letter code in the ST column. The most common are R (running), PD (pending), and CG (completing). The full list:
| Code | State | Description |
|---|---|---|
CA |
CANCELLED | Job was explicitly cancelled by the user or a system administrator. The job may or may not have been initiated. |
CD |
COMPLETED | Job has terminated all processes on all nodes. |
CF |
CONFIGURING | Job has been allocated resources, but is waiting for them to become ready for use (e.g. booting). |
CG |
COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
F |
FAILED | Job terminated with a non-zero exit code or other failure condition. |
NF |
NODE_FAIL | Job terminated due to failure of one or more allocated nodes. |
PD |
PENDING | Job is awaiting resource allocation. |
R |
RUNNING | Job currently has an allocation. |
S |
SUSPENDED | Job has an allocation, but execution has been suspended. |
TO |
TIMEOUT | Job terminated upon reaching its time limit. |