Monitoring Jobs

From HPCwiki
Jump to navigation Jump to search

Once you have submitted a job, you usually want to keep an eye on it: is it queued or running, on which node, using how much memory, producing the output you expected. SLURM exposes this through a small collection of commands; each one answers a different question, and together they cover everything from "is it running" to "what did it cost me last month".

This page is a reference for the four common needs:

  • Is the job in the queue, and where in the queue?squeue and scontrol.
  • Is the running job doing what I want? — the output streams from your script, plus sstat or top for live resource usage.
  • What did finished jobs actually use?sacct.
  • How busy is the cluster overall?node_usage_graph.

For submitting jobs see Batch Jobs and Interactive Jobs; to stop a job, see Cancelling Jobs.

Listing your jobs

squeue shows every job currently known to the scheduler. The most common form is just:

squeue

To restrict it to your own jobs:

squeue -u $USER

The -l flag adds the time limit set on each job (useful for spotting a misconfigured time request):

squeue -l -j <jobid>

Detailed information for one job

scontrol show jobid dumps everything SLURM knows about a job — start time, allocated nodes, command line, partition, account, exit code:

scontrol show jobid <jobid>

This works for jobs that are currently in the system (running, pending, completing). Once a job has cleared the queue, use sacct (below) to get its accounting record.

When will a pending job start?

If squeue shows your job as PD (pending) with a reason like ReqNodeNotAvail, SLURM has an estimate for when it can start:

squeue --start -j <jobid>

The START_TIME column is an estimate — it can shift forward or backward as other jobs finish or new jobs arrive — but it is the best signal SLURM has.

Tracking output streams

The output and error files declared in your batch script (see Batch Jobs) are written in real time. The most useful tool to follow them is tail:

tail -f output_<jobid>.txt

This keeps the file open and prints new lines as your job writes them. Press Ctrl-C to stop following. To see only the last N lines instead of following:

tail -n <N> output_<jobid>.txt

For a long output file you want to page through, use less:

less output_<jobid>.txt

Press q to quit.

Live resource usage

The output streams tell you what your code is logging; they don't tell you what it is actually using. For CPU load and memory while a job is running, you have two options.

sstat

sstat reports instantaneous resource usage. To use it your batch script needs to launch the work step under srun (so that SLURM tracks it as a job step). Example final line of the script:

srun python3 calc_pi.py

Then, with the job ID from the submission, query a few useful fields:

sstat --format=AveCPU,AveRSS,MaxRSS -P -j <jobid>

The full list of available fields is in the sstat manual.

Logging top inside the job

If you want a continuous record of what the OS sees while the job runs, run top alongside your work and let its output land in the job's stdout. Add this near the start of the script (replace the username):

srun --overcommit --ntasks=1 top -b -u user001 &
python3 calc_pi.py

The & backgrounds top so the real workload starts immediately. top -b runs in batch mode (no terminal control) and writes one snapshot every three seconds. The result in your output file looks like:

top - 18:09:12 up 53 days, 22:53,  0 users,  load average: 27,04, 27,63, 26,53
Tasks: 1068 total,   4 running, 1064 sleeping,   0 stopped,   0 zombie
%Cpu(s): 41,9 us,  0,2 sy,  0,0 ni, 57,9 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
29260 user001   20   0   15,6g  15,2g  17392 R  1777  1,5   4014:16 R
29447 user001   20   0  178604  12900   1676 R   1,6  0,0   6:33.17 top

Historical accounting with sacct

sacct reports on jobs that have left the queue. It reads from the accounting database, so it works for both finished and still-running jobs, including ones from months back.

The simplest form lists today's jobs:

sacct -a

For a real report — for instance an entire month, with the fields you actually care about, written out as CSV — pass --format, a date range, and the parsable flags:

sacct -P -X --delimiter=',' \
      -S 2026-01-01 -E 2026-02-01 \
      --format=comment%15,User,Partition%20,JobID,JobName,ncpus,nnodes,NodeList,Start,alloccpus,cputime%12,cputimeraw,state \
      > usage_report_$(date -I).csv

The flags break down as:

  • -P writes parsable output (no column alignment, just delimited fields).
  • -X hides the per-step rows and keeps just one row per job.
  • -S / -E set the start and end of the window. Without these sacct only looks at today.
  • --format selects the columns; the %NN suffix sets a width.
  • cputimeraw is the field you usually want for cost analysis — total CPU-seconds.

The resulting CSV opens directly in Excel or any spreadsheet. The full field list is in the sacct manual.

Cluster-wide view: node_usage_graph

To see how busy the cluster as a whole is — which nodes are crunching, which are drained, which are reserved — use node_usage_graph. It is a small wrapper around sacct that renders a per-node bar chart in your terminal.

module load legacy
module load anunna
node_usage_graph

The output is one row per node. Each row contains up to two strips of characters: the top strip shows CPU activity, the bottom shows memory:

node:   |0%                                                                            100%|
fat002: CCCCCCCCC
        MMMMMmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
node003:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
        MM
node010:
node040:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
node042:RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

Letters mean:

  • C — CPU reserved and in use
  • c — CPU reserved but idle
  • M — Memory reserved and in use
  • m — Memory reserved but unused
  • D — Drained node (unavailable for new jobs)
  • R — Reserved node (held for a specific user or event — see Reservations)
  • P — Powered off (energy saving)

node_usage_graph shows current allocation but not the queue depth. For "how busy is the queue?" stick with squeue.

Job status codes

squeue shows a two-letter code in the ST column. The most common are R (running), PD (pending), and CG (completing). The full list:

Code State Description
CA CANCELLED Job was explicitly cancelled by the user or a system administrator. The job may or may not have been initiated.
CD COMPLETED Job has terminated all processes on all nodes.
CF CONFIGURING Job has been allocated resources, but is waiting for them to become ready for use (e.g. booting).
CG COMPLETING Job is in the process of completing. Some processes on some nodes may still be active.
F FAILED Job terminated with a non-zero exit code or other failure condition.
NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
PD PENDING Job is awaiting resource allocation.
R RUNNING Job currently has an allocation.
S SUSPENDED Job has an allocation, but execution has been suspended.
TO TIMEOUT Job terminated upon reaching its time limit.

See also