<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.anunna.wur.nl/index.php?action=history&amp;feed=atom&amp;title=Monitoring_Jobs</id>
	<title>Monitoring Jobs - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.anunna.wur.nl/index.php?action=history&amp;feed=atom&amp;title=Monitoring_Jobs"/>
	<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Monitoring_Jobs&amp;action=history"/>
	<updated>2026-06-20T01:30:30Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.3</generator>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Monitoring_Jobs&amp;diff=2744&amp;oldid=prev</id>
		<title>Haars0011: Phase 1 § 4 P1.4.7: merge Using Slurm § Monitoring + Monitoring job execution + SACCT + Node usage graph into Monitoring Jobs (via create-page on MediaWiki MCP Server)</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Monitoring_Jobs&amp;diff=2744&amp;oldid=prev"/>
		<updated>2026-06-16T09:41:49Z</updated>

		<summary type="html">&lt;p&gt;Phase 1 § 4 P1.4.7: merge Using Slurm § Monitoring + Monitoring job execution + SACCT + Node usage graph into Monitoring Jobs (via create-page on MediaWiki MCP Server)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Once you have submitted a job, you usually want to keep an eye on it: is it queued or running, on which node, using how much memory, producing the output you expected. SLURM exposes this through a small collection of commands; each one answers a different question, and together they cover everything from &amp;quot;is it running&amp;quot; to &amp;quot;what did it cost me last month&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
This page is a reference for the four common needs:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Is the job in the queue, and where in the queue?&amp;#039;&amp;#039;&amp;#039; — &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;scontrol&amp;lt;/code&amp;gt;.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;Is the running job doing what I want?&amp;#039;&amp;#039;&amp;#039; — the output streams from your script, plus &amp;lt;code&amp;gt;sstat&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;top&amp;lt;/code&amp;gt; for live resource usage.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;What did finished jobs actually use?&amp;#039;&amp;#039;&amp;#039; — &amp;lt;code&amp;gt;sacct&amp;lt;/code&amp;gt;.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;How busy is the cluster overall?&amp;#039;&amp;#039;&amp;#039; — &amp;lt;code&amp;gt;node_usage_graph&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
For submitting jobs see [[Batch Jobs]] and [[Interactive Jobs]]; to stop a job, see [[Cancelling Jobs]].&lt;br /&gt;
&lt;br /&gt;
== Listing your jobs ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; shows every job currently known to the scheduler. The most common form is just:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
squeue&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To restrict it to your own jobs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
squeue -u $USER&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;code&amp;gt;-l&amp;lt;/code&amp;gt; flag adds the time limit set on each job (useful for spotting a misconfigured time request):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
squeue -l -j &amp;lt;jobid&amp;gt;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Detailed information for one job ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;scontrol show jobid&amp;lt;/code&amp;gt; dumps everything SLURM knows about a job — start time, allocated nodes, command line, partition, account, exit code:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
scontrol show jobid &amp;lt;jobid&amp;gt;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This works for jobs that are currently in the system (running, pending, completing). Once a job has cleared the queue, use &amp;lt;code&amp;gt;sacct&amp;lt;/code&amp;gt; (below) to get its accounting record.&lt;br /&gt;
&lt;br /&gt;
== When will a pending job start? ==&lt;br /&gt;
&lt;br /&gt;
If &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; shows your job as &amp;lt;code&amp;gt;PD&amp;lt;/code&amp;gt; (pending) with a reason like &amp;lt;code&amp;gt;ReqNodeNotAvail&amp;lt;/code&amp;gt;, SLURM has an estimate for when it can start:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
squeue --start -j &amp;lt;jobid&amp;gt;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;code&amp;gt;START_TIME&amp;lt;/code&amp;gt; column is an estimate — it can shift forward or backward as other jobs finish or new jobs arrive — but it is the best signal SLURM has.&lt;br /&gt;
&lt;br /&gt;
== Tracking output streams ==&lt;br /&gt;
&lt;br /&gt;
The output and error files declared in your batch script (see [[Batch Jobs]]) are written in real time. The most useful tool to follow them is &amp;lt;code&amp;gt;tail&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
tail -f output_&amp;lt;jobid&amp;gt;.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This keeps the file open and prints new lines as your job writes them. Press &amp;lt;code&amp;gt;Ctrl-C&amp;lt;/code&amp;gt; to stop following. To see only the last N lines instead of following:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
tail -n &amp;lt;N&amp;gt; output_&amp;lt;jobid&amp;gt;.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For a long output file you want to page through, use &amp;lt;code&amp;gt;less&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
less output_&amp;lt;jobid&amp;gt;.txt&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Press &amp;lt;code&amp;gt;q&amp;lt;/code&amp;gt; to quit.&lt;br /&gt;
&lt;br /&gt;
== Live resource usage ==&lt;br /&gt;
&lt;br /&gt;
The output streams tell you what your code is logging; they don&amp;#039;t tell you what it is actually using. For CPU load and memory while a job is running, you have two options.&lt;br /&gt;
&lt;br /&gt;
=== sstat ===&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;sstat&amp;lt;/code&amp;gt; reports instantaneous resource usage. To use it your batch script needs to launch the work step under &amp;lt;code&amp;gt;srun&amp;lt;/code&amp;gt; (so that SLURM tracks it as a job step). Example final line of the script:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
srun python3 calc_pi.py&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then, with the job ID from the submission, query a few useful fields:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
sstat --format=AveCPU,AveRSS,MaxRSS -P -j &amp;lt;jobid&amp;gt;&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The full list of available fields is in [https://slurm.schedmd.com/sstat.html the sstat manual].&lt;br /&gt;
&lt;br /&gt;
=== Logging top inside the job ===&lt;br /&gt;
&lt;br /&gt;
If you want a continuous record of what the OS sees while the job runs, run &amp;lt;code&amp;gt;top&amp;lt;/code&amp;gt; alongside your work and let its output land in the job&amp;#039;s stdout. Add this near the start of the script (replace the username):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
srun --overcommit --ntasks=1 top -b -u user001 &amp;amp;&lt;br /&gt;
python3 calc_pi.py&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The &amp;lt;code&amp;gt;&amp;amp;&amp;lt;/code&amp;gt; backgrounds &amp;lt;code&amp;gt;top&amp;lt;/code&amp;gt; so the real workload starts immediately. &amp;lt;code&amp;gt;top -b&amp;lt;/code&amp;gt; runs in batch mode (no terminal control) and writes one snapshot every three seconds. The result in your output file looks like:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
top - 18:09:12 up 53 days, 22:53,  0 users,  load average: 27,04, 27,63, 26,53&lt;br /&gt;
Tasks: 1068 total,   4 running, 1064 sleeping,   0 stopped,   0 zombie&lt;br /&gt;
%Cpu(s): 41,9 us,  0,2 sy,  0,0 ni, 57,9 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st&lt;br /&gt;
&lt;br /&gt;
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND&lt;br /&gt;
29260 user001   20   0   15,6g  15,2g  17392 R  1777  1,5   4014:16 R&lt;br /&gt;
29447 user001   20   0  178604  12900   1676 R   1,6  0,0   6:33.17 top&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Historical accounting with sacct ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;sacct&amp;lt;/code&amp;gt; reports on jobs that have left the queue. It reads from the accounting database, so it works for both finished and still-running jobs, including ones from months back.&lt;br /&gt;
&lt;br /&gt;
The simplest form lists today&amp;#039;s jobs:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
sacct -a&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For a real report — for instance an entire month, with the fields you actually care about, written out as CSV — pass &amp;lt;code&amp;gt;--format&amp;lt;/code&amp;gt;, a date range, and the parsable flags:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
sacct -P -X --delimiter=&amp;#039;,&amp;#039; \&lt;br /&gt;
      -S 2026-01-01 -E 2026-02-01 \&lt;br /&gt;
      --format=comment%15,User,Partition%20,JobID,JobName,ncpus,nnodes,NodeList,Start,alloccpus,cputime%12,cputimeraw,state \&lt;br /&gt;
      &amp;gt; usage_report_$(date -I).csv&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The flags break down as:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;-P&amp;lt;/code&amp;gt; writes parsable output (no column alignment, just delimited fields).&lt;br /&gt;
* &amp;lt;code&amp;gt;-X&amp;lt;/code&amp;gt; hides the per-step rows and keeps just one row per job.&lt;br /&gt;
* &amp;lt;code&amp;gt;-S&amp;lt;/code&amp;gt; / &amp;lt;code&amp;gt;-E&amp;lt;/code&amp;gt; set the start and end of the window. Without these &amp;lt;code&amp;gt;sacct&amp;lt;/code&amp;gt; only looks at today.&lt;br /&gt;
* &amp;lt;code&amp;gt;--format&amp;lt;/code&amp;gt; selects the columns; the &amp;lt;code&amp;gt;%NN&amp;lt;/code&amp;gt; suffix sets a width.&lt;br /&gt;
* &amp;lt;code&amp;gt;cputimeraw&amp;lt;/code&amp;gt; is the field you usually want for cost analysis — total CPU-seconds.&lt;br /&gt;
&lt;br /&gt;
The resulting CSV opens directly in Excel or any spreadsheet. The full field list is in [https://slurm.schedmd.com/sacct.html the sacct manual].&lt;br /&gt;
&lt;br /&gt;
== Cluster-wide view: node_usage_graph ==&lt;br /&gt;
&lt;br /&gt;
To see how busy the cluster as a whole is — which nodes are crunching, which are drained, which are reserved — use &amp;lt;code&amp;gt;node_usage_graph&amp;lt;/code&amp;gt;. It is a small wrapper around &amp;lt;code&amp;gt;sacct&amp;lt;/code&amp;gt; that renders a per-node bar chart in your terminal.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
module load legacy&lt;br /&gt;
module load anunna&lt;br /&gt;
node_usage_graph&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The output is one row per node. Each row contains up to two strips of characters: the top strip shows CPU activity, the bottom shows memory:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;text&amp;quot;&amp;gt;&lt;br /&gt;
node:   |0%                                                                            100%|&lt;br /&gt;
fat002: CCCCCCCCC&lt;br /&gt;
        MMMMMmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm&lt;br /&gt;
node003:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC&lt;br /&gt;
        MM&lt;br /&gt;
node010:&lt;br /&gt;
node040:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD&lt;br /&gt;
node042:RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Letters mean:&lt;br /&gt;
&lt;br /&gt;
* &amp;lt;code&amp;gt;C&amp;lt;/code&amp;gt; — CPU reserved and in use&lt;br /&gt;
* &amp;lt;code&amp;gt;c&amp;lt;/code&amp;gt; — CPU reserved but idle&lt;br /&gt;
* &amp;lt;code&amp;gt;M&amp;lt;/code&amp;gt; — Memory reserved and in use&lt;br /&gt;
* &amp;lt;code&amp;gt;m&amp;lt;/code&amp;gt; — Memory reserved but unused&lt;br /&gt;
* &amp;lt;code&amp;gt;D&amp;lt;/code&amp;gt; — Drained node (unavailable for new jobs)&lt;br /&gt;
* &amp;lt;code&amp;gt;R&amp;lt;/code&amp;gt; — Reserved node (held for a specific user or event — see [[Reservations]])&lt;br /&gt;
* &amp;lt;code&amp;gt;P&amp;lt;/code&amp;gt; — Powered off (energy saving)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;node_usage_graph&amp;lt;/code&amp;gt; shows current allocation but not the queue depth. For &amp;quot;how busy is the queue?&amp;quot; stick with &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
== Job status codes ==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; shows a two-letter code in the &amp;lt;code&amp;gt;ST&amp;lt;/code&amp;gt; column. The most common are &amp;lt;code&amp;gt;R&amp;lt;/code&amp;gt; (running), &amp;lt;code&amp;gt;PD&amp;lt;/code&amp;gt; (pending), and &amp;lt;code&amp;gt;CG&amp;lt;/code&amp;gt; (completing). The full list:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
! Code !! State !! Description&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;CA&amp;lt;/code&amp;gt; || CANCELLED || Job was explicitly cancelled by the user or a system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;CD&amp;lt;/code&amp;gt; || COMPLETED || Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;CF&amp;lt;/code&amp;gt; || CONFIGURING || Job has been allocated resources, but is waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;CG&amp;lt;/code&amp;gt; || COMPLETING || Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;F&amp;lt;/code&amp;gt; || FAILED || Job terminated with a non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;NF&amp;lt;/code&amp;gt; || NODE_FAIL || Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;PD&amp;lt;/code&amp;gt; || PENDING || Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;R&amp;lt;/code&amp;gt; || RUNNING || Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;S&amp;lt;/code&amp;gt; || SUSPENDED || Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
| &amp;lt;code&amp;gt;TO&amp;lt;/code&amp;gt; || TIMEOUT || Job terminated upon reaching its time limit.&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Batch Jobs]]&lt;br /&gt;
* [[Interactive Jobs]]&lt;br /&gt;
* [[Cancelling Jobs]]&lt;br /&gt;
* [[Reservations]]&lt;br /&gt;
* [[Using Slurm | Scheduler overview]]&lt;br /&gt;
* [https://slurm.schedmd.com/squeue.html squeue documentation]&lt;br /&gt;
* [https://slurm.schedmd.com/scontrol.html scontrol documentation]&lt;br /&gt;
* [https://slurm.schedmd.com/sstat.html sstat documentation]&lt;br /&gt;
* [https://slurm.schedmd.com/sacct.html sacct documentation]&lt;/div&gt;</summary>
		<author><name>Haars0011</name></author>
	</entry>
</feed>