<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.anunna.wur.nl/index.php?action=history&amp;feed=atom&amp;title=Debugging_Jobs</id>
	<title>Debugging Jobs - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.anunna.wur.nl/index.php?action=history&amp;feed=atom&amp;title=Debugging_Jobs"/>
	<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Debugging_Jobs&amp;action=history"/>
	<updated>2026-06-19T08:17:54Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.45.3</generator>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Debugging_Jobs&amp;diff=2867&amp;oldid=prev</id>
		<title>Haars0011: IA migration §8: new Debugging Jobs page (via create-page on MediaWiki MCP Server)</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Debugging_Jobs&amp;diff=2867&amp;oldid=prev"/>
		<updated>2026-06-18T14:22:05Z</updated>

		<summary type="html">&lt;p&gt;IA migration §8: new Debugging Jobs page (via create-page on MediaWiki MCP Server)&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;When a job fails, exits early, or produces the wrong results, a few systematic checks usually find the cause. This page covers how to work out what went wrong; for the monitoring commands themselves, see [[Monitoring Jobs]].&lt;br /&gt;
&lt;br /&gt;
== Start with the logs ==&lt;br /&gt;
&lt;br /&gt;
SLURM writes your job&amp;#039;s standard output and error to the files you set with &amp;lt;code&amp;gt;--output&amp;lt;/code&amp;gt; and &amp;lt;code&amp;gt;--error&amp;lt;/code&amp;gt; in your batch script. Read these first — most failures (a missing file, a typo, an out-of-memory message, a module that was not loaded) show up there.&lt;br /&gt;
&lt;br /&gt;
If you did not set those options, the output goes to &amp;lt;code&amp;gt;slurm-&amp;lt;jobid&amp;gt;.out&amp;lt;/code&amp;gt; in the directory you submitted from.&lt;br /&gt;
&lt;br /&gt;
== Check how the job ended ==&lt;br /&gt;
&lt;br /&gt;
For a finished job, &amp;lt;code&amp;gt;sacct&amp;lt;/code&amp;gt; shows its exit code and state (COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, CANCELLED):&lt;br /&gt;
&lt;br /&gt;
&amp;lt;syntaxhighlight lang=&amp;quot;bash&amp;quot;&amp;gt;&lt;br /&gt;
sacct -j &amp;lt;jobid&amp;gt; --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS,ReqMem&lt;br /&gt;
&amp;lt;/syntaxhighlight&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Common patterns:&lt;br /&gt;
&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;TIMEOUT&amp;#039;&amp;#039;&amp;#039; — the job hit its &amp;lt;code&amp;gt;--time&amp;lt;/code&amp;gt; limit. Request more time, or make the work faster or smaller.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;OUT_OF_MEMORY&amp;#039;&amp;#039;&amp;#039; — the job needed more memory than it requested. Increase &amp;lt;code&amp;gt;--mem&amp;lt;/code&amp;gt; (or &amp;lt;code&amp;gt;--mem-per-cpu&amp;lt;/code&amp;gt;); compare &amp;lt;code&amp;gt;MaxRSS&amp;lt;/code&amp;gt; against &amp;lt;code&amp;gt;ReqMem&amp;lt;/code&amp;gt; to size it.&lt;br /&gt;
* &amp;#039;&amp;#039;&amp;#039;FAILED&amp;#039;&amp;#039;&amp;#039; with a non-zero exit code — the program itself errored; check the logs.&lt;br /&gt;
&lt;br /&gt;
See [[Monitoring Jobs]] for more on &amp;lt;code&amp;gt;sacct&amp;lt;/code&amp;gt; and the other monitoring tools.&lt;br /&gt;
&lt;br /&gt;
== Reproduce it interactively ==&lt;br /&gt;
&lt;br /&gt;
If the logs are not enough, reproduce the problem in an [[Interactive Jobs|interactive job]]. Request the same resources, then run the commands from your script by hand and watch what happens. This is the quickest way to debug module loading, paths, and input files.&lt;br /&gt;
&lt;br /&gt;
== Check a running job ==&lt;br /&gt;
&lt;br /&gt;
For a job that is running but behaving oddly (too slow, using too much memory), inspect its live resource use with &amp;lt;code&amp;gt;sstat&amp;lt;/code&amp;gt;, or connect to the node it runs on — see [[Compute Nodes]] and [[Monitoring Jobs]].&lt;br /&gt;
&lt;br /&gt;
== Common causes ==&lt;br /&gt;
&lt;br /&gt;
* Software not available because a [[Environment Modules|module]] (or its bucket) was not loaded in the script.&lt;br /&gt;
* Relative paths that worked interactively but not from the job&amp;#039;s working directory.&lt;br /&gt;
* Running heavy work on, or writing large amounts of I/O to, the wrong filesystem — see [[Storage Systems Overview]].&lt;br /&gt;
* Asking for resources the partition cannot provide — see [[Partitions / Queues]].&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Monitoring Jobs]]&lt;br /&gt;
* [[Interactive Jobs]]&lt;br /&gt;
* [[Batch Jobs]]&lt;br /&gt;
* [[Scheduler Overview (Slurm)]]&lt;/div&gt;</summary>
		<author><name>Haars0011</name></author>
	</entry>
</feed>