Debugging Jobs

From HPCwiki
Revision as of 14:22, 18 June 2026 by Haars0011 (talk | contribs) (IA migration §8: new Debugging Jobs page (via create-page on MediaWiki MCP Server))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

When a job fails, exits early, or produces the wrong results, a few systematic checks usually find the cause. This page covers how to work out what went wrong; for the monitoring commands themselves, see Monitoring Jobs.

Start with the logs

SLURM writes your job's standard output and error to the files you set with --output and --error in your batch script. Read these first — most failures (a missing file, a typo, an out-of-memory message, a module that was not loaded) show up there.

If you did not set those options, the output goes to slurm-<jobid>.out in the directory you submitted from.

Check how the job ended

For a finished job, sacct shows its exit code and state (COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, CANCELLED):

sacct -j <jobid> --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS,ReqMem

Common patterns:

  • TIMEOUT — the job hit its --time limit. Request more time, or make the work faster or smaller.
  • OUT_OF_MEMORY — the job needed more memory than it requested. Increase --mem (or --mem-per-cpu); compare MaxRSS against ReqMem to size it.
  • FAILED with a non-zero exit code — the program itself errored; check the logs.

See Monitoring Jobs for more on sacct and the other monitoring tools.

Reproduce it interactively

If the logs are not enough, reproduce the problem in an interactive job. Request the same resources, then run the commands from your script by hand and watch what happens. This is the quickest way to debug module loading, paths, and input files.

Check a running job

For a job that is running but behaving oddly (too slow, using too much memory), inspect its live resource use with sstat, or connect to the node it runs on — see Compute Nodes and Monitoring Jobs.

Common causes

  • Software not available because a module (or its bucket) was not loaded in the script.
  • Relative paths that worked interactively but not from the job's working directory.
  • Running heavy work on, or writing large amounts of I/O to, the wrong filesystem — see Storage Systems Overview.
  • Asking for resources the partition cannot provide — see Partitions / Queues.

See also