Debugging Jobs
When a job fails, exits early, or produces the wrong results, a few systematic checks usually find the cause. This page covers how to work out what went wrong; for the monitoring commands themselves, see Monitoring Jobs.
Start with the logs
SLURM writes your job's standard output and error to the files you set with --output and --error in your batch script. Read these first — most failures (a missing file, a typo, an out-of-memory message, a module that was not loaded) show up there.
If you did not set those options, the output goes to slurm-<jobid>.out in the directory you submitted from.
Check how the job ended
For a finished job, sacct shows its exit code and state (COMPLETED, FAILED, TIMEOUT, OUT_OF_MEMORY, CANCELLED):
sacct -j <jobid> --format=JobID,JobName,State,ExitCode,Elapsed,MaxRSS,ReqMem
Common patterns:
- TIMEOUT — the job hit its
--timelimit. Request more time, or make the work faster or smaller. - OUT_OF_MEMORY — the job needed more memory than it requested. Increase
--mem(or--mem-per-cpu); compareMaxRSSagainstReqMemto size it. - FAILED with a non-zero exit code — the program itself errored; check the logs.
See Monitoring Jobs for more on sacct and the other monitoring tools.
Reproduce it interactively
If the logs are not enough, reproduce the problem in an interactive job. Request the same resources, then run the commands from your script by hand and watch what happens. This is the quickest way to debug module loading, paths, and input files.
Check a running job
For a job that is running but behaving oddly (too slow, using too much memory), inspect its live resource use with sstat, or connect to the node it runs on — see Compute Nodes and Monitoring Jobs.
Common causes
- Software not available because a module (or its bucket) was not loaded in the script.
- Relative paths that worked interactively but not from the job's working directory.
- Running heavy work on, or writing large amounts of I/O to, the wrong filesystem — see Storage Systems Overview.
- Asking for resources the partition cannot provide — see Partitions / Queues.