Monitoring job execution: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
Duque004 (talk | contribs)
Phase 1 § 4 redirect: content merged into Monitoring Jobs (P1.4.7) (via update-page on MediaWiki MCP Server)
Tag: New redirect
 
(3 intermediate revisions by one other user not shown)
Line 1: Line 1:
 
#REDIRECT [[Monitoring Jobs]]
== Output stream redirection ==
 
The primary way to monitor job execution is through the stdout and stderr streams. These are redirected to [https://wiki.hpcagrogenomics.wur.nl/index.php/Creating_sbatch_script#output_.28stderr.2Cstdout.29_directed_to_file text files specified in the SLURM script].
 
For this purpose the <code>tail</code> command is particularly useful. To continuously follow the output to a text file use:
 
<code>tail -f output_987654.txt</code>
 
To obtain the last X lines of a text file use:
 
<code>tail -n X output_987654.txt</code>
 
Replacing X by the desired number of lines.
 
If the output file gets too long and you wish to read from the begining you may combine the commands cat and less:
 
<code>cat output_987654.txt | less</code>
 
Use the Q key to exit less.
 
== Monitoring resource usage ==
 
While the output streams may suffice in most cases, certain programmes might not provide much feedback. This could be the case with a programme that rellies on modules that are not verbose. In such situations it is best to monitor resource usage to gauge job execution. Two possible options are described below.
 
=== Using sstat ===
 
<code>sstat</code> is a SLURM tool that can be used to obtain instantaneous information on resource usage, CPU load, memory, etc. To use it starting by changing your SLURM script so that your programme starts with the <code>srun</code> command (this should be the last line in the script):
 
<code>srun python3 calc_pi.py</code>
 
Note down the job number. During execution you can then use <code>sstat</code>, passing the job number with the <code>-j</code> flag:
 
<code>sstat --format=AveCPU,AveRSS,MaxRSS -P -j 987654</code>
 
<code>sstat</code> can provide information on many different variables, for more details check [https://slurm.schedmd.com/sstat.html the manual].
 
=== Using top ===
 
Another way it to log the output of the [http://man7.org/linux/man-pages/man1/top.1.html <code>top</code>] command. This requires adding an extra command to your SLURM script, again using <code>srun</code> (where "user001" should be replaced by your user name):
 
<code>srun --overcommit --ntasks=1 top -b -u user001 &
python3 calc_pi.py</code>
 
This will log the output of top to the sdtout stream file specific in the SLURM script every 3 seconds. Using the tail command you will be able to see logs like:
 
<pre>top - 18:09:12 up 53 days, 22:53,  0 users,  load average: 27,04, 27,63, 26,53
Tasks: 1068 total,  4 running, 1064 sleeping,  0 stopped,  0 zombie
%Cpu(s): 41,9 us,  0,2 sy,  0,0 ni, 57,9 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem : 10439453+total, 18650486+free, 43690281+used, 42053763+buff/cache
KiB Swap: 26214400+total, 26214393+free,      52 used. 54182425+avail Mem
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM    TIME+ COMMAND
29260 user001  20  0  15,6g  15,2g  17392 R  1777  1,5  4014:16 R
29447 user001  20  0  178604  12900  1676 R  1,6  0,0  6:33.17 top
28627 user001  20  0  113184  1520  1260 S  0,0  0,0  0:00.01 slurm_scri+
29253 user001  20  0  245096  4792  1976 S  0,0  0,0  0:00.66 srun
29357 user001  20  0  36124    688    16 S  0,0  0,0  0:00.00 srun</pre>

Latest revision as of 09:54, 16 June 2026

Redirect to: