Monitoring job execution: Difference between revisions
No edit summary |
|||
(5 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
== Output stream redirection == | == Output stream redirection == | ||
The primary way to monitor job execution is through the stdout and stderr streams. These are redirected to [https://wiki.hpcagrogenomics.wur.nl/index.php/Creating_sbatch_script#output_.28stderr.2Cstdout.29_directed_to_file text files specified in the SLURM script]. | The primary way to monitor job execution is through the ''stdout'' and ''stderr'' streams. These are redirected to [https://wiki.hpcagrogenomics.wur.nl/index.php/Creating_sbatch_script#output_.28stderr.2Cstdout.29_directed_to_file text files specified in the SLURM script]. | ||
For this purpose the <code>tail command is particularly useful. To | For this purpose the <code>tail</code> command is particularly useful. To continuously follow the output to a text file use: | ||
<code>tail -f output_987654.txt</code> | <code>tail -f output_987654.txt</code> | ||
Line 14: | Line 14: | ||
Replacing X by the desired number of lines. | Replacing X by the desired number of lines. | ||
If the output file gets too long and you wish to read from the begining you may combine the commands cat and less: | If the output file gets too long and you wish to read from the begining you may combine the commands <code>cat</code> and <code>less</code>: | ||
<code>cat output_987654.txt | less</code> | <code>cat output_987654.txt | less</code> | ||
Use the Q key to exit less. | Use the Q key to exit <code>less</code>. | ||
== Monitoring resource usage == | == Monitoring resource usage == | ||
Line 26: | Line 26: | ||
=== Using sstat === | === Using sstat === | ||
<code>sstat</code> is a SLURM tool that can be used to obtain instantaneous information on resource usage, CPU load, memory, etc. To use it | <code>sstat</code> is a SLURM tool that can be used to obtain instantaneous information on resource usage, CPU load, memory, etc. To use it start by changing your SLURM script so that your programme is launched with the <code>srun</code> command (this should be the last line in the script): | ||
<code>srun python3 calc_pi.py</code> | <code>srun python3 calc_pi.py</code> | ||
Line 38: | Line 38: | ||
=== Using top === | === Using top === | ||
Another way | Another way is to log the output of the [http://man7.org/linux/man-pages/man1/top.1.html <code>top</code>] command. This requires adding an extra command to your SLURM script, again using <code>srun</code> (where "user001" should be replaced by your user name): | ||
<code>srun --overcommit --ntasks=1 top -b -u user001 & | <code>srun --overcommit --ntasks=1 top -b -u user001 & | ||
python3 calc_pi.py</code> | python3 calc_pi.py</code> | ||
This will log the output of top to the sdtout stream file specific in the SLURM script every 3 seconds. Using the tail command you will be able to see logs like: | This will log the output of <code>top</code> to the sdtout stream file specific in the SLURM script every 3 seconds. Using the <code>tail</code> command you will be able to see logs like: | ||
< | <pre>top - 18:09:12 up 53 days, 22:53, 0 users, load average: 27,04, 27,63, 26,53 | ||
Tasks: 1068 total, 4 running, 1064 sleeping, 0 stopped, 0 zombie | Tasks: 1068 total, 4 running, 1064 sleeping, 0 stopped, 0 zombie | ||
%Cpu(s): 41,9 us, 0,2 sy, 0,0 ni, 57,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st | %Cpu(s): 41,9 us, 0,2 sy, 0,0 ni, 57,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st | ||
Line 56: | Line 56: | ||
28627 user001 20 0 113184 1520 1260 S 0,0 0,0 0:00.01 slurm_scri+ | 28627 user001 20 0 113184 1520 1260 S 0,0 0,0 0:00.01 slurm_scri+ | ||
29253 user001 20 0 245096 4792 1976 S 0,0 0,0 0:00.66 srun | 29253 user001 20 0 245096 4792 1976 S 0,0 0,0 0:00.66 srun | ||
29357 user001 20 0 36124 688 16 S 0,0 0,0 0:00.00 srun</ | 29357 user001 20 0 36124 688 16 S 0,0 0,0 0:00.00 srun</pre> |
Latest revision as of 14:26, 27 August 2018
Output stream redirection
The primary way to monitor job execution is through the stdout and stderr streams. These are redirected to text files specified in the SLURM script.
For this purpose the tail
command is particularly useful. To continuously follow the output to a text file use:
tail -f output_987654.txt
To obtain the last X lines of a text file use:
tail -n X output_987654.txt
Replacing X by the desired number of lines.
If the output file gets too long and you wish to read from the begining you may combine the commands cat
and less
:
cat output_987654.txt | less
Use the Q key to exit less
.
Monitoring resource usage
While the output streams may suffice in most cases, certain programmes might not provide much feedback. This could be the case with a programme that rellies on modules that are not verbose. In such situations it is best to monitor resource usage to gauge job execution. Two possible options are described below.
Using sstat
sstat
is a SLURM tool that can be used to obtain instantaneous information on resource usage, CPU load, memory, etc. To use it start by changing your SLURM script so that your programme is launched with the srun
command (this should be the last line in the script):
srun python3 calc_pi.py
Note down the job number. During execution you can then use sstat
, passing the job number with the -j
flag:
sstat --format=AveCPU,AveRSS,MaxRSS -P -j 987654
sstat
can provide information on many different variables, for more details check the manual.
Using top
Another way is to log the output of the top
command. This requires adding an extra command to your SLURM script, again using srun
(where "user001" should be replaced by your user name):
srun --overcommit --ntasks=1 top -b -u user001 &
python3 calc_pi.py
This will log the output of top
to the sdtout stream file specific in the SLURM script every 3 seconds. Using the tail
command you will be able to see logs like:
top - 18:09:12 up 53 days, 22:53, 0 users, load average: 27,04, 27,63, 26,53 Tasks: 1068 total, 4 running, 1064 sleeping, 0 stopped, 0 zombie %Cpu(s): 41,9 us, 0,2 sy, 0,0 ni, 57,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem : 10439453+total, 18650486+free, 43690281+used, 42053763+buff/cache KiB Swap: 26214400+total, 26214393+free, 52 used. 54182425+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29260 user001 20 0 15,6g 15,2g 17392 R 1777 1,5 4014:16 R 29447 user001 20 0 178604 12900 1676 R 1,6 0,0 6:33.17 top 28627 user001 20 0 113184 1520 1260 S 0,0 0,0 0:00.01 slurm_scri+ 29253 user001 20 0 245096 4792 1976 S 0,0 0,0 0:00.66 srun 29357 user001 20 0 36124 688 16 S 0,0 0,0 0:00.00 srun