Using Slurm: Difference between revisions
No edit summary |
|||
(111 intermediate revisions by 9 users not shown) | |||
Line 1: | Line 1: | ||
The resource allocation / scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement. | |||
Consider this simple python3 script that should calculate Pi | |||
< | == Queues and defaults == | ||
=== Quality of Service === | |||
When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with: | |||
<pre> | |||
#SBATCH --qos=std | |||
</pre> | |||
By default, jobs will use std, the standard quality. | |||
Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs. | |||
The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive. | |||
The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them. | |||
Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring. | |||
=== Queues === | |||
The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'. | |||
You can see the partitions available with `sinfo`: | |||
=== Defaults === | |||
The default partition is 'main'. This will work for most jobs. | |||
The default qos is 'std'. | |||
The default cpu count is 1. | |||
The default run time for a job is '''1 hour'''. | |||
The maximum run time for a job is '''3 weeks'''. | |||
The default memory limit is '''100MB per node'''. | |||
== Submitting jobs: sbatch == | |||
=== Example === | |||
Consider this simple python3 script that should calculate Pi to 1 million digits: | |||
<pre> | |||
from decimal import * | from decimal import * | ||
D=Decimal | D=Decimal | ||
Line 8: | Line 49: | ||
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411)) | p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411)) | ||
print(str(p)[:10000002]) | print(str(p)[:10000002]) | ||
</ | </pre> | ||
=== Loading modules === | |||
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command: | In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command: | ||
module avail | module avail | ||
Line 15: | Line 57: | ||
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command: | In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command: | ||
module load python/3.3.3 | module load python/3.3.3 | ||
=== Batch script === | |||
[[Creating_sbatch_script | Main Article: Creating a sbatch script]] | |||
The following shell/slurm script can then be used to schedule the job using the sbatch command: | The following shell/slurm script can then be used to schedule the job using the sbatch command: | ||
< | <pre> | ||
#!/bin/bash | #!/bin/bash | ||
# #SBATCH --time= | #SBATCH --comment=773320000 | ||
#SBATCH -- | #SBATCH --time=1200 | ||
#SBATCH --mem=2048 | |||
#SBATCH --cpus-per-task=1 | |||
#SBATCH --output=output_%j.txt | #SBATCH --output=output_%j.txt | ||
#SBATCH --error=error_output_%j.txt | #SBATCH --error=error_output_%j.txt | ||
#SBATCH --job-name=calc_pi.py | #SBATCH --job-name=calc_pi.py | ||
#SBATCH -- | #SBATCH --mail-type=ALL | ||
#SBATCH --mail-user=email@org.nl | |||
time python3 calc_pi.py | time python3 calc_pi.py | ||
</source> | </pre> | ||
=== Submitting === | |||
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command: | |||
<pre> | |||
sbatch run_calc_pi.sh | |||
</pre> | |||
=== Submitting multiple jobs (simple) === | |||
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code: | |||
<pre>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done | |||
</pre> | |||
=== Submitting multiple jobs (complex) === | |||
Lets's say you have three job scripts that depend on each other: | |||
<pre>job_1.sh #A simple initialisation script</pre> | |||
<pre>job_2.sh #An array task</pre> | |||
<pre>job_3.sh #Some finishing script, single run, after everything previous has finished</pre> | |||
You can create a script to simultaneously submit each job with a dependency on each other: | |||
<pre>#!/bin/bash | |||
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element | |||
if ! [ "z$JOB1" == "z" ] ; then | |||
echo "First job submitted as jobid $JOB1" | |||
JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev) | |||
if ! [ "z$JOB2" == "z" ] ; then | |||
echo "Second job submitted as jobid $JOB2, following $JOB1" | |||
JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev) | |||
if ! [ "z$JOB3" == "z" ] ; then | |||
echo "Third job submitted as jobid $JOB3, following after every element of $JOB2" | |||
fi | |||
fi | |||
fi | |||
</pre> | |||
This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed). | |||
Please see [https://slurm.schedmd.com/sbatch.html#OPT_dependency the sbatch documentation] for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job. | |||
=== Submitting array jobs === | |||
<pre> | |||
#SBATCH --array=0-10%4 | |||
</pre> | |||
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found [[Array_jobs|here]]. | |||
=== Using /tmp === | |||
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage. | |||
In order to be sure that you're able to use space in /tmp, you can add | |||
<pre> | |||
#SBATCH --tmp=<required size> | |||
</pre> | |||
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time. | |||
=== Using GPU === | |||
There are six GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add | |||
<pre> | |||
#SBATCH --gres=gpu:<num gpus> | |||
#SBATCH --partition=gpu | |||
</pre> | |||
To your sbatch script. Without this parameter, your job won't run on one of these nodes. | |||
Be sure to add the gres line, otherwise your job will either fail, or it will run on the CPU instead of on the GPU. | |||
As we have different flavours of GPU's, you might want to select a type/manufacturer. | |||
If you don't, you will get one that is available. | |||
To see which types are available, run this : | |||
<pre> | |||
scontrol show -o node | grep -o -e "NodeName=\w*" -e "ActiveFeatures=[[:alnum:][:punct:]]*" | paste - - | column -t | grep gpu | |||
</pre> | |||
To select a certain type, use the flag: | |||
<pre> | |||
#SBATCH --constraint | |||
</pre> | |||
Example: | |||
<pre> | |||
# This will limit this job to the A100 GPUs | |||
#SBATCH --constraint='nvidia&A100' | |||
</pre> | |||
A rough estimate is that the A100/80G cards are about twice as fast as the A6000/48G or the V100/16G. But this all depends on whether your analyses actually needs the RAM and can completely fill the GPU. | |||
We have set up the scheduler in such a way that the A100s are chosen first, and then the A6000s, and lastly the V100s | |||
The pricing for all is the same. | |||
Please use the nvidia constraint if your jobs are limited to those, as we will put the AMD GPUs online in the future, which will then probably break your analyses. | |||
== Monitoring submitted jobs == | |||
Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit. | |||
=== Generic monitoring of all running jobs === | |||
<pre> | |||
squeue | |||
</pre> | |||
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so: | |||
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) | |||
3396 ABGC BOV-WUR- megen002 R 27:26 1 node004 | |||
3397 ABGC BOV-WUR- megen002 R 27:26 1 node005 | |||
3398 ABGC BOV-WUR- megen002 R 27:26 1 node006 | |||
3399 ABGC BOV-WUR- megen002 R 27:26 1 node007 | |||
3400 ABGC BOV-WUR- megen002 R 27:26 1 node008 | |||
3401 ABGC BOV-WUR- megen002 R 27:26 1 node009 | |||
3385 research BOV-WUR- megen002 R 44:38 1 node049 | |||
3386 research BOV-WUR- megen002 R 44:38 1 node050 | |||
3387 research BOV-WUR- megen002 R 44:38 1 node051 | |||
3388 research BOV-WUR- megen002 R 44:38 1 node052 | |||
3389 research BOV-WUR- megen002 R 44:38 1 node053 | |||
3390 research BOV-WUR- megen002 R 44:38 1 node054 | |||
3391 research BOV-WUR- megen002 R 44:38 3 node[049-051] | |||
3392 research BOV-WUR- megen002 R 44:38 3 node[052-054] | |||
3393 research BOV-WUR- megen002 R 44:38 1 node001 | |||
3394 research BOV-WUR- megen002 R 44:38 1 node002 | |||
3395 research BOV-WUR- megen002 R 44:38 1 node003 | |||
=== Monitoring time limit set for a specific job === | |||
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command. | |||
<pre> | |||
squeue -l -j 3532 | |||
</pre> | |||
Information similar to the following should appear: | |||
Fri Nov 29 15:41:00 2013 | |||
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) | |||
3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054 | |||
=== Query a specific active job: scontrol === | |||
Show all the details of a currently active job, so not a completed job. | |||
<pre> | |||
login ~]$ scontrol show jobid 4241 | |||
JobId=4241 Name=WB20F06 | |||
UserId=megen002(16795409) GroupId=domain users(16777729) | |||
Priority=1 Account=(null) QOS=normal | |||
JobState=RUNNING Reason=None Dependency=(null) | |||
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 | |||
RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A | |||
SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29 | |||
StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29 | |||
PreemptTime=None SuspendTime=None SecsPreSuspend=0 | |||
Partition=research AllocNode:Sid=login0:21799 | |||
ReqNodeList=(null) ExcNodeList=(null) | |||
NodeList=node023 | |||
BatchHost=node023 | |||
NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:* | |||
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 | |||
Features=(null) Gres=(null) Reservation=(null) | |||
Shared=OK Contiguous=0 Licenses=(null) Network=(null) | |||
Command=/lustre/scratch/WUR/ABGC/... | |||
WorkDir=/lustre/scratch/WUR/ABGC/... | |||
</pre> | |||
=== Check on a pending job === | |||
A submitted job could result in a pending state when there are not enough resources available to this job. | |||
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start. | |||
<pre> | |||
[@login jobs]$ sbatch hpl_student.job | |||
Submitted batch job 740338 | |||
[@login jobs]$ squeue -l -j 740338 | |||
Fri Feb 21 15:32:31 2014 | |||
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) | |||
740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail) | |||
[@login jobs]$ squeue --start -j 740338 | |||
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON) | |||
740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail) | |||
</pre> | |||
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed. | |||
== Removing jobs from a list: scancel == | |||
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code: | |||
<pre> | |||
scancel 3401 | |||
</pre> | |||
== Allocating resources interactively: sinteractive == | |||
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run: | |||
<pre> | |||
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition> | |||
</pre> | |||
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be. | |||
Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however. | |||
=== sinteractive source === | |||
<pre> | |||
#!/bin/bash | |||
srun "$@" -I60 -N 1 -n 1 --pty bash -i | |||
</pre> | |||
=== interactive Slurm - using salloc === | |||
If you don't want your shell to be transported but want a new remote shell, do: | |||
<pre> | |||
salloc -p ABGC_Low $SHELL | |||
</pre> | |||
Now your shell will stay on the login node, but you can do: | |||
<pre> | |||
srun <command> & | |||
</pre> | |||
To submit tasks to this new shell! | |||
Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html | |||
== Get overview of past and current jobs: sacct == | |||
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do: | |||
<pre> | |||
sacct | |||
</pre> | |||
This should provide information similar to the following: | |||
JobID JobName Partition Account AllocCPUS State ExitCode | |||
------------ ---------- ---------- ---------- ---------- ---------- -------- | |||
3385 BOV-WUR-58 research 12 COMPLETED 0:0 | |||
3385.batch batch 1 COMPLETED 0:0 | |||
3386 BOV-WUR-59 research 12 CANCELLED+ 0:0 | |||
3386.batch batch 1 CANCELLED 0:15 | |||
3528 BOV-WUR-59 ABGC 16 RUNNING 0:0 | |||
3529 BOV-WUR-60 ABGC 16 RUNNING 0:0 | |||
Or in more detail for a specific job: | |||
<pre> | |||
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220 | |||
</pre> | |||
This should provide information about job id 4220: | |||
JobID JobName Comment Partition NTasks AllocCPUS Elapsed State ExitCode | |||
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- | |||
4220 PreProces+ research 3 00:30:52 COMPLETED 0:0 | |||
4220.batch batch 1 1 00:30:52 COMPLETED 0:0 | |||
'''Job Status Codes''' | |||
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in. | |||
= | {| class="wikitable" | ||
|- | |||
!Code!!State!!Description | |||
|- | |||
|CA ||CANCELLED|| Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. | |||
|- | |||
|CD|| COMPLETED|| Job has terminated all processes on all nodes. | |||
|- | |||
|CF|| CONFIGURING|| Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting). | |||
|- | |||
|CG|| COMPLETING|| Job is in the process of completing. Some processes on some nodes may still be active. | |||
|- | |||
|F|| FAILED|| Job terminated with non-zero exit code or other failure condition. | |||
|- | |||
|NF|| NODE_FAIL|| Job terminated due to failure of one or more allocated nodes. | |||
|- | |||
|PD|| PENDING|| Job is awaiting resource allocation. | |||
|- | |||
|R|| RUNNING|| Job currently has an allocation. | |||
|- | |||
|S|| SUSPENDED|| Job has an allocation, but execution has been suspended. | |||
|- | |||
|TO|| TIMEOUT|| Job terminated upon reaching its time limit. | |||
|- | |||
|- | |||
|} | |||
== | == Running MPI jobs on Anunna == | ||
[[MPI_on_B4F_cluster | Main article: MPI on Anunna]] | |||
== | == See also == | ||
* [[Tariffs | Costs associated with resource usage]] | |||
* [[B4F_cluster | Anunna]] | |||
* [[BCM_on_B4F_cluster | BCM on Anunna]] | |||
* [[SLURM_Compare | SLURM compared to other common schedulers]] | |||
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]] | |||
== | == External links == | ||
* [http://slurm.schedmd.com Slurm official documentation] | |||
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia] | |||
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube] |
Latest revision as of 12:21, 22 May 2024
The resource allocation / scheduling software on Anunna is SLURM: Simple Linux Utility for Resource Management.
Queues and defaults
Quality of Service
When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with:
#SBATCH --qos=std
By default, jobs will use std, the standard quality.
Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs.
The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive.
The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them.
Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring.
Queues
The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'.
You can see the partitions available with `sinfo`:
Defaults
The default partition is 'main'. This will work for most jobs.
The default qos is 'std'.
The default cpu count is 1.
The default run time for a job is 1 hour.
The maximum run time for a job is 3 weeks.
The default memory limit is 100MB per node.
Submitting jobs: sbatch
Example
Consider this simple python3 script that should calculate Pi to 1 million digits:
from decimal import * D=Decimal getcontext().prec=10000000 p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411)) print(str(p)[:10000002])
Loading modules
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3
Batch script
Main Article: Creating a sbatch script
The following shell/slurm script can then be used to schedule the job using the sbatch command:
#!/bin/bash #SBATCH --comment=773320000 #SBATCH --time=1200 #SBATCH --mem=2048 #SBATCH --cpus-per-task=1 #SBATCH --output=output_%j.txt #SBATCH --error=error_output_%j.txt #SBATCH --job-name=calc_pi.py #SBATCH --mail-type=ALL #SBATCH --mail-user=email@org.nl time python3 calc_pi.py
Submitting
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
sbatch run_calc_pi.sh
Submitting multiple jobs (simple)
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
Submitting multiple jobs (complex)
Lets's say you have three job scripts that depend on each other:
job_1.sh #A simple initialisation script
job_2.sh #An array task
job_3.sh #Some finishing script, single run, after everything previous has finished
You can create a script to simultaneously submit each job with a dependency on each other:
#!/bin/bash JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element if ! [ "z$JOB1" == "z" ] ; then echo "First job submitted as jobid $JOB1" JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev) if ! [ "z$JOB2" == "z" ] ; then echo "Second job submitted as jobid $JOB2, following $JOB1" JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev) if ! [ "z$JOB3" == "z" ] ; then echo "Third job submitted as jobid $JOB3, following after every element of $JOB2" fi fi fi
This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).
Please see the sbatch documentation for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.
Submitting array jobs
#SBATCH --array=0-10%4
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found here.
Using /tmp
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.
In order to be sure that you're able to use space in /tmp, you can add
#SBATCH --tmp=<required size>
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.
Using GPU
There are six GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add
#SBATCH --gres=gpu:<num gpus> #SBATCH --partition=gpu
To your sbatch script. Without this parameter, your job won't run on one of these nodes. Be sure to add the gres line, otherwise your job will either fail, or it will run on the CPU instead of on the GPU.
As we have different flavours of GPU's, you might want to select a type/manufacturer.
If you don't, you will get one that is available.
To see which types are available, run this :
scontrol show -o node | grep -o -e "NodeName=\w*" -e "ActiveFeatures=[[:alnum:][:punct:]]*" | paste - - | column -t | grep gpu
To select a certain type, use the flag:
#SBATCH --constraint
Example:
# This will limit this job to the A100 GPUs #SBATCH --constraint='nvidia&A100'
A rough estimate is that the A100/80G cards are about twice as fast as the A6000/48G or the V100/16G. But this all depends on whether your analyses actually needs the RAM and can completely fill the GPU. We have set up the scheduler in such a way that the A100s are chosen first, and then the A6000s, and lastly the V100s The pricing for all is the same.
Please use the nvidia constraint if your jobs are limited to those, as we will put the AMD GPUs online in the future, which will then probably break your analyses.
Monitoring submitted jobs
Once a job is submitted, the status can be monitored using the squeue
command. The squeue
command has a number of parameters for monitoring specific properties of the jobs such as time limit.
Generic monitoring of all running jobs
squeue
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3396 ABGC BOV-WUR- megen002 R 27:26 1 node004 3397 ABGC BOV-WUR- megen002 R 27:26 1 node005 3398 ABGC BOV-WUR- megen002 R 27:26 1 node006 3399 ABGC BOV-WUR- megen002 R 27:26 1 node007 3400 ABGC BOV-WUR- megen002 R 27:26 1 node008 3401 ABGC BOV-WUR- megen002 R 27:26 1 node009 3385 research BOV-WUR- megen002 R 44:38 1 node049 3386 research BOV-WUR- megen002 R 44:38 1 node050 3387 research BOV-WUR- megen002 R 44:38 1 node051 3388 research BOV-WUR- megen002 R 44:38 1 node052 3389 research BOV-WUR- megen002 R 44:38 1 node053 3390 research BOV-WUR- megen002 R 44:38 1 node054 3391 research BOV-WUR- megen002 R 44:38 3 node[049-051] 3392 research BOV-WUR- megen002 R 44:38 3 node[052-054] 3393 research BOV-WUR- megen002 R 44:38 1 node001 3394 research BOV-WUR- megen002 R 44:38 1 node002 3395 research BOV-WUR- megen002 R 44:38 1 node003
Monitoring time limit set for a specific job
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the squeue
command.
squeue -l -j 3532
Information similar to the following should appear:
Fri Nov 29 15:41:00 2013 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054
Query a specific active job: scontrol
Show all the details of a currently active job, so not a completed job.
login ~]$ scontrol show jobid 4241 JobId=4241 Name=WB20F06 UserId=megen002(16795409) GroupId=domain users(16777729) Priority=1 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29 StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=research AllocNode:Sid=login0:21799 ReqNodeList=(null) ExcNodeList=(null) NodeList=node023 BatchHost=node023 NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/lustre/scratch/WUR/ABGC/... WorkDir=/lustre/scratch/WUR/ABGC/...
Check on a pending job
A submitted job could result in a pending state when there are not enough resources available to this job. In this example I sumbit a job, check the status and after finding out is it pending I'll check when is probably will start.
[@login jobs]$ sbatch hpl_student.job Submitted batch job 740338 [@login jobs]$ squeue -l -j 740338 Fri Feb 21 15:32:31 2014 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail) [@login jobs]$ squeue --start -j 740338 JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON) 740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.
Removing jobs from a list: scancel
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
scancel 3401
Allocating resources interactively: sinteractive
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.
Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.
sinteractive source
#!/bin/bash srun "$@" -I60 -N 1 -n 1 --pty bash -i
interactive Slurm - using salloc
If you don't want your shell to be transported but want a new remote shell, do:
salloc -p ABGC_Low $SHELL
Now your shell will stay on the login node, but you can do:
srun <command> &
To submit tasks to this new shell!
Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html
Get overview of past and current jobs: sacct
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
sacct
This should provide information similar to the following:
JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 3385 BOV-WUR-58 research 12 COMPLETED 0:0 3385.batch batch 1 COMPLETED 0:0 3386 BOV-WUR-59 research 12 CANCELLED+ 0:0 3386.batch batch 1 CANCELLED 0:15 3528 BOV-WUR-59 ABGC 16 RUNNING 0:0 3529 BOV-WUR-60 ABGC 16 RUNNING 0:0
Or in more detail for a specific job:
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
This should provide information about job id 4220:
JobID JobName Comment Partition NTasks AllocCPUS Elapsed State ExitCode ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- 4220 PreProces+ research 3 00:30:52 COMPLETED 0:0 4220.batch batch 1 1 00:30:52 COMPLETED 0:0
Job Status Codes
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.
Code | State | Description |
---|---|---|
CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD | COMPLETED | Job has terminated all processes on all nodes. |
CF | CONFIGURING | Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting). |
CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
F | FAILED | Job terminated with non-zero exit code or other failure condition. |
NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes. |
PD | PENDING | Job is awaiting resource allocation. |
R | RUNNING | Job currently has an allocation. |
S | SUSPENDED | Job has an allocation, but execution has been suspended. |
TO | TIMEOUT | Job terminated upon reaching its time limit. |
Running MPI jobs on Anunna
See also
- Costs associated with resource usage
- Anunna
- BCM on Anunna
- SLURM compared to other common schedulers
- Setting up and using a virtual environment for Python3