|
|
| (38 intermediate revisions by 8 users not shown) |
| Line 1: |
Line 1: |
| The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement. | | The resource allocation and scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement. This page is the entry point — most topics have their own page; below is a short summary plus links. |
|
| |
|
| | == What's on which page == |
|
| |
|
| == Queues and defaults ==
| | * [[Partitions / Queues]] — list of partitions (<code>main</code>, <code>gpu</code>, <code>gpu_amd</code>) and how to choose one. |
| | * [[Choosing a node (constraints)]] — defaults, hardware constraints, GPU selection. |
| | * [[Batch Jobs]] — writing sbatch scripts and submitting them, including multi-job submissions and dependencies. |
| | * [[Interactive Jobs]] — <code>sinteractive</code> and <code>salloc</code> for live shell sessions on a compute node. |
| | * [[Array jobs]] — running the same script many times with a varying parameter. |
| | * [[Monitoring Jobs]] — <code>squeue</code>, <code>scontrol</code>, <code>sstat</code>, <code>sacct</code>, <code>node_usage_graph</code>. |
| | * [[Cancelling Jobs]] — <code>scancel</code>. |
| | * [[Reservations]] — booking nodes in advance for events. |
|
| |
|
| === Queues === | | == Quality of Service == |
| Every organization has 3 queues (in slurm called partitions) : a high, a standard and a low priority queue.<br>
| |
| The High queue provides the highest priority to jobs (20) then the standard queue (10). In the low priority queue (0)<br>
| |
| jobs will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Low queue jobs.
| |
| To find out which queues your account has been authorized for, type sinfo:
| |
| <source lang='bash'>
| |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
| |
| ABGC_High up infinite 12 down* node[043-048,055-060]
| |
| ABGC_High up infinite 6 mix fat[001-002],node[002-005]
| |
| ABGC_High up infinite 44 idle node[001,006-042,049-054]
| |
| ABGC_Std up infinite 12 down* node[043-048,055-060]
| |
| ABGC_Std up infinite 6 mix fat[001-002],node[002-005]
| |
| ABGC_Std up infinite 44 idle node[001,006-042,049-054]
| |
| ABGC_Low up infinite 12 down* node[043-048,055-060]
| |
| ABGC_Low up infinite 6 mix fat[001-002],node[002-005]
| |
| ABGC_Low up infinite 44 idle node[001,006-042,049-054]
| |
| </source>
| |
|
| |
|
| === Defaults ===
| | When submitting a job, you may optionally assign a different Quality of Service (QoS) to it: |
| There is no default queue, so you need to specify which queue to use when submitting a job.<br>
| |
| '''The default run time for a job is 1 hour!''' <br>
| |
| '''Default memory limit is 1024MB per node!'''
| |
|
| |
|
| == Submitting jobs: sbatch == | | <syntaxhighlight lang="bash"> |
| | #SBATCH --qos=std |
| | </syntaxhighlight> |
|
| |
|
| === Example ===
| | The QoS values configured on Anunna: |
| Consider this simple python3 script that should calculate Pi to 1 million digits:
| |
| <source lang='python'>
| |
| from decimal import *
| |
| D=Decimal
| |
| getcontext().prec=10000000
| |
| p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
| |
| print(str(p)[:10000002])
| |
| </source>
| |
|
| |
|
| === Loading modules ===
| | * '''std''' (priority 10) — the default. Use this unless you have a specific reason to pick another. |
| In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
| | * '''low''' (priority 1) — reduced priority, but limited to 8 hours per job so a flood of low-priority jobs cannot lock up the cluster. |
| module avail
| | * '''high''' (priority 20) — higher priority than <code>std</code>. More expensive — see [[Tariffs]]. |
| | * '''interactive''' (priority 100) — the highest priority, exclusively for immediate-running interactive jobs. You may not submit many or large jobs at this QoS. |
|
| |
|
| In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
| | Jobs can in principle be restarted and rescheduled if a higher-priority job needs cluster resources, but at the time of writing this preemption is not actually configured. |
| module load python/3.3.3
| |
|
| |
|
| === Batch script === | | == Running MPI jobs == |
| The following shell/slurm script can then be used to schedule the job using the sbatch command:
| |
| <source lang='bash'>
| |
| #!/bin/bash
| |
| #SBATCH --account=773320000
| |
| #SBATCH --time=1200
| |
| #SBATCH --mem=2048
| |
| #SBATCH --ntasks=1
| |
| #SBATCH --output=output_%j.txt
| |
| #SBATCH --error=error_output_%j.txt
| |
| #SBATCH --job-name=calc_pi.py
| |
| #SBATCH --partition=ABGC_Std
| |
| #SBATCH --mail-type=ALL
| |
| #SBATCH --mail-user=email@org.nl
| |
|
| |
|
| | For multi-node MPI workloads see [[MPI on B4F cluster | MPI on Anunna]]. |
|
| |
|
| time python3 calc_pi.py
| | == See also == |
| </source>
| |
| Explanation of used SBATCH parameters:
| |
| <source lang='bash'>
| |
| #SBATCH --account=773320000
| |
| </source>
| |
| Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.
| |
| <source lang='bash'>
| |
| #SBATCH --time=1200
| |
| </source>
| |
| A time limit of zero requests that no time limit be imposed. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". So in this example the job will run for a maximum of 1200 minutes.
| |
|
| |
|
| ----
| | * [[Partitions / Queues]] |
| | * [[Choosing a node (constraints)]] |
| | * [[Batch Jobs]] |
| | * [[Interactive Jobs]] |
| | * [[Array jobs]] |
| | * [[Monitoring Jobs]] |
| | * [[Cancelling Jobs]] |
| | * [[Reservations]] |
| | * [[Tariffs | Costs associated with resource usage]] |
|
| |
|
| <source lang='bash'>
| | == External links == |
| #SBATCH --mem=2048
| |
| </source>
| |
| SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission:
| |
| <source lang='bash'>
| |
| #SBATCH --mem X
| |
| </source>
| |
| | |
| where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:
| |
| <source lang='bash'>
| |
| $ sacct -o MaxRSS -j JOBID
| |
| </source>
| |
| where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.
| |
| | |
| ----
| |
| | |
| <source lang='bash'>
| |
| #SBATCH --ntasks=1
| |
| </source>
| |
| sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.
| |
| | |
| When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the <code>-N</code> or <code>--node</code> flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:
| |
| <source lang='bash'>
| |
| #SBATCH --nodes=1
| |
| </source>
| |
| This should force your job to be scheduled to a single node.
| |
| | |
| Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the <code>-C</code> or <code>--constraints</code> flag.
| |
| <source lang='bash'>
| |
| #SBATCH --constraint=normalmem
| |
| </source>
| |
| The example above will result in jobs being scheduled to the regular compute nodes. By using <code>largemem</code> as option the job will specifically be scheduled to one of the fat nodes.
| |
| | |
| <source lang='bash'>
| |
| #SBATCH --output=output_%j.txt
| |
| </source>
| |
| Instruct SLURM to connect the batch script's standard output directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.
| |
| <source lang='bash'>
| |
| #SBATCH --error=error_output_%j.txt
| |
| </source>
| |
| Instruct SLURM to connect the batch script's standard error directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.
| |
| <source lang='bash'>
| |
| #SBATCH --job-name=calc_pi.py
| |
| </source>
| |
| Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just "sbatch" if the script is read on sbatch's standard input.
| |
| <source lang='bash'>
| |
| #SBATCH --partition=ABGC_Std
| |
| </source>
| |
| Request a specific partition for the resource allocation. It is prefered to use your organizations partition.
| |
| <source lang='bash'>
| |
| #SBATCH --mail-type=ALL
| |
| </source>
| |
| Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.
| |
| <source lang='bash'>
| |
| #SBATCH --mail-user=email@org.nl
| |
| </source>
| |
| Email address to use.
| |
| | |
| === Submitting ===
| |
| The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
| |
| <source lang='bash'>
| |
| sbatch run_calc_pi.sh
| |
| </source>
| |
| | |
| === Submitting multiple jobs ===
| |
| Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
| |
| <source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
| |
| </source>
| |
| | |
| === Interactive X11/GUI jobs ===
| |
| Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.
| |
| For example, an interactive session for 1 hour with HPL using eigth cores can be started with:
| |
| <source lang='bash'>module load hpl/2.1
| |
| srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl
| |
| </source>
| |
| | |
| == Monitoring submitted jobs ==
| |
| Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit.
| |
| | |
| === Generic monitoring of all running jobs ===
| |
| <source lang='bash'>
| |
| squeue
| |
| </source>
| |
| | |
| You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
| |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
| |
| 3396 ABGC BOV-WUR- megen002 R 27:26 1 node004
| |
| 3397 ABGC BOV-WUR- megen002 R 27:26 1 node005
| |
| 3398 ABGC BOV-WUR- megen002 R 27:26 1 node006
| |
| 3399 ABGC BOV-WUR- megen002 R 27:26 1 node007
| |
| 3400 ABGC BOV-WUR- megen002 R 27:26 1 node008
| |
| 3401 ABGC BOV-WUR- megen002 R 27:26 1 node009
| |
| 3385 research BOV-WUR- megen002 R 44:38 1 node049
| |
| 3386 research BOV-WUR- megen002 R 44:38 1 node050
| |
| 3387 research BOV-WUR- megen002 R 44:38 1 node051
| |
| 3388 research BOV-WUR- megen002 R 44:38 1 node052
| |
| 3389 research BOV-WUR- megen002 R 44:38 1 node053
| |
| 3390 research BOV-WUR- megen002 R 44:38 1 node054
| |
| 3391 research BOV-WUR- megen002 R 44:38 3 node[049-051]
| |
| 3392 research BOV-WUR- megen002 R 44:38 3 node[052-054]
| |
| 3393 research BOV-WUR- megen002 R 44:38 1 node001
| |
| 3394 research BOV-WUR- megen002 R 44:38 1 node002
| |
| 3395 research BOV-WUR- megen002 R 44:38 1 node003
| |
| | |
| === Monitoring time limit set for a specific job ===
| |
| The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
| |
| <source lang='bash'>
| |
| squeue -l -j 3532
| |
| </source>
| |
| Information similar to the following should appear:
| |
| Fri Nov 29 15:41:00 2013
| |
| JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
| |
| 3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054
| |
| | |
| === Query a specific active job: scontrol ===
| |
| Show all the details of a currently active job, so not a completed job.
| |
| <source lang='bash'>
| |
| nfs01 ~]$ scontrol show jobid 4241
| |
| JobId=4241 Name=WB20F06
| |
| UserId=megen002(16795409) GroupId=domain users(16777729)
| |
| Priority=1 Account=(null) QOS=normal
| |
| JobState=RUNNING Reason=None Dependency=(null)
| |
| Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
| |
| RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
| |
| SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
| |
| StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
| |
| PreemptTime=None SuspendTime=None SecsPreSuspend=0
| |
| Partition=research AllocNode:Sid=nfs01:21799
| |
| ReqNodeList=(null) ExcNodeList=(null)
| |
| NodeList=node023
| |
| BatchHost=node023
| |
| NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
| |
| MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
| |
| Features=(null) Gres=(null) Reservation=(null)
| |
| Shared=OK Contiguous=0 Licenses=(null) Network=(null)
| |
| Command=/lustre/scratch/WUR/ABGC/...
| |
| WorkDir=/lustre/scratch/WUR/ABGC/...
| |
| </source>
| |
| | |
| === Check on a pending job ===
| |
| A submitted job could result in a pending state when there are not enough resources available to this job.
| |
| In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
| |
| <source lang='bash'>
| |
| [@nfs01 jobs]$ sbatch hpl_student.job
| |
| Submitted batch job 740338
| |
| | |
| [@nfs01 jobs]$ squeue -l -j 740338
| |
| Fri Feb 21 15:32:31 2014
| |
| JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
| |
| 740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail)
| |
| | |
| [@nfs01 jobs]$ squeue --start -j 740338
| |
| JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
| |
| 740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
| |
| </source>
| |
| So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.
| |
| | |
| == Removing jobs from a list: scancel ==
| |
| If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
| |
| <source lang='bash'>
| |
| scancel 3401
| |
| </source>
| |
| | |
| == Allocating resources interactively: sallocate ==
| |
| < text here>
| |
| | |
| == Get overview of past and current jobs: sacct ==
| |
| To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
| |
| <source lang='bash'>
| |
| sacct
| |
| </source>
| |
| This should provide information similar to the following:
| |
| | |
| JobID JobName Partition Account AllocCPUS State ExitCode
| |
| ------------ ---------- ---------- ---------- ---------- ---------- --------
| |
| 3385 BOV-WUR-58 research 12 COMPLETED 0:0
| |
| 3385.batch batch 1 COMPLETED 0:0
| |
| 3386 BOV-WUR-59 research 12 CANCELLED+ 0:0
| |
| 3386.batch batch 1 CANCELLED 0:15
| |
| 3528 BOV-WUR-59 ABGC 16 RUNNING 0:0
| |
| 3529 BOV-WUR-60 ABGC 16 RUNNING 0:0
| |
| | |
| Or in more detail for a specific job:
| |
| <source lang='bash'>
| |
| sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
| |
| </source>
| |
| This should provide information about job id 4220:
| |
| | |
| JobID JobName Account Partition NTasks AllocCPUS Elapsed State ExitCode
| |
| ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------
| |
| 4220 PreProces+ research 3 00:30:52 COMPLETED 0:0
| |
| 4220.batch batch 1 1 00:30:52 COMPLETED 0:0
| |
| | |
| '''Job Status Codes'''
| |
| | |
| Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.
| |
| | |
| {| class="wikitable"
| |
| |-
| |
| !Code!!State!!Description
| |
| |-
| |
| |CA ||CANCELLED|| Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
| |
| |-
| |
| |CD|| COMPLETED|| Job has terminated all processes on all nodes.
| |
| |-
| |
| |CF|| CONFIGURING|| Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
| |
| |-
| |
| |CG|| COMPLETING|| Job is in the process of completing. Some processes on some nodes may still be active.
| |
| |-
| |
| |F|| FAILED|| Job terminated with non-zero exit code or other failure condition.
| |
| |-
| |
| |NF|| NODE_FAIL|| Job terminated due to failure of one or more allocated nodes.
| |
| |-
| |
| |PD|| PENDING|| Job is awaiting resource allocation.
| |
| |-
| |
| |R|| RUNNING|| Job currently has an allocation.
| |
| |-
| |
| |S|| SUSPENDED|| Job has an allocation, but execution has been suspended.
| |
| |-
| |
| |TO|| TIMEOUT|| Job terminated upon reaching its time limit.
| |
| |-
| |
| |-
| |
| |}
| |
|
| |
|
| == Running MPI jobs on B4F cluster ==
| |
|
| |
| [[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]
| |
| < text here >
| |
|
| |
| == Understanding which resources are available to you: sinfo ==
| |
| By using the 'sinfo' command you can retrieve information on which 'Partitions' are available to you. A 'Partition' using SLURM is similar to the 'queue' when submitting using the Sun Grid Engine ('qsub'). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have 'student', 'research', and 'ABGC' partitions available. The higher the level of resource allocation, though, the higher the cost per compute-hour. The default Partition is the 'student' partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.
| |
|
| |
| <source lang='bash'>
| |
| sinfo
| |
| </source>
| |
|
| |
| PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
| |
| student* up infinite 12 down* node[043-048,055-060]
| |
| student* up infinite 50 idle fat[001-002],node[001-042,049-054]
| |
| research up infinite 12 down* node[043-048,055-060]
| |
| research up infinite 50 idle fat[001-002],node[001-042,049-054]
| |
| ABGC up infinite 12 down* node[043-048,055-060]
| |
| ABGC up infinite 50 idle fat[001-002],node[001-042,049-054]
| |
|
| |
| == See also ==
| |
| * [[B4F_cluster | B4F Cluster]]
| |
| * [[BCM_on_B4F_cluster | BCM on B4F cluster]]
| |
| * [[SLURM_Compare | SLURM compared to other common schedulers]]
| |
| * [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]
| |
|
| |
| == External links ==
| |
| * [http://slurm.schedmd.com Slurm official documentation] | | * [http://slurm.schedmd.com Slurm official documentation] |
| * [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia] | | * [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia] |
| * [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]
| |