Using Slurm: Difference between revisions
m (→Batch script) |
|||
Line 47: | Line 47: | ||
=== Batch script === | === Batch script === | ||
[[Creating_sbatch_script | Main Article: Creating a | [[Creating_sbatch_script | Main Article: Creating a sbatch script]] | ||
The following shell/slurm script can then be used to schedule the job using the sbatch command: | The following shell/slurm script can then be used to schedule the job using the sbatch command: |
Revision as of 14:05, 19 April 2018
The resource allocation / scheduling software on the B4F Cluster is SLURM: Simple Linux Utility for Resource Management.
Queues and defaults
Queues
Every organization has 3 queues (in slurm called partitions) : a high, a standard and a low priority queue.
The High queue provides the highest priority to jobs (20) then the standard queue (10). In the low priority queue (0)
jobs will be resubmitted if a job with higer priority needs cluster resources and those resoruces are occupied by a Low queue jobs.
To find out which queues your account has been authorized for, type sinfo:
<source lang='bash'>
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
ABGC_High up infinite 12 down* node[043-048,055-060]
ABGC_High up infinite 6 mix fat[001-002],node[002-005]
ABGC_High up infinite 44 idle node[001,006-042,049-054]
ABGC_Std up infinite 12 down* node[043-048,055-060]
ABGC_Std up infinite 6 mix fat[001-002],node[002-005]
ABGC_Std up infinite 44 idle node[001,006-042,049-054]
ABGC_Low up infinite 12 down* node[043-048,055-060]
ABGC_Low up infinite 6 mix fat[001-002],node[002-005]
ABGC_Low up infinite 44 idle node[001,006-042,049-054]
</source>
Defaults
There is no default queue, so you need to specify which queue to use when submitting a job.
The default run time for a job is 1 hour!
Default memory limit is 100MB per node!
Submitting jobs: sbatch
Example
Consider this simple python3 script that should calculate Pi to 1 million digits: <source lang='python'> from decimal import * D=Decimal getcontext().prec=10000000 p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411)) print(str(p)[:10000002]) </source>
Loading modules
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3
Batch script
Main Article: Creating a sbatch script
The following shell/slurm script can then be used to schedule the job using the sbatch command: <source lang='bash'>
- !/bin/bash
- SBATCH --account=773320000
- SBATCH --time=1200
- SBATCH --mem=2048
- SBATCH --ntasks=1
- SBATCH --output=output_%j.txt
- SBATCH --error=error_output_%j.txt
- SBATCH --job-name=calc_pi.py
- SBATCH --partition=ABGC_Std
- SBATCH --mail-type=ALL
- SBATCH --mail-user=email@org.nl
time python3 calc_pi.py
</source>
Submitting
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command: <source lang='bash'> sbatch run_calc_pi.sh </source>
Submitting multiple jobs (simple)
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code: <source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done </source>
Submitting array jobs
<source lang='bash'>
- SBATCH --array=0-10%4
</source> SLURM allows you to submit multiple jobs using the same template. Further information about this can be found here.
Using /tmp
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.
In order to be sure that you're able to use space in /tmp, you can add <source lang='bash'>
- SBATCH --tmp=<required size>
</source> To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.
Monitoring submitted jobs
Once a job is submitted, the status can be monitored using the squeue
command. The squeue
command has a number of parameters for monitoring specific properties of the jobs such as time limit.
Generic monitoring of all running jobs
<source lang='bash'>
squeue
</source>
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3396 ABGC BOV-WUR- megen002 R 27:26 1 node004 3397 ABGC BOV-WUR- megen002 R 27:26 1 node005 3398 ABGC BOV-WUR- megen002 R 27:26 1 node006 3399 ABGC BOV-WUR- megen002 R 27:26 1 node007 3400 ABGC BOV-WUR- megen002 R 27:26 1 node008 3401 ABGC BOV-WUR- megen002 R 27:26 1 node009 3385 research BOV-WUR- megen002 R 44:38 1 node049 3386 research BOV-WUR- megen002 R 44:38 1 node050 3387 research BOV-WUR- megen002 R 44:38 1 node051 3388 research BOV-WUR- megen002 R 44:38 1 node052 3389 research BOV-WUR- megen002 R 44:38 1 node053 3390 research BOV-WUR- megen002 R 44:38 1 node054 3391 research BOV-WUR- megen002 R 44:38 3 node[049-051] 3392 research BOV-WUR- megen002 R 44:38 3 node[052-054] 3393 research BOV-WUR- megen002 R 44:38 1 node001 3394 research BOV-WUR- megen002 R 44:38 1 node002 3395 research BOV-WUR- megen002 R 44:38 1 node003
Monitoring time limit set for a specific job
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the squeue
command.
<source lang='bash'>
squeue -l -j 3532
</source>
Information similar to the following should appear:
Fri Nov 29 15:41:00 2013 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054
Query a specific active job: scontrol
Show all the details of a currently active job, so not a completed job. <source lang='bash'> nfs01 ~]$ scontrol show jobid 4241 JobId=4241 Name=WB20F06
UserId=megen002(16795409) GroupId=domain users(16777729) Priority=1 Account=(null) QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29 StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=research AllocNode:Sid=nfs01:21799 ReqNodeList=(null) ExcNodeList=(null) NodeList=node023 BatchHost=node023 NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/lustre/scratch/WUR/ABGC/... WorkDir=/lustre/scratch/WUR/ABGC/...
</source>
Check on a pending job
A submitted job could result in a pending state when there are not enough resources available to this job. In this example I sumbit a job, check the status and after finding out is it pending I'll check when is probably will start. <source lang='bash'> [@nfs01 jobs]$ sbatch hpl_student.job
Submitted batch job 740338
[@nfs01 jobs]$ squeue -l -j 740338
Fri Feb 21 15:32:31 2014 JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON) 740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail)
[@nfs01 jobs]$ squeue --start -j 740338
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON) 740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
</source> So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.
Removing jobs from a list: scancel
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code: <source lang='bash'> scancel 3401 </source>
Allocating resources interactively: salloc
It's possible to set up an interactive session using salloc. Run salloc as follows: <source lang='bash'> salloc -p <partition, say, ABGC_Low> </source> And because of the magic of SallocDefaultCommand, you will immediately be transported to a new prompt.
Here, run 'hostname' to see which node your shell has been transported to.
If you don't want your shell to be transported but want a new remote shell, do: <source lang='bash'> salloc -p ABGC_Low $SHELL </source> Now your shell will stay on nfs01, but you can do: <source lang='bash'> srun <command> & </source> To submit tasks to this new shell!
Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html
Get overview of past and current jobs: sacct
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do: <source lang='bash'> sacct </source> This should provide information similar to the following:
JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 3385 BOV-WUR-58 research 12 COMPLETED 0:0 3385.batch batch 1 COMPLETED 0:0 3386 BOV-WUR-59 research 12 CANCELLED+ 0:0 3386.batch batch 1 CANCELLED 0:15 3528 BOV-WUR-59 ABGC 16 RUNNING 0:0 3529 BOV-WUR-60 ABGC 16 RUNNING 0:0
Or in more detail for a specific job: <source lang='bash'> sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220 </source> This should provide information about job id 4220:
JobID JobName Account Partition NTasks AllocCPUS Elapsed State ExitCode ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- 4220 PreProces+ research 3 00:30:52 COMPLETED 0:0 4220.batch batch 1 1 00:30:52 COMPLETED 0:0
Job Status Codes
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.
Code | State | Description |
---|---|---|
CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
CD | COMPLETED | Job has terminated all processes on all nodes. |
CF | CONFIGURING | Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting). |
CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
F | FAILED | Job terminated with non-zero exit code or other failure condition. |
NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes. |
PD | PENDING | Job is awaiting resource allocation. |
R | RUNNING | Job currently has an allocation. |
S | SUSPENDED | Job has an allocation, but execution has been suspended. |
TO | TIMEOUT | Job terminated upon reaching its time limit. |
Running MPI jobs on B4F cluster
Main article: MPI on B4F Cluster < text here >
Understanding which resources are available to you: sinfo
By using the 'sinfo' command you can retrieve information on which 'Partitions' are available to you. A 'Partition' using SLURM is similar to the 'queue' when submitting using the Sun Grid Engine ('qsub'). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have 'student', 'research', and 'ABGC' partitions available. The higher the level of resource allocation, though, the higher the cost per compute-hour. The default Partition is the 'student' partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.
<source lang='bash'> sinfo </source>
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST student* up infinite 12 down* node[043-048,055-060] student* up infinite 50 idle fat[001-002],node[001-042,049-054] research up infinite 12 down* node[043-048,055-060] research up infinite 50 idle fat[001-002],node[001-042,049-054] ABGC up infinite 12 down* node[043-048,055-060] ABGC up infinite 50 idle fat[001-002],node[001-042,049-054]
See also
- B4F Cluster
- BCM on B4F cluster
- SLURM compared to other common schedulers
- Setting up and using a virtual environment for Python3