<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.anunna.wur.nl/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Bohme001</id>
	<title>HPCwiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.anunna.wur.nl/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Bohme001"/>
	<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php/Special:Contributions/Bohme001"/>
	<updated>2026-04-18T01:52:57Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.43.1</generator>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=User:Bohme001&amp;diff=1348</id>
		<title>User:Bohme001</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=User:Bohme001&amp;diff=1348"/>
		<updated>2014-08-28T12:11:01Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Former Linux system administrator at FB-IT Infrastructure&amp;lt;br&amp;gt;&lt;br /&gt;
Our team is maintaining +200 Linux servers running mainly on RedHat Enterprise Server and some specials on Ubuntu Server.&amp;lt;br&amp;gt;&lt;br /&gt;
All HPC AgroGenomics hosts are running Scientific Linux version 6&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1347</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1347"/>
		<updated>2014-08-12T15:03:29Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 3 queues (in slurm called partitions) : a high, a standard and a low priority queue.&amp;lt;br&amp;gt;&lt;br /&gt;
The High queue provides the highest priority to jobs (20) then the standard queue (10). In the low priority queue (0)&amp;lt;br&amp;gt;&lt;br /&gt;
jobs will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Low queue jobs.&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_High      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_High      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_High      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Std       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Std       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Std       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Low       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Low       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Low       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 100MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC_Std&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 100 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=ABGC_Std&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Monitoring submitted jobs ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
=== Query a specific active job: scontrol ===&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Check on a pending job ===&lt;br /&gt;
A submitted job could result in a pending state when there are not enough resources available to this job.&lt;br /&gt;
In this example I sumbit a job, check the status and after finding out is it &#039;&#039;&#039;pending&#039;&#039;&#039; I&#039;ll check when is probably will start.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
[@nfs01 jobs]$ sbatch hpl_student.job&lt;br /&gt;
 Submitted batch job 740338&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue -l -j 740338&lt;br /&gt;
 Fri Feb 21 15:32:31 2014&lt;br /&gt;
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PENDING       0:00 1-00:00:00      1 (ReqNodeNotAvail)&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue --start -j 740338&lt;br /&gt;
  JOBID PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PD  2014-02-22T15:31:48      1 (ReqNodeNotAvail)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
So it seems this job will problably start the next day, but&#039;s thats no guarantee it will start indeed.&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1345</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1345"/>
		<updated>2014-06-26T13:21:27Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on June 26, 2014:&lt;br /&gt;
  &amp;lt;code&amp;gt;DEVICE INFORMATION&lt;br /&gt;
  Hostname		State	Memory	Cores	CPU	Speed	GPU	NICs	IB	Category&lt;br /&gt;
  node001..node002	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&lt;br /&gt;
  node049..node054	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&lt;br /&gt;
  master1 master2	UP	67.5 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		5	1	custom&lt;br /&gt;
  mds01, mds02		UP	16.8 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		5	1	mds&lt;br /&gt;
  storage01..storage06	UP	67.6 GiB	32	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		5	1	oss&lt;br /&gt;
  nfs01			UP	67.6 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		7	1	login&lt;br /&gt;
  fat001 fat002		UP	1.0 TiB		64	AMD Opteron(tm) Processor 6376	2300 MHz		5	1	fat&lt;br /&gt;
  &amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.&lt;br /&gt;
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at one of two main server centres of WUR-FB-IT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-IT reagdring non WUR accounts). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/Our-facilities/Show/High-Performance-Computing-Cluster-HPC.htm CATAgroFood offers a HPC facilty]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1344</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1344"/>
		<updated>2014-06-26T13:14:37Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on June 26, 2014:&lt;br /&gt;
  &amp;lt;code&amp;gt;DEVICE INFORMATION&lt;br /&gt;
  Hostname		State	Memory	Cores	CPU	Speed	GPU	NICs	IB	Category&lt;br /&gt;
  node001..node002	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&lt;br /&gt;
  node049..node054	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&lt;br /&gt;
  master1 master2	UP	67.5 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		5	1	custom&lt;br /&gt;
  mds01, mds02		UP	16.8 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		5	1	mds&lt;br /&gt;
  storage01..storage06	UP	67.6 GiB	32	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		5	1	oss&lt;br /&gt;
  nfs01			UP	67.6 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		7	1	login&lt;br /&gt;
  fat001 fat002		UP	1.0 TiB		64	AMD Opteron(tm) Processor 6376	2300 MHz		5	1	fat&lt;br /&gt;
  &amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.&lt;br /&gt;
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at one of two main server centres of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1343</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1343"/>
		<updated>2014-06-26T13:14:04Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on June 26, 2014:&lt;br /&gt;
  &amp;lt;code&amp;gt;DEVICE INFORMATION&lt;br /&gt;
  Hostname		State	Memory	Cores	CPU	Speed	GPU	NICs	IB	Category&lt;br /&gt;
  node001..node002	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&lt;br /&gt;
  node049..node054	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&lt;br /&gt;
  master1 master2	UP	67.5 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		5	1	custom&lt;br /&gt;
  mds01, mds02		UP	16.8 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		5	1	mds&lt;br /&gt;
  storage01..storage06	UP	67.6 GiB	32	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		5	1	oss&lt;br /&gt;
  nfs01			UP	67.6 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		7	1	login&lt;br /&gt;
  fat001 fat002		UP	1.0 TiB	64	AMD Opteron(tm) Processor 6376	2300 MHz		5	1	fat&lt;br /&gt;
  &amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.&lt;br /&gt;
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at one of two main server centres of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1342</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1342"/>
		<updated>2014-06-26T13:12:59Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on June 26, 2014:&lt;br /&gt;
  &amp;lt;code&amp;gt;DEVICE INFORMATION&lt;br /&gt;
  Hostname	State	Memory	Cores	CPU	Speed	GPU	NICs	IB	Category&lt;br /&gt;
  node001..node002	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&amp;lt;br&amp;gt;&lt;br /&gt;
  node049..node054	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&amp;lt;br&amp;gt;&lt;br /&gt;
  master1 master2	UP	67.5 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		5	1&amp;lt;br&amp;gt;&lt;br /&gt;
  mds01, mds02	UP	16.8 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		5	1	mds&amp;lt;br&amp;gt;&lt;br /&gt;
  storage01..storage06	UP	67.6 GiB	32	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		5	1	oss&amp;lt;br&amp;gt;&lt;br /&gt;
  nfs01	UP	67.6 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		7	1	login&amp;lt;br&amp;gt;&lt;br /&gt;
  fat001 fat002	UP	1.0 TiB	64	AMD Opteron(tm) Processor 6376	2300 MHz		5	1	fat&amp;lt;br&amp;gt;&lt;br /&gt;
  &amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.&lt;br /&gt;
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at one of two main server centres of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1341</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1341"/>
		<updated>2014-06-26T13:12:10Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on June 26, 2014:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.&lt;br /&gt;
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at one of two main server centres of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1340</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1340"/>
		<updated>2014-06-26T13:10:46Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on June 26, 2014:&lt;br /&gt;
  &amp;lt;code&amp;gt;DEVICE INFORMATION&lt;br /&gt;
  Hostname		State	Memory	Cores	CPU	Speed	GPU	NICs	IB	Category&lt;br /&gt;
node001..node002	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&amp;lt;br&amp;gt;&lt;br /&gt;
node049..node054	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&amp;lt;br&amp;gt;&lt;br /&gt;
master1 master2	UP	67.5 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		5	1&amp;lt;br&amp;gt;	&lt;br /&gt;
mds01, mds02	UP	16.8 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		5	1	mds&amp;lt;br&amp;gt;&lt;br /&gt;
storage01..storage06	UP	67.6 GiB	32	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		5	1	oss&amp;lt;br&amp;gt;&lt;br /&gt;
nfs01	UP	67.6 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		7	1	login&amp;lt;br&amp;gt;&lt;br /&gt;
fat001 fat002	UP	1.0 TiB	64	AMD Opteron(tm) Processor 6376	2300 MHz		5	1	fat&lt;br /&gt;
  &amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.&lt;br /&gt;
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at one of two main server centres of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1339</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1339"/>
		<updated>2014-06-26T13:10:27Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on June 26, 2014:&lt;br /&gt;
  &amp;lt;code&amp;gt;DEVICE INFORMATION&lt;br /&gt;
  Hostname		State	Memory	Cores	CPU	Speed	GPU	NICs	IB	Category&lt;br /&gt;
  &amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.&lt;br /&gt;
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at one of two main server centres of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1338</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1338"/>
		<updated>2014-06-26T13:05:52Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on November 23, 2013:&lt;br /&gt;
  &amp;lt;code&amp;gt;DEVICE INFORMATION&lt;br /&gt;
  Hostname	State	Memory	Cores	CPU	Speed	GPU	NICs	IB	Category&lt;br /&gt;
node001, node004, node006..node008, node010, node012..node015, node017, node020, node022, node027..node035, node037, node039, node050, node051, node053	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&amp;lt;br&amp;gt;&lt;br /&gt;
node002, node003, node005, node009, node016, node018, node019, node021, node023..node026, node036, node038, node040..node042, node049, node052, node054	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		3	1	default&amp;lt;br&amp;gt;&lt;br /&gt;
node011	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&amp;lt;br&amp;gt;&lt;br /&gt;
master1 master2	UP	67.5 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		5	1&amp;lt;br&amp;gt;	&lt;br /&gt;
mds01, mds02	UP	16.8 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		5	1	mds&amp;lt;br&amp;gt;&lt;br /&gt;
storage01..storage06	UP	67.6 GiB	32	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		5	1	oss&amp;lt;br&amp;gt;&lt;br /&gt;
nfs01	UP	67.6 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		7	1	login&amp;lt;br&amp;gt;&lt;br /&gt;
fat001 fat002	UP	1.0 TiB	64	AMD Opteron(tm) Processor 6376	2300 MHz		5	1	fat&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.&lt;br /&gt;
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at one of two main server centres of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1337</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1337"/>
		<updated>2014-06-26T13:01:51Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on November 23, 2013:&lt;br /&gt;
  &amp;lt;code&amp;gt;DEVICE INFORMATION&lt;br /&gt;
  Hostname	State	Memory	Cores	CPU	Speed	GPU	NICs	IB	Category&lt;br /&gt;
node001, node004, node006..node008, node010, node012..node015, node017, node020, node022, node027..node035, node037, node039, node050, node051, node053	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&lt;br /&gt;
node002, node003, node005, node009, node016, node018, node019, node021, node023..node026, node036, node038, node040..node042, node049, node052, node054	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		3	1	default&lt;br /&gt;
node011	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		3	1	default&lt;br /&gt;
master1 master2	UP	67.5 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		5	1	&lt;br /&gt;
mds01, mds02	UP	16.8 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		5	1	mds&lt;br /&gt;
storage01..storage06	UP	67.6 GiB	32	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		5	1	oss&lt;br /&gt;
nfs01	UP	67.6 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2399 MHz		7	1	login&lt;br /&gt;
fat001 fat002	UP	1.0 TiB	64	AMD Opteron(tm) Processor 6376	2300 MHz		5	1	fat&amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.&lt;br /&gt;
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota -s&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at one of two main server centres of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Mailinglist&amp;diff=1336</id>
		<title>Mailinglist</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Mailinglist&amp;diff=1336"/>
		<updated>2014-06-05T11:40:20Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Available mailling lists ==&lt;br /&gt;
In order to improve collaboration between HPC users and to provide an easy to use platform in order to exchange ideas, the HPC has a public mailing list service available.&amp;lt;br&amp;gt;&lt;br /&gt;
More info about the software itself, mailman, can be found on this [http://www.list.org/ website] . &amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Information about which lists are hosted on the HPC, is available [https://master1.hpcagrogenomics.wur.nl/mailman/listinfo here] &amp;lt;br&amp;gt;&lt;br /&gt;
We do recommend our HPC users to subscribe to the [https://master1.hpcagrogenomics.wur.nl/mailman/listinfo/hpcag hpcag] list.&amp;lt;br&amp;gt;&lt;br /&gt;
Via the website you will be able to start the subscription process.&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Mailinglist&amp;diff=1335</id>
		<title>Mailinglist</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Mailinglist&amp;diff=1335"/>
		<updated>2014-06-05T11:38:50Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: Created page with &amp;quot;== Available mailling lists == In order to improve collaboration between HPC users and to provide an easy to use platform in order to exchange ideas, the HPC has public access...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Available mailling lists ==&lt;br /&gt;
In order to improve collaboration between HPC users and to provide an easy to use platform in order to exchange ideas, the HPC has public accessible mailing list service available. More info about the service software itself, mailman, can be found on this [http://www.list.org/ website] . &amp;lt;br&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Information about which lists are hosted on the HPC can be found [https://master1.hpcagrogenomics.wur.nl/mailman/listinfo here] &amp;lt;br&amp;gt;&lt;br /&gt;
We do recommend our HPC users to subscribe to the [https://master1.hpcagrogenomics.wur.nl/mailman/listinfo/hpcag hpcag] list. Via the website you will be able to start the subscription process.&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1334</id>
		<title>Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Main_Page&amp;diff=1334"/>
		<updated>2014-06-05T11:24:16Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* See also */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS]. &lt;br /&gt;
&lt;br /&gt;
== Rationale and Requirements for a new cluster ==&lt;br /&gt;
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]&lt;br /&gt;
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other&#039;s know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Process of acquisition and financing ==&lt;br /&gt;
&lt;br /&gt;
[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]&lt;br /&gt;
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.&lt;br /&gt;
&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Architecture of the cluster ==&lt;br /&gt;
&lt;br /&gt;
[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]&lt;br /&gt;
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying &#039;size&#039;), all connected by superfast internet connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.&lt;br /&gt;
{{-}}&lt;br /&gt;
=== Nodes ===&lt;br /&gt;
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL. &lt;br /&gt;
&lt;br /&gt;
The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called &#039;slim nodes&#039;, that each have 16 cores and 64GB of RAM (called &#039;node001&#039; through &#039;node060&#039;; note that not all node names map to physical nodes), and two so called &#039;fat nodes&#039; that each have 64 cores and 1TB of RAM (&#039;fat001&#039; and &#039;fat002&#039;).&lt;br /&gt;
&lt;br /&gt;
Information from the Cluster Management Portal, as it appeared on November 23, 2013:&lt;br /&gt;
  &amp;lt;code&amp;gt;DEVICE INFORMATION&lt;br /&gt;
  Hostname	State	Memory	Cores	CPU	Speed	GPU	NICs	IB	Category&lt;br /&gt;
  master1, master2	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		5	1	&lt;br /&gt;
  node001..node042, node049..node054	UP	67.6 GiB	16	Intel(R) Xeon(R) CPU E5-2660 0+	1200 MHz		3	1	default&lt;br /&gt;
  node043..node048, node055..node060	DOWN	N/A	N/A	N/A	N/A	N/A	N/A	N/A	default&lt;br /&gt;
  mds01, mds02	UP	16.8 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2400 MHz		5	1	mds&lt;br /&gt;
  storage01	UP	67.6 GiB	32	Intel(R) Xeon(R) CPU E5-2660 0+	2200 MHz		5	1	oss&lt;br /&gt;
  storage02..storage06	UP	67.6 GiB	32	Intel(R) Xeon(R) CPU E5-2660 0+	2199 MHz		5	1	oss&lt;br /&gt;
  nfs01	UP	67.6 GiB	8	Intel(R) Xeon(R) CPU E5-2609 0+	2400 MHz		7	1	login&lt;br /&gt;
  fat001, fat002	UP	1.0 TiB	64	AMD Opteron(tm) Processor 6376	2299 MHz		5	1	fat &amp;lt;/code&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Main cluster node configuration:&lt;br /&gt;
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration&lt;br /&gt;
* The NFS server is a PowerEdge R720XD, which will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution. The NFS node will also act as a login node, where users log in and compile applications and submit jobs.&lt;br /&gt;
* 50 compute nodes&lt;br /&gt;
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes&lt;br /&gt;
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each&lt;br /&gt;
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each&lt;br /&gt;
Hyperthreading is disabled in compute nodes.&lt;br /&gt;
&lt;br /&gt;
=== Filesystems ===&lt;br /&gt;
&lt;br /&gt;
[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]&lt;br /&gt;
The B4F Cluster has two primary file systems, each with different properties and purposes.&lt;br /&gt;
==== Parallel File System: Lustre ====&lt;br /&gt;
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.&lt;br /&gt;
&lt;br /&gt;
The hardware components of the PFS:&lt;br /&gt;
* 2x Dell PowerEdge R720&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
* 6x Dell PowerEdge R620&lt;br /&gt;
* 6x Dell PowerVault MD3260&lt;br /&gt;
&lt;br /&gt;
==== Network File System (NFS): $HOME dirs ====&lt;br /&gt;
Each user will have his/her own home directory. The path of the home directory will be: &lt;br /&gt;
&lt;br /&gt;
  /home/[name partner]/[username]&lt;br /&gt;
&lt;br /&gt;
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
quota&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The NFS is supported through the NFS server that also serves as access point to the cluster.&lt;br /&gt;
&lt;br /&gt;
Hardware components of the NFS:&lt;br /&gt;
* 1x Dell PowerEdge R720XD&lt;br /&gt;
* 1x Dell PowerVault MD3220&lt;br /&gt;
&lt;br /&gt;
=== Network ===&lt;br /&gt;
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.&lt;br /&gt;
&lt;br /&gt;
== Housing at Theia ==&lt;br /&gt;
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]&lt;br /&gt;
The B4F Cluster is housed at the main server centre of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-ICT.&lt;br /&gt;
{{-}}&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;10%&amp;quot; |&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]&lt;br /&gt;
| width=&amp;quot;70%&amp;quot; |&lt;br /&gt;
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]&lt;br /&gt;
|}&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
== Management ==&lt;br /&gt;
&lt;br /&gt;
=== Project Leader ===&lt;br /&gt;
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)&lt;br /&gt;
&lt;br /&gt;
=== Daily Project Management ===&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]&lt;br /&gt;
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)&lt;br /&gt;
&lt;br /&gt;
[[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
&lt;br /&gt;
=== Steering Group ===&lt;br /&gt;
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication. &lt;br /&gt;
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)&lt;br /&gt;
* Petra Caessens (CAT-AgroFood)&lt;br /&gt;
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)&lt;br /&gt;
* Edda Neuteboom (CAT_AgroFood, secretariat)&lt;br /&gt;
* Johan van Arendonk (Wageningen UR, chair).&lt;br /&gt;
&lt;br /&gt;
=== IT Workgroup ===&lt;br /&gt;
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]&lt;br /&gt;
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.&lt;br /&gt;
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]&lt;br /&gt;
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]&lt;br /&gt;
* [[User:Barris01 | Wes Barris (Cobb)]]&lt;br /&gt;
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen (Topigs)]]&lt;br /&gt;
* Harry Dijkstra (CRV)&lt;br /&gt;
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]&lt;br /&gt;
{{-}}&lt;br /&gt;
&lt;br /&gt;
=== User Group ===&lt;br /&gt;
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.&lt;br /&gt;
&lt;br /&gt;
* [[List_of_users | List of users (alphabetical order)]]&lt;br /&gt;
&lt;br /&gt;
== Access Policy ==&lt;br /&gt;
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues (&#039;partitions&#039;) granted to a user, priority to the system&#039;s resources is regulated. &lt;br /&gt;
&lt;br /&gt;
=== Contact Persons ===&lt;br /&gt;
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):&lt;br /&gt;
&lt;br /&gt;
==== Cobb-Vantress ====&lt;br /&gt;
* Wes Barris&lt;br /&gt;
* Jun Chen&lt;br /&gt;
&lt;br /&gt;
==== ABGC ====&lt;br /&gt;
===== Animal Breeding and Genetics =====&lt;br /&gt;
* [[User:Hulze001 |Alex Hulzebosch]]&lt;br /&gt;
* [[User:Megen002 | Hendrik-Jan Megens]]&lt;br /&gt;
&lt;br /&gt;
===== Wageningen Livestock Research =====&lt;br /&gt;
* Mario Calus&lt;br /&gt;
* Ina Hulsegge&lt;br /&gt;
==== CRV ====&lt;br /&gt;
* Frido Hamoen&lt;br /&gt;
* Chris Schrooten&lt;br /&gt;
==== Hendrix Genetics ==== &lt;br /&gt;
* Ton Dings&lt;br /&gt;
* Abe Huisman&lt;br /&gt;
* Addie Vereijken&lt;br /&gt;
==== Topigs ====&lt;br /&gt;
* [[User:dongen01 | Henk van Dongen]]&lt;br /&gt;
* Egiel Hanenbarg&lt;br /&gt;
* Naomi Duijvensteijn&lt;br /&gt;
&lt;br /&gt;
== Using the B4F Cluster ==&lt;br /&gt;
=== Gaining access to the B4F Cluster ===&lt;br /&gt;
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].&lt;br /&gt;
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]&lt;br /&gt;
&lt;br /&gt;
=== Cluster Management Software and Scheduler ===&lt;br /&gt;
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.&lt;br /&gt;
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]&lt;br /&gt;
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]&lt;br /&gt;
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]&lt;br /&gt;
&lt;br /&gt;
=== Installation of software by users ===&lt;br /&gt;
&lt;br /&gt;
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Installed software ===&lt;br /&gt;
&lt;br /&gt;
* [[Globally_installed_software | Globally installed software]]&lt;br /&gt;
* [[ABGC_modules | ABGC specific modules]]&lt;br /&gt;
&lt;br /&gt;
=== Being in control of Environment parameters ===&lt;br /&gt;
&lt;br /&gt;
* [[Using_environment_modules | Using environment modules]]&lt;br /&gt;
* [[Setting local variables]]&lt;br /&gt;
* [[Setting_TMPDIR | Set a custom temporary directory location]]&lt;br /&gt;
* [[Installing_R_packages_locally | Installing R packages locally]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
=== Controlling costs ===&lt;br /&gt;
&lt;br /&gt;
* [[SACCT | using SACCT to see your costs]]&lt;br /&gt;
&lt;br /&gt;
== Miscellaneous ==&lt;br /&gt;
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[Maintenance_and_Management | Maintenance and Management]]&lt;br /&gt;
* [[Mailinglist | Electronic mail discussion lists]]&lt;br /&gt;
* [[About_ABGC | About ABGC]]&lt;br /&gt;
* [[Computer_cluster | High Performance Computing @ABGC]]&lt;br /&gt;
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
{| width=&amp;quot;90%&amp;quot;&lt;br /&gt;
|- valign=&amp;quot;top&amp;quot;&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]&lt;br /&gt;
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]&lt;br /&gt;
* [http://www.cobb-vantress.com Cobb-Vantress homepage]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [https://www.crv4all.nl CRV homepage]&lt;br /&gt;
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]&lt;br /&gt;
* [http://www.topigs.com TOPIGS homepage]&lt;br /&gt;
| width=&amp;quot;30%&amp;quot; |&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Maintenance_and_Management&amp;diff=1333</id>
		<title>Maintenance and Management</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Maintenance_and_Management&amp;diff=1333"/>
		<updated>2014-06-05T11:21:31Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Maintenance and Management */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Maintenance and Management ==&lt;br /&gt;
&lt;br /&gt;
As of April 2014, questions should be directed to the Service Desk IT.&lt;br /&gt;
&lt;br /&gt;
This can be done via the mail: servicedesk.it@wur.nl &lt;br /&gt;
And it can be done via the telephone: +31 317 488888&lt;br /&gt;
Please give your name and phonenumber and tell that your mail/call is about the HPC for Agrogenomics and give the company you are working for.&lt;br /&gt;
When you call the servicedesk, give also your email address.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Maintenance June 11&#039;th 2014 ==&lt;br /&gt;
There will be mainly firmware maintenance between 8h and 13h CET . Because network controller and storage controller firmware will be upgrades, all servers need to be rebooted and also will have network hick ups. So to prevent job and/or data corruption, the HPC will be shut down during this maintenance window. Running jobs will be killed!&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1320</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1320"/>
		<updated>2014-04-03T14:07:55Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Batch script */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 3 queues (in slurm called partitions) : a high, a standard and a low priority queue.&amp;lt;br&amp;gt;&lt;br /&gt;
The High queue provides the highest priority to jobs (20) then the standard queue (10). In the low priority queue (0)&amp;lt;br&amp;gt;&lt;br /&gt;
jobs will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Low queue jobs.&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_High      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_High      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_High      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Std       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Std       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Std       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Low       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Low       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Low       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC_Std&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=ABGC_Std&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Monitoring submitted jobs ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
=== Query a specific active job: scontrol ===&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Check on a pending job ===&lt;br /&gt;
A submitted job could result in a pending state when there are not enough resources available to this job.&lt;br /&gt;
In this example I sumbit a job, check the status and after finding out is it &#039;&#039;&#039;pending&#039;&#039;&#039; I&#039;ll check when is probably will start.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
[@nfs01 jobs]$ sbatch hpl_student.job&lt;br /&gt;
 Submitted batch job 740338&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue -l -j 740338&lt;br /&gt;
 Fri Feb 21 15:32:31 2014&lt;br /&gt;
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PENDING       0:00 1-00:00:00      1 (ReqNodeNotAvail)&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue --start -j 740338&lt;br /&gt;
  JOBID PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PD  2014-02-22T15:31:48      1 (ReqNodeNotAvail)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
So it seems this job will problably start the next day, but&#039;s thats no guarantee it will start indeed.&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1319</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1319"/>
		<updated>2014-04-03T14:06:33Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Queues */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 3 queues (in slurm called partitions) : a high, a standard and a low priority queue.&amp;lt;br&amp;gt;&lt;br /&gt;
The High queue provides the highest priority to jobs (20) then the standard queue (10). In the low priority queue (0)&amp;lt;br&amp;gt;&lt;br /&gt;
jobs will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Low queue jobs.&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_High      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_High      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_High      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Std       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Std       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Std       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Low       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Low       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Low       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Monitoring submitted jobs ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
=== Query a specific active job: scontrol ===&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Check on a pending job ===&lt;br /&gt;
A submitted job could result in a pending state when there are not enough resources available to this job.&lt;br /&gt;
In this example I sumbit a job, check the status and after finding out is it &#039;&#039;&#039;pending&#039;&#039;&#039; I&#039;ll check when is probably will start.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
[@nfs01 jobs]$ sbatch hpl_student.job&lt;br /&gt;
 Submitted batch job 740338&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue -l -j 740338&lt;br /&gt;
 Fri Feb 21 15:32:31 2014&lt;br /&gt;
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PENDING       0:00 1-00:00:00      1 (ReqNodeNotAvail)&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue --start -j 740338&lt;br /&gt;
  JOBID PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PD  2014-02-22T15:31:48      1 (ReqNodeNotAvail)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
So it seems this job will problably start the next day, but&#039;s thats no guarantee it will start indeed.&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=User:Bohme001&amp;diff=1318</id>
		<title>User:Bohme001</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=User:Bohme001&amp;diff=1318"/>
		<updated>2014-04-02T11:07:50Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Linux system administrator at FB-IT Infrastructure&amp;lt;br&amp;gt;&lt;br /&gt;
Our team is maintaining +200 Linux servers running mainly on RedHat Enterprise Server and some specials on Ubuntu Server.&amp;lt;br&amp;gt;&lt;br /&gt;
All HPC AgroGenomics hosts are running Scientific Linux version 6&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=User:Bohme001&amp;diff=1317</id>
		<title>User:Bohme001</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=User:Bohme001&amp;diff=1317"/>
		<updated>2014-04-02T11:05:27Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: Created page with &amp;quot;Linux system administrator at FB-IT Infrastructure Our team is maintaining +200 Linux servers running mainly on RedHat Enterprise Server and some specials on Ubuntu Server. Al...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Linux system administrator at FB-IT Infrastructure&lt;br /&gt;
Our team is maintaining +200 Linux servers running mainly on RedHat Enterprise Server and some specials on Ubuntu Server.&lt;br /&gt;
All HPC AgroGenomics hosts are running Scientific Linux version 6&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=MPI_on_B4F_cluster&amp;diff=1305</id>
		<title>MPI on B4F cluster</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=MPI_on_B4F_cluster&amp;diff=1305"/>
		<updated>2014-03-28T13:10:45Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* A mvapich2 sbatch example */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
== A simple &#039;Hello World&#039; example ==&lt;br /&gt;
Consider the following simple MPI version, in C, of the &#039;Hello World&#039; example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;cpp&#039;&amp;gt;&lt;br /&gt;
#include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
#include &amp;lt;mpi.h&amp;gt;&lt;br /&gt;
int main(int argc, char ** argv) {&lt;br /&gt;
  int size,rank,namelen;&lt;br /&gt;
  char processor_name[MPI_MAX_PROCESSOR_NAME];&lt;br /&gt;
  MPI_Init(&amp;amp;argc, &amp;amp;argv);&lt;br /&gt;
  MPI_Comm_rank(MPI_COMM_WORLD,&amp;amp;rank);&lt;br /&gt;
  MPI_Comm_size(MPI_COMM_WORLD,&amp;amp;size);&lt;br /&gt;
  MPI_Get_processor_name(processor_name, &amp;amp;namelen);&lt;br /&gt;
  printf(&amp;quot;Hello MPI! Process %d of %d on %s\n&amp;quot;, rank, size, processor_name);&lt;br /&gt;
  MPI_Finalize();&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Before compiling, make sure that the compilers that are required available.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
module list&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid conflicts between libraries, the safest way is purging all modules:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
module purge&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The load both gcc and openmpi libraries. If modules were purged, then slurm needs to be reloaded too.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
module load gcc/4.8.1 openmpi/gcc/64/1.6.5 slurm/2.5.7&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Compile the &amp;lt;code&amp;gt;hello_mpi.c&amp;lt;/code&amp;gt; code.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
mpicc hello_mpi.c -o test_hello_world&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If desired, a list of libraries compiled into the executable can be viewed:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
ldd test_hello_world&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  linux-vdso.so.1 =&amp;gt;  (0x00002aaaaaacb000)&lt;br /&gt;
  libmpi.so.1 =&amp;gt; /cm/shared/apps/openmpi/gcc/64/1.6.5/lib64/libmpi.so.1 (0x00002aaaaaccd000)&lt;br /&gt;
  libdl.so.2 =&amp;gt; /lib64/libdl.so.2 (0x00002aaaab080000)&lt;br /&gt;
  libm.so.6 =&amp;gt; /lib64/libm.so.6 (0x00002aaaab284000)&lt;br /&gt;
  libnuma.so.1 =&amp;gt; /usr/lib64/libnuma.so.1 (0x0000003e29400000)&lt;br /&gt;
  librt.so.1 =&amp;gt; /lib64/librt.so.1 (0x00002aaaab509000)&lt;br /&gt;
  libnsl.so.1 =&amp;gt; /lib64/libnsl.so.1 (0x00002aaaab711000)&lt;br /&gt;
  libutil.so.1 =&amp;gt; /lib64/libutil.so.1 (0x00002aaaab92a000)&lt;br /&gt;
  libpthread.so.0 =&amp;gt; /lib64/libpthread.so.0 (0x00002aaaabb2e000)&lt;br /&gt;
  libc.so.6 =&amp;gt; /lib64/libc.so.6 (0x00002aaaabd4b000)&lt;br /&gt;
  /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)&lt;br /&gt;
&lt;br /&gt;
Running the executable on two nodes, with four tasks per node, can be done like this:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
srun --nodes=2 --ntasks-per-node=4 --partition=ABGC --mpi=openmpi ./test_hello_world&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will result in the following output:&lt;br /&gt;
  Hello MPI! Process 4 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 1 of 8 on node010&lt;br /&gt;
  Hello MPI! Process 7 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 6 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 5 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 2 of 8 on node010&lt;br /&gt;
  Hello MPI! Process 0 of 8 on node010&lt;br /&gt;
  Hello MPI! Process 3 of 8 on node010&lt;br /&gt;
&lt;br /&gt;
== A mvapich2 sbatch example ==&lt;br /&gt;
A mpi job using mvapich2 on 32 cores, using the normal compute nodes and the fast infiniband interconnect for RDMA traffic.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ module load mvapich2/gcc&lt;br /&gt;
$ vim batch.sh&lt;br /&gt;
 #!/bin/sh&lt;br /&gt;
 #SBATCH --account=projectx&lt;br /&gt;
 #SBATCH --time=0&lt;br /&gt;
 #SBATCH  -n 32&lt;br /&gt;
 #SBATCH --constraint=normalmem&lt;br /&gt;
 #SBATCH --output=output_%j.txt&lt;br /&gt;
 #SBATCH --error=error_output_%j.txt&lt;br /&gt;
 #SBATCH --job-name=MPItest&lt;br /&gt;
 #SBATCH --partition=ABGC_Production&lt;br /&gt;
 #SBATCH --mail-type=ALL&lt;br /&gt;
 #SBATCH --mail-user=user@wur.nl&lt;br /&gt;
 &lt;br /&gt;
 echo &amp;quot;Starting at `date`&amp;quot;&lt;br /&gt;
 echo &amp;quot;Running on hosts: $SLURM_NODELIST&amp;quot;&lt;br /&gt;
 echo &amp;quot;Running on $SLURM_NNODES nodes.&amp;quot;&lt;br /&gt;
 echo &amp;quot;Running on $SLURM_NPROCS processors.&amp;quot;&lt;br /&gt;
 echo &amp;quot;Current working directory is `pwd`&amp;quot;&lt;br /&gt;
 # echo &amp;quot;Env var MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE is $MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE&amp;quot;&lt;br /&gt;
 # export MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE=ib0&lt;br /&gt;
&lt;br /&gt;
 mpirun -iface ib0 -np 32 ./tmf_par.out -NX 480 -NY 240 -alpha  11 -chi 1.3 -psi_b 5e-2  -beta  0.0 -zeta 3.5 -kT 0.10 &lt;br /&gt;
&lt;br /&gt;
 echo &amp;quot;Program finished with exit code $? at: `date`&amp;quot;&lt;br /&gt;
&lt;br /&gt;
$ sbatch batch.sh&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=MPI_on_B4F_cluster&amp;diff=1304</id>
		<title>MPI on B4F cluster</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=MPI_on_B4F_cluster&amp;diff=1304"/>
		<updated>2014-03-28T13:10:18Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* A mvapich2 sbatch example */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
== A simple &#039;Hello World&#039; example ==&lt;br /&gt;
Consider the following simple MPI version, in C, of the &#039;Hello World&#039; example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;cpp&#039;&amp;gt;&lt;br /&gt;
#include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
#include &amp;lt;mpi.h&amp;gt;&lt;br /&gt;
int main(int argc, char ** argv) {&lt;br /&gt;
  int size,rank,namelen;&lt;br /&gt;
  char processor_name[MPI_MAX_PROCESSOR_NAME];&lt;br /&gt;
  MPI_Init(&amp;amp;argc, &amp;amp;argv);&lt;br /&gt;
  MPI_Comm_rank(MPI_COMM_WORLD,&amp;amp;rank);&lt;br /&gt;
  MPI_Comm_size(MPI_COMM_WORLD,&amp;amp;size);&lt;br /&gt;
  MPI_Get_processor_name(processor_name, &amp;amp;namelen);&lt;br /&gt;
  printf(&amp;quot;Hello MPI! Process %d of %d on %s\n&amp;quot;, rank, size, processor_name);&lt;br /&gt;
  MPI_Finalize();&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Before compiling, make sure that the compilers that are required available.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
module list&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid conflicts between libraries, the safest way is purging all modules:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
module purge&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The load both gcc and openmpi libraries. If modules were purged, then slurm needs to be reloaded too.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
module load gcc/4.8.1 openmpi/gcc/64/1.6.5 slurm/2.5.7&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Compile the &amp;lt;code&amp;gt;hello_mpi.c&amp;lt;/code&amp;gt; code.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
mpicc hello_mpi.c -o test_hello_world&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If desired, a list of libraries compiled into the executable can be viewed:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
ldd test_hello_world&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  linux-vdso.so.1 =&amp;gt;  (0x00002aaaaaacb000)&lt;br /&gt;
  libmpi.so.1 =&amp;gt; /cm/shared/apps/openmpi/gcc/64/1.6.5/lib64/libmpi.so.1 (0x00002aaaaaccd000)&lt;br /&gt;
  libdl.so.2 =&amp;gt; /lib64/libdl.so.2 (0x00002aaaab080000)&lt;br /&gt;
  libm.so.6 =&amp;gt; /lib64/libm.so.6 (0x00002aaaab284000)&lt;br /&gt;
  libnuma.so.1 =&amp;gt; /usr/lib64/libnuma.so.1 (0x0000003e29400000)&lt;br /&gt;
  librt.so.1 =&amp;gt; /lib64/librt.so.1 (0x00002aaaab509000)&lt;br /&gt;
  libnsl.so.1 =&amp;gt; /lib64/libnsl.so.1 (0x00002aaaab711000)&lt;br /&gt;
  libutil.so.1 =&amp;gt; /lib64/libutil.so.1 (0x00002aaaab92a000)&lt;br /&gt;
  libpthread.so.0 =&amp;gt; /lib64/libpthread.so.0 (0x00002aaaabb2e000)&lt;br /&gt;
  libc.so.6 =&amp;gt; /lib64/libc.so.6 (0x00002aaaabd4b000)&lt;br /&gt;
  /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)&lt;br /&gt;
&lt;br /&gt;
Running the executable on two nodes, with four tasks per node, can be done like this:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
srun --nodes=2 --ntasks-per-node=4 --partition=ABGC --mpi=openmpi ./test_hello_world&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will result in the following output:&lt;br /&gt;
  Hello MPI! Process 4 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 1 of 8 on node010&lt;br /&gt;
  Hello MPI! Process 7 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 6 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 5 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 2 of 8 on node010&lt;br /&gt;
  Hello MPI! Process 0 of 8 on node010&lt;br /&gt;
  Hello MPI! Process 3 of 8 on node010&lt;br /&gt;
&lt;br /&gt;
== A mvapich2 sbatch example ==&lt;br /&gt;
A mpi job using mvapich2 on 32 cores, using the normal compute nodes and the fast infiniband interconnect for RDMA traffic.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ module load mvapich2/gcc&lt;br /&gt;
$ vim batch.sh&lt;br /&gt;
 #!/bin/sh&lt;br /&gt;
 #SBATCH --account=projectx&lt;br /&gt;
 #SBATCH --time=0&lt;br /&gt;
 #SBATCH  -n 32&lt;br /&gt;
 #SBATCH --constraint=normalmem&lt;br /&gt;
 #SBATCH --output=output_%j.txt&lt;br /&gt;
 #SBATCH --error=error_output_%j.txt&lt;br /&gt;
 #SBATCH --job-name=MPItest&lt;br /&gt;
 #SBATCH --partition=ABGC_Production&lt;br /&gt;
 #SBATCH --mail-type=ALL&lt;br /&gt;
 #SBATCH --mail-user=user@wur.nl&lt;br /&gt;
 &lt;br /&gt;
 echo &amp;quot;Starting at `date`&amp;quot;&lt;br /&gt;
 echo &amp;quot;Running on hosts: $SLURM_NODELIST&amp;quot;&lt;br /&gt;
 echo &amp;quot;Running on $SLURM_NNODES nodes.&amp;quot;&lt;br /&gt;
 echo &amp;quot;Running on $SLURM_NPROCS processors.&amp;quot;&lt;br /&gt;
 echo &amp;quot;Current working directory is `pwd`&amp;quot;&lt;br /&gt;
 echo &amp;quot;Env var MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE is $MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE&amp;quot;&lt;br /&gt;
&lt;br /&gt;
 # export MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE=ib0&lt;br /&gt;
&lt;br /&gt;
 mpirun -iface ib0 -np 32 ./tmf_par.out -NX 480 -NY 240 -alpha  11 -chi 1.3 -psi_b 5e-2  -beta  0.0 -zeta 3.5 -kT 0.10 &lt;br /&gt;
&lt;br /&gt;
 echo &amp;quot;Program finished with exit code $? at: `date`&amp;quot;&lt;br /&gt;
&lt;br /&gt;
$ sbatch batch.sh&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=MPI_on_B4F_cluster&amp;diff=1303</id>
		<title>MPI on B4F cluster</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=MPI_on_B4F_cluster&amp;diff=1303"/>
		<updated>2014-03-28T13:08:02Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* A simple &amp;#039;Hello World&amp;#039; example */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;br /&gt;
== A simple &#039;Hello World&#039; example ==&lt;br /&gt;
Consider the following simple MPI version, in C, of the &#039;Hello World&#039; example:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;cpp&#039;&amp;gt;&lt;br /&gt;
#include &amp;lt;stdio.h&amp;gt;&lt;br /&gt;
#include &amp;lt;mpi.h&amp;gt;&lt;br /&gt;
int main(int argc, char ** argv) {&lt;br /&gt;
  int size,rank,namelen;&lt;br /&gt;
  char processor_name[MPI_MAX_PROCESSOR_NAME];&lt;br /&gt;
  MPI_Init(&amp;amp;argc, &amp;amp;argv);&lt;br /&gt;
  MPI_Comm_rank(MPI_COMM_WORLD,&amp;amp;rank);&lt;br /&gt;
  MPI_Comm_size(MPI_COMM_WORLD,&amp;amp;size);&lt;br /&gt;
  MPI_Get_processor_name(processor_name, &amp;amp;namelen);&lt;br /&gt;
  printf(&amp;quot;Hello MPI! Process %d of %d on %s\n&amp;quot;, rank, size, processor_name);&lt;br /&gt;
  MPI_Finalize();&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Before compiling, make sure that the compilers that are required available.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
module list&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To avoid conflicts between libraries, the safest way is purging all modules:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
module purge&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The load both gcc and openmpi libraries. If modules were purged, then slurm needs to be reloaded too.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
module load gcc/4.8.1 openmpi/gcc/64/1.6.5 slurm/2.5.7&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Compile the &amp;lt;code&amp;gt;hello_mpi.c&amp;lt;/code&amp;gt; code.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
mpicc hello_mpi.c -o test_hello_world&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
If desired, a list of libraries compiled into the executable can be viewed:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
ldd test_hello_world&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  linux-vdso.so.1 =&amp;gt;  (0x00002aaaaaacb000)&lt;br /&gt;
  libmpi.so.1 =&amp;gt; /cm/shared/apps/openmpi/gcc/64/1.6.5/lib64/libmpi.so.1 (0x00002aaaaaccd000)&lt;br /&gt;
  libdl.so.2 =&amp;gt; /lib64/libdl.so.2 (0x00002aaaab080000)&lt;br /&gt;
  libm.so.6 =&amp;gt; /lib64/libm.so.6 (0x00002aaaab284000)&lt;br /&gt;
  libnuma.so.1 =&amp;gt; /usr/lib64/libnuma.so.1 (0x0000003e29400000)&lt;br /&gt;
  librt.so.1 =&amp;gt; /lib64/librt.so.1 (0x00002aaaab509000)&lt;br /&gt;
  libnsl.so.1 =&amp;gt; /lib64/libnsl.so.1 (0x00002aaaab711000)&lt;br /&gt;
  libutil.so.1 =&amp;gt; /lib64/libutil.so.1 (0x00002aaaab92a000)&lt;br /&gt;
  libpthread.so.0 =&amp;gt; /lib64/libpthread.so.0 (0x00002aaaabb2e000)&lt;br /&gt;
  libc.so.6 =&amp;gt; /lib64/libc.so.6 (0x00002aaaabd4b000)&lt;br /&gt;
  /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)&lt;br /&gt;
&lt;br /&gt;
Running the executable on two nodes, with four tasks per node, can be done like this:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
srun --nodes=2 --ntasks-per-node=4 --partition=ABGC --mpi=openmpi ./test_hello_world&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This will result in the following output:&lt;br /&gt;
  Hello MPI! Process 4 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 1 of 8 on node010&lt;br /&gt;
  Hello MPI! Process 7 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 6 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 5 of 8 on node011&lt;br /&gt;
  Hello MPI! Process 2 of 8 on node010&lt;br /&gt;
  Hello MPI! Process 0 of 8 on node010&lt;br /&gt;
  Hello MPI! Process 3 of 8 on node010&lt;br /&gt;
&lt;br /&gt;
== A mvapich2 sbatch example ==&lt;br /&gt;
A mpi job using mvapich2 on 32 cores using the fast infiniband interconnect for RDMA traffic.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ module load mvapich2/gcc&lt;br /&gt;
$ vim batch.sh&lt;br /&gt;
 #!/bin/sh&lt;br /&gt;
 #SBATCH --account=projectx&lt;br /&gt;
 #SBATCH --time=0&lt;br /&gt;
 #SBATCH  -n 32&lt;br /&gt;
 #SBATCH --constraint=normalmem&lt;br /&gt;
 #SBATCH --output=output_%j.txt&lt;br /&gt;
 #SBATCH --error=error_output_%j.txt&lt;br /&gt;
 #SBATCH --job-name=MPItest&lt;br /&gt;
 #SBATCH --partition=ABGC_Production&lt;br /&gt;
 #SBATCH --mail-type=ALL&lt;br /&gt;
 #SBATCH --mail-user=user@wur.nl&lt;br /&gt;
 &lt;br /&gt;
 echo &amp;quot;Starting at `date`&amp;quot;&lt;br /&gt;
 echo &amp;quot;Running on hosts: $SLURM_NODELIST&amp;quot;&lt;br /&gt;
 echo &amp;quot;Running on $SLURM_NNODES nodes.&amp;quot;&lt;br /&gt;
 echo &amp;quot;Running on $SLURM_NPROCS processors.&amp;quot;&lt;br /&gt;
 echo &amp;quot;Current working directory is `pwd`&amp;quot;&lt;br /&gt;
 echo &amp;quot;Env var MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE is $MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE&amp;quot;&lt;br /&gt;
&lt;br /&gt;
 # export MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE=ib0&lt;br /&gt;
&lt;br /&gt;
 mpirun -iface ib0 -np 32 ./tmf_par.out -NX 480 -NY 240 -alpha  11 -chi 1.3 -psi_b 5e-2  -beta  0.0 -zeta 3.5 -kT 0.10 &lt;br /&gt;
&lt;br /&gt;
 echo &amp;quot;Program finished with exit code $? at: `date`&amp;quot;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1137</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1137"/>
		<updated>2014-02-21T14:47:39Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Check on a pending job */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Monitoring submitted jobs ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
=== Query a specific active job: scontrol ===&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Check on a pending job ===&lt;br /&gt;
A submitted job could result in a pending state when there are not enough resources available to this job.&lt;br /&gt;
In this example I sumbit a job, check the status and after finding out is it &#039;&#039;&#039;pending&#039;&#039;&#039; I&#039;ll check when is probably will start.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
[@nfs01 jobs]$ sbatch hpl_student.job&lt;br /&gt;
 Submitted batch job 740338&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue -l -j 740338&lt;br /&gt;
 Fri Feb 21 15:32:31 2014&lt;br /&gt;
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PENDING       0:00 1-00:00:00      1 (ReqNodeNotAvail)&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue --start -j 740338&lt;br /&gt;
  JOBID PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PD  2014-02-22T15:31:48      1 (ReqNodeNotAvail)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
So it seems this job will problably start the next day, but&#039;s thats no guarantee it will start indeed.&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1136</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1136"/>
		<updated>2014-02-21T14:45:22Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Query a specific active job: scontrol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Monitoring submitted jobs ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
=== Query a specific active job: scontrol ===&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Check on a pending job ===&lt;br /&gt;
A submitted job could result in a pending state of tje job when there are not enough resources available to this job.&lt;br /&gt;
In this example I sumbit a job, check it status and after finding out is it pending I&#039;ll check whan is probably will start.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
[@nfs01 jobs]$ sbatch hpl_student.job&lt;br /&gt;
 Submitted batch job 740338&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue -l -j 740338&lt;br /&gt;
 Fri Feb 21 15:32:31 2014&lt;br /&gt;
  JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PENDING       0:00 1-00:00:00      1 (ReqNodeNotAvail)&lt;br /&gt;
&lt;br /&gt;
[@nfs01 jobs]$ squeue --start -j 740338&lt;br /&gt;
  JOBID PARTITION     NAME     USER  ST           START_TIME  NODES NODELIST(REASON)&lt;br /&gt;
 740338 ABGC_Stud HPLstude bohme999  PD  2014-02-22T15:31:48      1 (ReqNodeNotAvail)&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1135</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1135"/>
		<updated>2014-02-21T14:41:01Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Monitoring submitted and quering jobs */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Monitoring submitted jobs ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
=== Query a specific active job: scontrol ===&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1134</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1134"/>
		<updated>2014-02-21T14:40:33Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Query a specific active job: scontrol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Monitoring submitted and quering jobs ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
=== Query a specific active job: scontrol ===&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1133</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1133"/>
		<updated>2014-02-21T14:40:02Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* monitoring submitted jobs: squeue */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Monitoring submitted and quering jobs ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1132</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1132"/>
		<updated>2014-02-21T14:37:49Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Queues */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== monitoring submitted jobs: squeue ===&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1131</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1131"/>
		<updated>2014-02-21T14:37:27Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Queues */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
== Queues ==&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== monitoring submitted jobs: squeue ===&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1130</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1130"/>
		<updated>2014-02-21T14:34:35Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* monitoring submitted jobs: squeue */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== monitoring submitted jobs: squeue ===&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1129</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1129"/>
		<updated>2014-02-21T13:55:45Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* = Interactive X11/GUI jobs */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ===&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1128</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1128"/>
		<updated>2014-02-21T13:55:20Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Submitting multiple jobs */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&lt;br /&gt;
When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the &amp;lt;code&amp;gt;-N&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--node&amp;lt;/code&amp;gt; flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --nodes=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should force your job to be scheduled to a single node.&lt;br /&gt;
&lt;br /&gt;
Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the &amp;lt;code&amp;gt;-C&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;--constraints&amp;lt;/code&amp;gt; flag.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --constraint=normalmem&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
The example above will result in jobs being scheduled to the regular compute nodes. By using &amp;lt;code&amp;gt;largemem&amp;lt;/code&amp;gt; as option the job will specifically be scheduled to one of the fat nodes. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Interactive X11/GUI jobs ==&lt;br /&gt;
Slurm will forward your X11 credentials to the first (or even all) node for a job with the (undocumented) --x11 option.&lt;br /&gt;
For example, an interactive session for 1 hour with HPL using eigth cores can be started with:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;module load hpl/2.1&lt;br /&gt;
srun --ntasks=1 --cpus-per-task=8 --time=1:00:00 --pty --x11=first xhpl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1126</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1126"/>
		<updated>2014-01-16T08:25:03Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Get overview of past and current jobs: sacct */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code!!State!!Description&lt;br /&gt;
|-&lt;br /&gt;
|CA	||CANCELLED||	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
|-&lt;br /&gt;
|CD||	COMPLETED||	Job has terminated all processes on all nodes.&lt;br /&gt;
|-&lt;br /&gt;
|CF||	CONFIGURING||	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
|-&lt;br /&gt;
|CG||	COMPLETING||	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
|-&lt;br /&gt;
|F||	FAILED||	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
|-&lt;br /&gt;
|NF||	NODE_FAIL||	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
|-&lt;br /&gt;
|PD||	PENDING||	Job is awaiting resource allocation.&lt;br /&gt;
|-&lt;br /&gt;
|R||	RUNNING||	Job currently has an allocation.&lt;br /&gt;
|-&lt;br /&gt;
|S||	SUSPENDED||	Job has an allocation, but execution has been suspended.&lt;br /&gt;
|-&lt;br /&gt;
|TO||	TIMEOUT||	Job terminated upon reaching its time limit.&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1125</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1125"/>
		<updated>2014-01-16T08:21:46Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Get overview of past and current jobs: sacct */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Job Status Codes&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
!Code	&lt;br /&gt;
!State	&lt;br /&gt;
!Description&lt;br /&gt;
|-&lt;br /&gt;
CA	CANCELLED	Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.&lt;br /&gt;
CD	COMPLETED	Job has terminated all processes on all nodes.&lt;br /&gt;
CF	CONFIGURING	Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).&lt;br /&gt;
CG	COMPLETING	Job is in the process of completing. Some processes on some nodes may still be active.&lt;br /&gt;
F	FAILED	Job terminated with non-zero exit code or other failure condition.&lt;br /&gt;
NF	NODE_FAIL	Job terminated due to failure of one or more allocated nodes.&lt;br /&gt;
PD	PENDING	Job is awaiting resource allocation.&lt;br /&gt;
R	RUNNING	Job currently has an allocation.&lt;br /&gt;
S	SUSPENDED	Job has an allocation, but execution has been suspended.&lt;br /&gt;
TO	TIMEOUT	Job terminated upon reaching its time limit.&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Setting_TMPDIR&amp;diff=1080</id>
		<title>Setting TMPDIR</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Setting_TMPDIR&amp;diff=1080"/>
		<updated>2013-12-29T11:03:24Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Note: all nodes are partioned with a large /tmp (+- 400G).&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Many programs require writing intermediary or temporary information. A process may even require writing to a temporary location without a user knowing it. For instance, the command &amp;lt;code&amp;gt;sort&amp;lt;/code&amp;gt;, part of the &amp;lt;code&amp;gt;bash&amp;lt;/code&amp;gt; toolkit, requires a lot of temporary file space when sorting large volumes of data. Often programs take the system default, which is usually &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; on Linux systems. The &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; partition is however often too limited in size and can get filled up. When this happens all users that require write access to &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; will experience problems in running jobs, which can range from unexpected quitting of processes, to erroneous output. &lt;br /&gt;
&lt;br /&gt;
== Overriding the system default - works for many applications (but not all) ==&lt;br /&gt;
&lt;br /&gt;
Users can set a custom temporary folder location by setting the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; environment variable. A custom temporary folder location is best placed in the &#039;scratch&#039; space on the /lustre filesystem, because this will ensure periodic tidying up of the custom temporary directory thereby reducing the opportunity for very large temporary files that have gone unnoticed to remain on the filesystem. In addition, this is in line with protocols for the default system temporary directory &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; as this directory is also periodically purged of older files.&lt;br /&gt;
&lt;br /&gt;
First, create a temporary directory on user designated space on on the &#039;scratch&#039; partition:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
mkdir /lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Set the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; variable to point at the custom directory in either &amp;lt;code&amp;gt;~/.bash_profile&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;~/.bashrc&amp;lt;/code&amp;gt; using a command-line editor of choice (vi, emacs, nano, etc), adding the following line of code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
export TMPDIR=/lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The temporary directory will be set to the custom location once you log in. For the current open terminal to set the new location immediately, without need to log out and in again:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
source ~/.bash_profile&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note that certain (compiled) apps may in fact ignore the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; environment variable. These can include Java-apps and c++/g++ compiled apps.&lt;br /&gt;
&lt;br /&gt;
== Setting custom temporary directory for Java applications ==&lt;br /&gt;
&lt;br /&gt;
Java applications often do not respond properly to re-setting of the system default by manipulating the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; environment variable. Java behaviour can be modified by setting the &amp;lt;code&amp;gt;_JAVA_OPTIONS&amp;lt;/code&amp;gt; environment variable:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Note that this does not - strictly speaking - manipulates a temporary directory specific environment variable; rather it sets a Java specific environment variable that allows to pass options to Java on runtime.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
[[Main_Page#Being_in_control_of_Environment_parameters | Controling environment on the Agrogenomics cluster]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Setting_TMPDIR&amp;diff=1079</id>
		<title>Setting TMPDIR</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Setting_TMPDIR&amp;diff=1079"/>
		<updated>2013-12-29T11:03:01Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Note: all nodes are partioned with a large /tmp (+- 400G), /local is much smaller!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Many programs require writing intermediary or temporary information. A process may even require writing to a temporary location without a user knowing it. For instance, the command &amp;lt;code&amp;gt;sort&amp;lt;/code&amp;gt;, part of the &amp;lt;code&amp;gt;bash&amp;lt;/code&amp;gt; toolkit, requires a lot of temporary file space when sorting large volumes of data. Often programs take the system default, which is usually &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; on Linux systems. The &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; partition is however often too limited in size and can get filled up. When this happens all users that require write access to &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; will experience problems in running jobs, which can range from unexpected quitting of processes, to erroneous output. &lt;br /&gt;
&lt;br /&gt;
== Overriding the system default - works for many applications (but not all) ==&lt;br /&gt;
&lt;br /&gt;
Users can set a custom temporary folder location by setting the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; environment variable. A custom temporary folder location is best placed in the &#039;scratch&#039; space on the /lustre filesystem, because this will ensure periodic tidying up of the custom temporary directory thereby reducing the opportunity for very large temporary files that have gone unnoticed to remain on the filesystem. In addition, this is in line with protocols for the default system temporary directory &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; as this directory is also periodically purged of older files.&lt;br /&gt;
&lt;br /&gt;
First, create a temporary directory on user designated space on on the &#039;scratch&#039; partition:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
mkdir /lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Set the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; variable to point at the custom directory in either &amp;lt;code&amp;gt;~/.bash_profile&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;~/.bashrc&amp;lt;/code&amp;gt; using a command-line editor of choice (vi, emacs, nano, etc), adding the following line of code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
export TMPDIR=/lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The temporary directory will be set to the custom location once you log in. For the current open terminal to set the new location immediately, without need to log out and in again:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
source ~/.bash_profile&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Note that certain (compiled) apps may in fact ignore the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; environment variable. These can include Java-apps and c++/g++ compiled apps.&lt;br /&gt;
&lt;br /&gt;
== Setting custom temporary directory for Java applications ==&lt;br /&gt;
&lt;br /&gt;
Java applications often do not respond properly to re-setting of the system default by manipulating the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; environment variable. Java behaviour can be modified by setting the &amp;lt;code&amp;gt;_JAVA_OPTIONS&amp;lt;/code&amp;gt; environment variable:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Note that this does not - strictly speaking - manipulates a temporary directory specific environment variable; rather it sets a Java specific environment variable that allows to pass options to Java on runtime.&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
[[Main_Page#Being_in_control_of_Environment_parameters | Controling environment on the Agrogenomics cluster]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1047</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1047"/>
		<updated>2013-12-27T14:56:50Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
----&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1046</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1046"/>
		<updated>2013-12-27T14:55:12Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Defaults */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;br&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1045</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1045"/>
		<updated>2013-12-27T14:54:53Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Defaults */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;cr&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&#039;&#039;&#039; &amp;lt;cr&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1044</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=1044"/>
		<updated>2013-12-27T14:54:15Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&amp;lt;cr&amp;gt;&lt;br /&gt;
&#039;&#039;&#039;The default run time for a job is 1 hour!&lt;br /&gt;
Default memory limit is 1024MB per node!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem=2048&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 1024 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission: &lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mem X&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
$ sacct -o MaxRSS -j JOBID&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=HPCwiki:About&amp;diff=901</id>
		<title>HPCwiki:About</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=HPCwiki:About&amp;diff=901"/>
		<updated>2013-12-20T13:25:54Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This Wiki is intended for members and affiliates of the HPC Agrogenomics of Wageningen UR only.&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=887</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=887"/>
		<updated>2013-12-16T14:40:41Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
=== Defaults ===&lt;br /&gt;
There is no default queue, so you need to specify which queue to use when submitting a job.&lt;br /&gt;
The default run time for a job is 1 hour!&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=886</id>
		<title>Using Slurm</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Using_Slurm&amp;diff=886"/>
		<updated>2013-12-16T14:28:24Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;The resource allocation / scheduling software on the B4F Cluster is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: &#039;&#039;&#039;S&#039;&#039;&#039;imple &#039;&#039;&#039;L&#039;&#039;&#039;inux &#039;&#039;&#039;U&#039;&#039;&#039;tility for &#039;&#039;&#039;R&#039;&#039;&#039;esource &#039;&#039;&#039;M&#039;&#039;&#039;anagement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Queues and defaults ==&lt;br /&gt;
&lt;br /&gt;
=== Queues ===&lt;br /&gt;
Every organization has 2 queues (in slurm called partitions) : a production and a research queue. The production queue provides a higher priority to jobs (20) then the research queue (10).&lt;br /&gt;
To find out which queues your account has been authorized for, type sinfo:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
ABGC_Production    up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Production    up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Production    up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Research      up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Research      up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Research      up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
ABGC_Student       up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
ABGC_Student       up   infinite      6    mix fat[001-002],node[002-005]&lt;br /&gt;
ABGC_Student       up   infinite     44   idle node[001,006-042,049-054]&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
WUR organizations also do have a Student queue. Jobs in this queue will be resubmitted if a job with higer priority needs cluster resources and those resoruces are accupied by a Student queue jobs.&lt;br /&gt;
&lt;br /&gt;
== Submitting jobs: sbatch ==&lt;br /&gt;
&lt;br /&gt;
=== Example ===&lt;br /&gt;
Consider this simple python3 script that should calculate Pi to 1 million digits:&lt;br /&gt;
&amp;lt;source lang=&#039;python&#039;&amp;gt;&lt;br /&gt;
from decimal import *&lt;br /&gt;
D=Decimal&lt;br /&gt;
getcontext().prec=10000000&lt;br /&gt;
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))&lt;br /&gt;
print(str(p)[:10000002])&lt;br /&gt;
&amp;lt;/source&amp;gt; &lt;br /&gt;
&lt;br /&gt;
=== Loading modules ===&lt;br /&gt;
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:&lt;br /&gt;
  module avail&lt;br /&gt;
&lt;br /&gt;
In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:&lt;br /&gt;
  module load python/3.3.3&lt;br /&gt;
&lt;br /&gt;
=== Batch script ===&lt;br /&gt;
The following shell/slurm script can then be used to schedule the job using the sbatch command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#!/bin/bash&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
#SBATCH --partition=ABGC&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
time python3 calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Explanation of used SBATCH parameters:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --account=773320000&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the scontrol command. For WUR users a projectnumber or KTP number would be advisable.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --time=1200&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
A time limit of zero requests that no time limit be imposed. Acceptable time formats include &amp;quot;minutes&amp;quot;, &amp;quot;minutes:seconds&amp;quot;, &amp;quot;hours:minutes:seconds&amp;quot;, &amp;quot;days-hours&amp;quot;, &amp;quot;days-hours:minutes&amp;quot; and &amp;quot;days-hours:minutes:seconds&amp;quot;. So in this example the job will run for a maximum of 1200 minutes.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --ntasks=1&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --output=output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard output directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --error=error_output_%j.txt&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Instruct SLURM to connect the batch script&#039;s standard error directly to the file name specified in the &amp;quot;filename pattern&amp;quot;. By default both standard output and standard error are directed to a file of the name &amp;quot;slurm-%j.out&amp;quot;, where the &amp;quot;%j&amp;quot; is replaced with the job allocation number. See the --input option for filename specification options.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --job-name=calc_pi.py&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just &amp;quot;sbatch&amp;quot; if the script is read on sbatch&#039;s standard input.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --partition=research&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-type=ALL&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
#SBATCH --mail-user=email@org.nl&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Email address to use.&lt;br /&gt;
&lt;br /&gt;
=== Submitting ===&lt;br /&gt;
The script, assuming it was named &#039;run_calc_pi.sh&#039;, can then be posted using the following command:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sbatch run_calc_pi.sh&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Submitting multiple jobs ===&lt;br /&gt;
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== monitoring submitted jobs: squeue ==&lt;br /&gt;
Once a job is submitted, the status can be monitored using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command. The &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command has a number of parameters for monitoring specific properties of the jobs such as time limit.&lt;br /&gt;
&lt;br /&gt;
=== Generic monitoring of all running jobs ===&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
  squeue&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the &#039;sbatch&#039; command, it may look like so:&lt;br /&gt;
    JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)&lt;br /&gt;
   3396      ABGC BOV-WUR- megen002   R      27:26      1 node004&lt;br /&gt;
   3397      ABGC BOV-WUR- megen002   R      27:26      1 node005&lt;br /&gt;
   3398      ABGC BOV-WUR- megen002   R      27:26      1 node006&lt;br /&gt;
   3399      ABGC BOV-WUR- megen002   R      27:26      1 node007&lt;br /&gt;
   3400      ABGC BOV-WUR- megen002   R      27:26      1 node008&lt;br /&gt;
   3401      ABGC BOV-WUR- megen002   R      27:26      1 node009&lt;br /&gt;
   3385  research BOV-WUR- megen002   R      44:38      1 node049&lt;br /&gt;
   3386  research BOV-WUR- megen002   R      44:38      1 node050&lt;br /&gt;
   3387  research BOV-WUR- megen002   R      44:38      1 node051&lt;br /&gt;
   3388  research BOV-WUR- megen002   R      44:38      1 node052&lt;br /&gt;
   3389  research BOV-WUR- megen002   R      44:38      1 node053&lt;br /&gt;
   3390  research BOV-WUR- megen002   R      44:38      1 node054&lt;br /&gt;
   3391  research BOV-WUR- megen002   R      44:38      3 node[049-051]&lt;br /&gt;
   3392  research BOV-WUR- megen002   R      44:38      3 node[052-054]&lt;br /&gt;
   3393  research BOV-WUR- megen002   R      44:38      1 node001&lt;br /&gt;
   3394  research BOV-WUR- megen002   R      44:38      1 node002&lt;br /&gt;
   3395  research BOV-WUR- megen002   R      44:38      1 node003&lt;br /&gt;
&lt;br /&gt;
=== Monitoring time limit set for a specific job ===&lt;br /&gt;
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the &amp;lt;code&amp;gt;squeue&amp;lt;/code&amp;gt; command.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
squeue -l -j 3532&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
Information similar to the following should appear:&lt;br /&gt;
  Fri Nov 29 15:41:00 2013&lt;br /&gt;
   JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  NODES NODELIST(REASON)&lt;br /&gt;
   3532      ABGC BOV-WUR- megen002  RUNNING    2:47:03 3-08:00:00      1 node054&lt;br /&gt;
&lt;br /&gt;
== Query a specific active job: scontrol ==&lt;br /&gt;
Show all the details of a currently active job, so not a completed job.&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
nfs01 ~]$ scontrol show jobid 4241&lt;br /&gt;
JobId=4241 Name=WB20F06&lt;br /&gt;
   UserId=megen002(16795409) GroupId=domain users(16777729)&lt;br /&gt;
   Priority=1 Account=(null) QOS=normal&lt;br /&gt;
   JobState=RUNNING Reason=None Dependency=(null)&lt;br /&gt;
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0&lt;br /&gt;
   RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A&lt;br /&gt;
   SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29&lt;br /&gt;
   StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29&lt;br /&gt;
   PreemptTime=None SuspendTime=None SecsPreSuspend=0&lt;br /&gt;
   Partition=research AllocNode:Sid=nfs01:21799&lt;br /&gt;
   ReqNodeList=(null) ExcNodeList=(null)&lt;br /&gt;
   NodeList=node023&lt;br /&gt;
   BatchHost=node023&lt;br /&gt;
   NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*&lt;br /&gt;
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0&lt;br /&gt;
   Features=(null) Gres=(null) Reservation=(null)&lt;br /&gt;
   Shared=OK Contiguous=0 Licenses=(null) Network=(null)&lt;br /&gt;
   Command=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
   WorkDir=/lustre/scratch/WUR/ABGC/...&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Removing jobs from a list: scancel ==&lt;br /&gt;
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the &#039;scancel&#039; command. The &#039;scancel&#039; command takes the jobid as a parameter. For the example above, this would be done using the following code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
scancel 3401&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Allocating resources interactively: sallocate ==&lt;br /&gt;
&amp;lt; text here&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Get overview of past and current jobs: sacct ==&lt;br /&gt;
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information similar to the following:&lt;br /&gt;
&lt;br /&gt;
         JobID    JobName  Partition    Account  AllocCPUS      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- ---------- ---------- -------- &lt;br /&gt;
  3385         BOV-WUR-58   research                    12  COMPLETED      0:0 &lt;br /&gt;
  3385.batch        batch                                1  COMPLETED      0:0 &lt;br /&gt;
  3386         BOV-WUR-59   research                    12 CANCELLED+      0:0 &lt;br /&gt;
  3386.batch        batch                                1  CANCELLED     0:15 &lt;br /&gt;
  3528         BOV-WUR-59       ABGC                    16    RUNNING      0:0 &lt;br /&gt;
  3529         BOV-WUR-60       ABGC                    16    RUNNING      0:0&lt;br /&gt;
&lt;br /&gt;
Or in more detail for a specific job:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sacct --format=jobid,jobname,account,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
This should provide information about job id 4220:&lt;br /&gt;
&lt;br /&gt;
       JobID    JobName    Account  Partition   NTasks  AllocCPUS    Elapsed      State ExitCode &lt;br /&gt;
  ------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- &lt;br /&gt;
  4220         PreProces+              research                   3   00:30:52  COMPLETED      0:0 &lt;br /&gt;
  4220.batch        batch                              1          1   00:30:52  COMPLETED      0:0&lt;br /&gt;
&lt;br /&gt;
== Running MPI jobs on B4F cluster ==&lt;br /&gt;
&lt;br /&gt;
[[MPI_on_B4F_cluster | Main article: MPI on B4F Cluster]]&lt;br /&gt;
&amp;lt; text here &amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Understanding which resources are available to you: sinfo ==&lt;br /&gt;
By using the &#039;sinfo&#039; command you can retrieve information on which &#039;Partitions&#039; are available to you. A &#039;Partition&#039; using SLURM is similar to the &#039;queue&#039; when submitting using the Sun Grid Engine (&#039;qsub&#039;). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have &#039;student&#039;, &#039;research&#039;, and &#039;ABGC&#039; partitions available. The higher the level of  resource allocation, though, the higher the cost per compute-hour. The default Partition is the &#039;student&#039; partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
sinfo&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST&lt;br /&gt;
  student*     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  student*     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  research     up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  research     up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
  ABGC         up   infinite     12  down* node[043-048,055-060]&lt;br /&gt;
  ABGC         up   infinite     50   idle fat[001-002],node[001-042,049-054]&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
* [[B4F_cluster | B4F Cluster]]&lt;br /&gt;
* [[BCM_on_B4F_cluster | BCM on B4F cluster]]&lt;br /&gt;
* [[SLURM_Compare | SLURM compared to other common schedulers]]&lt;br /&gt;
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]&lt;br /&gt;
&lt;br /&gt;
== External links ==&lt;br /&gt;
* [http://slurm.schedmd.com Slurm official documentation]&lt;br /&gt;
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]&lt;br /&gt;
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=File:BCM.png&amp;diff=885</id>
		<title>File:BCM.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=File:BCM.png&amp;diff=885"/>
		<updated>2013-12-16T13:52:19Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: Bohme001 uploaded a new version of &amp;amp;quot;File:BCM.png&amp;amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;BCM portal&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Setting_TMPDIR&amp;diff=884</id>
		<title>Setting TMPDIR</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Setting_TMPDIR&amp;diff=884"/>
		<updated>2013-12-13T14:51:35Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&#039;&#039;&#039;Note: all nodes are partioned with a large /local (+- 400G), /tmp is much smaller!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Many programs require writing intermediary or temporary information. A process may even require writing to a temporary location without a user knowing it. For instance, the command &amp;lt;code&amp;gt;sort&amp;lt;/code&amp;gt;, part of the &amp;lt;code&amp;gt;bash&amp;lt;/code&amp;gt; toolkit, requires a lot of temporary file space when sorting large volumes of data. Often programs take the system default, which is usually &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; on Linux systems. The &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; partition is however often too limited in size and can get filled up. When this happens all users that require write access to &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; will experience problems in running jobs, which can range from unexpected quitting of processes, to erroneous output. &lt;br /&gt;
&lt;br /&gt;
Users can set a custom temporary folder location by setting the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; environment variable. A custom temporary folder location is best placed in the &#039;scratch&#039; space on the /lustre filesystem, because this will ensure periodic tidying up of the custom temporary directory thereby reducing the opportunity for very large temporary files that have gone unnoticed to remain on the filesystem. In addition, this is in line with protocols for the default system temporary directory &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; as this directory is also periodically purged of older files.&lt;br /&gt;
&lt;br /&gt;
First, create a temporary directory on user designated space on on the &#039;scratch&#039; partition:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
mkdir /lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Set the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; variable to point at the custom directory in either &amp;lt;code&amp;gt;~/.bash_profile&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;~/.bashrc&amp;lt;/code&amp;gt; using a command-line editor of choice (vi, emacs, nano, etc), adding the following line of code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
export TMPDIR=/lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The temporary directory will be set to the custom location once you log in. For the current open terminal to set the new location immediately, without need to log out and in again:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
source ~/.bash_profile&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=Setting_TMPDIR&amp;diff=883</id>
		<title>Setting TMPDIR</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=Setting_TMPDIR&amp;diff=883"/>
		<updated>2013-12-13T14:51:05Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Many programs require writing intermediary or temporary information. A process may even require writing to a temporary location without a user knowing it. For instance, the command &amp;lt;code&amp;gt;sort&amp;lt;/code&amp;gt;, part of the &amp;lt;code&amp;gt;bash&amp;lt;/code&amp;gt; toolkit, requires a lot of temporary file space when sorting large volumes of data. Often programs take the system default, which is usually &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; on Linux systems. The &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; partition is however often too limited in size and can get filled up. When this happens all users that require write access to &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; will experience problems in running jobs, which can range from unexpected quitting of processes, to erroneous output. &lt;br /&gt;
&lt;br /&gt;
&#039;&#039;&#039;Note: all nodes are partioned with a large /local (+- 400G), /tmp is much smaller!&#039;&#039;&#039;&lt;br /&gt;
&lt;br /&gt;
Users can set a custom temporary folder location by setting the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; environment variable. A custom temporary folder location is best placed in the &#039;scratch&#039; space on the /lustre filesystem, because this will ensure periodic tidying up of the custom temporary directory thereby reducing the opportunity for very large temporary files that have gone unnoticed to remain on the filesystem. In addition, this is in line with protocols for the default system temporary directory &amp;lt;code&amp;gt;/tmp&amp;lt;/code&amp;gt; as this directory is also periodically purged of older files.&lt;br /&gt;
&lt;br /&gt;
First, create a temporary directory on user designated space on on the &#039;scratch&#039; partition:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
mkdir /lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Set the &amp;lt;code&amp;gt;TMPDIR&amp;lt;/code&amp;gt; variable to point at the custom directory in either &amp;lt;code&amp;gt;~/.bash_profile&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;~/.bashrc&amp;lt;/code&amp;gt; using a command-line editor of choice (vi, emacs, nano, etc), adding the following line of code:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
export TMPDIR=/lustre/scratch/WUR/ABGC/[user]/tmp&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The temporary directory will be set to the custom location once you log in. For the current open terminal to set the new location immediately, without need to log out and in again:&lt;br /&gt;
&amp;lt;source lang=&#039;bash&#039;&amp;gt;&lt;br /&gt;
source ~/.bash_profile&lt;br /&gt;
&amp;lt;/source&amp;gt;&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=SLURM_Compare&amp;diff=869</id>
		<title>SLURM Compare</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=SLURM_Compare&amp;diff=869"/>
		<updated>2013-12-10T11:56:57Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Rosetta Stone of Workload Managers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Rosetta Stone of Workload Managers ===&lt;br /&gt;
&lt;br /&gt;
PBS/Torque, Slurm, LSF, SGE and LoadLeveler [http://slurm.schedmd.com/rosetta.html Rosetta Stone]&lt;br /&gt;
&lt;br /&gt;
This table lists the most common command, environment variables, and job specification options used by the major workload management systems: PBS/Torque, Slurm, LSF, SGE and LoadLeveler. Each of these workload managers has unique features, but the most commonly used functionality is available in all of these environments as listed in the table. This should be considered a work in progress and contributions to improve the document are welcome.&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
!User Commands!!PBS/Torque!!Slurm!!LSF!!SGE!!LoadLeveler&lt;br /&gt;
|-&lt;br /&gt;
||Job submission|| qsub [script_file]|| sbatch [script_file]|| bsub [script_file]|| qsub [script_file]|| llsubmit [script_file] &lt;br /&gt;
|-&lt;br /&gt;
||Job deletion ||qdel [job_id]|| scancel [job_id]|| bkill [job_id]|| qdel [job_id]|| llcancel [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Job status (by job)|| qstat [job_id]|| squeue [job_id]|| bjobs [job_id]|| qstat -u \* [-j job_id]|| llq -u [username] &lt;br /&gt;
|-&lt;br /&gt;
||Job status (by user)|| qstat -u [user_name]|| squeue -u [user_name]|| bjobs -u [user_name]|| qstat [-u user_name]|| llq -u [user_name] &lt;br /&gt;
|-&lt;br /&gt;
||Job hold ||qhold [job_id]|| scontrol hold [job_id]|| bstop [job_id]|| qhold [job_id]|| llhold -r [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Job release|| qrls [job_id]|| scontrol release [job_id]|| bresume [job_id]|| qrls [job_id]|| llhold -r [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Queue list|| qstat -Q|| squeue ||bqueues ||qconf -sql|| llclass &lt;br /&gt;
|-&lt;br /&gt;
||Node list ||pbsnodes -l|| sinfo -N OR scontrol show nodes|| bhosts|| qhost|| llstatus -L machine &lt;br /&gt;
|-&lt;br /&gt;
||Cluster status|| qstat -a|| sinfo|| bqueues|| qhost -q|| llstatus -L cluster &lt;br /&gt;
|-&lt;br /&gt;
||GUI|| xpbsmon|| sview|| xlsf OR xlsbatch|| qmon|| xload &lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
||&#039;&#039;&#039;Environment&#039;&#039;&#039;||&#039;&#039;&#039;PBS/Torque&#039;&#039;&#039;||&#039;&#039;&#039;Slurm&#039;&#039;&#039;||&#039;&#039;&#039;LSF&#039;&#039;&#039;||&#039;&#039;&#039;SGE&#039;&#039;&#039;||&#039;&#039;&#039;LoadLeveler&#039;&#039;&#039; &lt;br /&gt;
|-&lt;br /&gt;
||Job ID|| $PBS_JOBID|| $SLURM_JOBID|| $LSB_JOBID|| $JOB_ID|| $LOAD_STEP_ID &lt;br /&gt;
|-&lt;br /&gt;
||Submit Directory|| $PBS_O_WORKDIR|| $SLURM_SUBMIT_DIR|| $LSB_SUBCWD|| $SGE_O_WORKDIR|| $LOADL_STEP_INITDIR &lt;br /&gt;
|-&lt;br /&gt;
||Submit Host|| $PBS_O_HOST|| $SLURM_SUBMIT_HOST|| $LSB_SUB_HOST|| $SGE_O_HOST ||&lt;br /&gt;
|-&lt;br /&gt;
||Node List|| $PBS_NODEFILE|| $SLURM_JOB_NODELIST|| $LSB_HOSTS/LSB_MCPU_HOST|| $PE_HOSTFILE|| $LOADL_PROCESSOR_LIST &lt;br /&gt;
|-&lt;br /&gt;
||Job Array Index|| $PBS_ARRAYID|| $SLURM_ARRAY_TASK_ID|| $LSB_JOBINDEX|| $SGE_TASK_ID ||&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=SLURM_Compare&amp;diff=868</id>
		<title>SLURM Compare</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=SLURM_Compare&amp;diff=868"/>
		<updated>2013-12-10T11:55:44Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Rosetta Stone of Workload Managers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Rosetta Stone of Workload Managers ===&lt;br /&gt;
&lt;br /&gt;
PBS/Torque, Slurm, LSF, SGE and LoadLeveler [http://slurm.schedmd.com/rosetta.html Rosetta Stone]&lt;br /&gt;
&lt;br /&gt;
This table lists the most common command, environment variables, and job specification options used by the major workload management systems: PBS/Torque, Slurm, LSF, SGE and LoadLeveler. Each of these workload managers has unique features, but the most commonly used functionality is available in all of these environments as listed in the table. This should be considered a work in progress and contributions to improve the document are welcome.&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
!User Commands!!PBS/Torque!!Slurm!!LSF!!SGE!!LoadLeveler&lt;br /&gt;
|-&lt;br /&gt;
||Job submission|| qsub [script_file]|| sbatch [script_file]|| bsub [script_file]|| qsub [script_file]|| llsubmit [script_file] &lt;br /&gt;
|-&lt;br /&gt;
||Job deletion ||qdel [job_id]|| scancel [job_id]|| bkill [job_id]|| qdel [job_id]|| llcancel [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Job status (by job)|| qstat [job_id]|| squeue [job_id]|| bjobs [job_id]|| qstat -u \* [-j job_id]|| llq -u [username] &lt;br /&gt;
|-&lt;br /&gt;
||Job status (by user)|| qstat -u [user_name]|| squeue -u [user_name]|| bjobs -u [user_name]|| qstat [-u user_name]|| llq -u [user_name] &lt;br /&gt;
|-&lt;br /&gt;
||Job hold ||qhold [job_id]|| scontrol hold [job_id]|| bstop [job_id]|| qhold [job_id]|| llhold -r [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Job release|| qrls [job_id]|| scontrol release [job_id]|| bresume [job_id]|| qrls [job_id]|| llhold -r [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Queue list|| qstat -Q|| squeue ||bqueues ||qconf -sql|| llclass &lt;br /&gt;
|-&lt;br /&gt;
||Node list ||pbsnodes -l|| sinfo -N OR scontrol show nodes|| bhosts|| qhost|| llstatus -L machine &lt;br /&gt;
|-&lt;br /&gt;
||Cluster status|| qstat -a|| sinfo|| bqueues|| qhost -q|| llstatus -L cluster &lt;br /&gt;
|-&lt;br /&gt;
||GUI|| xpbsmon|| sview|| xlsf OR xlsbatch|| qmon|| xload &lt;br /&gt;
|-&lt;br /&gt;
|-&lt;br /&gt;
||&#039;&#039;&#039;Environment&#039;&#039;&#039;||&#039;&#039;&#039;PBS/Torque&#039;&#039;&#039;||&#039;&#039;&#039;Slurm&#039;&#039;&#039;||&#039;&#039;&#039;LSF&#039;&#039;&#039;||&#039;&#039;&#039;SGE&#039;&#039;&#039;||&#039;&#039;&#039;LoadLeveler&#039;&#039;&#039; &lt;br /&gt;
|-&lt;br /&gt;
||Job ID|| $PBS_JOBID|| $SLURM_JOBID|| $LSB_JOBID|| $JOB_ID|| $LOAD_STEP_ID &lt;br /&gt;
|-&lt;br /&gt;
||Submit Directory|| $PBS_O_WORKDIR|| $SLURM_SUBMIT_DIR|| $LSB_SUBCWD|| $SGE_O_WORKDIR|| $LOADL_STEP_INITDIR &lt;br /&gt;
|-&lt;br /&gt;
||Submit Host|| $PBS_O_HOST|| $SLURM_SUBMIT_HOST|| $LSB_SUB_HOST|| $SGE_O_HOST ||&lt;br /&gt;
|-&lt;br /&gt;
||Node List|| $PBS_NODEFILE|| $SLURM_JOB_NODELIST|| $LSB_HOSTS/LSB_MCPU_HOST|| $PE_HOSTFILE|| $LOADL_PROCESSOR_LIST &lt;br /&gt;
|-&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=SLURM_Compare&amp;diff=867</id>
		<title>SLURM Compare</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=SLURM_Compare&amp;diff=867"/>
		<updated>2013-12-10T11:47:37Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Rosetta Stone of Workload Managers ===&lt;br /&gt;
&lt;br /&gt;
PBS/Torque, Slurm, LSF, SGE and LoadLeveler [http://slurm.schedmd.com/rosetta.html Rosetta Stone]&lt;br /&gt;
&lt;br /&gt;
This table lists the most common command, environment variables, and job specification options used by the major workload management systems: PBS/Torque, Slurm, LSF, SGE and LoadLeveler. Each of these workload managers has unique features, but the most commonly used functionality is available in all of these environments as listed in the table. This should be considered a work in progress and contributions to improve the document are welcome.&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
!User Commands!!PBS/Torque!!Slurm!!LSF!!SGE!!LoadLeveler&lt;br /&gt;
|-&lt;br /&gt;
||Job submission|| qsub [script_file]|| sbatch [script_file]|| bsub [script_file]|| qsub [script_file]|| llsubmit [script_file] &lt;br /&gt;
|-&lt;br /&gt;
||Job deletion ||qdel [job_id]|| scancel [job_id]|| bkill [job_id]|| qdel [job_id]|| llcancel [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Job status (by job)|| qstat [job_id]|| squeue [job_id]|| bjobs [job_id]|| qstat -u \* [-j job_id]|| llq -u [username] &lt;br /&gt;
|-&lt;br /&gt;
||Job status (by user)|| qstat -u [user_name]|| squeue -u [user_name]|| bjobs -u [user_name]|| qstat [-u user_name]|| llq -u [user_name] &lt;br /&gt;
|-&lt;br /&gt;
||Job hold ||qhold [job_id]|| scontrol hold [job_id]|| bstop [job_id]|| qhold [job_id]|| llhold -r [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Job release|| qrls [job_id]|| scontrol release [job_id]|| bresume [job_id]|| qrls [job_id]|| llhold -r [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Queue list|| qstat -Q|| squeue ||bqueues ||qconf -sql|| llclass &lt;br /&gt;
|-&lt;br /&gt;
||Node list ||pbsnodes -l|| sinfo -N OR scontrol show nodes|| bhosts|| qhost|| llstatus -L machine &lt;br /&gt;
|-&lt;br /&gt;
||Cluster status|| qstat -a|| sinfo|| bqueues|| qhost -q|| llstatus -L cluster &lt;br /&gt;
|-&lt;br /&gt;
||GUI|| xpbsmon|| sview|| xlsf OR xlsbatch|| qmon|| xload &lt;br /&gt;
|-&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
	<entry>
		<id>https://wiki.anunna.wur.nl/index.php?title=SLURM_Compare&amp;diff=866</id>
		<title>SLURM Compare</title>
		<link rel="alternate" type="text/html" href="https://wiki.anunna.wur.nl/index.php?title=SLURM_Compare&amp;diff=866"/>
		<updated>2013-12-10T11:41:07Z</updated>

		<summary type="html">&lt;p&gt;Bohme001: /* Rosetta Stone of Workload Managers */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=== Rosetta Stone of Workload Managers ===&lt;br /&gt;
&lt;br /&gt;
PBS/Torque, Slurm, LSF, SGE and LoadLeveler&lt;br /&gt;
&lt;br /&gt;
This table lists the most common command, environment variables, and job specification options used by the major workload management systems: PBS/Torque, Slurm, LSF, SGE and LoadLeveler. Each of these workload managers has unique features, but the most commonly used functionality is available in all of these environments as listed in the table. This should be considered a work in progress and contributions to improve the document are welcome.&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot;&lt;br /&gt;
!User Commands!!PBS/Torque!!Slurm!!LSF!!SGE!!LoadLeveler&lt;br /&gt;
|-&lt;br /&gt;
||Job submission|| qsub [script_file]|| sbatch [script_file]|| bsub [script_file]|| qsub [script_file]|| llsubmit [script_file] &lt;br /&gt;
|-&lt;br /&gt;
||Job deletion ||qdel [job_id]|| scancel [job_id]|| bkill [job_id]|| qdel [job_id]|| llcancel [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Job status (by job)|| qstat [job_id]|| squeue [job_id]|| bjobs [job_id]|| qstat -u \* [-j job_id]|| llq -u [username] &lt;br /&gt;
|-&lt;br /&gt;
||Job status (by user)|| qstat -u [user_name]|| squeue -u [user_name]|| bjobs -u [user_name]|| qstat [-u user_name]|| llq -u [user_name] &lt;br /&gt;
|-&lt;br /&gt;
||Job hold ||qhold [job_id]|| scontrol hold [job_id]|| bstop [job_id]|| qhold [job_id]|| llhold -r [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Job release|| qrls [job_id]|| scontrol release [job_id]|| bresume [job_id]|| qrls [job_id]|| llhold -r [job_id] &lt;br /&gt;
|-&lt;br /&gt;
||Queue list|| qstat -Q|| squeue ||bqueues ||qconf -sql|| llclass &lt;br /&gt;
|-&lt;br /&gt;
||Node list ||pbsnodes -l|| sinfo -N OR scontrol show nodes|| bhosts|| qhost|| llstatus -L machine &lt;br /&gt;
|-&lt;br /&gt;
||Cluster status|| qstat -a|| sinfo|| bqueues|| qhost -q|| llstatus -L cluster &lt;br /&gt;
|-&lt;br /&gt;
||GUI|| xpbsmon|| sview|| xlsf OR xlsbatch|| qmon|| xload &lt;br /&gt;
|-&lt;br /&gt;
|}&lt;/div&gt;</summary>
		<author><name>Bohme001</name></author>
	</entry>
</feed>