HPCwiki - User contributions [en]

Main Page

2022-06-02T13:44:51Z

Dawes001: /* Events */

Anunna is a [http://en.wikipedia.org/wiki/High-performance_computing High Performance Computer] (HPC) infrastructure hosted by [http://www.wageningenur.nl/nl/activiteit/Opening-High-Performance-Computing-cluster-HPC.htm Wageningen University & Research Centre]. It is open for use for all WUR research groups as well as other organizations, including companies, that have collaborative projects with WUR.

= Using Anunna =
* [[Tariffs | Costs associated with resource usage]]

== Gaining access to Anunna==
Access to the cluster and file transfer are traditionally done via [http://en.wikipedia.org/wiki/Secure_Shell SSH and SFTP].
* [[log_in_to_B4F_cluster | Logging into cluster using ssh]]
* [[file_transfer | File transfer options]]
* [[Services | Alternative access methods, and extra features and services on Anunna]]
* [[Filesystems | Data storage methods on Anunna]]

== Access Policy ==
[[Access_Policy | Main Article: Access Policy]]

Access needs to be granted actively (by creation of an account on the cluster by FB-IT). Use of resources is limited by the scheduler. Depending on availability of queues ('partitions') granted to a user, priority to the system's resources is regulated. Note that the use of Anunna is not free of charge. List price of CPU time and storage, and possible discounts on that list price for your organisation, can be retrieved from Shared Research Facilities or FB-IT.

= Events =
* Upcoming courses on 23rd + 30th June!

* Linux Basic - 23rd June

* HPC Basic - 30th June

* [[Courses]] that have happened and are happening
* [[Downtime]] that will affect all users
* [[Meetings]] that may affect the policies of Anunna

= Other Software =

== Cluster Management Software and Scheduler ==
Anunna uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]
* [[Using_Slurm | Submit jobs with Slurm]]
* [[node_usage_graph | Be aware of how much work the cluster is under right now with 'node_usage_graph']]
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]

== Installation of software by users ==

* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]
* [[Setting local variables]]
* [[Installing_R_packages_locally | Installing R packages locally]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]
* [[Virtual_environment_Python_3.4_or_higher | Setting up and using a virtual environment for Python3.4 or higher ]]
* [[Installing WRF and WPS]]
* [[Running scripts on a fixed timeschedule (cron)]]

== Installed software ==

* [[Globally_installed_software | Globally installed software]]
* [[ABGC_modules | ABGC specific modules]]

= Useful Notes =

== Being in control of Environment parameters ==

* [[Using_environment_modules | Using environment modules]]
* [[Setting local variables]]
* [[Setting_TMPDIR | Set a custom temporary directory location]]
* [[Installing_R_packages_locally | Installing R packages locally]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

== Controlling costs ==

* [[SACCT | using SACCT to see your costs]]
* [[get_my_bill | using the "get_my_bill" script to estimate costs]]

== Management ==
Product Owner of Anunna is Alexander van Ittersum (Wageningen UR,FB-IT, C&PS). [[User:dawes001 | Gwen Dawes (Wageningen UR, FB-IT, C&PS)]] and [[User:haars001 | Jan van Haarst (Wageningen UR,FB-IT, C&PS)]] are responsible for [[Maintenance_and_Management | Maintenance and Management]] of the cluster.

* [[Roadmap | Ambitions regarding innovation, support and administration of Anunna ]]

= Miscellaneous =
* [[Mailinglist | Electronic mail discussion lists]]
* [[History_of_the_Cluster | Historical information on the startup of Anunna]]
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]
* [[Parallel_R_code_on_SLURM | Running parallel R code on SLURM]]
* [[Convert_between_MediaWiki_and_other_formats | Convert between MediaWiki format and other formats]]
* [[Manual GitLab | GitLab: Create projects and add scripts]]
* [[Monitoring_executions | Monitoring job execution]]
* [[Shared_folders | Working with shared folders in the Lustre file system]]

= See also =
* [[Maintenance_and_Management | Maintenance and Management]]
* [[BCData | BCData]]
* [[Mailinglist | Electronic mail discussion lists]]
* [[About_ABGC | About ABGC]]
* [[Computer_cluster | High Performance Computing @ABGC]]
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]

= External links =
{| width="90%"
|- valign="top"
| width="30%" |
* [https://www.wur.nl/en/Value-Creation-Cooperation/Facilities/Wageningen-Shared-Research-Facilities/Our-facilities/Show/High-Performance-Computing-Cluster-HPC-Anunna.htm SRF offers a HPC facilty]
| width="30%" |
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]
|}

Spark

2020-10-05T12:23:46Z

Dawes001: /* SPARK in Jupyter */

Apache Spark is a means of distributing compute resources across multiple worker machines. It is the successor to Hadoop, and allows for a wider distribution of code to be executed on the clustered resources. The only requirement for Spark to be able to operate is that each worker must be able to reach each other via TCP, thus it allows for compute to be executed on very simple resources, if the code itself can be translated into the MapReduce paradigm.

== SPARK on HPC ==
In order to create a personal SPARK cluster, you must first request resources on the HPC. Use this example submission script to initialise your cluster:

<nowiki>#!/bin/bash
#SBATCH --time=<length>
#SBATCH --mem-per-cpu=4000
#SBATCH --nodes=<number of nodes>
#SBATCH --tasks-per-node=<number of workers per node>
#SBATCH --job-name="my spark cluster"
#SBATCH --qos=QOS

module load spark/3.0.1-2.7
module load python/3.8.5

source $SPARK_HOME/wur/start-spark

tail -f /dev/null</nowiki>

This will spawn a new cluster of your desired dimensions once resources are available. This spark module has been written to output its logs to your home directory, at:

/home/WUR/yourid/.spark/<jobid>/

In this folder you will find the raw logs of the master and all worker threads. By default the master will consume 1Gb of memory from the first process, and so a single 4Gb 'cluster' will be provided with one 3Gb worker. You can adjust the CPU/memory use by adjusting the parameters in your batch script.

Within the log file you will find two unique files: master, and master-console. master will always contain the URI of the current spark cluster master access point, and master-console the URL of the console of it.

To access the web console, the easiest solution is to use links:

<nowiki>links http://myspark:8081</nowiki>

This will nicely render the page for you in the console. Ctrl-R reloads the page, q to quit.

There are several caveats to remember with this:

* The cluster exists (and consumes resources) until you cancel it with scancel <jobid>
* There is no security at all - any user of the HPC can access both these at any point if they know the port and host.

== Instant SPARK ==

You can also spin up clusters solely to execute scripts. Simply replace the last line from the example above:
<nowiki>tail -f /dev/null</nowiki>
with
<nowiki>spark-submit myscript.py</nowiki>

And after the script has executed, the cluster will automatically terminate.

== SPARK in Jupyter ==

There is a kernel available for using Spark from Jupyter. All this does is to set up the correct path to the python version and the spark binaries for you. In order to set up your Context, your first cell for each notebook should be:
<nowiki>import os, pyspark
with open(os.environ['HOME']+'/.spark/current/master') as f:
conf = (pyspark.SparkConf()
.setMaster(f.read().strip()))
.setAppName("MyName"))

sc = pyspark.SparkContext(conf=conf)</nowiki>

Using the cluster master name from the master file in your job output as above. Subsequent cells will then have sc defined. Run this cell only once - attempting to reconnect will throw an error. That application will run until the kernel is terminated and prevent other applications from being able to be executed - you may wish to manually terminate your kernel from the top bar in Jupyter to free resources.

As a teacher, you might want to put that master file somewhere else, such as /lustre/shared, so that students can connect to your cluster.

Spark

2020-10-05T12:21:30Z

Dawes001: /* SPARK on HPC */

Apache Spark is a means of distributing compute resources across multiple worker machines. It is the successor to Hadoop, and allows for a wider distribution of code to be executed on the clustered resources. The only requirement for Spark to be able to operate is that each worker must be able to reach each other via TCP, thus it allows for compute to be executed on very simple resources, if the code itself can be translated into the MapReduce paradigm.

== SPARK on HPC ==
In order to create a personal SPARK cluster, you must first request resources on the HPC. Use this example submission script to initialise your cluster:

<nowiki>#!/bin/bash
#SBATCH --time=<length>
#SBATCH --mem-per-cpu=4000
#SBATCH --nodes=<number of nodes>
#SBATCH --tasks-per-node=<number of workers per node>
#SBATCH --job-name="my spark cluster"
#SBATCH --qos=QOS

module load spark/3.0.1-2.7
module load python/3.8.5

source $SPARK_HOME/wur/start-spark

tail -f /dev/null</nowiki>

This will spawn a new cluster of your desired dimensions once resources are available. This spark module has been written to output its logs to your home directory, at:

/home/WUR/yourid/.spark/<jobid>/

In this folder you will find the raw logs of the master and all worker threads. By default the master will consume 1Gb of memory from the first process, and so a single 4Gb 'cluster' will be provided with one 3Gb worker. You can adjust the CPU/memory use by adjusting the parameters in your batch script.

Within the log file you will find two unique files: master, and master-console. master will always contain the URI of the current spark cluster master access point, and master-console the URL of the console of it.

To access the web console, the easiest solution is to use links:

<nowiki>links http://myspark:8081</nowiki>

This will nicely render the page for you in the console. Ctrl-R reloads the page, q to quit.

There are several caveats to remember with this:

* The cluster exists (and consumes resources) until you cancel it with scancel <jobid>
* There is no security at all - any user of the HPC can access both these at any point if they know the port and host.

== Instant SPARK ==

You can also spin up clusters solely to execute scripts. Simply replace the last line from the example above:
<nowiki>tail -f /dev/null</nowiki>
with
<nowiki>spark-submit myscript.py</nowiki>

And after the script has executed, the cluster will automatically terminate.

== SPARK in Jupyter ==

There is a kernel available for using Spark from Jupyter. All this does (for now) is to set up the correct path to the python version and the spark binaries for you. In order to set up your Context, your first cell for each notebook should be:
<nowiki>import pyspark
conf = (pyspark.SparkConf()
.setMaster("spark://mysparkcluster:7077")
.setAppName("MyName"))

sc = pyspark.SparkContext(conf=conf)</nowiki>

Using the cluster master name from the master file in your job output as above. Subsequent cells will then have sc defined. Run this cell only once - attempting to reconnect will throw an error. That application will run until the kernel is terminated and prevent other applications from being able to be executed - you may wish to manually terminate your kernel from the top bar in Jupyter to free resources.

Conda for teaching

2020-05-27T09:03:03Z

Dawes001:

You are going to give a teaching course, and you need a specific code environment.

== Setup ==
First - find a good location that everyone can read (and not write). I'd suggest somewhere under <code>/cm/shared/apps/SHARED/</code> as a starting point - this allows everyone to access this location. It's important not to put anything secret there - it's a public resource, so please bear that in mind.

Next - create a folder for your environment:

<source lang='bash'>
mkdir /cm/shared/apps/SHARED/my_conda_env
chmod +r /cm/shared/apps/SHARED/my_conda_env
</source>
You may want to manipulate the permissions for this folder if someone is going to set this up with you. Consider the commands in [[Shared folders]].

Then, install Anaconda into it:

<source lang='bash'>
wget https://repo.anaconda.com/archive/Anaconda3-YEAR.MONTH-Linux-x86_64.sh
./Anaconda3-YEAR.MONTH-Linux-x86_64.sh -s -b -p /cm/shared/apps/SHARED/my_conda_env
</source>

Now you have a working conda environment in this folder. You can manipulate this here by running <code>/cm/shared/apps/SHARED/my_conda_env/bin/conda</code>, or, I would recommend creating a modulefile so that you can use it as default.

Create the following example modulefile in a matching <code>/cm/shared/modulefiles/SHARED/my_conda_env</code>:

<source lang='bash'>
#%Module -*- tcl -*-
##
## conda environment modulefile
##

set loadedmodules [split $::env(LOADEDMODULES) ":"]
set modulepath [split $ModulesCurrentModulefile "/"]
set envpath [lrange $modulepath 4 end]

set root /cm/shared/apps/[join $envpath "/"]

proc ModulesHelp { } {
global version

puts stderr "\tThis module provides the conda environment at $envpath"
}

if { [module-info mode] != "whatis" } {
puts stderr "[module-info mode] environent $envpath ."
}

module-whatis "Provides environment $envpath"

prepend-path PATH $root/bin
prepend-path LD_LIBRARY_PATH $root/lib
prepend-path LIBRARY_PATH $root/lib
prepend-path CPATH $root/include
prepend-path MANPATH $root/share/man
</source>

This will allow you to <code>module load SHARED/my_conda_env</code> and thus have <code>conda</code> pathed to the currently active environment.

== Jupyter Kernel ==

In order for students to be able to use this environment in jupyter, they will need a kernel definition.

Kernel definitions are usually a separate folder containing, in particular, a file called <code>kernel.json</code>, plus an icon that is displayed that represents this kernel, and other helper code.

The setup for this is that you should create the following folder for their access:

<source lang='bash'>
mkdir -p /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env
chmod +r /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env
</source>

This folder is something they will need to copy in place to their home directory, specifically <code>$HOME/.local/share/jupyter/kernels/</code>

Inside this folder, create the following <code>kernel.json</code> file. Watch out that the paths will need to match the current environment path if you're using a different location!

=== Python Kernel ===

<source lang='bash'>
vim /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env/kernel.json
</source>
<source lang='bash'>
{
"env": {
"PATH":

"/cm/shared/apps/SHARED/my_conda_env/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
},
"language": "python",
"argv": [
"/cm/shared/apps/SHARED/my_conda_env/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"display_name": "my_conda_env"
}
</source>

For Python, that's it. Ipykernel is installed automatically on conda initialisation.

=== R Kernel ===

For an R kernel, you need to make sure that the IRkernel package is installed. This is the package that is used to communicate from Jupyter to your running R kernel.
<source lang='bash'>
#MISSING EXAMPLE CODE
#PROBABLY /cm/shared/apps/SHARED/my_conda_env/bin/conda install R_irkernel
</source>

You'll also need to create two files: <code>kernel.json</code> and <code>kernel.js</code>. the <code>kernel.js</code> is a helper script to allow jupyter to communicate to R effectively:

<source lang='bash'>
vim /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env/kernel.json
</source>
<source lang='bash'>
{
"env": {
"LD_LIBRARY_PATH":
"/cm/shared/apps/SHARED/my_conda_env/lib:/cm/shared/apps/SHARED/my_conda_env/lib64"
},
"argv": ["/cm/shared/apps/SHARED/my_conda_env/bin/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
"display_name": "MAE50806-AdvMolEcol/Sandbox_R",
"language": "R"
}
</source>

<source lang='bash'>
vim /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env/kernel.js
</source>
<source lang='bash'>
const cmd_key = /Mac/.test(navigator.platform) ? 'Cmd' : 'Ctrl'

const edit_actions = [
{
name: 'R Assign',
shortcut: 'Alt--',
icon: 'fa-long-arrow-left',
help: 'R: Inserts the left-assign operator (<-)',
handler(cm) {
cm.replaceSelection(' <- ')
},
},
{
name: 'R Pipe',
shortcut: `Shift-${cmd_key}-M`,
icon: 'fa-angle-right',
help: 'R: Inserts the magrittr pipe operator (%>%)',
handler(cm) {
cm.replaceSelection(' %>% ')
},
},
{
name: 'R Help',
shortcut: 'F1',
icon: 'fa-book',
help: 'R: Shows the manpage for the item under the cursor',
handler(cm, cell) {
const {anchor, head} = cm.findWordAt(cm.getCursor())
const word = cm.getRange(anchor, head)

const callbacks = cell.get_callbacks()
const options = {silent: false, store_history: false, stop_on_error: true}
cell.last_msg_id = cell.notebook.kernel.execute(`help(\`${word}\`)`, callbacks, options)
},
},
]

const prefix = 'irkernel'

function add_edit_shortcut(notebook, actions, keyboard_manager, edit_action) {
const {name, shortcut, icon, help, handler} = edit_action

const action = {
icon, help,
help_index : 'zz',
handler: () => {
const cell = notebook.get_selected_cell()
handler(cell.code_mirror, cell)
},
}

const full_name = actions.register(action, name, prefix)

Jupyter.keyboard_manager.edit_shortcuts.add_shortcut(shortcut, full_name)
}

function render_math(pager, html) {
if (!html) return
const $container = pager.pager_element.find('#pager-container')
$container.find('p[style="text-align: center;"]').map((i, e) =>
e.outerHTML = `\\[${e.querySelector('i').innerHTML}\\]`)
$container.find('i').map((i, e) =>
e.outerHTML = `\$${e.innerHTML}\$`)
MathJax.Hub.Queue(['Typeset', MathJax.Hub, $container[0]])
}

define(['base/js/namespace'], ({
notebook,
actions,
keyboard_manager,
pager,
}) => ({
onload() {
edit_actions.forEach(a => add_edit_shortcut(notebook, actions, keyboard_manager, a))

pager.events.on('open_with_text.Pager', (event, {data: {'text/html': html}}) =>
render_math(pager, html))
},
}))
</source>

== Last Steps ==

In order to help your students get their kernel definitions into <code>$HOME/.local/share/jupyter/kernels/</code>, it's probably a good idea to write a small and simple notebook to do this when executed, or else instruct them to do:

<source lang='bash'>
cp -rv /cm/shared/apps/SHARED/my_conda_env/kernel/* .local/share/jupyter/kernels/
</source>

Conda for teaching

2020-05-27T09:01:02Z

Dawes001:

You are going to give a teaching course, and you need a specific code environment.

== Setup ==
First - find a good location that everyone can read (and not write). I'd suggest somewhere under /cm/shared/apps/SHARED/ as a starting point - this allows everyone to access this location. It's important not to put anything secret there - it's a public resource, so please bear that in mind.

Next - create a folder for your environment:

<source lang='bash'>
mkdir /cm/shared/apps/SHARED/my_conda_env
chmod +r /cm/shared/apps/SHARED/my_conda_env
</source>
You may want to manipulate the permissions for this folder if someone is going to set this up with you. Consider the commands in [[Shared Folders]].

Then, install Anaconda into it:

<source lang='bash'>
wget https://repo.anaconda.com/archive/Anaconda3-YEAR.MONTH-Linux-x86_64.sh
./Anaconda3-YEAR.MONTH-Linux-x86_64.sh -s -b -p /cm/shared/apps/SHARED/my_conda_env
</source>

Now you have a working conda environment in this folder. You can manipulate this here by running /cm/shared/apps/SHARED/my_conda_env/bin/conda , or, I would recommend creating a modulefile so that you can use it as default.

Create the following example modulefile in a matching /cm/shared/modulefiles/SHARED/my_conda_env :

<source lang='bash'>
#%Module -*- tcl -*-
##
## conda environment modulefile
##

set loadedmodules [split $::env(LOADEDMODULES) ":"]
set modulepath [split $ModulesCurrentModulefile "/"]
set envpath [lrange $modulepath 4 end]

set root /cm/shared/apps/[join $envpath "/"]

proc ModulesHelp { } {
global version

puts stderr "\tThis module provides the conda environment at $envpath"
}

if { [module-info mode] != "whatis" } {
puts stderr "[module-info mode] environent $envpath ."
}

module-whatis "Provides environment $envpath"

prepend-path PATH $root/bin
prepend-path LD_LIBRARY_PATH $root/lib
prepend-path LIBRARY_PATH $root/lib
prepend-path CPATH $root/include
prepend-path MANPATH $root/share/man
</source>

This will allow you to <code>module load SHARED/my_conda_env</code> and thus have <code>conda</code> pathed to the currently active environment.

== Jupyter Kernel ==

In order for students to be able to use this environment in jupyter, they will need a kernel definition.

Kernel definitions are usually a separate folder containing, in particular, a file called <code>kernel.json</code>, plus an icon that is displayed that represents this kernel, and other helper code.

The setup for this is that you should create the following folder for their access:

<source lang='bash'>
mkdir -p /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env
chmod +r /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env
</source>

This folder is something they will need to copy in place to their home directory, specifically $HOME/.local/share/jupyter/kernels/

Inside this folder, create the following kernel.json file. Watch out that the paths will need to match the current environment path if you're using a different location!

=== Python Kernel ===

<source lang='bash'>
vim /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env/kernel.json
</source>
<source lang='bash'>
{
"env": {
"PATH":

"/cm/shared/apps/SHARED/my_conda_env/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
},
"language": "python",
"argv": [
"/cm/shared/apps/SHARED/my_conda_env/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"display_name": "my_conda_env"
}
</source>

For Python, that's it. Ipykernel is installed automatically on conda initialisation.

=== R Kernel ===

For an R kernel, you need to make sure that the IRkernel package is installed. This is the package that is used to communicate from Jupyter to your running R kernel.<source lang='bash'>
#MISSING EXAMPLE CODE
#PROBABLY /cm/shared/apps/SHARED/my_conda_env/bin/conda install R_irkernel
</source>

You'll also need to create two files. the kernel.js is a helper script to allow jupyter to communicate to R effectively:

<source lang='bash'>
vim /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env/kernel.json
</source>
<source lang='bash'>
{
"env": {
"LD_LIBRARY_PATH":
"/cm/shared/apps/SHARED/my_conda_env/lib:/cm/shared/apps/SHARED/my_conda_env/lib64"
},
"argv": ["/cm/shared/apps/SHARED/my_conda_env/bin/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
"display_name": "MAE50806-AdvMolEcol/Sandbox_R",
"language": "R"
}
</source>

<source lang='bash'>
vim /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env/kernel.js
</source>
<source lang='bash'>
const cmd_key = /Mac/.test(navigator.platform) ? 'Cmd' : 'Ctrl'

const edit_actions = [
{
name: 'R Assign',
shortcut: 'Alt--',
icon: 'fa-long-arrow-left',
help: 'R: Inserts the left-assign operator (<-)',
handler(cm) {
cm.replaceSelection(' <- ')
},
},
{
name: 'R Pipe',
shortcut: `Shift-${cmd_key}-M`,
icon: 'fa-angle-right',
help: 'R: Inserts the magrittr pipe operator (%>%)',
handler(cm) {
cm.replaceSelection(' %>% ')
},
},
{
name: 'R Help',
shortcut: 'F1',
icon: 'fa-book',
help: 'R: Shows the manpage for the item under the cursor',
handler(cm, cell) {
const {anchor, head} = cm.findWordAt(cm.getCursor())
const word = cm.getRange(anchor, head)

const callbacks = cell.get_callbacks()
const options = {silent: false, store_history: false, stop_on_error: true}
cell.last_msg_id = cell.notebook.kernel.execute(`help(\`${word}\`)`, callbacks, options)
},
},
]

const prefix = 'irkernel'

function add_edit_shortcut(notebook, actions, keyboard_manager, edit_action) {
const {name, shortcut, icon, help, handler} = edit_action

const action = {
icon, help,
help_index : 'zz',
handler: () => {
const cell = notebook.get_selected_cell()
handler(cell.code_mirror, cell)
},
}

const full_name = actions.register(action, name, prefix)

Jupyter.keyboard_manager.edit_shortcuts.add_shortcut(shortcut, full_name)
}

function render_math(pager, html) {
if (!html) return
const $container = pager.pager_element.find('#pager-container')
$container.find('p[style="text-align: center;"]').map((i, e) =>
e.outerHTML = `\\[${e.querySelector('i').innerHTML}\\]`)
$container.find('i').map((i, e) =>
e.outerHTML = `\$${e.innerHTML}\$`)
MathJax.Hub.Queue(['Typeset', MathJax.Hub, $container[0]])
}

define(['base/js/namespace'], ({
notebook,
actions,
keyboard_manager,
pager,
}) => ({
onload() {
edit_actions.forEach(a => add_edit_shortcut(notebook, actions, keyboard_manager, a))

pager.events.on('open_with_text.Pager', (event, {data: {'text/html': html}}) =>
render_math(pager, html))
},
}))
</source>

== Last Steps ==

In order to help your students get their kernel definitions into $HOME/.local/share/jupyter/kernels/, it's probably a good idea to write a small and simple notebook to do this when executed, or else instruct them to do:

<source lang='bash'>
cp -rv /cm/shared/apps/SHARED/my_conda_env/kernel/* .local/share/jupyter/kernels/
</source>

Conda for teaching

2020-05-27T08:59:13Z

Dawes001: /* Jupyter Kernel */

You are going to give a teaching course, and you need a specific code environment.

== Setup ==
First - find a good location that everyone can read (and not write). I'd suggest somewhere under /cm/shared/apps/SHARED/ as a starting point - this allows everyone to access this location. It's important not to put anything secret there - it's a public resource, so please bear that in mind.

Next - create a folder for your environment:

<source lang='bash'>
mkdir /cm/shared/apps/SHARED/my_conda_env
chmod +r /cm/shared/apps/SHARED/my_conda_env
</source>
You may want to manipulate the permissions for this folder if someone is going to set this up with you. Consider the commands in [[Shared Folders]].

Then, install Anaconda into it:

<source lang='bash'>
wget https://repo.anaconda.com/archive/Anaconda3-YEAR.MONTH-Linux-x86_64.sh
./Anaconda3-YEAR.MONTH-Linux-x86_64.sh -s -b -p /cm/shared/apps/SHARED/my_conda_env
</source>

Now you have a working conda environment in this folder. You can manipulate this here by running /cm/shared/apps/SHARED/my_conda_env/bin/conda , or, I would recommend creating a modulefile so that you can use it as default.

Create the following example modulefile in a matching /cm/shared/modulefiles/SHARED/my_conda_env :

<source lang='bash'>
#%Module -*- tcl -*-
##
## conda environment modulefile
##

set loadedmodules [split $::env(LOADEDMODULES) ":"]
set modulepath [split $ModulesCurrentModulefile "/"]
set envpath [lrange $modulepath 4 end]

set root /cm/shared/apps/[join $envpath "/"]

proc ModulesHelp { } {
global version

puts stderr "\tThis module provides the conda environment at $envpath"
}

if { [module-info mode] != "whatis" } {
puts stderr "[module-info mode] environent $envpath ."
}

module-whatis "Provides environment $envpath"

prepend-path PATH $root/bin
prepend-path LD_LIBRARY_PATH $root/lib
prepend-path LIBRARY_PATH $root/lib
prepend-path CPATH $root/include
prepend-path MANPATH $root/share/man
</source>

This will allow you to `module load SHARED/my_conda_env` and thus have `conda` pathed to the currently active environment.

== Jupyter Kernel ==

In order for students to be able to use this environment in jupyter, they will need a kernel definition.

Kernel definitions are usually a separate folder containing, in particular, a file called `kernel.json`, plus an icon that is displayed that represents this kernel, and other helper code.

The setup for this is that you should create the following folder for their access:

<source lang='bash'>
mkdir -p /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env
chmod +r /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env
</source>

This folder is something they will need to copy in place to their home directory, specifically $HOME/.local/share/jupyter/kernels/

Inside this folder, create the following kernel.json file. Watch out that the paths will need to match the current environment path if you're using a different location!

=== Python Kernel ===

<source lang='bash'>
vim /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env/kernel.json
</source>
<source lang='bash'>
{
"env": {
"PATH":

"/cm/shared/apps/SHARED/my_conda_env/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin"
},
"language": "python",
"argv": [
"/cm/shared/apps/SHARED/my_conda_env/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"display_name": "my_conda_env"
}
</source>

For Python, that's it. Ipykernel is installed automatically on conda initialisation.

=== R Kernel ===

For an R kernel, you need to make sure that the IRkernel package is installed. This is the package that is used to communicate from Jupyter to your running R kernel.<source lang='bash'>
#MISSING EXAMPLE CODE
#PROBABLY /cm/shared/apps/SHARED/my_conda_env/bin/conda install R_irkernel
</source>

You'll also need to create two files. the kernel.js is a helper script to allow jupyter to communicate to R effectively:

<source lang='bash'>
vim /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env/kernel.json
</source>
<source lang='bash'>
{
"env": {
"LD_LIBRARY_PATH":
"/cm/shared/apps/SHARED/my_conda_env/lib:/cm/shared/apps/SHARED/my_conda_env/lib64"
},
"argv": ["/cm/shared/apps/SHARED/my_conda_env/bin/R", "--slave", "-e", "IRkernel::main()", "--args", "{connection_file}"],
"display_name": "MAE50806-AdvMolEcol/Sandbox_R",
"language": "R"
}
</source>

<source lang='bash'>
vim /cm/shared/apps/SHARED/my_conda_env/kernel/my_conda_env/kernel.js
</source>
<source lang='bash'>
const cmd_key = /Mac/.test(navigator.platform) ? 'Cmd' : 'Ctrl'

const edit_actions = [
{
name: 'R Assign',
shortcut: 'Alt--',
icon: 'fa-long-arrow-left',
help: 'R: Inserts the left-assign operator (<-)',
handler(cm) {
cm.replaceSelection(' <- ')
},
},
{
name: 'R Pipe',
shortcut: `Shift-${cmd_key}-M`,
icon: 'fa-angle-right',
help: 'R: Inserts the magrittr pipe operator (%>%)',
handler(cm) {
cm.replaceSelection(' %>% ')
},
},
{
name: 'R Help',
shortcut: 'F1',
icon: 'fa-book',
help: 'R: Shows the manpage for the item under the cursor',
handler(cm, cell) {
const {anchor, head} = cm.findWordAt(cm.getCursor())
const word = cm.getRange(anchor, head)

const callbacks = cell.get_callbacks()
const options = {silent: false, store_history: false, stop_on_error: true}
cell.last_msg_id = cell.notebook.kernel.execute(`help(\`${word}\`)`, callbacks, options)
},
},
]

const prefix = 'irkernel'

function add_edit_shortcut(notebook, actions, keyboard_manager, edit_action) {
const {name, shortcut, icon, help, handler} = edit_action

const action = {
icon, help,
help_index : 'zz',
handler: () => {
const cell = notebook.get_selected_cell()
handler(cell.code_mirror, cell)
},
}

const full_name = actions.register(action, name, prefix)

Jupyter.keyboard_manager.edit_shortcuts.add_shortcut(shortcut, full_name)
}

function render_math(pager, html) {
if (!html) return
const $container = pager.pager_element.find('#pager-container')
$container.find('p[style="text-align: center;"]').map((i, e) =>
e.outerHTML = `\\[${e.querySelector('i').innerHTML}\\]`)
$container.find('i').map((i, e) =>
e.outerHTML = `\$${e.innerHTML}\$`)
MathJax.Hub.Queue(['Typeset', MathJax.Hub, $container[0]])
}

define(['base/js/namespace'], ({
notebook,
actions,
keyboard_manager,
pager,
}) => ({
onload() {
edit_actions.forEach(a => add_edit_shortcut(notebook, actions, keyboard_manager, a))

pager.events.on('open_with_text.Pager', (event, {data: {'text/html': html}}) =>
render_math(pager, html))
},
}))
</source>

== Last Steps ==

In order to help your students get their kernel definitions into $HOME/.local/share/jupyter/kernels/, it's probably a good idea to write a small and simple notebook to do this when executed, or else instruct them to do:

<source lang='bash'>
cp -rv /cm/shared/apps/SHARED/my_conda_env/kernel/* .local/share/jupyter/kernels/
</source>

Conda for teaching

2020-05-27T08:45:06Z

Dawes001:

Conda for teaching

2020-05-27T08:40:47Z

Dawes001: /* Jupyter Kernel */

Conda for teaching

2020-05-27T08:38:31Z

Dawes001: Created page with "You are going to give a teaching course, and you need a specific code environment. == Setup == First - find a good location that everyone can read (and not write). I'd sugges..."

Tariffs

2020-02-17T08:46:03Z

Dawes001:

== Computing: Calculations (cores)==
{| class="wikitable"
!Queue
!CPU core hour
!GB memory hour
|-
|Standard queue
|€ 0.0150
|€ 0.0015
|-
|High priority queue
|€ 0.0200
|€ 0.0020
|-
|Low priority queue
|€ 0.0100
|€ 0.0010
|}

== Computing: GPU Use==
{| class="wikitable"
!Tariff per device per hour (gpu/hour)
|-
|€ 0.3000
|}

== Storage ==
Tariffs per year per TB
{| class="wikitable"
!Lustre Backup
!Lustre Nobackup
!Lustre Scratch
!Home-dir
!Archive
|-
|€ 175
|€ 125
|€ 125
|€ 175
|€ 125
|}

== Reservations ==
{| class="wikitable"
!Tariff per node per day (node/day)
|-
|€ 30
|}

== Notes==

If you are a member of a group with a commitment, then these costs get deducted from that commitment. Typically we are fairly lax with enforcing limits - only once you get to around 150% of your commitment will we consider taking action (mainly coming to discuss things).

== Example ==

You are running a job that needs 4 cores, 32G of RAM and runs for 90 minutes in the std quality. To run this, you over-request resources slightly, and execute in a job that requests 4 CPUs, 40G of RAM and with a time limit of 3 hours. Your job terminates early. Thus, your costs are:

4 * 0.015 * 1.5 = 0.09 EUR for the CPU

40 * 0.0015 * 1.5 = 0.09 EUR for the memory

Total: 0.18 EUR

Setting up Python virtualenv

2019-12-11T15:32:51Z

Dawes001: /* Virtualenv kernels in Jupyter */

With many Python packages available, which are often in conflict or requiring different versions depending on application, installing and controlling packages and versions is not always easy. In addition, so many packages are often used only occasionally, that it is questionable whether a system administrator of a centralized server system or a High Performance Compute (HPC) infrastructure can be expected to resolve all issues posed by users of the infrastructure. Even on a local system with full administrative rights managing versions, dependencies, and package collisions is often very difficult. The solution is to use a virtual environment, in which a specific set of packages can then be installed. As many different virtual environments can be created, and used side-by-side, as is necessary.

NOTE: as of Python 3.3 virtual environment support is built-in. See this page for an [[virtual_environment_Python_3.4_or_higher | alternative set-up of your virtual environment if using Python 3.4 or higher]].

== Creating a new virtual environment ==
It is assumed that the appropriate <code>virtualenv</code> executable for the Python version of choice is installed. A new virtual environment, in this case called <code>newenv</code> is created like so:
<source lang='bash'>
module load python/my-favourite-version (e.g. 2.7.12)
virtualenv newenv
OR
pyvenv newenv (For versions >3.4)
</source>

When the new environment is created, one will see a message similar to this:
<nowiki> New python executable in newenv/bin/python3
Also creating executable in newenv/bin/python
Installing Setuptools.........................................................................done.
Installing Pip................................................................................done.</nowiki>

== Activating a virtual environment ==
Once the environment is created, each time the environment needs to be activated, the following command needs to be issued:
<source lang='bash'>
source newenv/bin/activate
</source>
This assumes that the folder that contains the virtual environment documents (in this case called <code>newenv</code>), is in the present working directory.
When working on the virtual environment, the virtual environment name will be between brackets in front of the <code>user-host-prompt</code> string.
<nowiki> (newenv)user@host:~$</nowiki>

== Installing modules on the virtual environment ==
Installing modules is the same as usual. The difference is that modules are in <code>/path/to/virtenv/lib</code>, which may be living somewhere on your home directory. When working from the virtual environment, the default <code>pip</code> will belong to the python version that is currently active. This means that the executable in <code>/path/to/virtenv/bin</code> are in fact the first in the <code>$PATH</code>.
<source lang='bash'>
pip install numpy
</source>
Similarly, installing packages from source works exactly the same as usual.
<source lang='bash'>
python setup.py install
</source>

== deactivating a virtual environment ==
Quitting a virtual environment can be done by using the command <code>deactivate</code>, which was loaded using the <code>source</code> command upon activating the virtual environment.
<source lang='bash'>
deactivate
</source>

== Virtualenv kernels in Jupyter ==
Want your own virtualenv kernel in a notebook? This can be done by making your own kernel specifications:

* Make sure you have the ipykernel module in your venv. Activate it and pip install it:
<nowiki>source ~/path/to/my/virtualenv/bin/activate && pip install ipykernel</nowiki>
* Create the following directory path in your homedir if it doesn't already exist:
<nowiki>mkdir -p ~/.local/share/jupyter/kernels/</nowiki>
* Think of a nice descriptive name that doesn't clash with one of the already present kernels. I'll use 'testing'. Create this folder:
<nowiki>mkdir ~/.local/share/jupyter/kernels/testing/</nowiki>
* Add this file to this folder:
<nowiki>vi ~/.local/share/jupyter/kernels/testing/kernel.json
{
"language": "python",
"argv": [
"/home/myhome/path/to/my/virtualenv/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"display_name": "testing"
}</nowiki>
* Reload Jupyterhub page. testing should now exist in your kernels list.

You can do more complex things with this, such as construct your own Spark environment. This relies on having the module findspark installed:
<nowiki> vi ~/.local/share/jupyter/kernels/mysparkkernel/kernel.json
{
"language": "python",
"env": {
"SPARK_HOME":
"/cm/shared/apps/spark/my-spark-version"
},
"argv": [
"/home/myhome/my/spark/venv/bin/python",
"-m",
"ipykernel",
"-c", "import findspark; findspark.init()",
"-f",
"{connection_file}"
],
"display_name": "My Spark kernel"
}</nowiki>
(You'll want to make sure your spark cluster has the same environment - start it after activating this venv inside your sbatch script)

== Make IPython work under virtualenv ==
IPython may not work initially under a virtual environment. It may produce an error message like below:

<nowiki> File "/usr/bin/ipython", line 11
print "Could not start qtconsole. Please install ipython-qtconsole"
^</nowiki>

This can be resolved by adding a soft link with the name <code>ipython</code> to the <code>bin</code> directory in the virtual environment folder.
<source lang='bash'>
ln -s /path/to/virtenv/bin/ipython3 /path/to/virtenv/bin/ipython
</source>

== External links ==
* [https://pypi.python.org/pypi/virtualenv Python3 documentation for virtualenv]
* [http://cemcfarland.wordpress.com/2013/03/09/getting-ipython3-working-inside-your-virtualenv/ Solving the IPython hickup under virtual environment]

Setting up Python virtualenv

2019-12-11T15:32:23Z

Dawes001: /* Virtualenv kernels in Jupyter */

With many Python packages available, which are often in conflict or requiring different versions depending on application, installing and controlling packages and versions is not always easy. In addition, so many packages are often used only occasionally, that it is questionable whether a system administrator of a centralized server system or a High Performance Compute (HPC) infrastructure can be expected to resolve all issues posed by users of the infrastructure. Even on a local system with full administrative rights managing versions, dependencies, and package collisions is often very difficult. The solution is to use a virtual environment, in which a specific set of packages can then be installed. As many different virtual environments can be created, and used side-by-side, as is necessary.

NOTE: as of Python 3.3 virtual environment support is built-in. See this page for an [[virtual_environment_Python_3.4_or_higher | alternative set-up of your virtual environment if using Python 3.4 or higher]].

== Creating a new virtual environment ==
It is assumed that the appropriate <code>virtualenv</code> executable for the Python version of choice is installed. A new virtual environment, in this case called <code>newenv</code> is created like so:
<source lang='bash'>
module load python/my-favourite-version (e.g. 2.7.12)
virtualenv newenv
OR
pyvenv newenv (For versions >3.4)
</source>

When the new environment is created, one will see a message similar to this:
<nowiki> New python executable in newenv/bin/python3
Also creating executable in newenv/bin/python
Installing Setuptools.........................................................................done.
Installing Pip................................................................................done.</nowiki>

== Activating a virtual environment ==
Once the environment is created, each time the environment needs to be activated, the following command needs to be issued:
<source lang='bash'>
source newenv/bin/activate
</source>
This assumes that the folder that contains the virtual environment documents (in this case called <code>newenv</code>), is in the present working directory.
When working on the virtual environment, the virtual environment name will be between brackets in front of the <code>user-host-prompt</code> string.
<nowiki> (newenv)user@host:~$</nowiki>

== Installing modules on the virtual environment ==
Installing modules is the same as usual. The difference is that modules are in <code>/path/to/virtenv/lib</code>, which may be living somewhere on your home directory. When working from the virtual environment, the default <code>pip</code> will belong to the python version that is currently active. This means that the executable in <code>/path/to/virtenv/bin</code> are in fact the first in the <code>$PATH</code>.
<source lang='bash'>
pip install numpy
</source>
Similarly, installing packages from source works exactly the same as usual.
<source lang='bash'>
python setup.py install
</source>

== deactivating a virtual environment ==
Quitting a virtual environment can be done by using the command <code>deactivate</code>, which was loaded using the <code>source</code> command upon activating the virtual environment.
<source lang='bash'>
deactivate
</source>

== Virtualenv kernels in Jupyter ==
Want your own virtualenv kernel in a notebook? This can be done by making your own kernel specifications:

* Make sure you have the ipykernel module in your venv. Activate it and pip install it:
<nowiki>source ~/path/to/my/virtualenv/bin/activate && pip install ipykernel</nowiki>
* Create the following directory path in your homedir if it doesn't already exist:
<nowiki>mkdir -p ~/.local/share/jupyter/kernels/</nowiki>
* Think of a nice descriptive name that doesn't clash with one of the already present kernels. I'll use 'testing'. Create this folder:
<nowiki>mkdir ~/.local/share/jupyter/kernels/testing/</nowiki>
* Add this file to this folder:
<nowiki>vi ~/.local/share/jupyter/kernels/testing/kernel.json
{
"language": "python",
"argv": [
"~/path/to/my/virtualenv/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"display_name": "testing"
}</nowiki>
* Reload Jupyterhub page. testing should now exist in your kernels list.

You can do more complex things with this, such as construct your own Spark environment. This relies on having the module findspark installed:
<nowiki> vi ~/.local/share/jupyter/kernels/mysparkkernel/kernel.json
{
"language": "python",
"env": {
"SPARK_HOME":
"/cm/shared/apps/spark/my-spark-version"
},
"argv": [
"/home/myhome/my/spark/venv/bin/python",
"-m",
"ipykernel",
"-c", "import findspark; findspark.init()",
"-f",
"{connection_file}"
],
"display_name": "My Spark kernel"
}</nowiki>
(You'll want to make sure your spark cluster has the same environment - start it after activating this venv inside your sbatch script)

== Make IPython work under virtualenv ==
IPython may not work initially under a virtual environment. It may produce an error message like below:

<nowiki> File "/usr/bin/ipython", line 11
print "Could not start qtconsole. Please install ipython-qtconsole"
^</nowiki>

This can be resolved by adding a soft link with the name <code>ipython</code> to the <code>bin</code> directory in the virtual environment folder.
<source lang='bash'>
ln -s /path/to/virtenv/bin/ipython3 /path/to/virtenv/bin/ipython
</source>

== External links ==
* [https://pypi.python.org/pypi/virtualenv Python3 documentation for virtualenv]
* [http://cemcfarland.wordpress.com/2013/03/09/getting-ipython3-working-inside-your-virtualenv/ Solving the IPython hickup under virtual environment]

Using Slurm

2019-10-17T15:19:26Z

Dawes001: /* Batch script */

The resource allocation / scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.

== Queues and defaults ==

=== Quality of Service ===
When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with:

<source lang='bash'>
#SBATCH --qos=std
</source>

By default, jobs will use std, the standard quality.

Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs.

The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive.

The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them.

Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring.

=== Queues ===
The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'.

You can see the partitions available with `sinfo`:

=== Defaults ===
The default partition is 'main'. This will work for most jobs.

The default qos is 'std'.

The default cpu count is 1.

The default run time for a job is '''1 hour'''.

The default memory limit is '''100MB per node'''.

== Submitting jobs: sbatch ==

=== Example ===
Consider this simple python3 script that should calculate Pi to 1 million digits:
<source lang='python'>
from decimal import *
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
</source>

=== Loading modules ===
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail

In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3

=== Batch script ===
[[Creating_sbatch_script | Main Article: Creating a sbatch script]]

The following shell/slurm script can then be used to schedule the job using the sbatch command:
<source lang='bash'>
#!/bin/bash
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --cpus-per-task=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl

time python3 calc_pi.py
</source>

=== Submitting ===
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'>
sbatch run_calc_pi.sh
</source>

=== Submitting multiple jobs (simple) ===
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</source>

=== Submitting multiple jobs (complex) ===
Lets's say you have three job scripts that depend on each other:

<source lang='bash'>job_1.sh #A simple initialisation script</source>
<source lang='bash'>job_2.sh #An array task</source>
<source lang='bash'>job_3.sh #Some finishing script, single run, after everything previous has finished</source>

You can create a script to simultaneously submit each job with a dependency on each other:

<source lang='bash'>#!/bin/bash
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element

if ! [ "z$JOB1" == "z" ] ; then
echo "First job submitted as jobid $JOB1"
JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB2" == "z" ] ; then
echo "Second job submitted as jobid $JOB2, following $JOB1"
JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB3" == "z" ] ; then
echo "Third job submitted as jobid $JOB3, following after every element of $JOB2"

fi
fi
fi
</source>

This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).

Please see [https://slurm.schedmd.com/sbatch.html#OPT_dependency the sbatch documentation] for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.

=== Submitting array jobs ===
<source lang='bash'>
#SBATCH --array=0-10%4
</source>
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found [[Array_jobs|here]].

=== Using /tmp ===
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.

In order to be sure that you're able to use space in /tmp, you can add
<source lang='bash'>
#SBATCH --tmp=<required size>
</source>
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.

=== Using GPU ===
There are two GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add

<source lang='bash'>
#SBATCH --gres=gpu:<num gpus>
#SBATCH --constraint=<gpu flavour e.g. K80, V100>
#SBATCH --partition=gpu
</source>
To your sbatch script. Without this parameter, your job won't run on one of these nodes.

== Monitoring submitted jobs ==
Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit.

=== Generic monitoring of all running jobs ===
<source lang='bash'>
squeue
</source>

You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3396 ABGC BOV-WUR- megen002 R 27:26 1 node004
3397 ABGC BOV-WUR- megen002 R 27:26 1 node005
3398 ABGC BOV-WUR- megen002 R 27:26 1 node006
3399 ABGC BOV-WUR- megen002 R 27:26 1 node007
3400 ABGC BOV-WUR- megen002 R 27:26 1 node008
3401 ABGC BOV-WUR- megen002 R 27:26 1 node009
3385 research BOV-WUR- megen002 R 44:38 1 node049
3386 research BOV-WUR- megen002 R 44:38 1 node050
3387 research BOV-WUR- megen002 R 44:38 1 node051
3388 research BOV-WUR- megen002 R 44:38 1 node052
3389 research BOV-WUR- megen002 R 44:38 1 node053
3390 research BOV-WUR- megen002 R 44:38 1 node054
3391 research BOV-WUR- megen002 R 44:38 3 node[049-051]
3392 research BOV-WUR- megen002 R 44:38 3 node[052-054]
3393 research BOV-WUR- megen002 R 44:38 1 node001
3394 research BOV-WUR- megen002 R 44:38 1 node002
3395 research BOV-WUR- megen002 R 44:38 1 node003

=== Monitoring time limit set for a specific job ===
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
<source lang='bash'>
squeue -l -j 3532
</source>
Information similar to the following should appear:
Fri Nov 29 15:41:00 2013
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054

=== Query a specific active job: scontrol ===
Show all the details of a currently active job, so not a completed job.
<source lang='bash'>
login ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
UserId=megen002(16795409) GroupId=domain users(16777729)
Priority=1 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=research AllocNode:Sid=login0:21799
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node023
BatchHost=node023
NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/lustre/scratch/WUR/ABGC/...
WorkDir=/lustre/scratch/WUR/ABGC/...
</source>

=== Check on a pending job ===
A submitted job could result in a pending state when there are not enough resources available to this job.
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
<source lang='bash'>
[@login jobs]$ sbatch hpl_student.job
Submitted batch job 740338

[@login jobs]$ squeue -l -j 740338
Fri Feb 21 15:32:31 2014
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail)

[@login jobs]$ squeue --start -j 740338
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
</source>
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.

== Removing jobs from a list: scancel ==
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
<source lang='bash'>
scancel 3401
</source>

== Allocating resources interactively: sinteractive ==
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:
<source lang='bash'>
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>
</source>
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.

Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.

=== sinteractive source ===
<source lang='bash'>
#!/bin/bash
srun "$@" -I60 -N 1 -n 1 --pty bash -i
</source>

=== interactive Slurm - using salloc ===
If you don't want your shell to be transported but want a new remote shell, do:
<source lang='bash'>
salloc -p ABGC_Low $SHELL
</source>
Now your shell will stay on the login node, but you can do:
<source lang='bash'>
srun <command> &
</source>
To submit tasks to this new shell!

Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html

== Get overview of past and current jobs: sacct ==
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
<source lang='bash'>
sacct
</source>
This should provide information similar to the following:

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3385 BOV-WUR-58 research 12 COMPLETED 0:0
3385.batch batch 1 COMPLETED 0:0
3386 BOV-WUR-59 research 12 CANCELLED+ 0:0
3386.batch batch 1 CANCELLED 0:15
3528 BOV-WUR-59 ABGC 16 RUNNING 0:0
3529 BOV-WUR-60 ABGC 16 RUNNING 0:0

Or in more detail for a specific job:
<source lang='bash'>
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
</source>
This should provide information about job id 4220:

JobID JobName Comment Partition NTasks AllocCPUS Elapsed State ExitCode
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------
4220 PreProces+ research 3 00:30:52 COMPLETED 0:0
4220.batch batch 1 1 00:30:52 COMPLETED 0:0

'''Job Status Codes'''

Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.

{| class="wikitable"
|-
!Code!!State!!Description
|-
|CA ||CANCELLED|| Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
|-
|CD|| COMPLETED|| Job has terminated all processes on all nodes.
|-
|CF|| CONFIGURING|| Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
|-
|CG|| COMPLETING|| Job is in the process of completing. Some processes on some nodes may still be active.
|-
|F|| FAILED|| Job terminated with non-zero exit code or other failure condition.
|-
|NF|| NODE_FAIL|| Job terminated due to failure of one or more allocated nodes.
|-
|PD|| PENDING|| Job is awaiting resource allocation.
|-
|R|| RUNNING|| Job currently has an allocation.
|-
|S|| SUSPENDED|| Job has an allocation, but execution has been suspended.
|-
|TO|| TIMEOUT|| Job terminated upon reaching its time limit.
|-
|-
|}

== Running MPI jobs on Anunna ==

[[MPI_on_B4F_cluster | Main article: MPI on Anunna]]

== See also ==
* [[Tariffs | Costs associated with resource usage]]
* [[B4F_cluster | Anunna]]
* [[BCM_on_B4F_cluster | BCM on Anunna]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

== External links ==
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]

Using Slurm

2019-10-02T08:00:14Z

Dawes001: /* Using GPU */

The resource allocation / scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.

== Queues and defaults ==

=== Quality of Service ===
When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with:

<source lang='bash'>
#SBATCH --qos=std
</source>

By default, jobs will use std, the standard quality.

Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs.

The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive.

The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them.

Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring.

=== Queues ===
The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'.

You can see the partitions available with `sinfo`:

=== Defaults ===
The default partition is 'main'. This will work for most jobs.

The default qos is 'std'.

The default cpu count is 1.

The default run time for a job is '''1 hour'''.

The default memory limit is '''100MB per node'''.

== Submitting jobs: sbatch ==

=== Example ===
Consider this simple python3 script that should calculate Pi to 1 million digits:
<source lang='python'>
from decimal import *
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
</source>

=== Loading modules ===
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail

In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3

=== Batch script ===
[[Creating_sbatch_script | Main Article: Creating a sbatch script]]

The following shell/slurm script can then be used to schedule the job using the sbatch command:
<source lang='bash'>
#!/bin/bash
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl

time python3 calc_pi.py
</source>

=== Submitting ===
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'>
sbatch run_calc_pi.sh
</source>

=== Submitting multiple jobs (simple) ===
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</source>

=== Submitting multiple jobs (complex) ===
Lets's say you have three job scripts that depend on each other:

<source lang='bash'>job_1.sh #A simple initialisation script</source>
<source lang='bash'>job_2.sh #An array task</source>
<source lang='bash'>job_3.sh #Some finishing script, single run, after everything previous has finished</source>

You can create a script to simultaneously submit each job with a dependency on each other:

<source lang='bash'>#!/bin/bash
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element

if ! [ "z$JOB1" == "z" ] ; then
echo "First job submitted as jobid $JOB1"
JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB2" == "z" ] ; then
echo "Second job submitted as jobid $JOB2, following $JOB1"
JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB3" == "z" ] ; then
echo "Third job submitted as jobid $JOB3, following after every element of $JOB2"

fi
fi
fi
</source>

This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).

Please see [https://slurm.schedmd.com/sbatch.html#OPT_dependency the sbatch documentation] for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.

=== Submitting array jobs ===
<source lang='bash'>
#SBATCH --array=0-10%4
</source>
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found [[Array_jobs|here]].

=== Using /tmp ===
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.

In order to be sure that you're able to use space in /tmp, you can add
<source lang='bash'>
#SBATCH --tmp=<required size>
</source>
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.

=== Using GPU ===
There are two GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add

<source lang='bash'>
#SBATCH --gres=gpu:<num gpus>
#SBATCH --constraint=<gpu flavour e.g. K80, V100>
#SBATCH --partition=gpu
</source>
To your sbatch script. Without this parameter, your job won't run on one of these nodes.

== Monitoring submitted jobs ==
Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit.

=== Generic monitoring of all running jobs ===
<source lang='bash'>
squeue
</source>

You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3396 ABGC BOV-WUR- megen002 R 27:26 1 node004
3397 ABGC BOV-WUR- megen002 R 27:26 1 node005
3398 ABGC BOV-WUR- megen002 R 27:26 1 node006
3399 ABGC BOV-WUR- megen002 R 27:26 1 node007
3400 ABGC BOV-WUR- megen002 R 27:26 1 node008
3401 ABGC BOV-WUR- megen002 R 27:26 1 node009
3385 research BOV-WUR- megen002 R 44:38 1 node049
3386 research BOV-WUR- megen002 R 44:38 1 node050
3387 research BOV-WUR- megen002 R 44:38 1 node051
3388 research BOV-WUR- megen002 R 44:38 1 node052
3389 research BOV-WUR- megen002 R 44:38 1 node053
3390 research BOV-WUR- megen002 R 44:38 1 node054
3391 research BOV-WUR- megen002 R 44:38 3 node[049-051]
3392 research BOV-WUR- megen002 R 44:38 3 node[052-054]
3393 research BOV-WUR- megen002 R 44:38 1 node001
3394 research BOV-WUR- megen002 R 44:38 1 node002
3395 research BOV-WUR- megen002 R 44:38 1 node003

=== Monitoring time limit set for a specific job ===
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
<source lang='bash'>
squeue -l -j 3532
</source>
Information similar to the following should appear:
Fri Nov 29 15:41:00 2013
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054

=== Query a specific active job: scontrol ===
Show all the details of a currently active job, so not a completed job.
<source lang='bash'>
login ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
UserId=megen002(16795409) GroupId=domain users(16777729)
Priority=1 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=research AllocNode:Sid=login0:21799
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node023
BatchHost=node023
NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/lustre/scratch/WUR/ABGC/...
WorkDir=/lustre/scratch/WUR/ABGC/...
</source>

=== Check on a pending job ===
A submitted job could result in a pending state when there are not enough resources available to this job.
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
<source lang='bash'>
[@login jobs]$ sbatch hpl_student.job
Submitted batch job 740338

[@login jobs]$ squeue -l -j 740338
Fri Feb 21 15:32:31 2014
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail)

[@login jobs]$ squeue --start -j 740338
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
</source>
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.

== Removing jobs from a list: scancel ==
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
<source lang='bash'>
scancel 3401
</source>

== Allocating resources interactively: sinteractive ==
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:
<source lang='bash'>
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>
</source>
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.

Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.

=== sinteractive source ===
<source lang='bash'>
#!/bin/bash
srun "$@" -I60 -N 1 -n 1 --pty bash -i
</source>

=== interactive Slurm - using salloc ===
If you don't want your shell to be transported but want a new remote shell, do:
<source lang='bash'>
salloc -p ABGC_Low $SHELL
</source>
Now your shell will stay on the login node, but you can do:
<source lang='bash'>
srun <command> &
</source>
To submit tasks to this new shell!

Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html

== Get overview of past and current jobs: sacct ==
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
<source lang='bash'>
sacct
</source>
This should provide information similar to the following:

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3385 BOV-WUR-58 research 12 COMPLETED 0:0
3385.batch batch 1 COMPLETED 0:0
3386 BOV-WUR-59 research 12 CANCELLED+ 0:0
3386.batch batch 1 CANCELLED 0:15
3528 BOV-WUR-59 ABGC 16 RUNNING 0:0
3529 BOV-WUR-60 ABGC 16 RUNNING 0:0

Or in more detail for a specific job:
<source lang='bash'>
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
</source>
This should provide information about job id 4220:

JobID JobName Comment Partition NTasks AllocCPUS Elapsed State ExitCode
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------
4220 PreProces+ research 3 00:30:52 COMPLETED 0:0
4220.batch batch 1 1 00:30:52 COMPLETED 0:0

'''Job Status Codes'''

Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.

{| class="wikitable"
|-
!Code!!State!!Description
|-
|CA ||CANCELLED|| Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
|-
|CD|| COMPLETED|| Job has terminated all processes on all nodes.
|-
|CF|| CONFIGURING|| Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
|-
|CG|| COMPLETING|| Job is in the process of completing. Some processes on some nodes may still be active.
|-
|F|| FAILED|| Job terminated with non-zero exit code or other failure condition.
|-
|NF|| NODE_FAIL|| Job terminated due to failure of one or more allocated nodes.
|-
|PD|| PENDING|| Job is awaiting resource allocation.
|-
|R|| RUNNING|| Job currently has an allocation.
|-
|S|| SUSPENDED|| Job has an allocation, but execution has been suspended.
|-
|TO|| TIMEOUT|| Job terminated upon reaching its time limit.
|-
|-
|}

== Running MPI jobs on Anunna ==

[[MPI_on_B4F_cluster | Main article: MPI on Anunna]]

== See also ==
* [[Tariffs | Costs associated with resource usage]]
* [[B4F_cluster | Anunna]]
* [[BCM_on_B4F_cluster | BCM on Anunna]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

== External links ==
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]

Using Slurm

2019-10-01T08:08:46Z

Dawes001: /* Using GPU */

The resource allocation / scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.

== Queues and defaults ==

=== Quality of Service ===
When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with:

<source lang='bash'>
#SBATCH --qos=std
</source>

By default, jobs will use std, the standard quality.

Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs.

The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive.

The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them.

Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring.

=== Queues ===
The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'.

You can see the partitions available with `sinfo`:

=== Defaults ===
The default partition is 'main'. This will work for most jobs.

The default qos is 'std'.

The default cpu count is 1.

The default run time for a job is '''1 hour'''.

The default memory limit is '''100MB per node'''.

== Submitting jobs: sbatch ==

=== Example ===
Consider this simple python3 script that should calculate Pi to 1 million digits:
<source lang='python'>
from decimal import *
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
</source>

=== Loading modules ===
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail

In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3

=== Batch script ===
[[Creating_sbatch_script | Main Article: Creating a sbatch script]]

The following shell/slurm script can then be used to schedule the job using the sbatch command:
<source lang='bash'>
#!/bin/bash
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl

time python3 calc_pi.py
</source>

=== Submitting ===
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'>
sbatch run_calc_pi.sh
</source>

=== Submitting multiple jobs (simple) ===
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</source>

=== Submitting multiple jobs (complex) ===
Lets's say you have three job scripts that depend on each other:

<source lang='bash'>job_1.sh #A simple initialisation script</source>
<source lang='bash'>job_2.sh #An array task</source>
<source lang='bash'>job_3.sh #Some finishing script, single run, after everything previous has finished</source>

You can create a script to simultaneously submit each job with a dependency on each other:

<source lang='bash'>#!/bin/bash
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element

if ! [ "z$JOB1" == "z" ] ; then
echo "First job submitted as jobid $JOB1"
JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB2" == "z" ] ; then
echo "Second job submitted as jobid $JOB2, following $JOB1"
JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB3" == "z" ] ; then
echo "Third job submitted as jobid $JOB3, following after every element of $JOB2"

fi
fi
fi
</source>

This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).

Please see [https://slurm.schedmd.com/sbatch.html#OPT_dependency the sbatch documentation] for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.

=== Submitting array jobs ===
<source lang='bash'>
#SBATCH --array=0-10%4
</source>
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found [[Array_jobs|here]].

=== Using /tmp ===
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.

In order to be sure that you're able to use space in /tmp, you can add
<source lang='bash'>
#SBATCH --tmp=<required size>
</source>
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.

=== Using GPU ===
There are two GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add

<source lang='bash'>
#SBATCH --gres=gpu:<num gpus>
#SBATCH --constraint=<gpu flavour e.g. K80, V100>
#SBATCH --partition=GPU
</source>
To your sbatch script. Without this parameter, your job won't run on one of these nodes.

== Monitoring submitted jobs ==
Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit.

=== Generic monitoring of all running jobs ===
<source lang='bash'>
squeue
</source>

You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3396 ABGC BOV-WUR- megen002 R 27:26 1 node004
3397 ABGC BOV-WUR- megen002 R 27:26 1 node005
3398 ABGC BOV-WUR- megen002 R 27:26 1 node006
3399 ABGC BOV-WUR- megen002 R 27:26 1 node007
3400 ABGC BOV-WUR- megen002 R 27:26 1 node008
3401 ABGC BOV-WUR- megen002 R 27:26 1 node009
3385 research BOV-WUR- megen002 R 44:38 1 node049
3386 research BOV-WUR- megen002 R 44:38 1 node050
3387 research BOV-WUR- megen002 R 44:38 1 node051
3388 research BOV-WUR- megen002 R 44:38 1 node052
3389 research BOV-WUR- megen002 R 44:38 1 node053
3390 research BOV-WUR- megen002 R 44:38 1 node054
3391 research BOV-WUR- megen002 R 44:38 3 node[049-051]
3392 research BOV-WUR- megen002 R 44:38 3 node[052-054]
3393 research BOV-WUR- megen002 R 44:38 1 node001
3394 research BOV-WUR- megen002 R 44:38 1 node002
3395 research BOV-WUR- megen002 R 44:38 1 node003

=== Monitoring time limit set for a specific job ===
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
<source lang='bash'>
squeue -l -j 3532
</source>
Information similar to the following should appear:
Fri Nov 29 15:41:00 2013
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054

=== Query a specific active job: scontrol ===
Show all the details of a currently active job, so not a completed job.
<source lang='bash'>
login ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
UserId=megen002(16795409) GroupId=domain users(16777729)
Priority=1 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=research AllocNode:Sid=login0:21799
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node023
BatchHost=node023
NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/lustre/scratch/WUR/ABGC/...
WorkDir=/lustre/scratch/WUR/ABGC/...
</source>

=== Check on a pending job ===
A submitted job could result in a pending state when there are not enough resources available to this job.
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
<source lang='bash'>
[@login jobs]$ sbatch hpl_student.job
Submitted batch job 740338

[@login jobs]$ squeue -l -j 740338
Fri Feb 21 15:32:31 2014
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail)

[@login jobs]$ squeue --start -j 740338
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
</source>
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.

== Removing jobs from a list: scancel ==
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
<source lang='bash'>
scancel 3401
</source>

== Allocating resources interactively: sinteractive ==
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:
<source lang='bash'>
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>
</source>
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.

Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.

=== sinteractive source ===
<source lang='bash'>
#!/bin/bash
srun "$@" -I60 -N 1 -n 1 --pty bash -i
</source>

=== interactive Slurm - using salloc ===
If you don't want your shell to be transported but want a new remote shell, do:
<source lang='bash'>
salloc -p ABGC_Low $SHELL
</source>
Now your shell will stay on the login node, but you can do:
<source lang='bash'>
srun <command> &
</source>
To submit tasks to this new shell!

Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html

== Get overview of past and current jobs: sacct ==
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
<source lang='bash'>
sacct
</source>
This should provide information similar to the following:

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3385 BOV-WUR-58 research 12 COMPLETED 0:0
3385.batch batch 1 COMPLETED 0:0
3386 BOV-WUR-59 research 12 CANCELLED+ 0:0
3386.batch batch 1 CANCELLED 0:15
3528 BOV-WUR-59 ABGC 16 RUNNING 0:0
3529 BOV-WUR-60 ABGC 16 RUNNING 0:0

Or in more detail for a specific job:
<source lang='bash'>
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
</source>
This should provide information about job id 4220:

JobID JobName Comment Partition NTasks AllocCPUS Elapsed State ExitCode
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------
4220 PreProces+ research 3 00:30:52 COMPLETED 0:0
4220.batch batch 1 1 00:30:52 COMPLETED 0:0

'''Job Status Codes'''

Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.

{| class="wikitable"
|-
!Code!!State!!Description
|-
|CA ||CANCELLED|| Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
|-
|CD|| COMPLETED|| Job has terminated all processes on all nodes.
|-
|CF|| CONFIGURING|| Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
|-
|CG|| COMPLETING|| Job is in the process of completing. Some processes on some nodes may still be active.
|-
|F|| FAILED|| Job terminated with non-zero exit code or other failure condition.
|-
|NF|| NODE_FAIL|| Job terminated due to failure of one or more allocated nodes.
|-
|PD|| PENDING|| Job is awaiting resource allocation.
|-
|R|| RUNNING|| Job currently has an allocation.
|-
|S|| SUSPENDED|| Job has an allocation, but execution has been suspended.
|-
|TO|| TIMEOUT|| Job terminated upon reaching its time limit.
|-
|-
|}

== Running MPI jobs on Anunna ==

[[MPI_on_B4F_cluster | Main article: MPI on Anunna]]

== See also ==
* [[Tariffs | Costs associated with resource usage]]
* [[B4F_cluster | Anunna]]
* [[BCM_on_B4F_cluster | BCM on Anunna]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

== External links ==
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]

Using Slurm

2019-10-01T08:07:53Z

Dawes001: /* Using GPU */

The resource allocation / scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.

== Queues and defaults ==

=== Quality of Service ===
When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with:

<source lang='bash'>
#SBATCH --qos=std
</source>

By default, jobs will use std, the standard quality.

Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs.

The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive.

The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them.

Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring.

=== Queues ===
The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'.

You can see the partitions available with `sinfo`:

=== Defaults ===
The default partition is 'main'. This will work for most jobs.

The default qos is 'std'.

The default cpu count is 1.

The default run time for a job is '''1 hour'''.

The default memory limit is '''100MB per node'''.

== Submitting jobs: sbatch ==

=== Example ===
Consider this simple python3 script that should calculate Pi to 1 million digits:
<source lang='python'>
from decimal import *
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
</source>

=== Loading modules ===
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail

In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3

=== Batch script ===
[[Creating_sbatch_script | Main Article: Creating a sbatch script]]

The following shell/slurm script can then be used to schedule the job using the sbatch command:
<source lang='bash'>
#!/bin/bash
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl

time python3 calc_pi.py
</source>

=== Submitting ===
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'>
sbatch run_calc_pi.sh
</source>

=== Submitting multiple jobs (simple) ===
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</source>

=== Submitting multiple jobs (complex) ===
Lets's say you have three job scripts that depend on each other:

<source lang='bash'>job_1.sh #A simple initialisation script</source>
<source lang='bash'>job_2.sh #An array task</source>
<source lang='bash'>job_3.sh #Some finishing script, single run, after everything previous has finished</source>

You can create a script to simultaneously submit each job with a dependency on each other:

<source lang='bash'>#!/bin/bash
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element

if ! [ "z$JOB1" == "z" ] ; then
echo "First job submitted as jobid $JOB1"
JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB2" == "z" ] ; then
echo "Second job submitted as jobid $JOB2, following $JOB1"
JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB3" == "z" ] ; then
echo "Third job submitted as jobid $JOB3, following after every element of $JOB2"

fi
fi
fi
</source>

This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).

Please see [https://slurm.schedmd.com/sbatch.html#OPT_dependency the sbatch documentation] for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.

=== Submitting array jobs ===
<source lang='bash'>
#SBATCH --array=0-10%4
</source>
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found [[Array_jobs|here]].

=== Using /tmp ===
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.

In order to be sure that you're able to use space in /tmp, you can add
<source lang='bash'>
#SBATCH --tmp=<required size>
</source>
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.

=== Using GPU ===
There are two GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add

<source lang='bash'>
#SBATCH --gres=gpu:<num gpus>
#SBATCH --partition=GPU
</source>
To your sbatch script. Without this parameter, your job won't run on one of these nodes.

== Monitoring submitted jobs ==
Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit.

=== Generic monitoring of all running jobs ===
<source lang='bash'>
squeue
</source>

You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3396 ABGC BOV-WUR- megen002 R 27:26 1 node004
3397 ABGC BOV-WUR- megen002 R 27:26 1 node005
3398 ABGC BOV-WUR- megen002 R 27:26 1 node006
3399 ABGC BOV-WUR- megen002 R 27:26 1 node007
3400 ABGC BOV-WUR- megen002 R 27:26 1 node008
3401 ABGC BOV-WUR- megen002 R 27:26 1 node009
3385 research BOV-WUR- megen002 R 44:38 1 node049
3386 research BOV-WUR- megen002 R 44:38 1 node050
3387 research BOV-WUR- megen002 R 44:38 1 node051
3388 research BOV-WUR- megen002 R 44:38 1 node052
3389 research BOV-WUR- megen002 R 44:38 1 node053
3390 research BOV-WUR- megen002 R 44:38 1 node054
3391 research BOV-WUR- megen002 R 44:38 3 node[049-051]
3392 research BOV-WUR- megen002 R 44:38 3 node[052-054]
3393 research BOV-WUR- megen002 R 44:38 1 node001
3394 research BOV-WUR- megen002 R 44:38 1 node002
3395 research BOV-WUR- megen002 R 44:38 1 node003

=== Monitoring time limit set for a specific job ===
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
<source lang='bash'>
squeue -l -j 3532
</source>
Information similar to the following should appear:
Fri Nov 29 15:41:00 2013
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054

=== Query a specific active job: scontrol ===
Show all the details of a currently active job, so not a completed job.
<source lang='bash'>
login ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
UserId=megen002(16795409) GroupId=domain users(16777729)
Priority=1 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=research AllocNode:Sid=login0:21799
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node023
BatchHost=node023
NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/lustre/scratch/WUR/ABGC/...
WorkDir=/lustre/scratch/WUR/ABGC/...
</source>

=== Check on a pending job ===
A submitted job could result in a pending state when there are not enough resources available to this job.
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
<source lang='bash'>
[@login jobs]$ sbatch hpl_student.job
Submitted batch job 740338

[@login jobs]$ squeue -l -j 740338
Fri Feb 21 15:32:31 2014
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail)

[@login jobs]$ squeue --start -j 740338
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
</source>
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.

== Removing jobs from a list: scancel ==
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
<source lang='bash'>
scancel 3401
</source>

== Allocating resources interactively: sinteractive ==
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:
<source lang='bash'>
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>
</source>
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.

Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.

=== sinteractive source ===
<source lang='bash'>
#!/bin/bash
srun "$@" -I60 -N 1 -n 1 --pty bash -i
</source>

=== interactive Slurm - using salloc ===
If you don't want your shell to be transported but want a new remote shell, do:
<source lang='bash'>
salloc -p ABGC_Low $SHELL
</source>
Now your shell will stay on the login node, but you can do:
<source lang='bash'>
srun <command> &
</source>
To submit tasks to this new shell!

Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html

== Get overview of past and current jobs: sacct ==
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
<source lang='bash'>
sacct
</source>
This should provide information similar to the following:

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3385 BOV-WUR-58 research 12 COMPLETED 0:0
3385.batch batch 1 COMPLETED 0:0
3386 BOV-WUR-59 research 12 CANCELLED+ 0:0
3386.batch batch 1 CANCELLED 0:15
3528 BOV-WUR-59 ABGC 16 RUNNING 0:0
3529 BOV-WUR-60 ABGC 16 RUNNING 0:0

Or in more detail for a specific job:
<source lang='bash'>
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
</source>
This should provide information about job id 4220:

JobID JobName Comment Partition NTasks AllocCPUS Elapsed State ExitCode
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------
4220 PreProces+ research 3 00:30:52 COMPLETED 0:0
4220.batch batch 1 1 00:30:52 COMPLETED 0:0

'''Job Status Codes'''

Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.

{| class="wikitable"
|-
!Code!!State!!Description
|-
|CA ||CANCELLED|| Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
|-
|CD|| COMPLETED|| Job has terminated all processes on all nodes.
|-
|CF|| CONFIGURING|| Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
|-
|CG|| COMPLETING|| Job is in the process of completing. Some processes on some nodes may still be active.
|-
|F|| FAILED|| Job terminated with non-zero exit code or other failure condition.
|-
|NF|| NODE_FAIL|| Job terminated due to failure of one or more allocated nodes.
|-
|PD|| PENDING|| Job is awaiting resource allocation.
|-
|R|| RUNNING|| Job currently has an allocation.
|-
|S|| SUSPENDED|| Job has an allocation, but execution has been suspended.
|-
|TO|| TIMEOUT|| Job terminated upon reaching its time limit.
|-
|-
|}

== Running MPI jobs on Anunna ==

[[MPI_on_B4F_cluster | Main article: MPI on Anunna]]

== See also ==
* [[Tariffs | Costs associated with resource usage]]
* [[B4F_cluster | Anunna]]
* [[BCM_on_B4F_cluster | BCM on Anunna]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

== External links ==
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]

Using Slurm

2019-10-01T08:06:55Z

Dawes001: /* Using GPU */

The resource allocation / scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.

== Queues and defaults ==

=== Quality of Service ===
When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with:

<source lang='bash'>
#SBATCH --qos=std
</source>

By default, jobs will use std, the standard quality.

Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs.

The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive.

The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them.

Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring.

=== Queues ===
The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'.

You can see the partitions available with `sinfo`:

=== Defaults ===
The default partition is 'main'. This will work for most jobs.

The default qos is 'std'.

The default cpu count is 1.

The default run time for a job is '''1 hour'''.

The default memory limit is '''100MB per node'''.

== Submitting jobs: sbatch ==

=== Example ===
Consider this simple python3 script that should calculate Pi to 1 million digits:
<source lang='python'>
from decimal import *
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
</source>

=== Loading modules ===
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail

In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3

=== Batch script ===
[[Creating_sbatch_script | Main Article: Creating a sbatch script]]

The following shell/slurm script can then be used to schedule the job using the sbatch command:
<source lang='bash'>
#!/bin/bash
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl

time python3 calc_pi.py
</source>

=== Submitting ===
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'>
sbatch run_calc_pi.sh
</source>

=== Submitting multiple jobs (simple) ===
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</source>

=== Submitting multiple jobs (complex) ===
Lets's say you have three job scripts that depend on each other:

<source lang='bash'>job_1.sh #A simple initialisation script</source>
<source lang='bash'>job_2.sh #An array task</source>
<source lang='bash'>job_3.sh #Some finishing script, single run, after everything previous has finished</source>

You can create a script to simultaneously submit each job with a dependency on each other:

<source lang='bash'>#!/bin/bash
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element

if ! [ "z$JOB1" == "z" ] ; then
echo "First job submitted as jobid $JOB1"
JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB2" == "z" ] ; then
echo "Second job submitted as jobid $JOB2, following $JOB1"
JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB3" == "z" ] ; then
echo "Third job submitted as jobid $JOB3, following after every element of $JOB2"

fi
fi
fi
</source>

This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).

Please see [https://slurm.schedmd.com/sbatch.html#OPT_dependency the sbatch documentation] for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.

=== Submitting array jobs ===
<source lang='bash'>
#SBATCH --array=0-10%4
</source>
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found [[Array_jobs|here]].

=== Using /tmp ===
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.

In order to be sure that you're able to use space in /tmp, you can add
<source lang='bash'>
#SBATCH --tmp=<required size>
</source>
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.

=== Using GPU ===
There are two GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add

<source lang='bash'>
#SBATCH --partition=GPU
</source>
To your sbatch script. Without this parameter, your job won't run on one of these nodes.

== Monitoring submitted jobs ==
Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit.

=== Generic monitoring of all running jobs ===
<source lang='bash'>
squeue
</source>

You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3396 ABGC BOV-WUR- megen002 R 27:26 1 node004
3397 ABGC BOV-WUR- megen002 R 27:26 1 node005
3398 ABGC BOV-WUR- megen002 R 27:26 1 node006
3399 ABGC BOV-WUR- megen002 R 27:26 1 node007
3400 ABGC BOV-WUR- megen002 R 27:26 1 node008
3401 ABGC BOV-WUR- megen002 R 27:26 1 node009
3385 research BOV-WUR- megen002 R 44:38 1 node049
3386 research BOV-WUR- megen002 R 44:38 1 node050
3387 research BOV-WUR- megen002 R 44:38 1 node051
3388 research BOV-WUR- megen002 R 44:38 1 node052
3389 research BOV-WUR- megen002 R 44:38 1 node053
3390 research BOV-WUR- megen002 R 44:38 1 node054
3391 research BOV-WUR- megen002 R 44:38 3 node[049-051]
3392 research BOV-WUR- megen002 R 44:38 3 node[052-054]
3393 research BOV-WUR- megen002 R 44:38 1 node001
3394 research BOV-WUR- megen002 R 44:38 1 node002
3395 research BOV-WUR- megen002 R 44:38 1 node003

=== Monitoring time limit set for a specific job ===
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
<source lang='bash'>
squeue -l -j 3532
</source>
Information similar to the following should appear:
Fri Nov 29 15:41:00 2013
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054

=== Query a specific active job: scontrol ===
Show all the details of a currently active job, so not a completed job.
<source lang='bash'>
login ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
UserId=megen002(16795409) GroupId=domain users(16777729)
Priority=1 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=research AllocNode:Sid=login0:21799
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node023
BatchHost=node023
NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/lustre/scratch/WUR/ABGC/...
WorkDir=/lustre/scratch/WUR/ABGC/...
</source>

=== Check on a pending job ===
A submitted job could result in a pending state when there are not enough resources available to this job.
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
<source lang='bash'>
[@login jobs]$ sbatch hpl_student.job
Submitted batch job 740338

[@login jobs]$ squeue -l -j 740338
Fri Feb 21 15:32:31 2014
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail)

[@login jobs]$ squeue --start -j 740338
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
</source>
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.

== Removing jobs from a list: scancel ==
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
<source lang='bash'>
scancel 3401
</source>

== Allocating resources interactively: sinteractive ==
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:
<source lang='bash'>
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>
</source>
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.

Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.

=== sinteractive source ===
<source lang='bash'>
#!/bin/bash
srun "$@" -I60 -N 1 -n 1 --pty bash -i
</source>

=== interactive Slurm - using salloc ===
If you don't want your shell to be transported but want a new remote shell, do:
<source lang='bash'>
salloc -p ABGC_Low $SHELL
</source>
Now your shell will stay on the login node, but you can do:
<source lang='bash'>
srun <command> &
</source>
To submit tasks to this new shell!

Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html

== Get overview of past and current jobs: sacct ==
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
<source lang='bash'>
sacct
</source>
This should provide information similar to the following:

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3385 BOV-WUR-58 research 12 COMPLETED 0:0
3385.batch batch 1 COMPLETED 0:0
3386 BOV-WUR-59 research 12 CANCELLED+ 0:0
3386.batch batch 1 CANCELLED 0:15
3528 BOV-WUR-59 ABGC 16 RUNNING 0:0
3529 BOV-WUR-60 ABGC 16 RUNNING 0:0

Or in more detail for a specific job:
<source lang='bash'>
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
</source>
This should provide information about job id 4220:

JobID JobName Comment Partition NTasks AllocCPUS Elapsed State ExitCode
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------
4220 PreProces+ research 3 00:30:52 COMPLETED 0:0
4220.batch batch 1 1 00:30:52 COMPLETED 0:0

'''Job Status Codes'''

Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.

{| class="wikitable"
|-
!Code!!State!!Description
|-
|CA ||CANCELLED|| Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
|-
|CD|| COMPLETED|| Job has terminated all processes on all nodes.
|-
|CF|| CONFIGURING|| Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
|-
|CG|| COMPLETING|| Job is in the process of completing. Some processes on some nodes may still be active.
|-
|F|| FAILED|| Job terminated with non-zero exit code or other failure condition.
|-
|NF|| NODE_FAIL|| Job terminated due to failure of one or more allocated nodes.
|-
|PD|| PENDING|| Job is awaiting resource allocation.
|-
|R|| RUNNING|| Job currently has an allocation.
|-
|S|| SUSPENDED|| Job has an allocation, but execution has been suspended.
|-
|TO|| TIMEOUT|| Job terminated upon reaching its time limit.
|-
|-
|}

== Running MPI jobs on Anunna ==

[[MPI_on_B4F_cluster | Main article: MPI on Anunna]]

== See also ==
* [[Tariffs | Costs associated with resource usage]]
* [[B4F_cluster | Anunna]]
* [[BCM_on_B4F_cluster | BCM on Anunna]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

== External links ==
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]

Ssh without password

2019-07-26T16:29:57Z

Dawes001: /* Step 1: create a public key and copy to remote computer */

Secure shell (ssh) protocols can be configure to work without protocols. This is particularly helpful for machines that are used often.

== Configuring ssh without password from a POSIX-compliant terminal ==

=== Step 1: create a public key and copy to remote computer ===
* Log into a local Linux or MacOSX computer
* Type the following to generate the ssh key:
<source lang='bash'>
ssh-keygen
</source>
* Accept the default key location by pressing <code>Enter</code>.
* Secure permission of your authentication keys by closing permission to your home directory, .ssh directory, and authentication files
<source lang='bash'>
chmod go-wx $HOME
chmod 700 $HOME/.ssh
chmod 600 $HOME/.ssh/*
</source>
* Type the following to copy the key to the remote server (this will prompt for a password).
<source lang='bash'>
ssh-copy-id remote_username@remote_host
</source>

== Configuring ssh without password for Anunna ==

* Create a public key as in Step 1 of the previous section and copy it to Anunna. Note that a public/private key pair needs to be made only once per machine.
* Similar to step 2 of the previous section, add the public key to the <code>$HOME/.ssh/authorized_keys2</code> file. There is already a <code>$HOME/.ssh/authorized_keys</code> present. You may append the key to this file as an alternative, but take care not to remove content that is already there. The cluster is configured so that passwordless communication will all other nodes is default.

== Configuring ssh without password using PuTTY ==
Use pAGEaNT: http://the.earth.li/~sgtatham/putty/0.58/htmldoc/Chapter9.html to generate local keys. You'll want have a copy of the pubkey in plaintext available.

Make sure to paste that plaintext string into ~/.ssh/authorized_keys in one single line. Chmod the file 600 (so it shows -rw------- in ls -l) and the directory .ssh to 700 (drwx------).

Now PuTTY will login passwordlessly whenever pAGEaNT is running.

Finally, get pAGEaNT to load on startup: http://blog.shvetsov.com/2010/03/making-pageant-automatically-load-keys.html

== See also ==
* [[log_in_to_Anunna | Logging into cluster using ssh and file transfer]]

== External Links ==

Ssh without password

2019-07-26T16:29:12Z

Dawes001: /* Step 1: create a public key and copy to remote computer */

Secure shell (ssh) protocols can be configure to work without protocols. This is particularly helpful for machines that are used often.

== Configuring ssh without password from a POSIX-compliant terminal ==

=== Step 1: create a public key and copy to remote computer ===
* Log into a local Linux or MacOSX computer
* Type the following to generate the ssh key:
<source lang='bash'>
ssh-keygen
</source>
* Accept the default key location by pressing <code>Enter</code>.
* Secure permission of your authentication keys by closing permission to your home directory, .ssh directory, and authentication files
<source lang='bash'>
chmod go-w $HOME
chmod 700 $HOME/.ssh
chmod go-rwx $HOME/.ssh/*
</source>
* Type the following to copy the key to the remote server (this will prompt for a password).
<source lang='bash'>
ssh-copy-id remote_username@remote_host
</source>

== Configuring ssh without password for Anunna ==

* Create a public key as in Step 1 of the previous section and copy it to Anunna. Note that a public/private key pair needs to be made only once per machine.
* Similar to step 2 of the previous section, add the public key to the <code>$HOME/.ssh/authorized_keys2</code> file. There is already a <code>$HOME/.ssh/authorized_keys</code> present. You may append the key to this file as an alternative, but take care not to remove content that is already there. The cluster is configured so that passwordless communication will all other nodes is default.

== Configuring ssh without password using PuTTY ==
Use pAGEaNT: http://the.earth.li/~sgtatham/putty/0.58/htmldoc/Chapter9.html to generate local keys. You'll want have a copy of the pubkey in plaintext available.

Make sure to paste that plaintext string into ~/.ssh/authorized_keys in one single line. Chmod the file 600 (so it shows -rw------- in ls -l) and the directory .ssh to 700 (drwx------).

Now PuTTY will login passwordlessly whenever pAGEaNT is running.

Finally, get pAGEaNT to load on startup: http://blog.shvetsov.com/2010/03/making-pageant-automatically-load-keys.html

== See also ==
* [[log_in_to_Anunna | Logging into cluster using ssh and file transfer]]

== External Links ==

Ssh without password

2019-07-26T16:28:37Z

Dawes001:

Secure shell (ssh) protocols can be configure to work without protocols. This is particularly helpful for machines that are used often.

== Configuring ssh without password from a POSIX-compliant terminal ==

=== Step 1: create a public key and copy to remote computer ===
* Log into a local Linux or MacOSX computer
* Type the following to generate the ssh key:
<source lang='bash'>
ssh-keygen
</source>
* Accept the default key location by pressing <code>Enter</code>.
* Secure permission of your authentication keys by closing permission to your home directory, .ssh directory, and authentication files
<source lang='bash'>
chmod go-w $HOME
chmod 700 $HOME/.ssh
chmod go-rwx $HOME/.ssh/*
</source>
* Type the following to copy the key to the remote server (this will prompt for a password).
<source lang='bash'>
cd ~/.ssh
ssh-copy-id remote_username@remote_host
</source>

== Configuring ssh without password for Anunna ==

* Create a public key as in Step 1 of the previous section and copy it to Anunna. Note that a public/private key pair needs to be made only once per machine.
* Similar to step 2 of the previous section, add the public key to the <code>$HOME/.ssh/authorized_keys2</code> file. There is already a <code>$HOME/.ssh/authorized_keys</code> present. You may append the key to this file as an alternative, but take care not to remove content that is already there. The cluster is configured so that passwordless communication will all other nodes is default.

== Configuring ssh without password using PuTTY ==
Use pAGEaNT: http://the.earth.li/~sgtatham/putty/0.58/htmldoc/Chapter9.html to generate local keys. You'll want have a copy of the pubkey in plaintext available.

Make sure to paste that plaintext string into ~/.ssh/authorized_keys in one single line. Chmod the file 600 (so it shows -rw------- in ls -l) and the directory .ssh to 700 (drwx------).

Now PuTTY will login passwordlessly whenever pAGEaNT is running.

Finally, get pAGEaNT to load on startup: http://blog.shvetsov.com/2010/03/making-pageant-automatically-load-keys.html

== See also ==
* [[log_in_to_Anunna | Logging into cluster using ssh and file transfer]]

== External Links ==

Creating sbatch script

2019-07-15T15:05:38Z

Dawes001:

== A skeleton Slurm script ==
<source lang='bash'>

#-----------------------------Mail address-----------------------------
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#-----------------------------Output files-----------------------------
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#-----------------------------Other information------------------------
#SBATCH --comment=
#SBATCH --qos=
#-----------------------------Required resources-----------------------
#SBATCH --time=0-0:0:0
#SBATCH --ntasks=
#SBATCH --cpus-per-task=
#SBATCH --mem-per-cpu=

#-----------------------------Environment, Operations and Job steps----
#load modules

#export variables

#your job

</source>

==Explanation of used SBATCH parameters==
===partition for resource allocation===
<source lang='bash'>
#SBATCH --partition=ABGC_Std
</source>
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.

=== Adding accounting information or project number ===
<source lang='bash'>
#SBATCH --comment=773320000
</source>
Charge resources used by this job to specified account. The comment is an arbitrary string. The comment may be changed after job submission using the <tt>scontrol</tt> command. For WUR users a projectnumber or KTP number would be advisable.

===time limit===
<source lang='bash'>
#SBATCH --time=1200
</source>
A time limit of zero requests that no time limit be imposed. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". So in this example the job will run for a maximum of 1200 minutes.

===memory limit===
<source lang='bash'>
#SBATCH --mem=2048
</source>
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 100 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission:
<source lang='bash'>
#SBATCH --mem X
</source>

where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:

<source lang='bash'>
$ sacct -o MaxRSS -j JOBID
</source>
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.

===number of tasks===
<source lang='bash'>
#SBATCH --ntasks=1
</source>
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.

When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the <code>-N</code> or <code>--node</code> flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:
<source lang='bash'>
#SBATCH --nodes=1
</source>
This should force your job to be scheduled to a single node.

Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the <code>-C</code> or <code>--constraints</code> flag.

===constraints: selecting by feature===
<source lang='bash'>
#SBATCH --constraint=4gpercpu
</source>
The HPC nodes have features associated with them, such as Intel CPU's, or the amount of memory per node. If you know that your job requires a specific architecture or memory size, you can elect to constrain your job to only these features.

The example above will result in jobs being scheduled to the compute nodes with 4GB of memory per CPU. By using <code>12gpercpu</code> as option the job will specifically be scheduled to one of the larger nodes with 12GB per CPU.

All features can be seen using:
<source lang='bash'>
scontrol show nodes | grep ActiveFeatures | sort | uniq
</source>

===requesting specific resources===
<source lang='bash'>
#SBATCH --gres=gpu:1
</source>
In order to be able to use specific hardware resources, you need to request a Generic Resource. Once you do this, one of the resources will be allocated to your job when they are available. In the above example, one GPU is requested for use.

===output (stderr,stdout) directed to file===
<source lang='bash'>
#SBATCH --output=output_%j.txt
</source>
Instruct SLURM to connect the batch script's standard output directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.
<source lang='bash'>
#SBATCH --error=error_output_%j.txt
</source>
Instruct SLURM to connect the batch script's standard error directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.

===adding a job name===
<source lang='bash'>
#SBATCH --job-name=calc_pi.py
</source>
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just "sbatch" if the script is read on sbatch's standard input.

===receiving mailed updates===
<source lang='bash'>
#SBATCH --mail-type=ALL
</source>
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.
<source lang='bash'>
#SBATCH --mail-user=yourname001@wur.nl
</source>
Email address to use.

== See also ==
* [[Anunna | Anunna]]
* [[Using_Slurm#Batch_script | Submitting jobs to Slurm]]
* [[Array_jobs|Array job hints]]

Tariffs

2019-07-15T15:04:20Z

Dawes001:

== Computing: Calculations (cores)==
{| class="wikitable"
!Queue
!CPU core hour
!GB memory hour
|-
|Standard queue
|€ 0.0150
|€ 0.0015
|-
|High priority queue
|€ 0.0200
|€ 0.0020
|-
|Low priority queue
|€ 0.0100
|€ 0.0010
|}

== Computing: GPU Use==
{| class="wikitable"
!Tariff per device per hour (gpu/hour)
|-
|€ 0.3000
|}

== Storage ==
Tariffs per year per TB
{| class="wikitable"
!Lustre Nobackup
!Lustre Backup
!Home-dir
!Archive
|-
|€ 150
|€ 200
|€ 200
|€ 100
|}

== Reservations ==
{| class="wikitable"
!Tariff per node per day (node/day)
|-
|€ 30
|}

== Notes==

If you are a member of a group with a commitment, then these costs get deducted from that commitment. Typically we are fairly lax with enforcing limits - only once you get to around 150% of your commitment will we consider taking action (mainly coming to discuss things).

== Example ==

You are running a job that needs 4 cores, 32G of RAM and runs for 90 minutes in the std quality. To run this, you over-request resources slightly, and execute in a job that requests 4 CPUs, 40G of RAM and with a time limit of 3 hours. Your job terminates early. Thus, your costs are:

4 * 0.015 * 1.5 = 0.09 EUR for the CPU

40 * 0.0015 * 1.5 = 0.09 EUR for the memory

Total: 0.18 EUR

Spark

2019-07-15T15:03:35Z

Dawes001: /* SPARK on HPC */

Apache Spark is a means of distributing compute resources across multiple worker machines. It is the successor to Hadoop, and allows for a wider distribution of code to be executed on the clustered resources. The only requirement for Spark to be able to operate is that each worker must be able to reach each other via TCP, thus it allows for compute to be executed on very simple resources, if the code itself can be translated into the MapReduce paradigm.

== SPARK on HPC ==
In order to create a personal SPARK cluster, you must first request resources on the HPC. Use this example submission script to initialise your cluster:

<nowiki>#!/bin/bash
#SBATCH --time=<length>
#SBATCH --mem-per-cpu=4000
#SBATCH --nodes=<number of nodes>
#SBATCH --tasks-per-node=<number of workers per node>
#SBATCH --job-name="my spark cluster"
#SBATCH --qos=QOS

module load python
module load spark

source $SPARK_HOME/wur/start-spark

tail -f /dev/null</nowiki>

This will spawn a new cluster of your desired dimensions once resources are available. This spark module has been written to output its logs to your home directory, at:

/home/WUR/yourid/.spark/<jobid>/

In this folder you will find the raw logs of the master and all worker threads. By default the master will consume 1Gb of memory from the first process, and so a single 4Gb 'cluster' will be provided with one 3Gb worker. You can adjust the CPU/memory use by adjusting the parameters in your batch script.

Within the log file you will find two unique files: master, and master-console. master will always contain the URI of the current spark cluster master access point, and master-console the URL of the console of it.

To access the web console, the easiest solution is to use links:

<nowiki>links http://myspark:8081</nowiki>

This will nicely render the page for you in the console. Ctrl-R reloads the page, q to quit.

There are several caveats to remember with this:

* The cluster exists (and consumes resources) until you cancel it with scancel <jobid>
* There is no security at all - any user of the HPC can access both these at any point if they know the port and host.

== Instant SPARK ==

You can also spin up clusters solely to execute scripts. Simply replace the last line from the example above:
<nowiki>tail -f /dev/null</nowiki>
with
<nowiki>spark-submit myscript.py</nowiki>

And after the script has executed, the cluster will automatically terminate.

== SPARK in Jupyter ==

There is a kernel available for using Spark from Jupyter. All this does (for now) is to set up the correct path to the python version and the spark binaries for you. In order to set up your Context, your first cell for each notebook should be:
<nowiki>import pyspark
conf = (pyspark.SparkConf()
.setMaster("spark://mysparkcluster:7077")
.setAppName("MyName"))

sc = pyspark.SparkContext(conf=conf)</nowiki>

Using the cluster master name from the master file in your job output as above. Subsequent cells will then have sc defined. Run this cell only once - attempting to reconnect will throw an error. That application will run until the kernel is terminated and prevent other applications from being able to be executed - you may wish to manually terminate your kernel from the top bar in Jupyter to free resources.

MPI on B4F cluster

2019-07-15T15:02:51Z

Dawes001:

== A simple 'Hello World' example ==
Consider the following simple MPI version, in C, of the 'Hello World' example:

<source lang='cpp'>
#include <stdio.h>
#include <mpi.h>
int main(int argc, char ** argv) {
int size,rank,namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Get_processor_name(processor_name, &namelen);
printf("Hello MPI! Process %d of %d on %s\n", rank, size, processor_name);
MPI_Finalize();
}
</source>

Before compiling, make sure that the compilers that are required available.
<source lang='bash'>
module list
</source>

To avoid conflicts between libraries, the safest way is purging all modules:
<source lang='bash'>
module purge
</source>

The load both gcc and openmpi libraries. If modules were purged, then slurm needs to be reloaded too.
<source lang='bash'>
module load gcc/4.8.1 openmpi/gcc/64/1.6.5 slurm/2.5.7
</source>

Compile the <code>hello_mpi.c</code> code.
<source lang='bash'>
mpicc hello_mpi.c -o test_hello_world
</source>

If desired, a list of libraries compiled into the executable can be viewed:
<source lang='bash'>
ldd test_hello_world
</source>

linux-vdso.so.1 => (0x00002aaaaaacb000)
libmpi.so.1 => /cm/shared/apps/openmpi/gcc/64/1.6.5/lib64/libmpi.so.1 (0x00002aaaaaccd000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab080000)
libm.so.6 => /lib64/libm.so.6 (0x00002aaaab284000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003e29400000)
librt.so.1 => /lib64/librt.so.1 (0x00002aaaab509000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaab711000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaab92a000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaabb2e000)
libc.so.6 => /lib64/libc.so.6 (0x00002aaaabd4b000)
/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)

Running the executable on two nodes, with four tasks per node, can be done like this:
<source lang='bash'>
srun --nodes=2 --ntasks-per-node=4 --mpi=openmpi ./test_hello_world
</source>

This will result in the following output:
Hello MPI! Process 4 of 8 on node011
Hello MPI! Process 1 of 8 on node010
Hello MPI! Process 7 of 8 on node011
Hello MPI! Process 6 of 8 on node011
Hello MPI! Process 5 of 8 on node011
Hello MPI! Process 2 of 8 on node010
Hello MPI! Process 0 of 8 on node010
Hello MPI! Process 3 of 8 on node010

== A mvapich2 sbatch example ==
A mpi job using mvapich2 on 32 cores, using the normal compute nodes and the fast infiniband interconnect for RDMA traffic.
<source lang='bash'>
$ module load mvapich2/gcc
$ vim batch.sh
#!/bin/sh
#SBATCH --comment=projectx
#SBATCH --time=30-0
#SBATCH -n 32
#SBATCH --constraint=4gpercpu
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=MPItest
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@wur.nl

echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running on $SLURM_NPROCS processors."
echo "Current working directory is `pwd`"
# echo "Env var MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE is $MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE"
# export MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE=ib0

mpirun -iface ib0 -np 32 ./tmf_par.out -NX 480 -NY 240 -alpha 11 -chi 1.3 -psi_b 5e-2 -beta 0.0 -zeta 3.5 -kT 0.10

echo "Program finished with exit code $? at: `date`"

$ sbatch batch.sh

</source>

MPI on B4F cluster

2019-07-15T15:02:32Z

Dawes001:

== A simple 'Hello World' example ==
Consider the following simple MPI version, in C, of the 'Hello World' example:

<source lang='cpp'>
#include <stdio.h>
#include <mpi.h>
int main(int argc, char ** argv) {
int size,rank,namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
MPI_Get_processor_name(processor_name, &namelen);
printf("Hello MPI! Process %d of %d on %s\n", rank, size, processor_name);
MPI_Finalize();
}
</source>

Before compiling, make sure that the compilers that are required available.
<source lang='bash'>
module list
</source>

To avoid conflicts between libraries, the safest way is purging all modules:
<source lang='bash'>
module purge
</source>

The load both gcc and openmpi libraries. If modules were purged, then slurm needs to be reloaded too.
<source lang='bash'>
module load gcc/4.8.1 openmpi/gcc/64/1.6.5 slurm/2.5.7
</source>

Compile the <code>hello_mpi.c</code> code.
<source lang='bash'>
mpicc hello_mpi.c -o test_hello_world
</source>

If desired, a list of libraries compiled into the executable can be viewed:
<source lang='bash'>
ldd test_hello_world
</source>

linux-vdso.so.1 => (0x00002aaaaaacb000)
libmpi.so.1 => /cm/shared/apps/openmpi/gcc/64/1.6.5/lib64/libmpi.so.1 (0x00002aaaaaccd000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaab080000)
libm.so.6 => /lib64/libm.so.6 (0x00002aaaab284000)
libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x0000003e29400000)
librt.so.1 => /lib64/librt.so.1 (0x00002aaaab509000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00002aaaab711000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaab92a000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaabb2e000)
libc.so.6 => /lib64/libc.so.6 (0x00002aaaabd4b000)
/lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)

Running the executable on two nodes, with four tasks per node, can be done like this:
<source lang='bash'>
srun --nodes=2 --ntasks-per-node=4 --partition=ABGC --mpi=openmpi ./test_hello_world
</source>

This will result in the following output:
Hello MPI! Process 4 of 8 on node011
Hello MPI! Process 1 of 8 on node010
Hello MPI! Process 7 of 8 on node011
Hello MPI! Process 6 of 8 on node011
Hello MPI! Process 5 of 8 on node011
Hello MPI! Process 2 of 8 on node010
Hello MPI! Process 0 of 8 on node010
Hello MPI! Process 3 of 8 on node010

== A mvapich2 sbatch example ==
A mpi job using mvapich2 on 32 cores, using the normal compute nodes and the fast infiniband interconnect for RDMA traffic.
<source lang='bash'>
$ module load mvapich2/gcc
$ vim batch.sh
#!/bin/sh
#SBATCH --comment=projectx
#SBATCH --time=30-0
#SBATCH -n 32
#SBATCH --constraint=4gpercpu
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=MPItest
#SBATCH --mail-type=ALL
#SBATCH --mail-user=user@wur.nl

echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running on $SLURM_NPROCS processors."
echo "Current working directory is `pwd`"
# echo "Env var MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE is $MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE"
# export MPIR_CVAR_NEMESIS_TCP_NETWORK_IFACE=ib0

mpirun -iface ib0 -np 32 ./tmf_par.out -NX 480 -NY 240 -alpha 11 -chi 1.3 -psi_b 5e-2 -beta 0.0 -zeta 3.5 -kT 0.10

echo "Program finished with exit code $? at: `date`"

$ sbatch batch.sh

</source>

Provean Sus scrofa

2019-07-15T15:02:03Z

Dawes001:

This page describes the procedure for mapping all known variants (batch of first 150 pigs, wild boar re-sequencing) at the ABGC.

== Pre-requisites ==
From Variant Effect Predictor output, select only protein altering variants and sort by transcript:
<source lang='bash'>
cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt
</source>

Protein models for Sus scrofa:
/lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa

== Automated procedure for mapping ==
The Provean analysis is somewhat involved because of an apparent bug in the program that results in conflict in temporary files. This is particularly problematic when farming out thousands of individual searches (i.e. per protein sequence) on the cluster. The cluster nodes need periodic 'cleaning' of those remaining temporary directories.

=== Master script to control the submission of jobs and cleaning ===
The following script will add 300 runs every hour. Note that it will kill remaining Provean processes, and, importantly, will clean the <code>/tmp</code> dirs of all nodes of remaining Provean related temporary folders. This to prevent the error message that Provean has problems creating temporary folders.
<source lang='bash'>
!/bin/bash
#SBATCH --time=4800
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=16000
#SBATCH --nice=1000
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=Provean
#cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt
TELLER=100
echo $TELLER;
let TELLER+=1;
echo $TELLER;
while [ $TELLER -gt 99 ]; do

PROVS=`squeue | grep Provean | sed 's/^ \+//' | sed 's/ \+/\t/' | cut -f1`;
for PROV in $PROVS; do scancel $PROV; done;
sleep 10;
for i in `seq 1 2`; do ssh fat00$i 'rm -rf /tmp/provean*'; done;
for i in `seq 10 60`; do ssh node0$i 'rm -rf /tmp/provean*'; done;
for i in `seq 1 9`; do ssh node00$i 'rm -rf /tmp/provean*'; done;
TRANS=`cat prot_alt.txt | cut -f6 | sort | uniq`;
TELLER2=0;
for TRAN in $TRANS; do
if [ $TELLER2 -lt 300 ]; then
echo "transcript: $TRAN";
echo "teller boven: $TELLER2";
PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRAN | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1`;
echo "protein: $PROT";
if [ -f $PROT.sss ];
then
echo "$PROT $TRAN already done";
else
echo "will do sbatch testProvean_sub.sh $TRAN";
sbatch runProvean_sub.sh $TRAN;
let TELLER2+=1;
echo "teller onder: $TELLER2";
fi;
fi;
done;
sleep 3600;
done

</source>

=== The slave script that does the actual submission ===
The 'runProvean_sub.sh' script referred to in the above script consists of the following code:
<source lang='bash'>
#!/bin/bash
#SBATCH --time=4800
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=16000
#SBATCH --nice=1000
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=Provean
TRANS=$1
PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRANS | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1`
cat prot_alt.txt | grep $TRANS | awk '{print $11,$12}' | sed 's/ \+/\t/' | sed 's/\//\t/' | awk '{OFS=","; print $1,$2,$3}' | sed 's/\t//g' | sed 's/ \+//g' >$TRANS.var;
cat prot_alt.txt | grep $TRANS | awk -v prot=$PROT '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$8,prot, $11,$12,$13,$14,$15}' >$PROT.var.info;
faOneRecord /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa $PROT >$PROT.fa;
mv $TRANS.var $PROT.var;
provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;
</source>

== Alternative: submission per transcript - no cleaning ==
Individual transcripts can also be submitted using the following script:
<source lang='bash'>
#!/bin/bash
#SBATCH --time=4800
#SBATCH --ntasks=1
#SBATCH --mem-per-cpu=16000
#SBATCH --nice=1000
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=Provean
#SBATCH --partition=ABGC_Research
#cat outVEP_*.txt | awk '$11~/\//' | sed 's/:/\t/' | sort -k6 >prot_alt.txt
TRANS=$1
PROT=`cat /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa | grep $TRANS | sed 's/ \+/\t/g' | sed 's/^>//' | cut -f1`
if [ -f $PROT.sss ];
then
echo "$PROT $TRANS already done.";
else
cat prot_alt.txt | grep $TRANS | awk '{print $11,$12}' | sed 's/ \+/\t/' | sed 's/\//\t/' | awk '{OFS=","; print $1,$2,$3}' | sed 's/\t//g' | sed 's/ \+//g' >$TRANS.var;
cat prot_alt.txt | grep $TRANS | awk -v prot=$PROT '{OFS="\t"; print $1,$2,$3,$5,$6,$7,$8,prot, $11,$12,$13,$14,$15}' >$PROT.var.info;
faOneRecord /lustre/nobackup/WUR/ABGC/shared/public_data_store/genomes/pig/Ensembl74/pep/Sus_scrofa.Sscrofa10.2.74.pep.all.fa $PROT >$PROT.fa;
mv $TRANS.var $PROT.var;
provean.sh -q $PROT.fa -v $PROT.var --save_supporting_set $PROT.sss >$PROT.result.txt 2>$PROT.error;
fi;

</source>

== See also ==
[[Provean_1.1.3 | Provean on Anunna]]

Maker protocols Pmajor

2019-07-15T15:01:44Z

Dawes001:

This page describes the various rounds of [http://www.yandell-lab.org/software/maker.html Maker]-based annotations for the [http://en.wikipedia.org/wiki/Parus_major ''Parus major'' (Great Tit)] genome.

== Round 1 ==
=== Rationale ===
For this round no P. major-based ESTs were available. Zebrafinch (T. guttata) is the closest relative for which a reasonably complete gene-model set is available. As a first pass, it was decided to let gene predictions be driven by ab-inititio predictions rather than by Zebrafinch EST.

=== Invoking maker script ===
Do not forget to load the <code>maker</code> module:
<source lang='bash'>
module load maker/2.28
</source>

script submitted by SLURM (<code>sbatch</code> command):
<source lang='bash'>
#!/bin/bash
#SBATCH --time=48000
#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=test_maker
#SBATCH --mail-type=ALL
#SBATCH --mail-user=hendrik-jan.megens@wur.nl
maker
</source>

=== Maker settings ===
==== content of <code>maker_opts.ctl</code> ====
#-----Genome (these are always required)
genome=Pam.fa #genome sequence (fasta file or fasta embeded in GFF3 file)
organism_type=eukaryotic #eukaryotic or prokaryotic. Default is eukaryotic

#-----Re-annotation Using MAKER Derived GFF3
maker_gff= #MAKER derived GFF3 file
est_pass=0 #use ESTs in maker_gff: 1 = yes, 0 = no
altest_pass=0 #use alternate organism ESTs in maker_gff: 1 = yes, 0 = no
protein_pass=0 #use protein alignments in maker_gff: 1 = yes, 0 = no
rm_pass=0 #use repeats in maker_gff: 1 = yes, 0 = no
model_pass=0 #use gene models in maker_gff: 1 = yes, 0 = no
pred_pass=0 #use ab-initio predictions in maker_gff: 1 = yes, 0 = no
other_pass=0 #passthrough anyything else in maker_gff: 1 = yes, 0 = no

#-----EST Evidence (for best results provide a file for at least one)
est= #set of ESTs or assembled mRNA-seq in fasta format
altest= Taeniopygia_guttata.taeGut3.2.4.74.cdna.all.fa #EST/cDNA sequence file in fasta format from an alternate organism
est_gff= #aligned ESTs or mRNA-seq from an external GFF3 file
altest_gff= #aligned ESTs from a closly relate species in GFF3 format

#-----Protein Homology Evidence (for best results provide a file for at least one)
protein= Taeniopygia_guttata.taeGut3.2.4.74.pep.all.fa #protein sequence file in fasta format (i.e. from mutiple oransisms)
protein_gff= #aligned protein homology evidence from an external GFF3 file

#-----Repeat Masking (leave values blank to skip repeat masking)
model_org=Metazoa #select a model organism for RepBase masking in RepeatMasker
rmlib= #provide an organism specific repeat library in fasta format for RepeatMasker
repeat_protein= #provide a fasta file of transposable element proteins for RepeatRunner
rm_gff= #pre-identified repeat elements from an external GFF3 file
prok_rm=0 #forces MAKER to repeatmask prokaryotes (no reason to change this), 1 = yes, 0 = no
softmask=1 #use soft-masking rather than hard-masking in BLAST (i.e. seg and dust filtering)

#-----Gene Prediction
snaphmm= /cm/shared/apps/WUR/ABGC/snap/snap-2013-11-29/HMM/mam54.hmm #SNAP HMM file
gmhmm= #GeneMark HMM file
augustus_species= chicken #Augustus gene prediction species model
fgenesh_par_file= #FGENESH parameter file
pred_gff= #ab-initio predictions from an external GFF3 file
model_gff= #annotated gene models from an external GFF3 file (annotation pass-through)
est2genome=0 #infer gene predictions directly from ESTs, 1 = yes, 0 = no
protein2genome=0 #infer predictions from protein homology, 1 = yes, 0 = no
unmask=0 #also run ab-initio prediction programs on unmasked sequence, 1 = yes, 0 = no

#-----Other Annotation Feature Types (features MAKER doesn't recognize)
other_gff= #extra features to pass-through to final MAKER generated GFF3 file

#-----External Application Behavior Options
alt_peptide=C #amino acid used to replace non-standard amino acids in BLAST databases
cpus=16 #max number of cpus to use in BLAST and RepeatMasker (not for MPI, leave 1 when using MPI)

#-----MAKER Behavior Options
max_dna_len=100000 #length for dividing up contigs into chunks (increases/decreases memory usage)
min_contig=1 #skip genome contigs below this length (under 10kb are often useless)

pred_flank=200 #flank for extending evidence clusters sent to gene predictors
pred_stats=0 #report AED and QI statistics for all predictions as well as models
AED_threshold=1 #Maximum Annotation Edit Distance allowed (bound by 0 and 1)
min_protein=0 #require at least this many amino acids in predicted proteins
alt_splice=0 #Take extra steps to try and find alternative splicing, 1 = yes, 0 = no
always_complete=0 #extra steps to force start and stop codons, 1 = yes, 0 = no
map_forward=0 #map names and attributes forward from old GFF3 genes, 1 = yes, 0 = no
keep_preds=0 #Concordance threshold to add unsupported gene prediction (bound by 0 and 1)

split_hit=10000 #length for the splitting of hits (expected max intron size for evidence alignments)
single_exon=0 #consider single exon EST evidence when generating annotations, 1 = yes, 0 = no
single_length=250 #min length required for single exon ESTs if 'single_exon is enabled'
correct_est_fusion=0 #limits use of ESTs in annotation to avoid fusion genes

tries=2 #number of times to try a contig if there is a failure for some reason
clean_try=1 #remove all data from previous run before retrying, 1 = yes, 0 = no
clean_up=1 #removes theVoid directory with individual analysis files, 1 = yes, 0 = no
TMP= #specify a directory other than the system default temporary directory for temporary files

==== content of <code>maker_exe.ctl</code> ====
#-----Location of Executables Used by MAKER/EVALUATOR
makeblastdb=/cm/shared/apps/WUR/ABGC/blast/ncbi-blast-2.2.28+/bin/makeblastdb #location of NCBI+ makeblastdb executable
blastn=/cm/shared/apps/WUR/ABGC/blast/ncbi-blast-2.2.28+/bin/blastn #location of NCBI+ blastn executable
blastx=/cm/shared/apps/WUR/ABGC/blast/ncbi-blast-2.2.28+/bin/blastx #location of NCBI+ blastx executable
tblastx=/cm/shared/apps/WUR/ABGC/blast/ncbi-blast-2.2.28+/bin/tblastx #location of NCBI+ tblastx executable
formatdb= #location of NCBI formatdb executable
blastall= #location of NCBI blastall executable
xdformat= #location of WUBLAST xdformat executable
blasta= #location of WUBLAST blasta executable
RepeatMasker=/cm/shared/apps/WUR/ABGC/RepeatMasker/RepeatMasker-4-0-3/RepeatMasker #location of RepeatMasker executable
exonerate=/cm/shared/apps/WUR/ABGC/exonerate/exonerate-2.2.0-x86_64/bin/exonerate #location of exonerate executable

#-----Ab-initio Gene Prediction Algorithms
snap=/cm/shared/apps/WUR/ABGC/snap/snap-2013-11-29/snap #location of snap executable
gmhmme3= #location of eukaryotic genemark executable
gmhmmp= #location of prokaryotic genemark executable
augustus=/cm/shared/apps/WUR/ABGC/augustus/augustus.2.7/src/augustus #location of augustus executable
fgenesh= #location of fgenesh executable

#-----Other Algorithms
probuild= #location of probuild executable (required for genemark)

==== contents of <code>maker_bopts.ctl</code>====
#-----BLAST and Exonerate Statistics Thresholds
blast_type=ncbi+ #set to 'ncbi+', 'ncbi' or 'wublast'

pcov_blastn=0.8 #Blastn Percent Coverage Threhold EST-Genome Alignments
pid_blastn=0.85 #Blastn Percent Identity Threshold EST-Genome Aligments
eval_blastn=1e-10 #Blastn eval cutoff
bit_blastn=40 #Blastn bit cutoff
depth_blastn=0 #Blastn depth cutoff (0 to disable cutoff)

pcov_blastx=0.5 #Blastx Percent Coverage Threhold Protein-Genome Alignments
pid_blastx=0.4 #Blastx Percent Identity Threshold Protein-Genome Aligments
eval_blastx=1e-06 #Blastx eval cutoff
bit_blastx=30 #Blastx bit cutoff
depth_blastx=0 #Blastx depth cutoff (0 to disable cutoff)

pcov_tblastx=0.8 #tBlastx Percent Coverage Threhold alt-EST-Genome Alignments
pid_tblastx=0.85 #tBlastx Percent Identity Threshold alt-EST-Genome Aligments
eval_tblastx=1e-10 #tBlastx eval cutoff
bit_tblastx=40 #tBlastx bit cutoff
depth_tblastx=0 #tBlastx depth cutoff (0 to disable cutoff)

pcov_rm_blastx=0.5 #Blastx Percent Coverage Threhold For Transposable Element Masking
pid_rm_blastx=0.4 #Blastx Percent Identity Threshold For Transposbale Element Masking
eval_rm_blastx=1e-06 #Blastx eval cutoff for transposable element masking
bit_rm_blastx=30 #Blastx bit cutoff for transposable element masking

ep_score_limit=20 #Exonerate protein percent of maximal score threshold
en_score_limit=20 #Exonerate nucleotide percent of maximal score threshold

== See also ==
[[Maker_2.2.8 | Maker pipeline as installed on Anunna]]

== External links ==
* [http://www.yandell-lab.org/software/maker.html Maker homepage]
* [http://gmod.org/wiki/MAKER_Tutorial_2013 Maker tutorial]

Migration from ESG HPC

2019-07-15T15:01:07Z

Dawes001:

== Folders ==
home folder (200GB limit):
/home/WUR/<user>/
lustre backup folder:
/lustre/backup/WUR/ESG/<user>/
lustre no-backup folder:
/lustre/nobackup/WUR/ESG/<user>/
/DATA folder
/lustre/backup/WUR/ESG/data/

== Example of a job script ==
#!/bin/bash
#SBATCH --comment=99999999
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name="test slurm"
#SBATCH --nodes=5
#SBATCH --mail-type=ALL
#SBATCH --mail-user=wietse.franssen@wur.nl

./executable

Assemble mitochondrial genomes from short read data

2019-07-15T15:00:46Z

Dawes001:

A simple procedure for assembling mitochondrial genomes based on whole-genome re-sequencing data. The first step is to extract reads from the sequence library based on a closely related entirely assembled genome (e.g., for pig, the MT genome as present in the genome build, but could also be of a related species). The genome is then assembled using SOAPdenovo.

* a reference genome of a closely related population or species.
* a bowtie2 index (make with bowtie2_build)
* a blastable db of the reference mitochondrial genome
* a SOAPdenovo configuration file:

soapdenovo.config
[LIB]
avg_ins=450
reverse_seq=0
asm_flags=1
rank=3
q1=fq1.fq
q2=fq2.fq
Note that the avg_ins flag may vary between libraries; may have an effect on assembly efficiency.

<source lang='bash'>
#!/bin/bash
#SBATCH --time=1000
#SBATCH --mem=16000
#SBATCH --ntasks=8
#SBATCH --nodes=1
#SBATCH --constraint=4gpercpu
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=assemble_mito
#SBATCH --mail-type=ALL
#SBATCH --mail-user=hendrik-jan.megens@wur.nl
module load bowtie/2-2.2.1 SOAPdenovo2/r240 BLAST+/2.2.28 MUMmer/3.23

bowtie2 --phred$2 --local -p 8 -x mt_pig.fa -1 $3 -2 $4 | head -2 >$1_mito_align.sam
bowtie2 --phred$2 --local -p 8 -x mt_pig.fa -1 $3 -2 $4 | awk '$5>0' | head -10000 >>$1_mito_align.sam

java7 -jar /cm/shared/apps/SHARED/picard-tools/picard-tools-1.109/SamToFastq.jar I=$1_mito_align.sam F=fq1.fq F2=fq2.fq INCLUDE_NON_PF_READS=True

SOAPdenovo-63mer all -K 63 -p 4 -s soapdenovo.config -o $1_mito_assembly.fa

blastn -query $1_mito_assembly.fa.scafSeq -db mt_pig.fa -outfmt 6

mummer -mum -b -c mt_pig.fa $1_mito_assembly.fa.scafSeq > mummer.mums
mummerplot -postscript -p mummer mummer.mums
</source>

Invoke like this:
<source lang='bash'>
sbatch do_mtalign_bowtie_pig.sh MA01F18 33\
/lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ABGSA0071/ABGSA0071_MA01F18_R1.PF.fastq.gz\
/lustre/nobackup/WUR/ABGC/shared/Pig/ABGSA/ABGSA0071/ABGSA0071_MA01F18_R2.PF.fastq.gz

Calculate corrected theta from resequencing data

2019-07-15T15:00:06Z

Dawes001:

This procedure will estimate theta (nucleotide diversity) based on re-sequencing data. The method is describe in [http://www.biomedcentral.com/1471-2164/14/148 Esteve-Codina et al.]

<source lang ='bash'>
#!/bin/bash
#SBATCH --time=10000
#SBATCH --mem=4000
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --constraint=4gpercpu
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=ngstheta
module load samtools/0.1.19
VAR=`gunzip -c /lustre/nobackup/WUR/ABGC/shared/Pig/vars_hjm_newbuild10_2/vars-flt_$1-final.txt.gz | cut -f8 | head -1000000 | sort | uniq -c | sed 's/^ \+//' | sed 's/ \+/\t/' | sort -k1 -nr | head -1 | cut -f2`
let MAX=2*VAR
echo "$1 max_depth is $MAX"
MIN=$(( $VAR / 3 ))
if [ $MIN -lt 5 ]; then MIN=4; fi

echo "$1 min_depth is $MIN"
samtools mpileup -uf /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/FASTA/Ssc10_2_chromosomes.fa /lustre/nobackup/WUR/ABGC/shared/Pig/BAM_files_hjm_newbuild10_2/$1_rh.bam | bcftools view -bvcg - > $1.mig.bcf
bcftools view $1.mig.bcf | vcfutils.pl varFilter -d$MIN -D$MAX > $1.mig.vcf
awk '$6 >= 20' $1.mig.vcf > $1.miguel.vcf
samtools mpileup -Bq 20 -d 50000 /lustre/nobackup/WUR/ABGC/shared/Pig/BAM_files_hjm_newbuild10_2/$1_rh.bam | perl covXwin-v3.1.pl -v $1.miguel.vcf -w 50000 -d $MIN -m $MAX -b /lustre/nobackup/WUR/ABGC/shared/Pig/BAM_files_hjm_newbuild10_2/$1_rh.bam | ./ngs_theta -d $MIN -m $MAX > $1.wintheta
</source>

The script can be submitted using <code>sbatch</code> using the following code, assuming that the names of the individuals are listed in a file called <code>individuals.txt</code>.
<source lang='bash'>
INDS=`cat individuals.txt`
for IND in $INDS; do sbatch nucdiv_pipeline.sh $IND; done
</source>

Average values for Theta were then extracted with the following R-srcript:
<source lang = 'rsplus'>
files=list.files(pattern="wintheta")
a <- data.frame("file" = character(), "theta_het" = numeric())
for (file1 in files){
x <- read.table(file1,header=T);
mn=mean(x$THETA_HET[x$BP>20000 & x$CHR != 'chrUN_nr' & x$CHR != 'Ssc10_2_X']);
print(paste(file1,mn,sep=" "));
a<- rbind(a,data.frame("file"=file1,"theta_het"=mn))
}
write.table(x=a,file="theta_het_results.txt")
</source>

Calculate corrected theta from resequencing data

2019-07-15T14:59:53Z

Dawes001:

This procedure will estimate theta (nucleotide diversity) based on re-sequencing data. The method is describe in [http://www.biomedcentral.com/1471-2164/14/148 Esteve-Codina et al.]

<source lang ='bash'>
#!/bin/bash
#SBATCH --time=10000
#SBATCH --mem=4000
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --constraint=normalmem
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=ngstheta
module load samtools/0.1.19
VAR=`gunzip -c /lustre/nobackup/WUR/ABGC/shared/Pig/vars_hjm_newbuild10_2/vars-flt_$1-final.txt.gz | cut -f8 | head -1000000 | sort | uniq -c | sed 's/^ \+//' | sed 's/ \+/\t/' | sort -k1 -nr | head -1 | cut -f2`
let MAX=2*VAR
echo "$1 max_depth is $MAX"
MIN=$(( $VAR / 3 ))
if [ $MIN -lt 5 ]; then MIN=4; fi

echo "$1 min_depth is $MIN"
samtools mpileup -uf /lustre/nobackup/WUR/ABGC/shared/Pig/Sscrofa_build10_2/FASTA/Ssc10_2_chromosomes.fa /lustre/nobackup/WUR/ABGC/shared/Pig/BAM_files_hjm_newbuild10_2/$1_rh.bam | bcftools view -bvcg - > $1.mig.bcf
bcftools view $1.mig.bcf | vcfutils.pl varFilter -d$MIN -D$MAX > $1.mig.vcf
awk '$6 >= 20' $1.mig.vcf > $1.miguel.vcf
samtools mpileup -Bq 20 -d 50000 /lustre/nobackup/WUR/ABGC/shared/Pig/BAM_files_hjm_newbuild10_2/$1_rh.bam | perl covXwin-v3.1.pl -v $1.miguel.vcf -w 50000 -d $MIN -m $MAX -b /lustre/nobackup/WUR/ABGC/shared/Pig/BAM_files_hjm_newbuild10_2/$1_rh.bam | ./ngs_theta -d $MIN -m $MAX > $1.wintheta
</source>

The script can be submitted using <code>sbatch</code> using the following code, assuming that the names of the individuals are listed in a file called <code>individuals.txt</code>.
<source lang='bash'>
INDS=`cat individuals.txt`
for IND in $INDS; do sbatch nucdiv_pipeline.sh $IND; done
</source>

Average values for Theta were then extracted with the following R-srcript:
<source lang = 'rsplus'>
files=list.files(pattern="wintheta")
a <- data.frame("file" = character(), "theta_het" = numeric())
for (file1 in files){
x <- read.table(file1,header=T);
mn=mean(x$THETA_HET[x$BP>20000 & x$CHR != 'chrUN_nr' & x$CHR != 'Ssc10_2_X']);
print(paste(file1,mn,sep=" "));
a<- rbind(a,data.frame("file"=file1,"theta_het"=mn))
}
write.table(x=a,file="theta_het_results.txt")
</source>

RNA-seq analysis

2019-07-15T14:59:37Z

Dawes001:

=== Typical commands used for analyzing RNA-seq with Tophat (including Bowtie2 as aligner).===

* Examples are RNA-seq (stranded) from pig aligned against the pig reference genome (''S. scrofa'' - 10.2)

* Tophat, Bowtie2, Picard and GATK need to be in PATH or loaded as modules (e.g.: module load SHARED/bowtie/2-2.2.1; module load SHARED/tophat/2.0.11)
* Bowtie2 index (made with bowtie2_build) of reference genome (need only to be made ones)
* PCR duplicates removed with Picard
* For allelic expression and rna-editing re-aligning with GATK

===For expression analysis===
* Allowing multiply hits (20 times)
<source lang='bash'>
#!/bin/bash
#SBATCH --time=2-12:00:00
#SBATCH -n 1
#SBATCH -c 16
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=AG_tophat
#SBATCH --mem=20000

#Setting java tmp for slurm
export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/madse001/tmp

#Brain_fontalloop
#Tophat alignment
tophat -p 16 -G ../ssc10_2_ens/Sscrofa10.2.68_ens_Tophat.gtf -M --rg-id agLW_BRfl --rg-sample ag_LW_BRfl --rg-platform Illumina --keep-fasta-order --read-realign-edit-dist 0 -r 120 --library-type fr-firststrand \
-o tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M --mate-std-dev 250 ../ssc10_2_ens/ssc10_2_ens agLW_BRfl_read1.trimmed.final.fastq.gz agLW_BRfl_read2.trimmed.final.fastq.gz

#Rename Tophat alignment file
mv tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M/accepted_hits.bam tophat2_agLW_BRfl_g20_peO_reAln_ens_RG_M/agLW_BRfl_g20_peO_reAln_ens_RG_M.bam

#Remove PCR duplicates with Picard MarkDuplicates
java -Xms16g -jar ~/bin/MarkDuplicates.jar I=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam O=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
M=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam.out REMOVE_DUPLICATES=true

</source>
===For allelic expression and rna-editing analysis===
* Unique alignment only
<source lang='bash'>
#!/bin/bash
#SBATCH --time=2-12:00:00
#SBATCH -n 1
#SBATCH -c 16
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=AG_tophat
#SBATCH --mem=20000

#Setting java tmp for slurm
export _JAVA_OPTIONS=-Djava.io.tmpdir=/lustre/scratch/WUR/ABGC/madse001/tmp

#Brain_fontalloop
#Tophat alignment
tophat -p 16 -g 1 -G ../ssc10_2_ens/Sscrofa10.2.68_ens_Tophat.gtf -M --rg-id agLW_BRfl --rg-sample ag_LW_BRfl --rg-platform Illumina --keep-fasta-order --read-realign-edit-dist 0 -r 120 --library-type fr-firststrand \
-o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M --mate-std-dev 250 ../ssc10_2_ens/ssc10_2_ens agLW_BRfl_read1.trimmed.final.fastq.gz agLW_BRfl_read2.trimmed.final.fastq.gz

#Rename Tophat alignment file
mv tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/accepted_hits.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam

#Remove PCR duplicates with Picard
java -Xms16g -jar ~/bin/MarkDuplicates.jar I=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam O=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
M=tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M.bam.out REMOVE_DUPLICATES=true

#Index BAM file for GATK analysis
samtools index tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bai

#Re-align with GATK
java -Xms16g -jar ~/bin/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../ssc10_2_ens/ssc10_2_ens.fa -I tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
-o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/realigner.intervals
java -Xms16g -jar ~/bin/GenomeAnalysisTK.jar -T IndelRealigner -R ../ssc10_2_ens/ssc10_2_ens.fa -I tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam \
-targetIntervals tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/realigner.intervals -o tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bam

#Index final BAM file
samtools index tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bam tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp_reG.bai

#Remove tmp bam (bai) file
rm tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bam
rm tophat2_agLW_BRfl_g1_peO_reAln_ens_RG_M/agLW_BRfl_g1_peO_reAln_ens_RG_M_RDp.bai

</source>
===Some notes on Tophat options:===
* -g (number of alignments, default 20)
* -M (prefilter-multihits against genome)
* --rg-id; --rg-sample; --keep-fasta-order (needed for many downstream analysis like Picard and GATK)

Array jobs

2019-07-15T14:58:59Z

Dawes001:

SLURM can simplify your efforts if you are planning on submitting multiple independent jobs in parallel. Rather than having to use sbatch multiple times, you can instead use an array job to run your job.

Take the following example:

<source lang='bash'>
#SBATCH --output=output_%A.%a.txt
#SBATCH --error=error_%A.%a.txt
#SBATCH --time=10
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4000
#SBATCH --array=0-9%4

echo $SLURM_ARRAY_TASK_ID

</source>
Let's break this down step by step:

<source lang='bash'>
#SBATCH --output=output_%A.%a.txt
#SBATCH --error=error_%A.%a.txt
</source>
This makes sure your job outputs to a file called output_<Jobnumber>.<Arrayid>.txt, allowing you to track which array ID returned what.

<source lang='bash'>
#SBATCH --array=0-9%4
</source>
This defined the array job itself. This specifies to run ten jobs, with array id's of 0 to 9, but not to allow more than 4 to run at once. The syntax for this allows you to specify exactly what ID's to use, for example:
<source lang='bash'>
#SBATCH --array=3,7-11
</source>
will only run array tasks with ID's of 3, 7, 8, 9, 10 and 11.

<source lang='bash'>
echo $SLURM_ARRAY_TASK_ID
</source>
This will print to stdout (and get redirected to output_%A.%a.txt) the environment variable set by SLURM that indicates which Array ID this process has.

So, once this job is run, we will end up with ten files, all called output_<jobid>.<n>.txt, containing the number n.

== Two dimensional arrays? ==
Running an array such as above will result in a one dimensional string of jobs, for example, with --array=0-9, then

SLURM_ARRAY_TASK_ID=[ 0 1 2 3 4 5 6 7 8 9 ]

for each job. What if you need two variables to change instead of one?

Well, there's a simple function called modulo that can solve this. Let's use an example with a modulo of 10, and an example number of 93:

<source lang='bash'>
A=$((93 / 10)) # A = 9
B=$((93 % 10)) # B = 3
</source>

As you can see, this splits the number in half, allowing a job array of 0-99 to be made into two variables, traversing a 2D array. Bear in mind this always starts at 0, so if you need, say, A to be 1-5, and B to be 3-8, then:

<source lang='bash'>
#SBATCH --array=0-29 ## 5*6 entries, thus 30, including 0 this is 0-29
A=$((SLURM_ARRAY_TASK_ID/5+1)) # A = [0-4]+1 = [1-5]
B=$((SLURM_ARRAY_TASK_ID%6+3)) # B = [0-5]+3 = [3-8]
mywork $A $B
</source>

Creating sbatch script

2019-07-15T14:58:38Z

Dawes001:

== A skeleton Slurm script ==
<source lang='bash'>

#-----------------------------Mail address-----------------------------
#SBATCH --mail-user=
#SBATCH --mail-type=ALL
#-----------------------------Output files-----------------------------
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#-----------------------------Other information------------------------
#SBATCH --comment=
#SBATCH --qos=
#-----------------------------Required resources-----------------------
#SBATCH --time=0-0:0:0
#SBATCH --ntasks=
#SBATCH --cpus-per-task=
#SBATCH --mem-per-cpu=

#-----------------------------Environment, Operations and Job steps----
#load modules

#export variables

#your job

</source>

==Explanation of used SBATCH parameters==
===partition for resource allocation===
<source lang='bash'>
#SBATCH --partition=ABGC_Std
</source>
Request a specific partition for the resource allocation. It is prefered to use your organizations partition.

=== Adding accounting information or project number ===
<source lang='bash'>
#SBATCH --comment=773320000
</source>
Charge resources used by this job to specified account. The comment is an arbitrary string. The comment may be changed after job submission using the <tt>scontrol</tt> command. For WUR users a projectnumber or KTP number would be advisable.

===time limit===
<source lang='bash'>
#SBATCH --time=1200
</source>
A time limit of zero requests that no time limit be imposed. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". So in this example the job will run for a maximum of 1200 minutes.

===memory limit===
<source lang='bash'>
#SBATCH --mem=2048
</source>
SLURM imposes a memory limit on each job. By default, it is deliberately relatively small — 100 MB per node. If your job uses more than that, you’ll get an error that your job Exceeded job memory limit. To set a larger limit, add to your job submission:
<source lang='bash'>
#SBATCH --mem X
</source>

where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:

<source lang='bash'>
$ sacct -o MaxRSS -j JOBID
</source>
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit). If your job completed long in the past you may have to tell sacct to look further back in time by adding a start time with -S YYYY-MM-DD. Note that for parallel jobs spanning multiple nodes, this is the maximum memory used on any one node; if you’re not setting an even distribution of tasks per node (e.g. with –ntasks-per-node), the same job could have very different values when run at different times.

===number of tasks===
<source lang='bash'>
#SBATCH --ntasks=1
</source>
sbatch does not launch tasks, it requests an allocation of resources and submits a batch script. This option advises the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources. The default is one task per node, but note that the --cpus-per-task option will change this default.

When requesting multiple tasks, you may or may not want the job to be partitioned among multiple nodes. You can specify the minimum number of nodes using the <code>-N</code> or <code>--node</code> flag. If you provide only one number, this will be minimum and maximum at the same time. For instance:
<source lang='bash'>
#SBATCH --nodes=1
</source>
This should force your job to be scheduled to a single node.

Because the cluster has a hybrid configuration, i.e. normal and fat nodes, it may be prudent to schedule your job specifically for one or the other node type, depending for instance on memory requirements. This can be done by using the <code>-C</code> or <code>--constraints</code> flag.

===constraints: selecting by feature===
<source lang='bash'>
#SBATCH --constraint=normalmem
</source>
The HPC nodes have features associated with them, such as Intel CPU's, or the amount of memory per node. If you know that your job requires a specific architecture or memory size, you can elect to constrain your job to only these features.

The example above will result in jobs being scheduled to the regular compute nodes. By using <code>largemem</code> as option the job will specifically be scheduled to one of the fat nodes.

All features can be seen using:
<source lang='bash'>
scontrol show nodes | grep ActiveFeatures | sort | uniq
</source>

===requesting specific resources===
<source lang='bash'>
#SBATCH --gres=gpu:1
</source>
In order to be able to use specific hardware resources, you need to request a Generic Resource. Once you do this, one of the resources will be allocated to your job when they are available. In the above example, one GPU is requested for use.

===output (stderr,stdout) directed to file===
<source lang='bash'>
#SBATCH --output=output_%j.txt
</source>
Instruct SLURM to connect the batch script's standard output directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.
<source lang='bash'>
#SBATCH --error=error_output_%j.txt
</source>
Instruct SLURM to connect the batch script's standard error directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.

===adding a job name===
<source lang='bash'>
#SBATCH --job-name=calc_pi.py
</source>
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just "sbatch" if the script is read on sbatch's standard input.

===receiving mailed updates===
<source lang='bash'>
#SBATCH --mail-type=ALL
</source>
Notify user by email when certain event types occur. Valid type values are BEGIN, END, FAIL, REQUEUE, and ALL (any state change). The user to be notified is indicated with --mail-user.
<source lang='bash'>
#SBATCH --mail-user=yourname001@wur.nl
</source>
Email address to use.

== See also ==
* [[Anunna | Anunna]]
* [[Using_Slurm#Batch_script | Submitting jobs to Slurm]]
* [[Array_jobs|Array job hints]]

Using Slurm

2019-07-15T14:57:48Z

Dawes001:

The resource allocation / scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.

== Queues and defaults ==

=== Quality of Service ===
When submitting a job, you may optionally assign a different Quality of Service to it. You can do this with:

<source lang='bash'>
#SBATCH --qos=std
</source>

By default, jobs will use std, the standard quality.

Optionally, you may elect to reduce the priority of your jobs to low. This comes with a limit of how long each job can be (8h) to prevent the cluster from being locked up entirely with low priority jobs.

The high quality provides a higher priority to jobs (20) than std (10), or low (1). It is naturally more expensive.

The highest priority goes to jobs in interactive quality (100), but you may not submit many jobs or many large jobs as this quality. This is exclusively for the use of immediate running jobs, ones that are going to have hands-on users behind them.

Jobs may be restarted and rescheduled if a job with higher priority needs cluster resources, but as of right now, this is not occurring.

=== Queues ===
The cluster consists of multiple partitions of nodes that you can submit to. The primary one is 'main'. There are other partitions as needed - current plans include 'gpu'.

You can see the partitions available with `sinfo`:

=== Defaults ===
The default partition is 'main'. This will work for most jobs.

The default qos is 'std'.

The default cpu count is 1.

The default run time for a job is '''1 hour'''.

The default memory limit is '''100MB per node'''.

== Submitting jobs: sbatch ==

=== Example ===
Consider this simple python3 script that should calculate Pi to 1 million digits:
<source lang='python'>
from decimal import *
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
</source>

=== Loading modules ===
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail

In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3

=== Batch script ===
[[Creating_sbatch_script | Main Article: Creating a sbatch script]]

The following shell/slurm script can then be used to schedule the job using the sbatch command:
<source lang='bash'>
#!/bin/bash
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl

time python3 calc_pi.py
</source>

=== Submitting ===
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'>
sbatch run_calc_pi.sh
</source>

=== Submitting multiple jobs (simple) ===
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</source>

=== Submitting multiple jobs (complex) ===
Lets's say you have three job scripts that depend on each other:

<source lang='bash'>job_1.sh #A simple initialisation script</source>
<source lang='bash'>job_2.sh #An array task</source>
<source lang='bash'>job_3.sh #Some finishing script, single run, after everything previous has finished</source>

You can create a script to simultaneously submit each job with a dependency on each other:

<source lang='bash'>#!/bin/bash
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element

if ! [ "z$JOB1" == "z" ] ; then
echo "First job submitted as jobid $JOB1"
JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB2" == "z" ] ; then
echo "Second job submitted as jobid $JOB2, following $JOB1"
JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB3" == "z" ] ; then
echo "Third job submitted as jobid $JOB3, following after every element of $JOB2"

fi
fi
fi
</source>

This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).

Please see [https://slurm.schedmd.com/sbatch.html#OPT_dependency the sbatch documentation] for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.

=== Submitting array jobs ===
<source lang='bash'>
#SBATCH --array=0-10%4
</source>
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found [[Array_jobs|here]].

=== Using /tmp ===
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.

In order to be sure that you're able to use space in /tmp, you can add
<source lang='bash'>
#SBATCH --tmp=<required size>
</source>
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.

=== Using GPU ===
There are two GPU nodes, in order to run a job that uses GPU on one of these nodes, you can add

<source lang='bash'>
#SBATCH --reservation='GPU'
</source>
To your sbatch script. Without this parameter, your job won't run on one of these nodes.

THIS IS PRONE TO CHANGE SHORTLY! [[User:Dawes001|Dawes001]] ([[User talk:Dawes001|talk]]) 14:57, 15 July 2019 (UTC)

== Monitoring submitted jobs ==
Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit.

=== Generic monitoring of all running jobs ===
<source lang='bash'>
squeue
</source>

You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3396 ABGC BOV-WUR- megen002 R 27:26 1 node004
3397 ABGC BOV-WUR- megen002 R 27:26 1 node005
3398 ABGC BOV-WUR- megen002 R 27:26 1 node006
3399 ABGC BOV-WUR- megen002 R 27:26 1 node007
3400 ABGC BOV-WUR- megen002 R 27:26 1 node008
3401 ABGC BOV-WUR- megen002 R 27:26 1 node009
3385 research BOV-WUR- megen002 R 44:38 1 node049
3386 research BOV-WUR- megen002 R 44:38 1 node050
3387 research BOV-WUR- megen002 R 44:38 1 node051
3388 research BOV-WUR- megen002 R 44:38 1 node052
3389 research BOV-WUR- megen002 R 44:38 1 node053
3390 research BOV-WUR- megen002 R 44:38 1 node054
3391 research BOV-WUR- megen002 R 44:38 3 node[049-051]
3392 research BOV-WUR- megen002 R 44:38 3 node[052-054]
3393 research BOV-WUR- megen002 R 44:38 1 node001
3394 research BOV-WUR- megen002 R 44:38 1 node002
3395 research BOV-WUR- megen002 R 44:38 1 node003

=== Monitoring time limit set for a specific job ===
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
<source lang='bash'>
squeue -l -j 3532
</source>
Information similar to the following should appear:
Fri Nov 29 15:41:00 2013
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054

=== Query a specific active job: scontrol ===
Show all the details of a currently active job, so not a completed job.
<source lang='bash'>
login ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
UserId=megen002(16795409) GroupId=domain users(16777729)
Priority=1 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=research AllocNode:Sid=login0:21799
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node023
BatchHost=node023
NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/lustre/scratch/WUR/ABGC/...
WorkDir=/lustre/scratch/WUR/ABGC/...
</source>

=== Check on a pending job ===
A submitted job could result in a pending state when there are not enough resources available to this job.
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
<source lang='bash'>
[@login jobs]$ sbatch hpl_student.job
Submitted batch job 740338

[@login jobs]$ squeue -l -j 740338
Fri Feb 21 15:32:31 2014
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail)

[@login jobs]$ squeue --start -j 740338
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
</source>
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.

== Removing jobs from a list: scancel ==
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
<source lang='bash'>
scancel 3401
</source>

== Allocating resources interactively: sinteractive ==
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:
<source lang='bash'>
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>
</source>
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.

Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.

=== sinteractive source ===
<source lang='bash'>
#!/bin/bash
srun "$@" -I60 -N 1 -n 1 --pty bash -i
</source>

=== interactive Slurm - using salloc ===
If you don't want your shell to be transported but want a new remote shell, do:
<source lang='bash'>
salloc -p ABGC_Low $SHELL
</source>
Now your shell will stay on the login node, but you can do:
<source lang='bash'>
srun <command> &
</source>
To submit tasks to this new shell!

Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html

== Get overview of past and current jobs: sacct ==
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
<source lang='bash'>
sacct
</source>
This should provide information similar to the following:

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3385 BOV-WUR-58 research 12 COMPLETED 0:0
3385.batch batch 1 COMPLETED 0:0
3386 BOV-WUR-59 research 12 CANCELLED+ 0:0
3386.batch batch 1 CANCELLED 0:15
3528 BOV-WUR-59 ABGC 16 RUNNING 0:0
3529 BOV-WUR-60 ABGC 16 RUNNING 0:0

Or in more detail for a specific job:
<source lang='bash'>
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
</source>
This should provide information about job id 4220:

JobID JobName Comment Partition NTasks AllocCPUS Elapsed State ExitCode
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------
4220 PreProces+ research 3 00:30:52 COMPLETED 0:0
4220.batch batch 1 1 00:30:52 COMPLETED 0:0

'''Job Status Codes'''

Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.

{| class="wikitable"
|-
!Code!!State!!Description
|-
|CA ||CANCELLED|| Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
|-
|CD|| COMPLETED|| Job has terminated all processes on all nodes.
|-
|CF|| CONFIGURING|| Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
|-
|CG|| COMPLETING|| Job is in the process of completing. Some processes on some nodes may still be active.
|-
|F|| FAILED|| Job terminated with non-zero exit code or other failure condition.
|-
|NF|| NODE_FAIL|| Job terminated due to failure of one or more allocated nodes.
|-
|PD|| PENDING|| Job is awaiting resource allocation.
|-
|R|| RUNNING|| Job currently has an allocation.
|-
|S|| SUSPENDED|| Job has an allocation, but execution has been suspended.
|-
|TO|| TIMEOUT|| Job terminated upon reaching its time limit.
|-
|-
|}

== Running MPI jobs on Anunna ==

[[MPI_on_B4F_cluster | Main article: MPI on Anunna]]

== See also ==
* [[Tariffs | Costs associated with resource usage]]
* [[B4F_cluster | Anunna]]
* [[BCM_on_B4F_cluster | BCM on Anunna]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

== External links ==
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]

Courses

2019-06-27T08:29:50Z

Dawes001: /* HPC CUDA/AI Course - 2019-06-21 */

== Upcoming: HPC Basic Course - 2019-06-28==

A course for beginners will be organised on the 28th of June, aiming to help absolute beginners to begin to use the main job scheduler, SLURM. You can register [https://oneschool.wur.nl/Lists/Cursus/DispForm.aspx?ID=100 here].

== Upcoming Linux Basic Course - 2019-06-27 ==

A course for beginners will be organised on the 13th of June, to help beginner Linux users gain some skills in using Linux. You can register for this course [https://www.wur.nl/en/activity/Linux-basic-course-on-13-June-2019.htm here]

== HPC CUDA/AI Course - 2019-06-21 ==

A course for interested users for deep learning and neural networks, combined with some deep level manipulation of graphics cards was given by Dell on the 21st of July.

[[File:WUR_CUDA_210619.pdf|WUR CUDA Course]]
[[File:WUR_AI_101_210619.pdf|WUR AI Course 101]]
[[File:WUR_AI_201_210619.pdf|WUR AI Course 201]]

[[File:WUR_CUDA_2_210619.pdf|WUR CUDA Course]]
[[File:WUR_Deep_Learning_Frameworks_210619.pdf|WUR Deep Learning Frameworks Primer]]
[[File:WUR_Deep_Learning_Lab_210619.pdf|WUR Deep Learning Lab]]

== HPC Advanced Course - 2019-05-28 ==

A course for experienced users was organised on the 28th of May, aiming to brush up users on techniques for submitting unusual jobs, and help provide some more helpful hints and techniques.

[[File:HPC_advanced_course_20190506.pdf|Advanced Course 1]]
[[File:HPC_advanced_slides_20190528.pdf|Advanced Course 2]]

== HPC Basic Course - 2019-05-07 ==

A course for beginners will be organised on the 7th of May, aiming to help absolute beginners to begin to use the main job scheduler, SLURM.

[[File:HPC basic course 20190506.pdf|Basic Course]]

== Linux Basic Course - 2019-04-16 ==

A course for beginners was organised on the 16th of April, to help beginner Linux users gain some skills in using Linux.

== HPC Advanced Course - 2018-10-16 ==

A course for experienced users was organised on the 16th of October, aiming to brush up users on techniques for submitting unusual jobs, and help provide some more helpful hints and techniques.

[[File:HPC_Advanced_Slides_20181016.pdf|Advanced Course (Gwen)]]

[[File:HPC_advanced_course_20181008.pdf|Advanced Course (Jeremie)]]

== HPC Basic Course - 2018-10-11 ==

A course for beginners was organised on the 11th of October, aiming to help absolute beginners to begin to use the main job scheduler, SLURM.

[[File:HPC_basic_course_20181008.pdf|Basic Course]]

== Basic Linux Course - 2018-10-02 ==

A course basic Linux usage was organised on the 2nd of October, to help beginner Linux users gain some skills in using Linux.

[https://etherpad.lug.wur.nl/p/UpkF2KXDVh]

== HPC Advanced Course - 2018-05-18 ==

A course for experienced users was organised on the 18th of May, aiming to brush up users on techniques for submitting unusual jobs, and help provide some more helpful hints and techniques.

[[File:HPC_Advanced_20180518-GD.pdf|Advanced Course (Gwen)]]

== HPC Basic Course - 2018-05-17 ==

A course for beginners was organised on the 17th of May, aiming to help absolute beginners to begin to use the main job scheduler, SLURM.

== Basic Linux Course - 2018-04-19 ==

A course basic Linux usage was organised on the 19th of April, to help beginner Linux users gain some skills in using Linux.

== HPC Advanced Course - 2017-11-09 ==

A course for experienced users was organised on the 9th of November, aiming to brush up users on techniques for submitting unusual jobs, and help provide some more helpful hints and techniques.

[[File:HPC_Advanced_course_2017-11-08-JV.pdf|Advanced Course (Jeremie)]]

[[File:HPC_Advanced_course_2017-11-08-GD.pdf|Advanced Course (Gwen)]]

[[File:Checkpointing_2017-11-08.pdf|Checkpointing]]

== HPC Basic Course - 2017-10-30 ==

A course for beginners was organised on the 30th of October, aiming to help absolute beginners to enhance their ability to use the main job scheduler, SLURM.

The slides for this course can be found here:

[[File:HPC_basic_course_20171025.pdf | Basic introduction to Linux]]

== HPC Teaching - 2017-06-07 ==

A course for was organised on the 7th of June, aiming to help absolute beginners (and moderately experienced users) to enhance their ability to use the main job scheduler, SLURM.

The slides for this course can be found here:

[[File:Connecting_with_Secure_Shell_to_the_HPC_20170606.pdf | Basic introduction to Linux]]

[[File:Submitting_and_monitoring_jobs_on_the_HPC_20170602.pdf | Submitting and Monitoring Jobs]]

== Old Courses ==
* [http://www.basgen.nl/sdac/ Sequence Data Analysis Course (Dec. 2012)]

File:WUR Deep Learning Lab 210619.pdf

2019-06-27T08:28:39Z

Dawes001:

File:WUR Deep Learning Frameworks 210619.pdf

2019-06-27T08:28:23Z

Dawes001:

File:WUR CUDA 2 210619.pdf

2019-06-27T08:27:57Z

Dawes001:

Courses

2019-06-24T16:04:26Z

Dawes001:

== Upcoming: HPC Basic Course - 2019-06-28==

A course for beginners will be organised on the 28th of June, aiming to help absolute beginners to begin to use the main job scheduler, SLURM. You can register [https://oneschool.wur.nl/Lists/Cursus/DispForm.aspx?ID=100 here].

== Upcoming Linux Basic Course - 2019-06-27 ==

A course for beginners will be organised on the 13th of June, to help beginner Linux users gain some skills in using Linux. You can register for this course [https://www.wur.nl/en/activity/Linux-basic-course-on-13-June-2019.htm here]

== HPC CUDA/AI Course - 2019-06-21 ==

A course for interested users for deep learning and neural networks, combined with some deep level manipulation of graphics cards was given by Dell on the 21st of July.

[[File:WUR_CUDA_210619.pdf|WUR CUDA Course]]
[[File:WUR_AI_101_210619.pdf|WUR AI Course 101]]
[[File:WUR_AI_201_210619.pdf|WUR AI Course 201]]

== HPC Advanced Course - 2019-05-28 ==

A course for experienced users was organised on the 28th of May, aiming to brush up users on techniques for submitting unusual jobs, and help provide some more helpful hints and techniques.

[[File:HPC_advanced_course_20190506.pdf|Advanced Course 1]]
[[File:HPC_advanced_slides_20190528.pdf|Advanced Course 2]]

== HPC Basic Course - 2019-05-07 ==

A course for beginners will be organised on the 7th of May, aiming to help absolute beginners to begin to use the main job scheduler, SLURM.

[[File:HPC basic course 20190506.pdf|Basic Course]]

== Linux Basic Course - 2019-04-16 ==

A course for beginners was organised on the 16th of April, to help beginner Linux users gain some skills in using Linux.

== HPC Advanced Course - 2018-10-16 ==

A course for experienced users was organised on the 16th of October, aiming to brush up users on techniques for submitting unusual jobs, and help provide some more helpful hints and techniques.

[[File:HPC_Advanced_Slides_20181016.pdf|Advanced Course (Gwen)]]

[[File:HPC_advanced_course_20181008.pdf|Advanced Course (Jeremie)]]

== HPC Basic Course - 2018-10-11 ==

A course for beginners was organised on the 11th of October, aiming to help absolute beginners to begin to use the main job scheduler, SLURM.

[[File:HPC_basic_course_20181008.pdf|Basic Course]]

== Basic Linux Course - 2018-10-02 ==

A course basic Linux usage was organised on the 2nd of October, to help beginner Linux users gain some skills in using Linux.

[https://etherpad.lug.wur.nl/p/UpkF2KXDVh]

== HPC Advanced Course - 2018-05-18 ==

A course for experienced users was organised on the 18th of May, aiming to brush up users on techniques for submitting unusual jobs, and help provide some more helpful hints and techniques.

[[File:HPC_Advanced_20180518-GD.pdf|Advanced Course (Gwen)]]

== HPC Basic Course - 2018-05-17 ==

A course for beginners was organised on the 17th of May, aiming to help absolute beginners to begin to use the main job scheduler, SLURM.

== Basic Linux Course - 2018-04-19 ==

A course basic Linux usage was organised on the 19th of April, to help beginner Linux users gain some skills in using Linux.

== HPC Advanced Course - 2017-11-09 ==

A course for experienced users was organised on the 9th of November, aiming to brush up users on techniques for submitting unusual jobs, and help provide some more helpful hints and techniques.

[[File:HPC_Advanced_course_2017-11-08-JV.pdf|Advanced Course (Jeremie)]]

[[File:HPC_Advanced_course_2017-11-08-GD.pdf|Advanced Course (Gwen)]]

[[File:Checkpointing_2017-11-08.pdf|Checkpointing]]

== HPC Basic Course - 2017-10-30 ==

A course for beginners was organised on the 30th of October, aiming to help absolute beginners to enhance their ability to use the main job scheduler, SLURM.

The slides for this course can be found here:

[[File:HPC_basic_course_20171025.pdf | Basic introduction to Linux]]

== HPC Teaching - 2017-06-07 ==

A course for was organised on the 7th of June, aiming to help absolute beginners (and moderately experienced users) to enhance their ability to use the main job scheduler, SLURM.

The slides for this course can be found here:

[[File:Connecting_with_Secure_Shell_to_the_HPC_20170606.pdf | Basic introduction to Linux]]

[[File:Submitting_and_monitoring_jobs_on_the_HPC_20170602.pdf | Submitting and Monitoring Jobs]]

== Old Courses ==
* [http://www.basgen.nl/sdac/ Sequence Data Analysis Course (Dec. 2012)]

File:WUR CUDA 210619.pdf

2019-06-24T16:01:56Z

Dawes001:

File:WUR AI 101 210619.pdf

2019-06-24T16:01:22Z

Dawes001:

File:WUR AI 201 210619.pdf

2019-06-24T16:00:59Z

Dawes001:

File:Anunna Flyer 2019.svg

2019-05-06T08:48:54Z

Dawes001:

Tariffs

2019-04-23T08:30:08Z

Dawes001:

== Computing: Calculations (cores)==
{| class="wikitable"
!Queue
!CPU core hour
!GB memory hour
|-
|Standard queue
|€ 0.0150
|€ 0.0015
|-
|High priority queue
|€ 0.0200
|€ 0.0020
|-
|Low priority queue
|€ 0.0100
|€ 0.0010
|}

== Computing: GPU Use==
{| class="wikitable"
!Tariff per device per hour (gpu/hour)
|-
|€ 0.3000
|}

== Storage ==
Tariffs per year per TB
{| class="wikitable"
!Lustre Nobackup
!Lustre Backup
!Home-dir
!Archive
|-
|€ 150
|€ 200
|€ 200
|€ 100
|}

== Reservations ==
{| class="wikitable"
!Tariff per node per day (node/day)
|-
|€ 50
|}

== Notes==

If you are a member of a group with a commitment, then these costs get deducted from that commitment. Typically we are fairly lax with enforcing limits - only once you get to around 150% of your commitment will we consider taking action (mainly coming to discuss things).

== Example ==

You are running a job that needs 4 cores, 32G of RAM and runs for 90 minutes in the Std partition. To run this, you over-request resources slightly, and execute in a job that requests 4 CPUs, 40G of RAM and with a time limit of 3 hours. Your job terminates early. Thus, your costs are:

4 * 0.015 * 1.5 = 0.09 EUR for the CPU

40 * 0.0015 * 1.5 = 0.09 EUR for the memory

Total: 0.18 EUR

Tariffs

2019-04-23T08:25:47Z

Dawes001:

Using Slurm

2019-04-04T08:44:14Z

Dawes001:

The resource allocation / scheduling software on Anunna is [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management SLURM]: '''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement.

== Queues and defaults ==

=== Queues ===
Every organization has 3 queues (in slurm called partitions) : a high, a standard and a low priority queue.<br>
The High queue provides the highest priority to jobs (20) then the standard queue (10). In the low priority queue (0)<br>
jobs will be resubmitted if a job with higer priority needs cluster resources and those resoruces are occupied by a Low queue jobs.
To find out which queues your account has been authorized for, type sinfo:
<source lang='bash'>
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
ABGC_High up infinite 12 down* node[043-048,055-060]
ABGC_High up infinite 6 mix fat[001-002],node[002-005]
ABGC_High up infinite 44 idle node[001,006-042,049-054]
ABGC_Std up infinite 12 down* node[043-048,055-060]
ABGC_Std up infinite 6 mix fat[001-002],node[002-005]
ABGC_Std up infinite 44 idle node[001,006-042,049-054]
ABGC_Low up infinite 12 down* node[043-048,055-060]
ABGC_Low up infinite 6 mix fat[001-002],node[002-005]
ABGC_Low up infinite 44 idle node[001,006-042,049-054]
</source>

=== Defaults ===
There is no default queue, so you need to specify which queue to use when submitting a job.<br>
'''The default run time for a job is 1 hour!''' <br>
'''Default memory limit is 100MB per node!'''

== Submitting jobs: sbatch ==

=== Example ===
Consider this simple python3 script that should calculate Pi to 1 million digits:
<source lang='python'>
from decimal import *
D=Decimal
getcontext().prec=10000000
p=sum(D(1)/16**k*(D(4)/(8*k+1)-D(2)/(8*k+4)-D(1)/(8*k+5)-D(1)/(8*k+6))for k in range(411))
print(str(p)[:10000002])
</source>

=== Loading modules ===
In order for this script to run, the first thing that is needed is that Python3, which is not the default Python version on the cluster, is load into your environment. Availability of (different versions of) software can be checked by the following command:
module avail

In the list you should note that python3 is indeed available to be loaded, which then can be loaded with the following command:
module load python/3.3.3

=== Batch script ===
[[Creating_sbatch_script | Main Article: Creating a sbatch script]]

The following shell/slurm script can then be used to schedule the job using the sbatch command:
<source lang='bash'>
#!/bin/bash
#SBATCH --comment=773320000
#SBATCH --time=1200
#SBATCH --mem=2048
#SBATCH --ntasks=1
#SBATCH --output=output_%j.txt
#SBATCH --error=error_output_%j.txt
#SBATCH --job-name=calc_pi.py
#SBATCH --partition=ABGC_Std
#SBATCH --mail-type=ALL
#SBATCH --mail-user=email@org.nl

time python3 calc_pi.py
</source>

=== Submitting ===
The script, assuming it was named 'run_calc_pi.sh', can then be posted using the following command:
<source lang='bash'>
sbatch run_calc_pi.sh
</source>

=== Submitting multiple jobs (simple) ===
Assuming there are 10 job scripts, name runscript_1.sh through runscript_10.sh, all these scripts can be submitted using the following line of shell code:
<source lang='bash'>for i in `seq 1 10`; do echo $i; sbatch runscript_$i.sh;done
</source>

=== Submitting multiple jobs (complex) ===
Lets's say you have three job scripts that depend on each other:

<source lang='bash'>job_1.sh #A simple initialisation script</source>
<source lang='bash'>job_2.sh #An array task</source>
<source lang='bash'>job_3.sh #Some finishing script, single run, after everything previous has finished</source>

You can create a script to simultaneously submit each job with a dependency on each other:

<source lang='bash'>#!/bin/bash
JOB1=$(sbatch job_1.sh| rev | cut -d ' ' -f 1 | rev) #Get me the last space-separated element

if ! [ "z$JOB1" == "z" ] ; then
echo "First job submitted as jobid $JOB1"
JOB2=$(sbatch --dependency=afterany:$JOB1 job_2.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB2" == "z" ] ; then
echo "Second job submitted as jobid $JOB2, following $JOB1"
JOB3=$(sbatch --dependency=afterany:$JOB2 job_3.sh| rev | cut -d ' ' -f 1 | rev)

if ! [ "z$JOB3" == "z" ] ; then
echo "Third job submitted as jobid $JOB3, following after every element of $JOB2"

fi
fi
fi
</source>

This will ensure that the subsequent jobs occur after any finishing of the former (even if they failed).

Please see [https://slurm.schedmd.com/sbatch.html#OPT_dependency the sbatch documentation] for other options available to you. Note that aftercorr makes a subsequent array jobs array elements start after the correspondingly numbered ones from the previous job.

=== Submitting array jobs ===
<source lang='bash'>
#SBATCH --array=0-10%4
</source>
SLURM allows you to submit multiple jobs using the same template. Further information about this can be found [[Array_jobs|here]].

=== Using /tmp ===
There is a local disk of ~300G that can be used to temporarily stage some of your workload attached to each node. This is free to use, but please remember to clean up your data after usage.

In order to be sure that you're able to use space in /tmp, you can add
<source lang='bash'>
#SBATCH --tmp=<required size>
</source>
To your sbatch script. This will prevent your job from being run on nodes where there is no free space, or it's aimed to be used by another job at the same time.

== Monitoring submitted jobs ==
Once a job is submitted, the status can be monitored using the <code>squeue</code> command. The <code>squeue</code> command has a number of parameters for monitoring specific properties of the jobs such as time limit.

=== Generic monitoring of all running jobs ===
<source lang='bash'>
squeue
</source>

You should then get a list of jobs that are running at that time on the cluster, for the example on how to submit using the 'sbatch' command, it may look like so:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3396 ABGC BOV-WUR- megen002 R 27:26 1 node004
3397 ABGC BOV-WUR- megen002 R 27:26 1 node005
3398 ABGC BOV-WUR- megen002 R 27:26 1 node006
3399 ABGC BOV-WUR- megen002 R 27:26 1 node007
3400 ABGC BOV-WUR- megen002 R 27:26 1 node008
3401 ABGC BOV-WUR- megen002 R 27:26 1 node009
3385 research BOV-WUR- megen002 R 44:38 1 node049
3386 research BOV-WUR- megen002 R 44:38 1 node050
3387 research BOV-WUR- megen002 R 44:38 1 node051
3388 research BOV-WUR- megen002 R 44:38 1 node052
3389 research BOV-WUR- megen002 R 44:38 1 node053
3390 research BOV-WUR- megen002 R 44:38 1 node054
3391 research BOV-WUR- megen002 R 44:38 3 node[049-051]
3392 research BOV-WUR- megen002 R 44:38 3 node[052-054]
3393 research BOV-WUR- megen002 R 44:38 1 node001
3394 research BOV-WUR- megen002 R 44:38 1 node002
3395 research BOV-WUR- megen002 R 44:38 1 node003

=== Monitoring time limit set for a specific job ===
The default time limit is set at one hour. Estimated run times need to be specified when running jobs. To see what the time limit is that is set for a certain job, this can be done using the <code>squeue</code> command.
<source lang='bash'>
squeue -l -j 3532
</source>
Information similar to the following should appear:
Fri Nov 29 15:41:00 2013
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
3532 ABGC BOV-WUR- megen002 RUNNING 2:47:03 3-08:00:00 1 node054

=== Query a specific active job: scontrol ===
Show all the details of a currently active job, so not a completed job.
<source lang='bash'>
login ~]$ scontrol show jobid 4241
JobId=4241 Name=WB20F06
UserId=megen002(16795409) GroupId=domain users(16777729)
Priority=1 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=02:55:25 TimeLimit=3-08:00:00 TimeMin=N/A
SubmitTime=2013-12-09T13:37:29 EligibleTime=2013-12-09T13:37:29
StartTime=2013-12-09T13:37:29 EndTime=2013-12-12T21:37:29
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=research AllocNode:Sid=login0:21799
ReqNodeList=(null) ExcNodeList=(null)
NodeList=node023
BatchHost=node023
NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/lustre/scratch/WUR/ABGC/...
WorkDir=/lustre/scratch/WUR/ABGC/...
</source>

=== Check on a pending job ===
A submitted job could result in a pending state when there are not enough resources available to this job.
In this example I sumbit a job, check the status and after finding out is it '''pending''' I'll check when is probably will start.
<source lang='bash'>
[@login jobs]$ sbatch hpl_student.job
Submitted batch job 740338

[@login jobs]$ squeue -l -j 740338
Fri Feb 21 15:32:31 2014
JOBID PARTITION NAME USER STATE TIME TIMELIMIT NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PENDING 0:00 1-00:00:00 1 (ReqNodeNotAvail)

[@login jobs]$ squeue --start -j 740338
JOBID PARTITION NAME USER ST START_TIME NODES NODELIST(REASON)
740338 ABGC_Stud HPLstude bohme999 PD 2014-02-22T15:31:48 1 (ReqNodeNotAvail)
</source>
So it seems this job will problably start the next day, but's thats no guarantee it will start indeed.

== Removing jobs from a list: scancel ==
If for some reason you want to delete a job that is either in the queue or already running, you can remove it using the 'scancel' command. The 'scancel' command takes the jobid as a parameter. For the example above, this would be done using the following code:
<source lang='bash'>
scancel 3401
</source>

== Allocating resources interactively: sinteractive ==
sinteractive is a tiny wrapper on srun to create interactive jobs quickly and easily. It allows you to get a shell on one of the nodes, with similar limits as you would do for a normal job. To use it, simply run:
<source lang='bash'>
sinteractive -c <num_cpus> --mem <amount_mem> --time <minutes> -p <partition>
</source>
You will then be presented with a new shell prompt on one of the compute nodes (run 'hostname' to see which!). From here, you can test out code in an interactive fashion as needs be.

Be advised though - not filling in the above fields will get you a shell with 1 CPU and 100Mb of RAM for 1 hour. This is useful for quick testing, however.

=== sinteractive source ===
<source lang='bash'>
#!/bin/bash
srun "$@" -I60 -N 1 -n 1 --pty bash -i
</source>

=== interactive Slurm - using salloc ===
If you don't want your shell to be transported but want a new remote shell, do:
<source lang='bash'>
salloc -p ABGC_Low $SHELL
</source>
Now your shell will stay on the login node, but you can do:
<source lang='bash'>
srun <command> &
</source>
To submit tasks to this new shell!

Be aware that the time limit of salloc is default 1 hour. If you intend to run jobs for longer times than this, you need to edit the settings for it. See: https://computing.llnl.gov/linux/slurm/salloc.html

== Get overview of past and current jobs: sacct ==
To do some accounting on past and present jobs, and to see whether they ran to completion, you can do:
<source lang='bash'>
sacct
</source>
This should provide information similar to the following:

JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
3385 BOV-WUR-58 research 12 COMPLETED 0:0
3385.batch batch 1 COMPLETED 0:0
3386 BOV-WUR-59 research 12 CANCELLED+ 0:0
3386.batch batch 1 CANCELLED 0:15
3528 BOV-WUR-59 ABGC 16 RUNNING 0:0
3529 BOV-WUR-60 ABGC 16 RUNNING 0:0

Or in more detail for a specific job:
<source lang='bash'>
sacct --format=jobid,jobname,comment,partition,ntasks,alloccpus,elapsed,state,exitcode -j 4220
</source>
This should provide information about job id 4220:

JobID JobName Comment Partition NTasks AllocCPUS Elapsed State ExitCode
------------ ---------- ---------- ---------- -------- ---------- ---------- ---------- --------
4220 PreProces+ research 3 00:30:52 COMPLETED 0:0
4220.batch batch 1 1 00:30:52 COMPLETED 0:0

'''Job Status Codes'''

Typically your job will be either in the Running state of PenDing state. However here is a breakdown of all the states that your job could be in.

{| class="wikitable"
|-
!Code!!State!!Description
|-
|CA ||CANCELLED|| Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
|-
|CD|| COMPLETED|| Job has terminated all processes on all nodes.
|-
|CF|| CONFIGURING|| Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
|-
|CG|| COMPLETING|| Job is in the process of completing. Some processes on some nodes may still be active.
|-
|F|| FAILED|| Job terminated with non-zero exit code or other failure condition.
|-
|NF|| NODE_FAIL|| Job terminated due to failure of one or more allocated nodes.
|-
|PD|| PENDING|| Job is awaiting resource allocation.
|-
|R|| RUNNING|| Job currently has an allocation.
|-
|S|| SUSPENDED|| Job has an allocation, but execution has been suspended.
|-
|TO|| TIMEOUT|| Job terminated upon reaching its time limit.
|-
|-
|}

== Running MPI jobs on Anunna ==

[[MPI_on_B4F_cluster | Main article: MPI on Anunna]]
< text here >

== Understanding which resources are available to you: sinfo ==
By using the 'sinfo' command you can retrieve information on which 'Partitions' are available to you. A 'Partition' using SLURM is similar to the 'queue' when submitting using the Sun Grid Engine ('qsub'). The different Partitions grant different levels of resource allocation. Not all defined Partitions will be available to any given person. E.g., Master students will only have the Student Partition available, researchers at the ABGC will have 'student', 'research', and 'ABGC' partitions available. The higher the level of resource allocation, though, the higher the cost per compute-hour. The default Partition is the 'student' partition. A full list of Partitions can be found from the Bright Cluster Manager webpage.

<source lang='bash'>
sinfo
</source>

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
student* up infinite 12 down* node[043-048,055-060]
student* up infinite 50 idle fat[001-002],node[001-042,049-054]
research up infinite 12 down* node[043-048,055-060]
research up infinite 50 idle fat[001-002],node[001-042,049-054]
ABGC up infinite 12 down* node[043-048,055-060]
ABGC up infinite 50 idle fat[001-002],node[001-042,049-054]

== See also ==
* [[Tariffs | Costs associated with resource usage]]
* [[B4F_cluster | Anunna]]
* [[BCM_on_B4F_cluster | BCM on Anunna]]
* [[SLURM_Compare | SLURM compared to other common schedulers]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

== External links ==
* [http://slurm.schedmd.com Slurm official documentation]
* [http://en.wikipedia.org/wiki/Simple_Linux_Utility_for_Resource_Management Slurm on Wikipedia]
* [http://www.youtube.com/watch?v=axWffyrk3aY Slurm Tutorial on Youtube]

Filesystems

2019-04-04T08:43:24Z

Dawes001:

Anunna currently has multiple filesystem mounts that are available cluster-wide:

== Global ==
* /home - This mount uses NFS to mount the home directories directly from nfs01. Each user has a 200G quota for this filesystem, as it is regularly backed up to tape, and can reliably be restored from up to a week's history.

* /cm/shared - This mount provides a consistent set of binaries for the entire cluster.

* /lustre - This large mount uses the Lustre filesystem to provide files from multiple redundant servers. Access is provided per group, thus:
/lustre/[level]/[partner]/[unit]
e.g.
/lustre/backup/WUR/ABGC/
It comprises of three major parts (and some minor):
* /lustre/backup - In case of disaster, this data is stored a second time on a separate machine. Whilst this backup is purely in case of complete tragedy (such as some immense filesystem error, or multiple component failure), it can potentially be used to revert mistakes if you are very fast about reporting them. There is however no guarantee of this service.
* /lustre/nobackup - This is the 'normal' filesystem for Lustre - no backups, just stored on the filesystem. Without having a backup needed, the cost of data here is not as much as under /lustre/backup, but in case of disaster cannot be recivered.
* /lustre/scratch - Files here may be removed after some time if the filesystem gets too full (Typically 30 days). You should tidy up this data yourself once work is complete.
* /lustre/shared - Same as /lustre/backup, except publicly available. This is where truly shared data lives that isn't assigned to a specific group.

=== Private shared directories ===
If you are working with a group of users on a similar project, you might consider making a [[Shared_folders|Shared directory]] to coordinate. Information on how to do so is in the linked article.

== Local ==
Specific to certain machines are some other filesystems that are available to you:
* /archive - an archive mount only accessible from the login nodes. Files here are sent to the Isilon for deeper storage. The cost of storing data here is much less than on the Lustre, but it cannot be used for compute work. This location is only available to WUR users. Files are able to be reverted via snapshot, and there is a separated backup, however this only comes in fortnightly (14 day) intervals.

* /tmp - On each worker node there is a /tmp mount that can be used for temporary local caching. Be advised that you should clean this up, lest your files become a hindrance to other users. You can request a node with free space in your sbatch script like so:
<source lang='bash'>
#SBATCH --tmp=<required space>
</source>

* /dev/shm - On each worker you may also create a virtual filesystem directly into memory, for extremely fast data access. Be advised that this will count against the memory used for your job, but it is also the fastest available filesystem if needed.

== See also ==
* [[Tariffs | Costs associated with resource usage]]

== External links ==
* [http://wiki.lustre.org/index.php/Main_Page Lustre website]

Main Page

2019-04-04T08:42:41Z

Dawes001:

Anunna is a [http://en.wikipedia.org/wiki/High-performance_computing High Performance Computer] (HPC) infrastructure hosted by [http://www.wageningenur.nl/nl/activiteit/Opening-High-Performance-Computing-cluster-HPC.htm Wageningen University & Research Centre]. It is open for use for all WUR research groups as well as other organizations, including companies, that have collaborative projects with WUR.

= Using Anunna =
== Gaining access to Anunna==
Access to the cluster and file transfer are traditionally done via [http://en.wikipedia.org/wiki/Secure_Shell SSH and SFTP].
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]
* [[Services | Alternative access methods, and extra features and services on Anunna]]
* [[Filesystems | Accessible storage methods on Anunna]]
* [[Tariffs | Costs associated with resource usage]]

== Access Policy ==
[[Access_Policy | Main Article: Access Policy]]

Access needs to be granted actively (by creation of an account on the cluster by FB-IT). Use of resources is limited by the scheduler. Depending on availability of queues ('partitions') granted to a user, priority to the system's resources is regulated. Note that the use of Anunna is not free of charge. List price of CPU time and storage, and possible discounts on that list price for your organisation, can be retrieved from CAT-AGRO or FB-ICT.

= Events =
* [[Courses]] that have happened and are happening
* [[Downtime]] that will affect all users
* [[Meetings]] that may affect the policies of Anunna

= Other Software =

== Cluster Management Software and Scheduler ==
Anunna uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]
* [[Using_Slurm | Submit jobs with Slurm]]
* [[node_usage_graph | Be aware of how much work the cluster is under right now with 'node_usage_graph']]
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]

== Installation of software by users ==

* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]
* [[Setting local variables]]
* [[Installing_R_packages_locally | Installing R packages locally]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]
* [[Virtual_environment_Python_3.4_or_higher | Setting up and using a virtual environment for Python3.4 or higher ]]
* [[Installing WRF and WPS]]

== Installed software ==

* [[Globally_installed_software | Globally installed software]]
* [[ABGC_modules | ABGC specific modules]]

= Useful Notes =

== Being in control of Environment parameters ==

* [[Using_environment_modules | Using environment modules]]
* [[Setting local variables]]
* [[Setting_TMPDIR | Set a custom temporary directory location]]
* [[Installing_R_packages_locally | Installing R packages locally]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]

== Controlling costs ==

* [[SACCT | using SACCT to see your costs]]
* [[get_my_bill | using the "get_my_bill" script to estimate costs]]

== Management ==
Project Leader of Anunna is Stephen Janssen (Wageningen UR,FB-IT, Service Management). [[User:lith010 | Jan van Lith (Wageningen UR,FB-IT, Infrastructure)]], [[User:dawes001 | Gwen Dawes (Wageningen UR, FB-IT, Infrastructure)]] and [[User:vaend001 | Catharina Vaendel(Wageningen UR,FB-IT, Infrastructure)]] are responsible for [[Maintenance_and_Management | Maintenance and Management]].

* [[Roadmap | Ambitions regarding innovation, support and administration of Anunna ]]

= Miscellaneous =
* [[Mailinglist | Electronic mail discussion lists]]
* [[History_of_the_Cluster | Historical information on the startup of Anunna]]
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]
* [[Parallel_R_code_on_SLURM | Running parallel R code on SLURM]]
* [[Convert_between_MediaWiki_and_other_formats | Convert between MediaWiki format and other formats]]
* [[Manual GitLab | GitLab: Create projects and add scripts]]
* [[Monitoring_executions | Monitoring job execution]]
* [[Shared_folders | Working with shared folders in the Lustre file system]]

= See also =
* [[Maintenance_and_Management | Maintenance and Management]]
* [[BCData | BCData]]
* [[Mailinglist | Electronic mail discussion lists]]
* [[About_ABGC | About ABGC]]
* [[Computer_cluster | High Performance Computing @ABGC]]
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]

= External links =
{| width="90%"
|- valign="top"
| width="30%" |
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/Our-facilities/Show/High-Performance-Computing-Cluster-HPC.htm CATAgroFood offers a HPC facilty]
* [http://www.cobb-vantress.com Cobb-Vantress homepage]

| width="30%" |
* [https://www.crv4all.nl CRV homepage]
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]
* [http://www.topigs.com TOPIGS homepage]
| width="30%" |
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]
|}

Tariffs

2019-04-04T08:41:47Z

Dawes001: Created page with "== Computing: Calculations (cores)== {| class="wikitable" !Queue !CPU core hour !GB memory hour |- |Standard queue |€ 0.0150 |€ 0.0015 |- |High priority queue |€ 0.0200..."