Main Page: Difference between revisions

From HPCwiki
Jump to navigation Jump to search
No edit summary
 
(91 intermediate revisions by 16 users not shown)
Line 1: Line 1:
The [http://www.breed4food.com/en/breed4food.htm Breed4Food] (B4F) cluster is a joint [http://en.wikipedia.org/wiki/High-performance_computing High Performance Compute] (HPC) infrastructure of the [[About_ABGC | Animal Breeding and Genomics Centre]] (WU-Animal Breeding and Genomics and Wageningen Livestock Research) and four major breeding companies: [http://www.cobb-vantress.com Cobb-Vantress], [https://www.crv4all.nl CRV], [http://www.hendrix-genetics.com Hendrix Genetics], and [http://www.topigs.com TOPIGS].  
Anunna is a [http://en.wikipedia.org/wiki/High-performance_computing High Performance Computer] (HPC) infrastructure hosted by [https://www.wur.nl/en/show/supercomputer-anunna-opens-up-more-opportunities-for-data-storage-and-artificial-intelligence-applications.htm Wageningen University & Research Centre]. It is open for use for all WUR research groups as well as other organizations, including companies, that have collaborative projects with WUR.  


== Rationale and Requirements for a new cluster ==
== Access Policy ==
[[File:Breed4food-logo.jpg|thumb|right|200px|The Breed4Food logo]]
[[Access_Policy | Main Article: Access Policy]]
The B4F Cluster is, in a way, the 7th pillar of the [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]. While the other six pillars revolve around specific research themes, the Cluster represents a joint infrastructure. The rationale behind the cluster is to enable the increasing computational needs in the field of genetics and genomics research, by creating a joint facility that will generate benefits of scale, thereby reducing cost. In addition, the joint infrastructure is intended to facilitate cross-organisational knowledge transfer. In that capacity, the B4F Cluster acts as a joint (virtual) laboratory where researchers - academic and applied - can benefit from each other's know how. Lastly, the joint cluster, housed at Wageningen University campus, allows retaining vital and often confidential data sources in a controlled environment, something that cloud services such as Amazon Cloud or others usually can not guarantee.
 
 
{{-}}
 
== Process of acquisition and financing ==


[[File:Signing_CatAgro.png|thumb|left|300px|Petra Caessens, manager operations of CAT-AgroFood, signs the contract of the supplier on August 1st, 2013. Next to her Johan van Arendonk on behalf of Breed4Food.]]
Access needs to be granted actively (by creation of an account on the cluster by FB-IT). Use of resources is limited by the scheduler. Note that the use of Anunna is not free of charge.  
The B4F cluster was financed by [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood]. The [[B4F_cluster#IT_Workgroup | IT-Workgroup]] formulated a set of requirements that in the end were best met by an offer from [http://www.dell.com/learn/nl/nl/rc1078544/hpcc Dell]. [http://www.clustervision.com ClusterVision] was responsible for installing the cluster at the Theia server centre of FB-ICT.


{{-}}
= Using Anunna =
* [[Tariffs | Costs associated with resource usage]]


== Architecture of the cluster ==
== Gaining access to Anunna==
Access to the cluster and file transfer are traditionally done via [http://en.wikipedia.org/wiki/Secure_Shell SSH and SFTP].
* [[log_in_to_B4F_cluster | Logging into cluster using ssh]]
* [[file_transfer | File transfer options]]
* [[Services | Alternative access methods, and extra features and services on Anunna]]
* [[Filesystems | Data storage methods on Anunna]]


[[File:Cluster_scheme.png|thumb|right|600px|Schematic overview of the cluster.]]
== Using Anunna for courses (mainly jupyter notebooks) ==
The new B4F HPC has a classic cluster architecture: state of the art Parallel File System (PSF), headnodes, compute nodes (of varying 'size'), all connected by superfast network connections (Infiniband). Implementation of the cluster will be done in stages. The initial stage includes a 600TB PFS, 48 slim nodes of 16 cores and 64GB RAM each, and 2 fat nodes of 64 cores and 1TB RAM each. The overall architecture, that include two head nodes in fall-over configuration and an infiniband network backbone, can be easily expanded by adding nodes and expanding the PFS. The cluster management software is designed to facilitate a heterogenous and evolving cluster.
* [[steps_for_courses | Steps involved to run a course on Anunna]]
{{-}}
=== Nodes ===
= Events =
The cluster consists of a bunch of separate machines that each has its own operating system. The default operating system throughout the cluster is [https://www.scientificlinux.org Scientific Linux] version 6. Scientific Linux (SL) is based on [http://en.wikipedia.org/wiki/Red_Hat_Enterprise_Linux Red Hat Enterprise Linux (RHEL)], which currently is at version 6. SL therefore follows the versioning scheme of RHEL.


The cluster has two master nodes in a redundant configuration, which means that if one crashes, the other will take over seamlessly. Various other nodes exist to support the two main file systems (the Lustre parallel file system and the NFS file system). The actual computations are done on the worker nodes or compute nodes. The cluster is configured in a heterogeneous fashion: it consists of 48 so called 'slim nodes', that each have 16 cores and 64GB of RAM (called 'node001' through 'node060'; note that not all node names map to physical nodes), and two so called 'fat nodes' that each have 64 cores and 1TB of RAM ('fat001' and 'fat002').
* [[Courses]] that have happened and are happening
* [[Downtime]] that will affect all users
* [[Meetings]] that may affect the policies of Anunna


Information from the Cluster Management Portal, as it appeared on June 26, 2014:
= Software =
  <code>DEVICE INFORMATION
* [[Modules]]
  Hostname State Memory Cores CPU Speed GPU NICs IB Category
* [[Apptainer]]
node001..node002 UP 67.6 GiB 16 Intel(R) Xeon(R) CPU E5-2660 0+ 2200 MHz 3 1 default<br>
* [[Python]]
node049..node054 UP 67.6 GiB 16 Intel(R) Xeon(R) CPU E5-2660 0+ 2200 MHz 3 1 default<br>
* [[R]]
master1 master2 UP 67.5 GiB 16 Intel(R) Xeon(R) CPU E5-2660 0+ 2199 MHz 5 1<br>
* [[Julia]]
mds01, mds02 UP 16.8 GiB 8 Intel(R) Xeon(R) CPU E5-2609 0+ 2399 MHz 5 1 mds<br>
storage01..storage06 UP 67.6 GiB 32 Intel(R) Xeon(R) CPU E5-2660 0+ 2200 MHz 5 1 oss<br>
nfs01 UP 67.6 GiB 8 Intel(R) Xeon(R) CPU E5-2609 0+ 2399 MHz 7 1 login<br>
fat001 fat002 UP 1.0 TiB 64 AMD Opteron(tm) Processor 6376 2300 MHz 5 1 fat
  </code>


Main cluster node configuration:
=Web Apps=
* Master nodes: 2 PowerEdge R720 master nodes in a failover configuration, which also will share some applications and databases with machines in the cluster, for which the parallel file system is not the ideal solution.
* The NFS server is a PowerEdge R720XD. The NFS node will also act as a login node, where users log in and compile applications and submit jobs and share each home directory via nfs.
* 50 compute nodes
** 12x Dell PowerEdge C6000 enclosures, each containing four nodes
** 48x Dell PowerEdge C6220; 16 Intel Xeon cores, 64GB RAM each
** 2x Dell R815; 64 AMD Opteron cores, 1TB RAM each
Hyperthreading is disabled in compute nodes.


=== Filesystems ===
*[[Jupyter]]


[[File:Storage_pic.png|thumb|right|300px|Schematic overview of storage components of the B4F cluster.]]
= Other Software =
The B4F Cluster has two primary file systems, each with different properties and purposes.
==== Parallel File System: Lustre ====
At the base of the cluster is an ultrafast file system, a so called [http://en.wikipedia.org/wiki/Parallel_file_system Parallel File System] (PFS). The current size of the PFS is around 600TB. The PFS implemented in the B4F Cluster is called [http://en.wikipedia.org/wiki/Lustre_(file_system) Lustre]. Lustre has become very popular in recent years due to the fact that it is very feature rich, deemed very stable, and is Open Source. Lustre nowadays is the default option for PFS in Dell clusters as well as clusters sold by other vendors. The PFS is mounted on all head nodes and worker nodes of the cluster, providing a seamless integration between compute and data infrastructure. The strength of a PFS is speed - the total I/O should be up to 15GB/s. by design. With a very large number of compute nodes - and with very high volumes of data - these high read-write speeds that the PSF can provide are necessary. The Lustre filesystem is divided in [[Lustre_PFS_layout | several partitions]], each differing in persistence and backup features. The Lustre PSF is meant to store (shared) data that is likely to be used for analysis in the near future. Personal analysis scripts, software, or additional small data files can be stored in the $HOME directory of each of the users.


The hardware components of the PFS:
== Cluster Scheduler ==
* 2x Dell PowerEdge R720
Anunna uses Slurm as job scheduler.
* 1x Dell PowerVault MD3220
* [[Using_Slurm | Submit jobs with Slurm]]
* 6x Dell PowerEdge R620
* [[node_usage_graph | Be aware of how much work the cluster is under right now with 'node_usage_graph']]
* 6x Dell PowerVault MD3260
 
==== Network File System (NFS): $HOME dirs ====
Each user will have his/her own home directory. The path of the home directory will be:
 
  /home/[name partner]/[username]
 
/home lives on a so called [http://en.wikipedia.org/wiki/Network_File_System Network File System], or NFS. The NFS is separate from the PFS and is far more limited in I/O (read/write speeds, latency, etc) than the PFS. This means that it is not meant to store large datavolumes that require high data transfer or small latency. Compared to the Lustre PFS (600TB in size), the size of the NFS is small in comparison - only 20TB. The /home partition will be backed up daily. The amount of space that can be allocated is limited per user. Personal quota and total use per user can be found using (200GB soft and 210GB hard limit) :
 
<source lang='bash'>
quota -s
</source>
 
The NFS is supported through the NFS server (nfs01) that also serves as access point to the cluster.
 
Hardware components of the NFS:
* 1x Dell PowerEdge R720XD
* 1x Dell PowerVault MD3220
 
=== Network ===
The various components - head-nodes, worker nodes, and most importantly, the Lustre PFS - are all interconnected by an ultra-high speed network connection called [http://en.wikipedia.org/wiki/Infiniband InfiniBand]. A total of 7 InifiniBand switches are configured in a [http://en.wikipedia.org/wiki/Fat_tree fat tree] configuration.
 
== Housing at Theia ==
[[File:Map_Theia.png|thumb|left|200px|Location of Theia, just outside of Wageningen campus]]
The B4F Cluster is housed at one of two main server centres of WUR-FB-ICT, near Wageningen Campus. The building (Theia)  may not look like much from the outside (used to function as potato storage) but inside is a modern server centre that includes, a.o., emergency power backup systems and automated fire extinguishers. Many of the server facilities provided by FB-ICT that are used on a daily basis by WUR personnel and students are located there, as is the B4F Cluster. Access to Theia is evidently highly restricted and can only be granted in the presence of a representative of FB-IT.
{{-}}
{| width="90%"
|- valign="top"
| width="10%" |
 
| width="30%" |
[[File:Cluster2_pic.png|thumb|left|220px|Some components of the cluster after unpacking.]]
| width="70%" |
[[File:Cluster_pic.png|thumb|right|400px|The final configuration after installation.]]
|}
{{-}}
 
== Management ==
 
=== Project Leader ===
* Stephen Janssen (Wageningen UR,FB-IT, Service Management)
 
=== Daily Project Management ===
* [[User:pollm001 | Koen Pollmann (Wageningen UR,FB-IT, Infrastructure)]]
* Andre ten Böhmer (Wageningen UR, FB-ICT, Infrastructure)
 
[[Maintenance_and_Management | Maintenance and Management]]
 
=== Steering Group ===
Ensures that the HPC generates enough revenues and meets the needs of the users. This includes setting fees, developing contracts, attracting new users, decisions on investments in the HPC and communication.
* Frido Hamoen (CRV, on behalf of Breed4Food industrial partners, replaced Alfred de Vries in August)
* Petra Caessens (CAT-AgroFood)
* Wojtek Sablik (Wageningen UR, FB-IT, Infrastructure)
* Edda Neuteboom (CAT_AgroFood, secretariat)
* Johan van Arendonk (Wageningen UR, chair).
 
=== IT Workgroup ===
[[File:Image_(1).jpeg|thumb|right|380px|(part of) the IT working group in front of the B4F Cluster]]
Is responsible for the technical performance of the HPC. The IT-workgroup has been involved in the design of the HPC and the selection of the supplier. They will support the technical management of the HPC and share experiences to ensure that the HPC meets the needs of its users. The IT-workgroup will advise the steering group on investments in software and hardware.
* [[User:Janss115 | Stephen Janssen (Wageningen UR, FB-IT, Service Management)]]
* [[User:pollm001 | Koen Pollmann (Wageningen UR, FB-IT, Infrastructure)]]
* [[User:Bohme001 | Andre ten Böhmer (Wageningen UR, FB-IT, Infrastructure)]]
* [[User:Barris01 | Wes Barris (Cobb)]]
* [[User:Vereij01 | Addie Vereijken (Hendrix Genetics)]]
* [[User:dongen01 | Henk van Dongen (Topigs)]]
* Harry Dijkstra (CRV)
* [[User:Calus001 | Mario Calus (ABGC-WLR)]]
* [[User:Megen002 | Hendrik-Jan Megens (ABGC-ABG)]]
{{-}}
 
=== User Group ===
The User Group ultimately is the most important of all groups, because it encompasses the users for which the infrastructure was built. In addition, successful use of the cluster will rely on an active community of users that is willing to share knowledge and best practices, including maintenance and expansion of this Wiki. Regular User Group meetings will be held in the future [frequency to be determined] to facilitate this process.
 
* [[List_of_users | List of users (alphabetical order)]]
 
== Access Policy ==
Access policy is still a work in progress. In principle, all staff and students of the five main partners will have access to the cluster. Access needs to be granted actively (by creation of an account on the cluster by FB-ICT). Use of resources is limited by the scheduler. Depending on availability of queues ('partitions') granted to a user, priority to the system's resources is regulated.
 
=== Contact Persons ===
A request to access the cluster needs to be directed to one of the following persons (please refer to appropriate partner):
 
==== Cobb-Vantress ====
* Wes Barris
* Jun Chen
 
==== ABGC ====
===== Animal Breeding and Genetics =====
* [[User:Hulze001 |Alex Hulzebosch]]
* [[User:Megen002 | Hendrik-Jan Megens]]
 
===== Wageningen Livestock Research =====
* Mario Calus
* Ina Hulsegge
==== CRV ====
* Frido Hamoen
* Chris Schrooten
==== Hendrix Genetics ====
* Ton Dings
* Abe Huisman
* Addie Vereijken
==== Topigs ====
* [[User:dongen01 | Henk van Dongen]]
* Egiel Hanenbarg
* Naomi Duijvensteijn
 
== Using the B4F Cluster ==
=== Gaining access to the B4F Cluster ===
Access to the cluster and file transfer are done by [http://en.wikipedia.org/wiki/Secure_Shell ssh-based protocols].
* [[log_in_to_B4F_cluster | Logging into cluster using ssh and file transfer]]
 
=== Cluster Management Software and Scheduler ===
The B4F cluster uses Bright Cluster Manager software for overall cluster management, and Slurm as job scheduler.
* [[BCM_on_B4F_cluster | Monitor cluster status with BCM]]
* [[SLURM_on_B4F_cluster | Submit jobs with Slurm]]
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]
* [[SLURM_Compare | Rosetta Stone of Workload Managers]]


=== Installation of software by users ===
== Installation of software by users ==


* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]
* [[Domain_specific_software_on_B4Fcluster_installation_by_users | Installing domain specific software: installation by users]]
Line 179: Line 50:
* [[Installing_R_packages_locally | Installing R packages locally]]
* [[Installing_R_packages_locally | Installing R packages locally]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]
* [[Virtual_environment_Python_3.4_or_higher | Setting up and using a virtual environment for Python3.4 or higher ]]
* [[Installing WRF and WPS]]
* [[Running scripts on a fixed timeschedule (cron)]]


=== Installed software ===
== Installed software ==


* [[Globally_installed_software | Globally installed software]]
* [[Globally_installed_software | Globally installed software]]
* [[ABGC_modules | ABGC specific modules]]
* [[ABGC_modules | ABGC specific modules]]


=== Being in control of Environment parameters ===
= Useful Notes =
 
== Being in control of Environment parameters ==


* [[Using_environment_modules | Using environment modules]]
* [[Using_environment_modules | Using environment modules]]
* [[Aliases and local variables]]
* [[Setting local variables]]
* [[Setting local variables]]
* [[Setting_TMPDIR | Set a custom temporary directory location]]
* [[Setting_TMPDIR | Set a custom temporary directory location]]
Line 193: Line 70:
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]
* [[Setting_up_Python_virtualenv | Setting up and using a virtual environment for Python3 ]]


=== Controlling costs ===
== Controlling costs ==


* [[SACCT | using SACCT to see your costs]]
* [[SACCT | using SACCT to see your costs]]
* [[get_my_bill | using the "get_my_bill" script to estimate costs]]
== Management ==
Product Owner of Anunna is Alexander van Ittersum (Wageningen UR,FB-IT, C&PS). [[User: prins089 | Fons Prinsen (Wageningen UR, FB-IT, C&PS)]] is responsible for [[Maintenance_and_Management | Maintenance and Management]] of the cluster.
* [[Roadmap | Ambitions regarding innovation, support and administration of Anunna ]]


== Miscellaneous ==
= Miscellaneous =
* [[Mailinglist | Electronic mail discussion lists]]
* [[History_of_the_Cluster | Historical information on the startup of Anunna]]
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]
* [[Bioinformatics_tips_tricks_workflows | Bioinformatics tips, tricks, and workflows]]
* [[Parallel_R_code_on_SLURM | Running parallel R code on SLURM]]
* [[Convert_between_MediaWiki_and_other_formats | Convert between MediaWiki format and other formats]]
* [[Manual GitLab | GitLab: Create projects and add scripts]]
* [[Monitoring_executions | Monitoring job execution]]
* [[Shared_folders | Working with shared folders in the Lustre file system]]
* [[Old_binaries | Running older binaries on the updated OS]]


== See also ==
= See also =
* [[Maintenance_and_Management | Maintenance and Management]]
* [[Maintenance_and_Management | Maintenance and Management]]
* [[BCData | BCData]]
* [[Mailinglist | Electronic mail discussion lists]]
* [[Mailinglist | Electronic mail discussion lists]]
* [[About_ABGC | About ABGC]]
* [[About_ABGC | About ABGC]]
Line 207: Line 99:
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]
* [[Lustre_PFS_layout | Lustre Parallel File System layout]]


== External links ==
= External links =
{| width="90%"
{| width="90%"
|- valign="top"
|- valign="top"
| width="30%" |
| width="30%" |
* [http://www.breed4food.com/en/show/Breed4Food-initiative-reinforces-the-Netherlands-position-as-an-innovative-country-in-animal-breeding-and-genomics.htm Breed4Food programme]
* [https://www.wur.nl/en/Value-Creation-Cooperation/Facilities/Wageningen-Shared-Research-Facilities/Our-facilities/Show/High-Performance-Computing-Cluster-HPC-Anunna.htm SRF offers a HPC facilty]
* [http://www.wageningenur.nl/en/Expertise-Services/Facilities/CATAgroFood-3/CATAgroFood-3/News-and-agenda/Show/CATAgroFood-invests-in-a-High-Performance-Computing-cluster.htm CATAgroFood invests in HPC]
* [http://www.cobb-vantress.com Cobb-Vantress homepage]
 
 
| width="30%" |
* [https://www.crv4all.nl CRV homepage]
* [http://www.hendrix-genetics.com Hendrix Genetics homepage]
* [http://www.topigs.com TOPIGS homepage]
| width="30%" |
| width="30%" |
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]
* [http://en.wikipedia.org/wiki/Scientific_Linux Scientific Linux]
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]
* [http://en.wikipedia.org/wiki/Help:Cheatsheet Help with editing Wiki pages]
|}
|}

Latest revision as of 16:19, 2 December 2024

Anunna is a High Performance Computer (HPC) infrastructure hosted by Wageningen University & Research Centre. It is open for use for all WUR research groups as well as other organizations, including companies, that have collaborative projects with WUR.

Access Policy

Main Article: Access Policy

Access needs to be granted actively (by creation of an account on the cluster by FB-IT). Use of resources is limited by the scheduler. Note that the use of Anunna is not free of charge.

Using Anunna

Gaining access to Anunna

Access to the cluster and file transfer are traditionally done via SSH and SFTP.

Using Anunna for courses (mainly jupyter notebooks)

Events

  • Courses that have happened and are happening
  • Downtime that will affect all users
  • Meetings that may affect the policies of Anunna

Software

Web Apps

Other Software

Cluster Scheduler

Anunna uses Slurm as job scheduler.

Installation of software by users

Installed software

Useful Notes

Being in control of Environment parameters

Controlling costs

Management

Product Owner of Anunna is Alexander van Ittersum (Wageningen UR,FB-IT, C&PS). Fons Prinsen (Wageningen UR, FB-IT, C&PS) is responsible for Maintenance and Management of the cluster.

Miscellaneous

See also

External links