Workflow Engines (Snakemake, Nextflow): Difference between revisions

From HPCwiki
Jump to navigation Jump to search
m Haars0011 moved page Running Snakemake pipelines to Workflow Engines (Snakemake, Nextflow): IA migration §8: rename Running Snakemake pipelines → Workflow Engines (Snakemake, Nextflow) (leaving redirect) (via move-page on MediaWiki MCP Server)
IA migration §8: rewrite — remove personal/B4F specifics, fix HTML-encoded brackets, point conda to Python, add Nextflow section + executor TODO (via update-page on MediaWiki MCP Server)
 
Line 1: Line 1:
Author: Carolina Pita Barros <br/>
Workflow engines let you describe a multi-step analysis as a set of rules — which steps depend on which, and how to run each — and then execute the whole pipeline reproducibly, submitting the individual steps to the scheduler for you. The two most common on Anunna are [https://snakemake.github.io/ Snakemake] and [https://www.nextflow.io/ Nextflow].
Contact: carolina.pitabarros@wur.nl <br/>
ABG


<br/><br/>
Using a workflow engine has real advantages on an HPC cluster: steps run as SLURM jobs with the right resources, only the parts that need to run are rerun, and the same pipeline can be shared and reproduced by others.
You can find my pipelines [https://github.com/CarolinaPB/ here]


The Snakemake shared here use modules loaded from the HPC and tools installed with conda.
== Snakemake ==


Click [https://github.com/CarolinaPB/snakemake-template/blob/master/Short%20introduction%20to%20Snakemake.pdf here] for an introduction to Snakemake
Snakemake describes a pipeline as a set of rules in a <code>Snakefile</code>. It can submit each rule's work to SLURM and manage the dependencies between steps.


== Clone the repository ==
=== Set up ===


==== From github ====
Snakemake is usually installed in a conda environment. If you do not have conda/Miniforge yet, see [[Python]]. Create an environment containing Snakemake and your pipeline's dependencies:


Go to the repository’s page, click the green “Code” button and copy the path  <br/>
<syntaxhighlight lang="bash">
In your terminal go to where you want to download it to and run
conda create --name my-pipeline --file requirements.txt
conda activate my-pipeline
</syntaxhighlight>


<pre>git clone &lt;path you copied from github&gt;</pre>
Giving the environment the same name as the pipeline makes it easy to find later.
==== From the the WUR HPC (Anunna) ====


Go to <code>/lustre/nobackup/WUR/ABGC/shared/PIPELINES/</code> and choose which pipeline you want to use.
=== SLURM profile ===


<pre>cp -r &lt;pipeline directory&gt; &lt;directory where you want to save it to&gt;</pre>
To let Snakemake submit jobs to SLURM, create a profile. Make a directory for it:
First you’ll need to do some set up. Go to the pipeline’s directory.


== Installation ==
<syntaxhighlight lang="bash">
mkdir -p ~/.config/snakemake/my-pipeline
</syntaxhighlight>


Install <code>conda</code> if you don’t have it
and create a <code>config.yaml</code> inside it that tells Snakemake how to submit jobs, for example:
''Update 05/01/2022:''<br />
Here I show how to install miniconda in a linux system<br />
[https://docs.conda.io/en/latest/miniconda.html Download installer]<br />
[https://conda.io/projects/conda/en/latest/user-guide/install/index.html Installation instructions]


# Download the installer to your home directory. Choose the version according to your operating system. You can right click the link, copy and download with
<syntaxhighlight lang="yaml">
jobs: 10
cluster: "sbatch -t 1:0:0 --mem=16000 -c 16 --job-name={rule} --output=logs_slurm/{rule}.out --error=logs_slurm/{rule}.err"
use-conda: true
</syntaxhighlight>


<pre>wget &lt;link&gt;</pre>
Adjust the resources (time, memory, cores) to what your rules need.
At the time of writing this update, for me it would be:


<pre>wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh</pre>
<!-- TODO: confirm the recommended way to run Snakemake on the current cluster. The cluster-command profile shown here is the older (Snakemake <8) style; Snakemake 8+ uses the SLURM executor plugin (--executor slurm). Document whichever is installed/recommended on Anunna. -->
To install miniconda, run:


<pre>bash &lt;installer name&gt;</pre>
=== Configure and run ===
installer name could be <code>Miniconda3-latest-Linux-x86_64.sh</code>


Set up the conda channels in this order:
Open the pipeline's own <code>config.yaml</code> and set the input and output paths, keeping the variable names already in the file:


<pre>conda config --add channels defaults
<syntaxhighlight lang="yaml">
conda config --add channels bioconda
OUTDIR: /path/to/output
conda config --add channels conda-forge</pre>
READS_DIR: /path/to/reads/
 
=== Create conda environment ===
 
<pre>conda create --name &lt;name-of-pipeline&gt; --file requirements.txt</pre>
<blockquote>I recommend giving it the same name as the pipeline
</blockquote>
This environment contains snakemake and the other packages that are needed to run the pipeline.
 
=== Activate environment ===
 
<pre>conda activate &lt;name-of-pipeline&gt;</pre>
=== To deactivate the environment (if you want to leave the conda environment) ===
 
<pre>conda deactivate</pre>
== File configuration ==
 
=== Create HPC config file ===
 
Necessary for snakemake to prepare and send jobs.
 
==== Start with creating the directory ====
 
<pre>mkdir -p ~/.config/snakemake/&lt;name-of-pipeline&gt;
cd ~/.config/snakemake/&lt;name-of-pipeline&gt;</pre>
==== Create config.yaml and include the following: ====
 
<blockquote>My pipelines are configured to work with SLURM
</blockquote>
<pre>jobs: 10
cluster: &quot;sbatch -t 1:0:0 --mem=16000 -c 16 --job-name={rule} --exclude=fat001,fat002,fat101,fat100 --output=logs_slurm/{rule}.out --error=logs_slurm/{rule}.err&quot;
 
use-conda: true</pre>
<blockquote>Here you should configure the resources you want to use.
</blockquote>
=== Go to the pipeline directory and open config.yaml ===
 
Configure your paths, but keep the variable names that are already in the config file.
 
<pre>OUTDIR: /path/to/output
READS_DIR: /path/to/reads/  
ASSEMBLY: /path/to/assembly
ASSEMBLY: /path/to/assembly
PREFIX: &lt;output name&gt;</pre>
PREFIX: output_name
If you want the results to be written to this directory (not to a new directory), open the Snakefile and comment out <code>workdir: config[&quot;OUTDIR&quot;]</code> and ignore or comment out the <code>OUTDIR: /path/to/output</code> in the config file.
</syntaxhighlight>
 
'''Now the setup is complete'''


== How to run the pipeline ==
Because pipelines can take a long time, run Snakemake inside a persistent session ([https://linuxize.com/post/how-to-use-linux-screen/ screen] or tmux) so it keeps running if your connection drops. First do a dry run to check what will happen:


Since the pipelines can take a while to run, it’s best if you use a [https://linuxize.com/post/how-to-use-linux-screen/ screen session]. By using a screen session, Snakemake stays “active” in the shell while it’s running, there’s no risk of the connection going down and Snakemake stopping.
<syntaxhighlight lang="bash">
snakemake -np
</syntaxhighlight>


Start by creating a screen session:
If the steps and commands look right, run the pipeline with your profile:


<pre>screen -S &lt;name of session&gt;</pre>
<syntaxhighlight lang="bash">
snakemake --profile my-pipeline
</syntaxhighlight>


You'll need to activate the conda environment again
The jobs are submitted to SLURM and you can follow the progress in your terminal and with the usual tools — see [[Monitoring Jobs]].
<pre>conda activate &lt;name-of-pipeline&gt;</pre>


Then run
== Nextflow ==


<pre>snakemake -np</pre>
[https://www.nextflow.io/ Nextflow] is another widely used workflow engine, popular in bioinformatics (for example the nf-core pipelines).
This will show you the steps and commands that will be executed. Check the commands and file names to see if there’s any mistake.


If all looks ok, you can now run your pipeline
<!-- TODO: add a Nextflow section for Anunna — how to load or install Nextflow, the SLURM executor configuration (nextflow.config: process.executor = 'slurm'), and a minimal example. -->


<pre>snakemake --profile &lt;name-of-pipeline&gt;</pre>
== See also ==
If everything was set up correctly, the jobs should be submitted and you should be able to see the progress of the pipeline in your terminal.
* [[Python]]
* [[Environment Modules]]
* [[Monitoring Jobs]]
* [[Scheduler Overview (Slurm)]]

Latest revision as of 14:14, 18 June 2026

Workflow engines let you describe a multi-step analysis as a set of rules — which steps depend on which, and how to run each — and then execute the whole pipeline reproducibly, submitting the individual steps to the scheduler for you. The two most common on Anunna are Snakemake and Nextflow.

Using a workflow engine has real advantages on an HPC cluster: steps run as SLURM jobs with the right resources, only the parts that need to run are rerun, and the same pipeline can be shared and reproduced by others.

Snakemake

Snakemake describes a pipeline as a set of rules in a Snakefile. It can submit each rule's work to SLURM and manage the dependencies between steps.

Set up

Snakemake is usually installed in a conda environment. If you do not have conda/Miniforge yet, see Python. Create an environment containing Snakemake and your pipeline's dependencies:

conda create --name my-pipeline --file requirements.txt
conda activate my-pipeline

Giving the environment the same name as the pipeline makes it easy to find later.

SLURM profile

To let Snakemake submit jobs to SLURM, create a profile. Make a directory for it:

mkdir -p ~/.config/snakemake/my-pipeline

and create a config.yaml inside it that tells Snakemake how to submit jobs, for example:

jobs: 10
cluster: "sbatch -t 1:0:0 --mem=16000 -c 16 --job-name={rule} --output=logs_slurm/{rule}.out --error=logs_slurm/{rule}.err"
use-conda: true

Adjust the resources (time, memory, cores) to what your rules need.


Configure and run

Open the pipeline's own config.yaml and set the input and output paths, keeping the variable names already in the file:

OUTDIR: /path/to/output
READS_DIR: /path/to/reads/
ASSEMBLY: /path/to/assembly
PREFIX: output_name

Because pipelines can take a long time, run Snakemake inside a persistent session (screen or tmux) so it keeps running if your connection drops. First do a dry run to check what will happen:

snakemake -np

If the steps and commands look right, run the pipeline with your profile:

snakemake --profile my-pipeline

The jobs are submitted to SLURM and you can follow the progress in your terminal and with the usual tools — see Monitoring Jobs.

Nextflow

Nextflow is another widely used workflow engine, popular in bioinformatics (for example the nf-core pipelines).


See also