Workflow Engines (Snakemake, Nextflow): Difference between revisions
No edit summary |
IA migration §8: rewrite — remove personal/B4F specifics, fix HTML-encoded brackets, point conda to Python, add Nextflow section + executor TODO (via update-page on MediaWiki MCP Server) |
||
| (4 intermediate revisions by 2 users not shown) | |||
| Line 1: | Line 1: | ||
Workflow engines let you describe a multi-step analysis as a set of rules — which steps depend on which, and how to run each — and then execute the whole pipeline reproducibly, submitting the individual steps to the scheduler for you. The two most common on Anunna are [https://snakemake.github.io/ Snakemake] and [https://www.nextflow.io/ Nextflow]. | |||
Using a workflow engine has real advantages on an HPC cluster: steps run as SLURM jobs with the right resources, only the parts that need to run are rerun, and the same pipeline can be shared and reproduced by others. | |||
== Snakemake == | |||
Snakemake describes a pipeline as a set of rules in a <code>Snakefile</code>. It can submit each rule's work to SLURM and manage the dependencies between steps. | |||
== | === Set up === | ||
Snakemake is usually installed in a conda environment. If you do not have conda/Miniforge yet, see [[Python]]. Create an environment containing Snakemake and your pipeline's dependencies: | |||
<syntaxhighlight lang="bash"> | |||
conda create --name my-pipeline --file requirements.txt | |||
conda activate my-pipeline | |||
</syntaxhighlight> | |||
Giving the environment the same name as the pipeline makes it easy to find later. | |||
=== SLURM profile === | |||
To let Snakemake submit jobs to SLURM, create a profile. Make a directory for it: | |||
= | <syntaxhighlight lang="bash"> | ||
mkdir -p ~/.config/snakemake/my-pipeline | |||
</syntaxhighlight> | |||
and create a <code>config.yaml</code> inside it that tells Snakemake how to submit jobs, for example: | |||
=== | <syntaxhighlight lang="yaml"> | ||
jobs: 10 | |||
cluster: "sbatch -t 1:0:0 --mem=16000 -c 16 --job-name={rule} --output=logs_slurm/{rule}.out --error=logs_slurm/{rule}.err" | |||
use-conda: true | |||
</syntaxhighlight> | |||
Adjust the resources (time, memory, cores) to what your rules need. | |||
<!-- TODO: confirm the recommended way to run Snakemake on the current cluster. The cluster-command profile shown here is the older (Snakemake <8) style; Snakemake 8+ uses the SLURM executor plugin (--executor slurm). Document whichever is installed/recommended on Anunna. --> | |||
=== Configure and run === | |||
=== | |||
< | Open the pipeline's own <code>config.yaml</code> and set the input and output paths, keeping the variable names already in the file: | ||
= | <syntaxhighlight lang="yaml"> | ||
OUTDIR: /path/to/output | |||
READS_DIR: /path/to/reads/ | |||
ASSEMBLY: /path/to/assembly | |||
PREFIX: output_name | |||
</syntaxhighlight> | |||
</ | |||
Because pipelines can take a long time, run Snakemake inside a persistent session ([https://linuxize.com/post/how-to-use-linux-screen/ screen] or tmux) so it keeps running if your connection drops. First do a dry run to check what will happen: | |||
<syntaxhighlight lang="bash"> | |||
snakemake -np | |||
</syntaxhighlight> | |||
If the steps and commands look right, run the pipeline with your profile: | |||
<syntaxhighlight lang="bash"> | |||
snakemake --profile my-pipeline | |||
</syntaxhighlight> | |||
The jobs are submitted to SLURM and you can follow the progress in your terminal and with the usual tools — see [[Monitoring Jobs]]. | |||
== Nextflow == | |||
[https://www.nextflow.io/ Nextflow] is another widely used workflow engine, popular in bioinformatics (for example the nf-core pipelines). | |||
<!-- TODO: add a Nextflow section for Anunna — how to load or install Nextflow, the SLURM executor configuration (nextflow.config: process.executor = 'slurm'), and a minimal example. --> | |||
== See also == | |||
* [[Python]] | |||
* [[Environment Modules]] | |||
* [[Monitoring Jobs]] | |||
* [[Scheduler Overview (Slurm)]] | |||
Latest revision as of 14:14, 18 June 2026
Workflow engines let you describe a multi-step analysis as a set of rules — which steps depend on which, and how to run each — and then execute the whole pipeline reproducibly, submitting the individual steps to the scheduler for you. The two most common on Anunna are Snakemake and Nextflow.
Using a workflow engine has real advantages on an HPC cluster: steps run as SLURM jobs with the right resources, only the parts that need to run are rerun, and the same pipeline can be shared and reproduced by others.
Snakemake
Snakemake describes a pipeline as a set of rules in a Snakefile. It can submit each rule's work to SLURM and manage the dependencies between steps.
Set up
Snakemake is usually installed in a conda environment. If you do not have conda/Miniforge yet, see Python. Create an environment containing Snakemake and your pipeline's dependencies:
conda create --name my-pipeline --file requirements.txt
conda activate my-pipeline
Giving the environment the same name as the pipeline makes it easy to find later.
SLURM profile
To let Snakemake submit jobs to SLURM, create a profile. Make a directory for it:
mkdir -p ~/.config/snakemake/my-pipeline
and create a config.yaml inside it that tells Snakemake how to submit jobs, for example:
jobs: 10
cluster: "sbatch -t 1:0:0 --mem=16000 -c 16 --job-name={rule} --output=logs_slurm/{rule}.out --error=logs_slurm/{rule}.err"
use-conda: true
Adjust the resources (time, memory, cores) to what your rules need.
Configure and run
Open the pipeline's own config.yaml and set the input and output paths, keeping the variable names already in the file:
OUTDIR: /path/to/output
READS_DIR: /path/to/reads/
ASSEMBLY: /path/to/assembly
PREFIX: output_name
Because pipelines can take a long time, run Snakemake inside a persistent session (screen or tmux) so it keeps running if your connection drops. First do a dry run to check what will happen:
snakemake -np
If the steps and commands look right, run the pipeline with your profile:
snakemake --profile my-pipeline
The jobs are submitted to SLURM and you can follow the progress in your terminal and with the usual tools — see Monitoring Jobs.
Nextflow
Nextflow is another widely used workflow engine, popular in bioinformatics (for example the nf-core pipelines).