Workflow Engines (Snakemake, Nextflow)

From HPCwiki
Jump to navigation Jump to search

Workflow engines let you describe a multi-step analysis as a set of rules — which steps depend on which, and how to run each — and then execute the whole pipeline reproducibly, submitting the individual steps to the scheduler for you. The two most common on Anunna are Snakemake and Nextflow.

Using a workflow engine has real advantages on an HPC cluster: steps run as SLURM jobs with the right resources, only the parts that need to run are rerun, and the same pipeline can be shared and reproduced by others.

Snakemake

Snakemake describes a pipeline as a set of rules in a Snakefile. It can submit each rule's work to SLURM and manage the dependencies between steps.

Set up

Snakemake is usually installed in a conda environment. If you do not have conda/Miniforge yet, see Python. Create an environment containing Snakemake and your pipeline's dependencies:

conda create --name my-pipeline --file requirements.txt
conda activate my-pipeline

Giving the environment the same name as the pipeline makes it easy to find later.

SLURM profile

To let Snakemake submit jobs to SLURM, create a profile. Make a directory for it:

mkdir -p ~/.config/snakemake/my-pipeline

and create a config.yaml inside it that tells Snakemake how to submit jobs, for example:

jobs: 10
cluster: "sbatch -t 1:0:0 --mem=16000 -c 16 --job-name={rule} --output=logs_slurm/{rule}.out --error=logs_slurm/{rule}.err"
use-conda: true

Adjust the resources (time, memory, cores) to what your rules need.


Configure and run

Open the pipeline's own config.yaml and set the input and output paths, keeping the variable names already in the file:

OUTDIR: /path/to/output
READS_DIR: /path/to/reads/
ASSEMBLY: /path/to/assembly
PREFIX: output_name

Because pipelines can take a long time, run Snakemake inside a persistent session (screen or tmux) so it keeps running if your connection drops. First do a dry run to check what will happen:

snakemake -np

If the steps and commands look right, run the pipeline with your profile:

snakemake --profile my-pipeline

The jobs are submitted to SLURM and you can follow the progress in your terminal and with the usual tools — see Monitoring Jobs.

Nextflow

Nextflow is another widely used workflow engine, popular in bioinformatics (for example the nf-core pipelines).


See also