These best practices are separated into two parts, for the Nextflow user, and for a Nextflow developer. They are also focused on our needs in SciLifeLab, where we primarily use HPC and local compute.
Nextflow user
Use a parameter file
When running Nextflow, params are often supplied on the command-line.
nextflow run --reads '/path/to/reads' --reference '/path/to/reference'A more reproducible way is to supply params in a params.yml and supply this to -params-file:
params.yml
reads: '/path/to/reads'
reference: '/path/to/reference'nextflow run main.nf -params-file params.yml ...nf-core pipelines launch can write a params.json for you.
Use a launch script to run Nextflow
It’s common practice to run tools directly in the command-line. It can be quite helpful to batch commands in a script, which include setting up the environment, and cleaning up after runs to save disk space.
run_nextflow.sh
#! /usr/bin/env bash
# Exit script when you encounter an error or undefined variable
set -euo pipefail
# Define some environment Variables
PROJECT_STORAGE="/proj/naiss-..."
# Activate shared Nextflow environment unless in a Pixi env
if [ -z "${PIXI_ENVIRONMENT_NAME:-}" ]; then
set +u
eval "$(conda shell.bash hook)"
conda activate "${PROJECT_STORAGE}/conda/nextflow-env"
set -u
fi
# Store apptainer images in shared location
1NXF_APPTAINER_CACHEDIR="${PROJECT_STORAGE}/nobackup/nxf-apptainer-image-cache"
# Run nextflow
nextflow run \
2 -ansi-log false \
-profile uppmax \
-params-file params.yml \
-r 3.22.0 \
nf-core/rnaseq
# Clean up work directory
nextflow clean -f -before last && find work -type d -empty -delete
# Remind myself to commit files
git status- 1
-
Using
NXF_APPTAINER_CACHEDIRis the equivalent of usingapptainer.cacheDirin thenextflow.config. - 2
-
-ansi-log falseensures a readable log in the slurm output.
Run Nextflow with a fixed revision
Use the -r flag to provide a revision tag (either a version, a branch, or commitID).
nextflow run -r 3.22.0 nf-core/rnaseq ...Override resources using a custom config
Nextflow is able to layer configuration files. By placing a nextflow.config in your launch directory, it will automatically load configuration, and combine it with the pipeline’s existing configuration. It can be used to override existing configuration, such as tool resources like cpus, memory, or time.
nextflow.config
- 1
- Copy the full process name from the error message.
- 2
-
Increase scheduling priority by limiting the maximum allocatable time using
process.resourceLimits.
Using the process full name 'WORKFLOW:SUBWORKFLOW:PROCESS_NAME' makes sure it has the highest priority, whereas using the process simple name 'PROCESS_NAME' may mean your config might not be applied due to configuration priority.
Use local disks for computation when available
Use node local disk space on HPC’s when available to improve speed, alleviate network bandwidth usage, and leave intermediate files out of the work directory.
nextflow.config
- 1
-
Use
process.scratchto specify a path to the node’s local storage. TheSNIC_TMPenv is described here - 2
- Submit to SLURM.
- 3
-
Submit jobs using your project account ID (
-A naiss...).
For nf-core workflows use an existing nf-core config profile:
- Uppmax (Pelle / Bianca): https://nf-co.re/configs/uppmax/
- PDC-KTH (Dardel): https://nf-co.re/configs/pdc_kth/
Small vs large processes
When workflows submit hundreds or thousands of small, fast jobs, the scheduler overhead becomes the bottleneck. Each job submission has ~5-15 seconds of overhead (queueing, starting, cleanup). Instead of submitting each task as a separate job, book an entire node and let Nextflow manage task execution locally as subprocesses from an sbatch script.
run_nextflow.sh
#! /usr/bin/env bash
1#SBATCH -A naiss2026-...
#SBATCH -N 1
#SBATCH -n 50
#SBATCH --mem 300GB
#SBATCH -t 2-00:00:00
#SBATCH -o slurm-%j-nextflow.out
# Exit script when you encounter an error or undefined variable
set -euo pipefail
# Define some environment Variables
PROJECT_STORAGE="/proj/naiss-..."
# Activate shared Nextflow environment unless in a Pixi env
if [ -z "${PIXI_ENVIRONMENT_NAME:-}" ]; then
set +u
eval "$(conda shell.bash hook)"
conda activate "${PROJECT_STORAGE}/conda/nextflow-env"
set -u
fi
# Store apptainer images in shared location
NXF_APPTAINER_CACHEDIR="${PROJECT_STORAGE}/nobackup/nxf-apptainer-image-cache"
# Run nextflow
nextflow run \
2 -profile singularity \
-params-file params.yml \
-r 3.22.0 \
nf-core/rnaseq
# Clean up work directory
nextflow clean -f -before last && find work -type d -empty -delete
# Remind myself to commit files
git status- 1
- Include SBATCH headers.
- 2
- Use a local profile if one exists.
Debug inside the work directory
The task work directory should act as an isolated folder. When a process fails, it won’t be used further so you’re free to explore.
- 1
- Change to the task work directory.
- 2
- List the files in the directory, including hidden files.
- 3
- View the process log.
- 4
- Edit the shell file to try and fix the error.
- 5
-
Run the
.command.shin the process environment.
Repeat until you’ve figured out the issue, and then go back and fix the workflow itself.
Finding the process work directory
Use nextflow log to find the work directory for a process.
For example:
- 1
-
Use
-fto list the fields you want to see. - 2
-
Use
-Fto limit output with Groovy code. - 3
-
Use the nextflow run name or the keyword
lastto say which run to examine.
nextflow log -l lists the available displayable fields.
Nextflow developer
Use a test data set for development
When developing a pipeline, use a test data set that you’ve heavily subsampled. This allows you to iterate quickly when developing a pipeline. It’s also beneficial for use in CI tests, and when users want to quickly try out your workflow.
A development test data set only needs to run the workflow to the end. The final output generally doesn’t need to be coherent or make sense.
Use the correct object input
Files need to be “staged” in a work directory (for script: processes). Staging means files are part of the input: declaration for checkpointing, and are symlinked into the working directory. Generally, path type input: should be Path class objects. Path class objects are created by processes when emitting data of type path, from factory channels like channel.fromPath and channel.fromFilePairs, and from Nextflow functions, such as file, or files. When they’re passed as String class objects, it can lead to portability and reproducibility issues.
main.nf
workflow {
ALIGN (
channel.fromPath("/path/to/*.dat", checkIfExists: true),
file("/path/to/reference", checkIfExists: true)
)
}Using and fetching databases in processes
Use a separate process to fetch a flat-file database needed by your workflow. Then pass the output of that process to the process that needs the database. This means a database need only be fetched once, rather than in each process. When combined with the process directive storeDir, it can be stored in a central cache for reuse, and as long as it’s present, Nextflow will skip execution of that fetch-database process. By using a parameter to define where that storeDir should be means you can reuse the database across independent workflow runs.
main.nf
- 1
-
FETCH_DBdoes not need to be gated behind anifstatement. - 2
-
The
storeDirdirective checks if the database is stored at that path, otherwise the process will run and fetch it.params.db_cachedirprovides a means for independent workflow runs to use the same database.
Logging best practices
- Write messages to standard error
- Capture error messages from tools in logs through file redirection.
teecaptures output from stdin, and writes to both stdout and a file.- When running on a
scratchdisk and an error occurs, the output will still be in.command.err/.command.log, and not lost to a relinquished allocation.
process ANALYZE {
script:
"""
echo "[TASK] Starting analysis for ${meta.id}" >&2
echo "[TASK] Using ${task.cpus} CPUs and ${task.memory}" >&2
analyze.sh ${input} > output.txt 2> >(tee -a analysis.log >&2)
echo "[TASK] Analysis complete for ${meta.id}" >&2
"""
}Formatting and linting
- Formatting: Restructure code appearance.
- Linting: Analyse code to detect errors.
Nextflow recently released a linter and formatter.
nextflow lint [--format] -exclude .pixi -exclude results .This is also included in the Nextflow extension for various IDE’s.
Batch short tasks in processes
As previously mentioned, many short tasks increase scheduling overhead.
If you have short tasks, try to batch them in with existing processes.
main.nf
process BAM_STATS_SAMTOOLS {
input:
tuple val(meta), path(bam), path(index)
script:
"""
samtools stats --threads $task.cpus $bam > ${bam}.stats
samtools flagstat --threads $task.cpus $bam > ${bam}.flagstat
samtools idxstats $bam > ${bam}.idxstats
"""
output:
tuple val(meta), path "*.stats" , emit: stats
tuple val(meta), path "*.flagstat", emit: flagstat
tuple val(meta), path "*.idxstats", emit: idxstats
tuple val(task.process), val('samtools'), eval("samtools --version |& sed 's/^.*samtools //; s/Using.*\$//'"), emit: versions, topic: versions
}Alternatively, implement the processes such that they process files in batches.
main.nf
- 1
-
Input lists of files using
collect(),buffer(), or similar channel operators. Alternatively file paths can be collected together usingcollectFile()and staged as an input file (file of filenames). - 2
-
Commands could be run serially with
fororwhile, or could be parallelized usingxargsorparallel.
Summary
- Users:
- Use a parameter file.
- Use a launch script.
- Run with a fixed revision.
- Override configuration with a custom config.
- Use scratch disks.
- Decide to schedule or run locally based on task timing.
- Debug processes using the work directory.
- Use
nextflow logandnextflow clean.
- Developers:
- Include a test data set.
- Be mindful of object types.
- Use
storeDirfor databases or other stable files. - Log stdout and stderr with
teeto a file and their respective output streams. - Format and lint your code.
- Batch tasks appropriately for performance.