Nextflow is a workflow manager written in Groovy. In this post, we’ll cover the fundamental concepts you need to get started.
Core concepts
1. Processes
A process is the basic building block of a Nextflow workflow. It represents a single computational task.
- 1
-
Each line under
input:corresponds to a channel, and describes the input structure. - 2
- The default script language is Bash.
- 3
-
Variables are used with
${var_name}notation inside scripts. - 4
-
Each line under
output:corresponds to an output channel, and describes the captured content structure.
The full description, including all available directives, is on Nextflow Documentation - Process Reference.
2. Channels
Channels are the “pipes” that connect processes together, allowing data to flow between them. They can contain single values or multiple items.
workflow {
main:
// Create a channel with multiple values
ch_numbers = channel.of( 1, 2, 3, 4, 5 )
// Use the channel in a process
ADD_ONE( ch_numbers )
}channel.of is a channel factory and they are used to populate channels with data. See Nextflow Documentation - Channel factories for a full list of channel factories.
There are two types of channels. Queue type channels, and value type channels. Queue type channels “consume” the data (first in, first out). Value type channels can be repeatedly used.
For example:
process ALIGN_SEQUENCES {
input:
tuple val(sample_name), path(reads)
path reference
...
}
workflow {
ch_reads = channel.fromPath('/path/to/*.reads')
.map { file -> tuple(file.baseName, file) }
ch_reference = channel.fromPath('/path/to/reference')
ALIGN_SEQUENCES (
ch_reads, // queue type channel
ch_reference.collect() // queue type channel converted to a value channel using the `collect` channel operator
)
}If ch_reference was not converted, ALIGN_SEQUENCES would only run for the first set of reads, because there is only one reference file to pair with the first set of reads files.
The Nextflow Documentation - Operator shows the type of channel a channel operator (see below) returns. Returns: channel means it produces a queue type channel, whereas Returns: dataflow value means it produces a value type channel.
3. Workflows
A workflow orchestrates processes and channels to define the logic of your pipeline.
workflow {
main:
// Define your pipeline logic here
input_ch = Channel.fromPath('file1.txt', 'file2.txt')
COUNT_LINES( input_ch )
}4. Channel operators
Operators transform and manipulate channels, through the use of Closures (anonymous functions). A Closure is fenced by { }, takes an input variable, and maps it -> to an expression, e.g. { num -> num * 2 }. Common channel operators include:
map: Transform each elementchannel.of(1, 2, 3).map { num -> num * 2 }.view() // Output: 2, 4, 6filter: Select elements matching a conditionchannel.of(1, 2, 3, 4).filter { num -> num > 2 }.view() // Output: 3, 4collect: Gather all elements into a single listchannel.of(1, 2, 3).collect().view() // Output: [1, 2, 3]view: View the contents of the channel
See the full list of operators at Nextflow Documentation - Operators.
A simple example
Here’s a complete, minimal workflow:
workflow {
1 ch_in = channel.of( 1, 'A', true, [ 2, "B" ] )
TASK( ch_in )
TASK.out.view()
}
process TASK {
input:
val thing
script:
"""
echo "${thing}"
"""
output:
stdout
}- 1
- Channel data does not have to be the same type (Object class). Here is a channel with a number, string, boolean, and a list.
Key files and folders
main.nf: Your main workflow script.nextflow.config: Configuration file for default parameters, resource settings, and execution profiles.work: Each instance of a process has it’s own folder in the work directory. It’s used for caching, and troubleshooting..nextflow/: Hidden directory containing Nextflow metadata, such as run names, work directory paths, and exit statuses..nextflow.log: A hidden log file which records what was happening as the workflow was run. It can be an essential file for troubleshooting.
.
├── .nextflow
│ ├── cache
│ │ └── 5db3b1a2-8b3a-4289-805f-983f2c645401
│ │ ├── db
│ │ │ ├── 000003.log
│ │ │ ├── CURRENT
│ │ │ ├── LOCK
│ │ │ └── MANIFEST-000002
│ │ └── index.loving_lovelace
│ ├── history
│ └── plr
├── .nextflow.log
├── README.md
├── main.nf
├── nextflow.config
└── work
└── ec
└── eb1c4f518284c5c0e591d6388360ae
├── .command.begin
├── .command.err
├── .command.log
├── .command.out
├── .command.run
├── .command.sh
└── .exitcode
Running a workflow
# Run a basic workflow
nextflow run main.nfInput/output
Inputs
Processes receive data through input channels, typically created by channel factories, or previous processes:
workflow {
main:
ch_samples = channel.fromPath('/path/to/*.samples')
ch_threshold = channel.value(500)
ANALYZE(
ch_samples,
ch_threshold
)
}
process ANALYZE {
input:
1 path sample_file
val threshold
script:
"""
analyze_tool ${sample_file} --threshold ${threshold}
"""
}- 1
- Input files are “staged” in the work directory using symlinks.
Outputs
Processes emit results that can feed into other processes:
process GENERATE_DATA {
script:
"""
echo "Generated data" > output.txt
"""
output:
path "output.txt", emit: txt
}They feed into other processes in the workflow block by using the .out keyword.
workflow {
main:
GENERATE_DATA()
CONSUME_DATA( GENERATE_DATA.out.txt )
}Publishing outputs
Output files from processes live in their work directories. In order to make them accessible in a user-friendly way, they should be published. Since Nextflow version 25, the method of publishing files has changed. Previously, the outputs were published using the publishDir process directive, either defined in the process script or the nextflow config. Now, output files are published by using the channels they’re emitted in.
workflow {
main:
GENERATE_DATA()
CONSUME_DATA( GENERATE_DATA.out.txt )
publish:
my_data = GENERATE_DATA.out.txt
}
output {
my_data {
path "my_data"
}
}which would publish the txt files to the folder results/my_data. See Nextflow Documentation - Workflow outputs for more information on how to publish outputs.
Configuration
Nextflow configuration is stored in the file nextflow.config. The full range of scopes are documented in Nextflow Documentation - Configuration Options.
Default parameters
Nextflow scripts can have parameters. These are defined in the params scope of config.
params {
sequences = '/path/to/data/sequences'
reference = '/path/to/reference'
}This could also be written as:
params.sequences = '/path/to/data/sequences'
params.reference = '/path/to/reference'The default values can be overridden by using a double dash followed by the parameter name when running a workflow.
nextflow run main.nf --sequences '/new/path/to/sequences/*.fastx' --reference '/path/to/alternate/reference'Resuming
By default, Nextflow starts from the beginning of a workflow. In order to continue from where you last left off, you can either use -resume with nextflow run or enable it in the config.
resume = trueExecutor
By default, Nextflow runs locally using the local executor. You can reconfigure it to use a job submission system like SLURM using:
executor {
name = 'slurm'
}Using software
If no package manager is used with Nextflow, any command line tool you call is expected to be available from your PATH environment variable.
Conda
Nextflow supports conda as a package manager, where you can define conda environments per process using the conda process directive. This needs to be enabled in the config to activate the environments. Simply having the conda process directive is not enough to activate an environment.
process SAMTOOLS_VERSION {
// directives
conda "bioconda::samtools=1.22.1"
script:
"""
samtools --version
"""
}nextflow.config:
conda {
enabled = true
}Containers
Nextflow supports multiple container platforms, such as Docker, and Apptainer. Images are defined using the container process directive, and then the appropriate config scope enables the use of that image with the container platform.
process SAMTOOLS_VERSION {
// directives
container "community.wave.seqera.io/library/samtools:1.22.1--eccb42ff8fb55509"
script:
"""
samtools --version
"""
}nextflow.config:
docker {
enabled = true
}Seqera Containers is a useful place to quickly build container images for tools on hosted on Conda and Pypi.
Programming in Nextflow
In many languages we’re used to using logic control like if, for, and while. Nextflow is however designed on a streaming paradigm, and the use of logic control is different.
It’s best to think of nextflow run as a two step process, compile-time and run-time. In the compile-time step, if statements are used to determine which processes get included into the workflow to execute. for and while loops are never used. There is no data flow at this point. Just the building of the Directed Acyclic Graph (DAG), the internal representation of the workflow. Once the DAG is generated, the run-time step happens, and the data flows through the graph. Channel operators are the logic control structures now. Since if statements cannot test conditions on the data in channels (or channels themselves), one should use operators like filter or branch which test if conditions on data are true. The equivalent of for or while loops is simply putting data into channels. When data is put into a channel, it’s passed as input to a process to spawn a task. If a process in the DAG doesn’t receive input, it simply doesn’t spawn a task.
Useful resources
- Nextflow training material: https://training.nextflow.io/latest/hello_nextflow/01_hello_world/
- Nextflow documentation: https://www.nextflow.io/docs/latest/index.html
- Nextflow Slack help: https://www.nextflow.io/slack-invite.html
- Nextflow Community forum: https://community.seqera.io/