29-Nov-2024
Most common
Others
Goal: Create workflow to trim and compress FASTQ files
Using a bash-script:
trimfastq.sh
for input in *.fastq
do
sample=$(echo ${input} | sed 's/.fastq//')
# 1. Trim fastq file (trim 5 bp from left, 10 bp from right)
seqtk trimfq -b 5 -e 10 $input > ${sample}.trimmed.fastq
# 2. Compress fastq file
gzip -c ${sample}.trimmed.fastq > ${sample}.trimmed.fastq.gz
# 3. Remove intermediate files
rm ${sample}.trimmed.fastq
done
Using Snakemake rules:
Snakefile
$ snakemake -c 1 a.trimmed.fastq.gz b.trimmed.fastq.gz
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count
---------- -------
gzip 2
trim_fastq 2
total 4
Select jobs to execute...
Execute 1 jobs...
[Tue Nov 19 23:09:00 2024]
localrule trim_fastq:
input: b.fastq
output: b.trimmed.fastq
jobid: 3
reason: Missing output files: b.trimmed.fastq
wildcards: sample=b
resources: tmpdir=/var/folders/wb/jf9h8kw11b734gd98s6174rm0000gp/T
[Tue Nov 19 23:09:01 2024]
Finished job 3.
1 of 4 steps (25%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Nov 19 23:09:01 2024]
localrule gzip:
input: b.trimmed.fastq
output: b.trimmed.fastq.gz
jobid: 2
reason: Missing output files: b.trimmed.fastq.gz; Input files updated by another job: b.trimmed.fastq
wildcards: sample=b
resources: tmpdir=/var/folders/wb/jf9h8kw11b734gd98s6174rm0000gp/T
[Tue Nov 19 23:09:02 2024]
Finished job 2.
2 of 4 steps (50%) done
Removing temporary output b.trimmed.fastq.
Select jobs to execute...
Execute 1 jobs...
[Tue Nov 19 23:09:02 2024]
localrule trim_fastq:
input: a.fastq
output: a.trimmed.fastq
jobid: 1
reason: Missing output files: a.trimmed.fastq
wildcards: sample=a
resources: tmpdir=/var/folders/wb/jf9h8kw11b734gd98s6174rm0000gp/T
[Tue Nov 19 23:09:02 2024]
Finished job 1.
3 of 4 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
[Tue Nov 19 23:09:02 2024]
localrule gzip:
input: a.trimmed.fastq
output: a.trimmed.fastq.gz
jobid: 0
reason: Missing output files: a.trimmed.fastq.gz; Input files updated by another job: a.trimmed.fastq
wildcards: sample=a
resources: tmpdir=/var/folders/wb/jf9h8kw11b734gd98s6174rm0000gp/T
[Tue Nov 19 23:09:03 2024]
Finished job 0.
4 of 4 steps (100%) done
Removing temporary output a.trimmed.fastq.
Complete log: .snakemake/log/2024-11-19T230900.634412.snakemake.log
From the Snakemake documentation:
“A Snakemake workflow is defined by specifying rules in a Snakefile.”
“Rules decompose the workflow into small steps.”
“Snakemake automatically determines the dependencies between the rules by matching file names.”
$ snakemake -c 1 a.trimmed.fastq.gz b.trimmed.fastq.gz
$ snakemake -c 1 a.trimmed.fastq.gz
Example from the practical tutorial
make_supplementary
:$ snakemake -c 1 results/supplementary.html
$ touch results/bowtie2/NCTC8325.1.bt2
$ snakemake -c 1 results/supplementary.html
threads
directive specify maximum number of threads for a ruleresources
such as disk/memory requirements and runtimerule trim_fastq:
output: temp("{sample}.trimmed.fastq")
input: "{sample}.fastq"
log: "logs/{sample}.trim_fastq.log"
params:
leftTrim=5,
rightTrim=10
threads: 8
resources:
mem_mb=64,
runtime=120
shell:
"""
seqtk trimfq -t {threads} -b {params.leftTrim} -e {params.rightTrim} {input} > {output} 2> {log}
"""
conda
or container
directiverule trim_fastq:
output: temp("{sample}.trimmed.fastq")
input: "{sample}.fastq"
log: "logs/{sample}.trim_fastq.log"
params:
leftTrim=5,
rightTrim=10
threads: 8
resources:
mem_mb=64,
runtime=120
conda: "envs/seqtk.yaml"
container: "docker://quay.io/biocontainers/seqtk"
shell:
"""
seqtk trimfq -t {threads} -b {params.leftTrim} -e {params.rightTrim} {input} > {output} 2> {log}
"""
conda
or container
directivehttps://snakemake.readthedocs.io/en/stable/snakefiles/rules.html