Variant calling workflows

Making your work reproducible

Per Unneberg

Motivation

Manual variant calling

bwa index ref/M_aurantiacus_v1.fasta
samtools faidx ref/M_aurantiacus_v1.fasta
bwa mem -R "@RG\tID:SRR9309790\tSM:PUN-Y-INJ\tPL:ILLUMINA" -t 4 -M \
    ref/M_aurantiacus_v1.fasta \
    fastq/PUN-Y-INJ_R1.fastq.gz \
    fastq/PUN-Y-INJ_R2.fastq.gz | \
    samtools sort - | \
    samtools view --with-header --output bam/PUN-Y-INJ.bam

This quickly becomes complex - not to mention tedious and boring. With larger sample sizes it becomes difficult to keep track which commands need updating should input data change.

Solution

Workflow managers!

Of inputs and outputs

  • Inputs are connected to outputs
  • Arrows correspond to dependencies and transformation/generation of new data
  • Workflow managers define rules that link inputs to outputs with an action
rule markdup:
    input: map/PUN-Y-INJ.bam
    output: markdup/PUN-Y-INJ.bam
    action: run picard MarkDuplicates

Workflow managers

Nextflow

process GATK4_MARKDUPLICATES {

    input:
    path  bam
    path  fasta
    path  fasta_fai

    output:
    tuple val(meta), path("*bam"),      emit: bam
    tuple val(meta), path("*.bai"),     emit: bai
    tuple val(meta), path("*.metrics"), emit: metrics

    script:
    -- snip --

    gatk MarkDuplicates $input_list ...
  • Groovy syntax
  • becoming a standard in production settings

Snakemake Snakemake

rule mark_duplicates:
    input:
        bam = "map/{prefix}.bam",
        bai = "map/{prefix}.bai",
        fasta = "ref/M_aurantiacus_v1.fasta"
    output:
        bam = "markdup/{prefix}.bam",
    shell:
        "gatk MarkDuplicates..."
  • Python-like syntax

nf-core and Sarek

nf-core logo

A global community effort to collect a curated set of open‑source analysis pipelines built using Nextflow.

  • huge community base
  • lots of curated workflows

nf-core sarek logo

Analysis pipeline to detect germline or somatic variants (pre-processing, variant calling and annotation) from WGS / targeted sequencing

  • focusses on variant calling in human
  • does not produce all-sites variant file!

snpArcher - a Snakemake workflow for nonmodel organisms

snpArcher logotype

snpArcher is a reproducible workflow optimized for nonmodel organisms and comparisons across datasets, built on the Snakemake workflow management system. It provides a streamlined approach to dataset acquisition, variant calling, quality control, and downstream analysis.

Our Snakemake workflow

  1. raw read QC
  2. read mapping
    1. mapping QC
  3. duplicate marking
  4. raw variant calling
  5. base quality score recalibration (bqsr)
  6. bqsr variant calling
  7. genotyping
    1. variant statistics
  8. QC report generation
snakemake_dag 0 all 1 multiqc 1->0 2 fastqc 2->1 3 bcftools_stats 3->1 4 gatk_genotype_gvcfs 4->3 5 gatk_combine_gvcfs 5->4 6 gatk_haplotypecaller_bqsr 6->5 7 gatk_apply_bqsr 7->6 8 picard_mark_duplicates 8->1 8->7 11 gatk_base_recalibrator 8->11 12 gatk_haplotypecaller_raw 8->12 9 bwa_mem 9->8 15 qualimap_bamqc 9->15 10 bwa_index 10->9 11->7 12->11 13 picard_create_sequence_dictionary 13->4 13->6 13->12 14 samtools_faidx 14->4 14->6 14->12 15->1

:::

:::

::::