Ecoli genome assembly project

Exercise

This is a team project, so split up the workload as you see fit.

You have three datasets from the Ecoli K12 substrain MG1655, sequenced using Illumina, PacBio, and Nanopore.

/proj/sllstore2017027/workshop-GA2018/data/QC_files/Escherichia_coli

The aim is for you to try and explore different assemblers and see what you get. It will be impossible to do evaluate every combination, so choose your tasks wisely. Document your commands and share them with each other.

Working directories have been created for each team:

/proj/sllstore2017027/nobackup_GA2018/team_Turtle
/proj/sllstore2017027/nobackup_GA2018/team_Wolf
/proj/sllstore2017027/nobackup_GA2018/team_Rhino
/proj/sllstore2017027/nobackup_GA2018/team_Rooster

Illumina data.

Running spades example:

spades.py -k 21,33,55 --careful --pe1-1 "$READ1" --pe1-2 "$READ2" -o "${PREFIX}-spades_assembly"

Running abyss example:

abyss-pe name=abyss_k35_cleaned k=35 in='SRR492065_cleaned_R1.fastq.gz SRR492065_cleaned_R2.fastq.gz'

Running MaSuRCA:

# MaSuRCA needs a config file - You can use nano instead of cat -
# MaSuRCA needs the full path to the reads
cat <<-EOF > "${PREFIX}_masurca.cfg"
DATA
PE= pe 500 50 $READ1 $READ2
END

PARAMETERS
GRAPH_KMER_SIZE = auto
END
EOF
masurca "${PREFIX}_masurca.cfg"
bash assemble.sh

Running Pilon:

java -jar $PILON_HOME/pilon.jar --genome genome.fasta --frags reads_aligned_to_assembly.bam

PacBio data.

Running Canu:

canu -p ecoli -d ecoli-pacbio useGrid=false genomeSize=4.6m -pacbio-raw pacbio.fastq

Running Minimap:

# Overlap for PacBio reads (or use "-x ava-ont" for nanopore read overlapping)
minimap2/minimap2 -x ava-pb -t8 pb-reads.fq pb-reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa
# GFA to Fasta
awk ' /^S/ {print ">seq" $2 "\n" $3 } ' reads.gfa

Running Wtdbg2:

# assemble long reads
./wtdbg2 -t 16 -i reads.fa.gz -fo prefix -L 5000
# derive consensus
./wtpoa-cns -t 16 -i prefix.ctg.lay -fo prefix.ctg.lay.fa

Running Racon:

racon [options ...] <sequences> <overlaps> <target sequences>

    <sequences>
        (reads) input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences used for correction
    <overlaps>
        (reads aligned to assembly) input file in MHAP/PAF/SAM format (can be compressed with gzip)
        containing overlaps between sequences and target sequences
    <target sequences>
        (assembly) input file in FASTA/FASTQ format (can be compressed with gzip)
        containing sequences which will be corrected

Nanopore data.

Running Canu:

canu -p ecoli -d ecoli-oxford useGrid=false genomeSize=4.8m -nanopore-raw oxford.fasta

Running Medaka:

medaka_consensus -i reads.fasta -d assembly.fasta -o assembly.consensus.fasta -t 10

How to load the tools.

You have already used most of the tools needed for this task. Here is how to load the tools you have not encountered already.

Spades:

module load bioinfo-tools spades/3.12.0

Abyss:

module load bioinfo-tools abyss/2.0.2

MaSuRCA:

module load bioinfo-tools MaSuRCA/3.2.3

Canu:

module load bioinfo-tools canu/1.7

Pilon:

module load bioinfo-tools Pilon/1.22

Racon:

conda activate GA2018

Minimap2:

conda activate GA2018

Miniasm:

conda activate GA2018

Wtdbg2:

export PATH="$PATH:/proj/sllstore2017027/workshop-GA2018/tools/wtdbg2"

Medaka:

source /proj/sllstore2017027/workshop-GA2018/tools/medaka/venv/bin/activate
# to unload the virtual environment use
deactivate