This is a team project, so split up the workload as you see fit.
You have three datasets from the Ecoli K12 substrain MG1655, sequenced using Illumina, PacBio, and Nanopore.
/proj/sllstore2017027/workshop-GA2018/data/QC_files/Escherichia_coli
The aim is for you to try and explore different assemblers and see what you get. It will be impossible to do evaluate every combination, so choose your tasks wisely. Document your commands and share them with each other.
Working directories have been created for each team:
/proj/sllstore2017027/nobackup_GA2018/team_Turtle
/proj/sllstore2017027/nobackup_GA2018/team_Wolf
/proj/sllstore2017027/nobackup_GA2018/team_Rhino
/proj/sllstore2017027/nobackup_GA2018/team_Rooster
Running spades example:
spades.py -k 21,33,55 --careful --pe1-1 "$READ1" --pe1-2 "$READ2" -o "${PREFIX}-spades_assembly"
Running abyss example:
abyss-pe name=abyss_k35_cleaned k=35 in='SRR492065_cleaned_R1.fastq.gz SRR492065_cleaned_R2.fastq.gz'
Running MaSuRCA:
# MaSuRCA needs a config file - You can use nano instead of cat -
# MaSuRCA needs the full path to the reads
cat <<-EOF > "${PREFIX}_masurca.cfg"
DATA
PE= pe 500 50 $READ1 $READ2
END
PARAMETERS
GRAPH_KMER_SIZE = auto
END
EOF
masurca "${PREFIX}_masurca.cfg"
bash assemble.sh
Running Pilon:
java -jar $PILON_HOME/pilon.jar --genome genome.fasta --frags reads_aligned_to_assembly.bam
Running Canu:
canu -p ecoli -d ecoli-pacbio useGrid=false genomeSize=4.6m -pacbio-raw pacbio.fastq
Running Minimap:
# Overlap for PacBio reads (or use "-x ava-ont" for nanopore read overlapping)
minimap2/minimap2 -x ava-pb -t8 pb-reads.fq pb-reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa
# GFA to Fasta
awk ' /^S/ {print ">seq" $2 "\n" $3 } ' reads.gfa
Running Wtdbg2:
# assemble long reads
./wtdbg2 -t 16 -i reads.fa.gz -fo prefix -L 5000
# derive consensus
./wtpoa-cns -t 16 -i prefix.ctg.lay -fo prefix.ctg.lay.fa
Running Racon:
racon [options ...] <sequences> <overlaps> <target sequences>
<sequences>
(reads) input file in FASTA/FASTQ format (can be compressed with gzip)
containing sequences used for correction
<overlaps>
(reads aligned to assembly) input file in MHAP/PAF/SAM format (can be compressed with gzip)
containing overlaps between sequences and target sequences
<target sequences>
(assembly) input file in FASTA/FASTQ format (can be compressed with gzip)
containing sequences which will be corrected
Running Canu:
canu -p ecoli -d ecoli-oxford useGrid=false genomeSize=4.8m -nanopore-raw oxford.fasta
Running Medaka:
medaka_consensus -i reads.fasta -d assembly.fasta -o assembly.consensus.fasta -t 10
You have already used most of the tools needed for this task. Here is how to load the tools you have not encountered already.
Spades:
module load bioinfo-tools spades/3.12.0
Abyss:
module load bioinfo-tools abyss/2.0.2
MaSuRCA:
module load bioinfo-tools MaSuRCA/3.2.3
Canu:
module load bioinfo-tools canu/1.7
Pilon:
module load bioinfo-tools Pilon/1.22
Racon:
conda activate GA2018
Minimap2:
conda activate GA2018
Miniasm:
conda activate GA2018
Wtdbg2:
export PATH="$PATH:/proj/sllstore2017027/workshop-GA2018/tools/wtdbg2"
Medaka:
source /proj/sllstore2017027/workshop-GA2018/tools/medaka/venv/bin/activate
# to unload the virtual environment use
deactivate