The following exercises are intended to introduce you to the tools involved in assembly validation.
Many commands below are wrapped up in functions to make it easier for you to run.
Functions can be copy-and-pasted into your terminal window or you can write them in a file
and source the file (e.g. source functions.sh).
A function looks like the following:
Once this command is copied into your terminal, you can use the function to run the complex series of commands:
Exercises
Some assemblies have been provided for you to assess. They can be found here:
This is how the assemblies were made:
Assembly
Description
spades_k21-55_full.fasta
spades using k=21,33,55 and SRR492065_{1,2}.fastq.gz
spades_k21-127_full.fasta
spades using k=21,33,55,77,99,127 and SRR492065_{1,2}.fastq.gz
spades_k21-55_normalized.fasta
spades using k=21,33,55 and SRR492065normalized{1,2}.fastq.gz
spades_k21-127_normalized.fasta
spades using k=21,33,55,77,99,127 and SRR492065normalized{1,2}.fastq.gz
spades_k21-55_cleaned.fasta
spades using k=21,33,55 and SRR492065cleaned{1,2}.fastq.gz
spades_k21-127_cleaned.fasta
spades using k=21,33,55,77,99,127 and SRR492065cleaned{1,2}.fastq.gz
shovill_full_spades.fasta
shovill using assembler=spades and SRR492065_{1,2}.fastq.gz
shovill_full_megahit.fasta
shovill using assembler=megahit and SRR492065_{1,2}.fastq.gz
masurca_cleaned.fasta
masurca using SRR492065cleaned{1,2}.fastq.gz
abyss_k35_cleaned.fasta
abyss using k=35 and SRR492065cleaned{1,2}.fastq.gz
Task 1.
QUAST is a good starting point to help evaluate the quality of assemblies. It provides many helpful contiguity statistics.
Run QUAST on all the assemblies at once and generate a report.
Which assembly looks the best?
Solution - click to expand
First run Quast on all the assemblies.
Task 2.
Select three assemblies of your choice for the remaining tasks.
Read congruency is an important measure in determining assembly accuracy. Clusters of read pairs that align incorrectly are
strong indicators of mis-assembly.
How well do the reads align back to the draft assemblies? Use bwa and samtools to assess the basic alignment statistics.
Make a folder for your results.
Then copy this function into your terminal.
To run the function above, copy the function into your terminal window and use in the following way:
This will then run bwa index, bwa mem, samtools sort, samtools index, and samtools flagstat in the correct
order and with the correct parameters.
Solution - click to expand
Task 3.
Use the alignments to search for signals of misassembly.
Plot the FRC curves (${PREFIX}_FRC.txt) together in a plot using Gnuplot. Which assembly has the best feature response curve?
How do these results compare to the Quast results?
Solution - click to expand
Task 4.
Use IGV to load an assembly, BAM file, and the corresponding FRC GFF file.
Note that running a graphical application over the network is a very slow process. If you have
IGV installed on your computer, please download the files locally and view them there. Otherwise
IGV can be started on your node using:
In order for IGV to load the BAM file, you need to make also download the index for it as well.
Task 5.
KAT is useful tool for high accuracy sequence data. The spectra-cn (copy number spectra) graph shows a
decomposition of k-mers in the assembly vs k-mers in the reads.
The black portion are k-mers not present in the assembly, the red portion is found once in the assembly, and so on.
This shows the completeness of an assembly, i.e. are all the reads assembled into contigs representative of the sequence data.
Use KAT to compare the assembly against the reads.