What does the gene space look like for each assembly?
Solution - click to expand
Task 2.
Use Mauve to compare the assemblies to the reference.
The reference is here:
Reorder the assemblies with respect to the reference (Tools > Move contigs), and then make an alignment (align with ProgressiveMauve).
Mauve will first ask which directory you want to store your results, and then ask you which files to align.
Hint: Reordered are in an alignment folder. You can use grep ">" assembly.fasta | less -S to see if the contigs have been reordered.
Task 3.
Let’s change datasets now and re-circularize a different bacterial assembly.
The assembly is made from PacBio RSII data, and assembled using PacBio’s HGAP assembler. The result is in fastq format.
Use seqtk to convert it to fasta in your directory.
Circular assemblies are written out as linear contigs with an overlap at the end to piece at the beginning.
Use Mummer to find the coordinates of the overlap.
Task 4.
We would like to start this assembly somewhere at the origin of replication, between the genes rpmH and dnaA.
Use Prokka to annotate the assembly and find a point to break the assembly near the origin of replication.
Use Bedtools to check there are no other genes between this region. Use the start of the rpmH gene as $START
and the end of the dnaA gene as $END. The GFF file written by Prokka is not
strictly formatted as GFF and contains other data. The awk command retains only the needed lines of the file.
The output is all the genes in this region.
Task 5.
Select a point between the genes (not within a gene) to use as a break point.
Use samtools to break the polished assembly at this point. Redirect the output
of both commands to a file called Ecoli_broken.fasta. Modify the commands
as necessary to get a section to overlap at the ends of the contigs.
Merge the broken pieces again using AMOS.
Solution - click to expand
The overlap shown in the previous task was near the end (4642500-4660550), but not up to it (4681865).
In order to make a successful reassembly on the overlap we need to trim out the part on the end that
does not overlap, by not including it in the selection.
Task 6.
Polish the assembly again to reduce errors in the overlap region. Use the reads in your Ecoli_pb.subreads.bam file.
Using the entire dataset takes a long time to run. For this example, subsample the reads using samtools to make the remaining
tools run quicker.