Nanoplot and FastQC produce html files as output. These can be opened on the login node using firefox,
however running graphical applications across a network is slow. Alternatively you can download the
html files to your computer using either scp or rsync. You can then open the downloaded files with your
own html browser.
In a terminal, on your computer (not the login node) type the following:
The command is similar using rsync:
Task 1.
Run NanoPlot on your PacBio data. The results can be opened using firefox.
What are the average and median length of the long reads.
Solution - click to expand
The average read length of the PacBio data is 8.6kb, and the median read length is 6.7kb.
Task 2.
Run FastQC on your fastq data. Then open the results using firefox or fastqc.
How many sequences are in each fastq file?
Solution - click to expand
Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz: 5354356 each
Escherichia_coli/ERR022075_{1,2}.fastq.gz: 22720100 each
Solution - click to expand
An expectation of 1 error in 100bp.
Task 6.
What does a quality score of 40 (Q40) mean?
Solution - click to expand
An expectation of 1 error in 10000bp.
Task 7.
What distribution should the per base sequence plot follow?
Solution - click to expand
A Uniform distribution.
Task 8.
What value should the per base GC distribution be centered on?
Solution - click to expand
Average GC content.
Task 9.
How much duplication is present in each fastq file?
Solution - click to expand
Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz: 29.4% and 17.24%
Escherichia_coli/ERR022075_{1,2}.fastq.gz: 61.71% and 27.87%
Escherichia_coli/Ecoli_pacbio.fastq.gz: 0.12% but this value is uninformative for pacbio due to the error rate.
Task 10.
What is adapter read through?
Solution - click to expand
When the sequence reads past the insert into the adapter sequence on the other end.
Task 11.
Let’s look at the adapter sequence in the Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz fastq files. Illumina uses different adapters
for different libraries. It is important to know which adapter sequence it is. Since this is public data, it is sometimes difficult to
find out what the adapters were. Use bbmerge to discover the adapter sequence.
Solution - click to expand
Task 12.
Use the command below to view the reads that have matching adapter sequence in your files.
In the next step we will use Trimmomatic to trim adapters. It needs the correct adapter file. Use
grep to identify the necessary adapter file to use. Trimmomatic’s adapter files can be found in
$TRIMMOMATIC_HOME/adapters/.
Which adapter file should be used?
Solution - click to expand
Therefore the file to use is: `/sw/apps/bioinfo/trimmomatic/0.36/rackham/adapters/TruSeq3-PE-2.fa`.
Since we have paired end data, we only use the files with PE in their name. Also as we are looking to remove adapter
read-through, we are searching for the reverse compliment of the adapter `*_rc`. These two sequences are only common
to one file, so we will use that one.
Task 13.
Run Trimmomatic on Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz to only remove adapters. How many reads were trimmed for adapters?
Solution - click to expand
Task 14.
Using the bacterial database, run Kraken on Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz.
How many sequences are classified? What are they classified as?
It takes a very long time to build the database, so we have provided a build for you above.
Click here to see how the database was built
This is how we built the database for you. See the Kraken2 homepage for how to build more comprehensive databases.
Solution - click to expand
To make an image, open the html file and click on the snapshot button. Then save the resulting image to SRR492065_kraken_krona.svg.
The Kraken analysis shows at least three organisms in the sample; Enterococcus, Staphylococcus, and Cutibacterium. Enterococcus also shows a higher abundance than both Staphylococcus and Cutibacterium, which are both in similar proportions.
Task 15.
Using the references, filter the reads that align to Staphylococcus and Cutibacterium.