Change directory to the exercise data folder (/proj/sllstore2017027/workshop-GA2018/data/QC_files). Use md5sum to calculate the checksum of all the data files in the exercise data folder. Redirect the checksum values to a file called checksums.md5 in your working directory. Then change directory to your working directory. Your working directory is the directory where you will conduct your analyses.
Solution - click to expand
Simple solution:
Advanced solution (this is a more generally applicable solution):
Task 2.
Copy the exercise files to your working directory, but interrupt transfer with ctrl + c.
Use the -c option of md5sum to check the files are complete.
Solution - click to expand
Transfer the files again, this time making sure the files are complete.
Task 3.
The PacBio data has been converted from RSII platforms’ hd5 files to the Sequel platforms’ unaligned BAM format using
the bax2bam tool in the SMRT tools package. Use SMRT tools to extract the fastq from the BAM file.
Solution - click to expand
Only the subreads BAM file needs to be given as an argument. The scraps file contains poor quality sequence and adapters.
Task 4.
What does each tool in this command do?
Solution - click to expand
Task 5.
Load seqtk using:
How many bases in total are in these files?
a. Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz:
b. Escherichia_coli/ERR022075_{1,2}.fastq.gz:
c. Escherichia_coli/Ecoli_pacbio.fastq.gz:
d. Escherichia_coli/Ecoli_nanopore.fasta:
Solution - click to expand
Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz
1070871200 (nucleotides)
Escherichia_coli/ERR022075_{1,2}.fastq.gz
4589460200 (nucleotides)
Escherichia_coli/Ecoli_pacbio.fastq.gz
748508361 (nucleotides)
Escherichia_coli/Ecoli_nanopore.fasta
410782292 (nucleotides)
Task 6.
How many bases in Escherichia_coli/Ecoli_pacbio.fastq.gz are contained in reads 10kb or longer?
Solution - click to expand
The -L <int> option in seqtk drops sequences smaller than <int> bases.
510546352 (nucleotides)
Task 7.
What is the depth of coverage of these datasets?
a. Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz:
b. Escherichia_coli/ERR022075_{1,2}.fastq.gz:
c. Escherichia_coli/Ecoli_pacbio.fastq.gz:
d. Escherichia_coli/Ecoli_nanopore.fasta:
Solution - click to expand
Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz
Searching for the Enterococcus faecalis genome size gives and approximate value of 3.22 Mb.
Approximately 332x depth of coverage
Escherichia_coli/ERR022075_{1,2}.fastq.gz
Searching for the Escherichia coli genome size gives and approximate value of 4.6 Mb.
Approximately 998x depth of coverage.
Escherichia_coli/Ecoli_pacbio.fastq.gz
Approximately 163x depth of coverage.
Escherichia_coli/Ecoli_nanopore.fasta
Approximately 89x depth of coverage.
Task 8.
Use seqtk to subsample Escherichia_coli/ERR022075_{1,2}.fastq.gz to approximately 100x coverage.
Solution - click to expand
Since we want approximately 10% of the reads, we use a value of 0.1 as the fraction of reads to sample. The output is
piped to gzip to compress the file again.
Task 9.
Use bbmap to normalize the Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz data.
Solution - click to expand
As coverage is relatively high, we aim for a target coverage of 100x.