Data Quality Assessment

Exercises

For information about Sequencing technologies and file formats see our Wiki page.

NBIS Genome Assembly Wiki

Task 1.

Change directory to the exercise data folder (/proj/sllstore2017027/workshop-GA2018/data/QC_files). Use md5sum to calculate the checksum of all the data files in the exercise data folder. Redirect the checksum values to a file called checksums.md5 in your working directory. Then change directory to your working directory. Your working directory is the directory where you will conduct your analyses.

Solution - click to expand Simple solution:
# Change to exercise data directory
cd /proj/sllstore2017027/workshop-GA2018/data/QC_files
# Check contents of folder
ls -R
# Use the **relative** paths of the files with md5sum and redirect STDOUT to a file in your working directory
md5sum */* > "$WORKDIR/checksums.md5"
Advanced solution (this is a more generally applicable solution):
cd /proj/sllstore2017027/workshop-GA2018/data/QC_files
# Get the **relative** paths of the files and use md5sum. Redirect the output to a file and screen
find -type "f" -exec md5sum {} \; | tee $WORKDIR/checksums.md5 # redirected output

Task 2.

Copy the exercise files to your working directory, but interrupt transfer with ctrl + c.

echo "$WORKDIR" # to check your $WORKDIR variable is set, otherwise type the path manually.
cd $WORKDIR # Change directory to my working directory
cp -vr /proj/sllstore2017027/workshop-GA2018/data/QC_files/* . # Copy all the files to my working directory

Use the -c option of md5sum to check the files are complete.

Solution - click to expand
md5sum -c checksums.md5

Transfer the files again, this time making sure the files are complete.

Task 3.

The PacBio data has been converted from RSII platforms’ hd5 files to the Sequel platforms’ unaligned BAM format using the bax2bam tool in the SMRT tools package. Use SMRT tools to extract the fastq from the BAM file.

module load bioinfo-tools SMRT/5.0.1
bam2fastq --help
Solution - click to expand Only the subreads BAM file needs to be given as an argument. The scraps file contains poor quality sequence and adapters.
module load SMRT/5.0.1
bam2fastq -o Ecoli_pacbio Ecoli_pb.subreads.bam

Task 4.

What does each tool in this command do?

zcat <fastq.gz> | seqtk seq -A - | grep -v “^>” | tr -dc “ACGTNacgtn” | wc -m
Solution - click to expand
zcat <fastq.gz >     # concatenates compressed files to one output stream
seqtk seq -A -       # seqtk is a toolkit for manipulating sequence data. The -A converts input to fasta output.
grep -v "^>"         # grep searches for lines beginning (^) with the string > and excludes them (-v).
tr -dc "ACGTNacgtn"  # tr translates characters from one set to another. The -dc deletes characters not in the "ACGTNacgtn" set.
wc -m                # wc is the word count tool. wc -m counts characters.

Task 5.

Load seqtk using:

module load bioinfo-tools seqtk/1.2-r101

How many bases in total are in these files?

a. Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz:

b. Escherichia_coli/ERR022075_{1,2}.fastq.gz:

c. Escherichia_coli/Ecoli_pacbio.fastq.gz:

d. Escherichia_coli/Ecoli_nanopore.fasta:

Solution - click to expand

Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz

zcat Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz | seqtk seq -A - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m

1070871200 (nucleotides)

Escherichia_coli/ERR022075_{1,2}.fastq.gz

zcat Escherichia_coli/ERR022075_{1,2}.fastq.gz | seqtk seq -A - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m

4589460200 (nucleotides)

Escherichia_coli/Ecoli_pacbio.fastq.gz

zcat Escherichia_coli/Ecoli_pacbio.fastq.gz | seqtk seq -A - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m

748508361 (nucleotides)

Escherichia_coli/Ecoli_nanopore.fasta

grep -v "^>" Escherichia_coli/Ecoli_nanopore.fasta | tr -dc "ACGTNacgtn" | wc -m

410782292 (nucleotides)

Task 6.

How many bases in Escherichia_coli/Ecoli_pacbio.fastq.gz are contained in reads 10kb or longer?

Solution - click to expand

The -L <int> option in seqtk drops sequences smaller than <int> bases.

zcat Escherichia_coli/Ecoli_pacbio.fastq.gz | seqtk seq -A -L 10000 - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m

510546352 (nucleotides)

Task 7.

What is the depth of coverage of these datasets?

a. Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz:

b. Escherichia_coli/ERR022075_{1,2}.fastq.gz:

c. Escherichia_coli/Ecoli_pacbio.fastq.gz:

d. Escherichia_coli/Ecoli_nanopore.fasta:

Solution - click to expand

Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz

Searching for the Enterococcus faecalis genome size gives and approximate value of 3.22 Mb.

echo "1070871200 / 3220000" | bc -l

Approximately 332x depth of coverage

Escherichia_coli/ERR022075_{1,2}.fastq.gz

Searching for the Escherichia coli genome size gives and approximate value of 4.6 Mb.

echo "4589460200 / 4600000" | bc -l

Approximately 998x depth of coverage.

Escherichia_coli/Ecoli_pacbio.fastq.gz

echo "748508361 / 4600000" | bc -l

Approximately 163x depth of coverage.

Escherichia_coli/Ecoli_nanopore.fasta

echo "410782292 / 4600000" | bc -l

Approximately 89x depth of coverage.

Task 8.

Use seqtk to subsample Escherichia_coli/ERR022075_{1,2}.fastq.gz to approximately 100x coverage.

Solution - click to expand Since we want approximately 10% of the reads, we use a value of 0.1 as the fraction of reads to sample. The output is piped to gzip to compress the file again.
seqtk sample -s100 Escherichia_coli/ERR022075_1.fastq.gz 0.1 | gzip -c > Escherichia_coli/ERR022075_100x_1.fastq.gz
seqtk sample -s100 Escherichia_coli/ERR022075_2.fastq.gz 0.1 | gzip -c > Escherichia_coli/ERR022075_100x_2.fastq.gz

Task 9.

Use bbmap to normalize the Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz data.

module load bioinfo-tools bbmap/38.08
bbnorm.sh --help
Solution - click to expand As coverage is relatively high, we aim for a target coverage of 100x.
bbnorm.sh in=Enterococcus_faecalis/SRR492065_1.fastq.gz in2=Enterococcus_faecalis/SRR492065_2.fastq.gz \
 out=Enterococcus_faecalis/SRR492065_normalized_1.fastq.gz out2=Enterococcus_faecalis/SRR492065_normalized_2.fastq.gz \
 target=100 min=5