For information about Sequencing technologies and file formats see our Wiki page.
Change directory to the exercise data folder (/proj/sllstore2017027/workshop-GA2018/data/QC_files
). Use md5sum
to calculate the checksum of all the data files in the exercise data folder. Redirect the checksum values to a file called checksums.md5 in your working directory. Then change directory to your working directory. Your working directory is the directory where you will conduct your analyses.
# Change to exercise data directory
cd /proj/sllstore2017027/workshop-GA2018/data/QC_files
# Check contents of folder
ls -R
# Use the **relative** paths of the files with md5sum and redirect STDOUT to a file in your working directory
md5sum */* > "$WORKDIR/checksums.md5"
cd /proj/sllstore2017027/workshop-GA2018/data/QC_files
# Get the **relative** paths of the files and use md5sum. Redirect the output to a file and screen
find -type "f" -exec md5sum {} \; | tee $WORKDIR/checksums.md5 # redirected output
Copy the exercise files to your working directory, but interrupt transfer with ctrl + c
.
echo "$WORKDIR" # to check your $WORKDIR variable is set, otherwise type the path manually.
cd $WORKDIR # Change directory to my working directory
cp -vr /proj/sllstore2017027/workshop-GA2018/data/QC_files/* . # Copy all the files to my working directory
Use the -c
option of md5sum
to check the files are complete.
md5sum -c checksums.md5
Transfer the files again, this time making sure the files are complete.
The PacBio data has been converted from RSII platforms’ hd5 files to the Sequel platforms’ unaligned BAM format using
the bax2bam
tool in the SMRT tools package. Use SMRT tools to extract the fastq from the BAM file.
module load bioinfo-tools SMRT/5.0.1
bam2fastq --help
module load SMRT/5.0.1
bam2fastq -o Ecoli_pacbio Ecoli_pb.subreads.bam
What does each tool in this command do?
zcat <fastq.gz> | seqtk seq -A - | grep -v “^>” | tr -dc “ACGTNacgtn” | wc -m
zcat <fastq.gz > # concatenates compressed files to one output stream
seqtk seq -A - # seqtk is a toolkit for manipulating sequence data. The -A converts input to fasta output.
grep -v "^>" # grep searches for lines beginning (^) with the string > and excludes them (-v).
tr -dc "ACGTNacgtn" # tr translates characters from one set to another. The -dc deletes characters not in the "ACGTNacgtn" set.
wc -m # wc is the word count tool. wc -m counts characters.
Load seqtk
using:
module load bioinfo-tools seqtk/1.2-r101
How many bases in total are in these files?
a. Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz:
b. Escherichia_coli/ERR022075_{1,2}.fastq.gz:
c. Escherichia_coli/Ecoli_pacbio.fastq.gz:
d. Escherichia_coli/Ecoli_nanopore.fasta:
Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz
zcat Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz | seqtk seq -A - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m
1070871200 (nucleotides)
Escherichia_coli/ERR022075_{1,2}.fastq.gz
zcat Escherichia_coli/ERR022075_{1,2}.fastq.gz | seqtk seq -A - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m
4589460200 (nucleotides)
Escherichia_coli/Ecoli_pacbio.fastq.gz
zcat Escherichia_coli/Ecoli_pacbio.fastq.gz | seqtk seq -A - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m
748508361 (nucleotides)
Escherichia_coli/Ecoli_nanopore.fasta
grep -v "^>" Escherichia_coli/Ecoli_nanopore.fasta | tr -dc "ACGTNacgtn" | wc -m
410782292 (nucleotides)
How many bases in Escherichia_coli/Ecoli_pacbio.fastq.gz are contained in reads 10kb or longer?
The -L <int>
option in seqtk
drops sequences smaller than <int>
bases.
zcat Escherichia_coli/Ecoli_pacbio.fastq.gz | seqtk seq -A -L 10000 - | grep -v "^>" | tr -dc "ACGTNacgtn" | wc -m
510546352 (nucleotides)
What is the depth of coverage of these datasets?
a. Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz:
b. Escherichia_coli/ERR022075_{1,2}.fastq.gz:
c. Escherichia_coli/Ecoli_pacbio.fastq.gz:
d. Escherichia_coli/Ecoli_nanopore.fasta:
Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz
Searching for the Enterococcus faecalis genome size gives and approximate value of 3.22 Mb.
echo "1070871200 / 3220000" | bc -l
Approximately 332x depth of coverage
Escherichia_coli/ERR022075_{1,2}.fastq.gz
Searching for the Escherichia coli genome size gives and approximate value of 4.6 Mb.
echo "4589460200 / 4600000" | bc -l
Approximately 998x depth of coverage.
Escherichia_coli/Ecoli_pacbio.fastq.gz
echo "748508361 / 4600000" | bc -l
Approximately 163x depth of coverage.
Escherichia_coli/Ecoli_nanopore.fasta
echo "410782292 / 4600000" | bc -l
Approximately 89x depth of coverage.
Use seqtk
to subsample Escherichia_coli/ERR022075_{1,2}.fastq.gz to approximately 100x coverage.
seqtk sample -s100 Escherichia_coli/ERR022075_1.fastq.gz 0.1 | gzip -c > Escherichia_coli/ERR022075_100x_1.fastq.gz
seqtk sample -s100 Escherichia_coli/ERR022075_2.fastq.gz 0.1 | gzip -c > Escherichia_coli/ERR022075_100x_2.fastq.gz
Use bbmap
to normalize the Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz data.
module load bioinfo-tools bbmap/38.08
bbnorm.sh --help
bbnorm.sh in=Enterococcus_faecalis/SRR492065_1.fastq.gz in2=Enterococcus_faecalis/SRR492065_2.fastq.gz \
out=Enterococcus_faecalis/SRR492065_normalized_1.fastq.gz out2=Enterococcus_faecalis/SRR492065_normalized_2.fastq.gz \
target=100 min=5