Nanoplot and FastQC produce html files as output. These can be opened on the login node using firefox
,
however running graphical applications across a network is slow. Alternatively you can download the
html files to your computer using either scp
or rsync
. You can then open the downloaded files with your
own html browser.
In a terminal, on your computer (not the login node) type the following:
scp username@rackham.uppmax.uu.se:/path/to/files/filename /path/to/download
The command is similar using rsync
:
rsync -av username@rackham.uppmax.uu.se:/path/to/files/filename /path/to/download
Run NanoPlot on your PacBio data. The results can be opened using firefox
.
What are the average and median length of the long reads.
source /proj/sllstore2017027/workshop-GA2018/tools/anaconda/miniconda2/etc/profile.d/conda.sh
conda activate GA2018
NanoPlot --help
NanoPlot --fastq Ecoli_pacbio.fastq.gz
Run FastQC on your fastq data. Then open the results using firefox
or fastqc
.
module load bioinfo-tools FastQC/0.11.5
fastqc --help
How many sequences are in each fastq file?
fastqc -t 6 */*.fastq.gz
Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz: 5354356 each
Escherichia_coli/ERR022075_{1,2}.fastq.gz: 22720100 each
Escherichia_coli/Ecoli_pacbio.fastq.gz: 87225
What is the average GC% in each data set?
Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz: 40%
Escherichia_coli/ERR022075_{1,2}.fastq.gz: 49%
Escherichia_coli/Ecoli_pacbio.fastq.gz: 49%
Which quality score encoding is used?
What does a quality score of 20 (Q20) mean?
What does a quality score of 40 (Q40) mean?
What distribution should the per base sequence plot follow?
What value should the per base GC distribution be centered on?
How much duplication is present in each fastq file?
Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz: 29.4% and 17.24%
Escherichia_coli/ERR022075_{1,2}.fastq.gz: 61.71% and 27.87%
Escherichia_coli/Ecoli_pacbio.fastq.gz: 0.12% but this value is uninformative for pacbio due to the error rate.
What is adapter read through?
Let’s look at the adapter sequence in the Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz fastq files. Illumina uses different adapters
for different libraries. It is important to know which adapter sequence it is. Since this is public data, it is sometimes difficult to
find out what the adapters were. Use bbmerge
to discover the adapter sequence.
module load bioinfo-tools bbmap/38.08
bbmerge.sh --help
bbmerge.sh in=Enterococcus_faecalis/SRR492065_1.fastq.gz in2=Enterococcus_faecalis/SRR492065_2.fastq.gz outa=adapters.fa
more adapters.fa
>Read1_adapter
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATNTCGTATGCCGTCTTNTGNTT
>Read2_adapter
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
Use the command below to view the reads that have matching adapter sequence in your files.
paste <( zcat Enterococcus_faecalis/SRR492065_1.fastq.gz ) <( zcat Enterococcus_faecalis/SRR492065_2.fastq.gz ) \
| grep -A2 -B1 --colour=always "AGATCGGAAGAGC" | less -SR
In the next step we will use Trimmomatic to trim adapters. It needs the correct adapter file. Use
grep
to identify the necessary adapter file to use. Trimmomatic’s adapter files can be found in
$TRIMMOMATIC_HOME/adapters/
.
module load bioinfo-tools trimmomatic/0.36
ls $TRIMMOMATIC_HOME
ls $TRIMMOMATIC_HOME/adapters
grep -B1 --colour=always "AGATCGGAAGAGCACACGTCTGAACTCC" $TRIMMOMATIC_HOME/adapters/*PE*.fa
grep -B1 --colour=always "AGATCGGAAGAGCGTCGTGTAGGGAAAG" $TRIMMOMATIC_HOME/adapters/*PE*.fa
Which adapter file should be used?
$ grep -B1 --colour=always "AGATCGGAAGAGCACACGTCTGAACTCC" $TRIMMOMATIC_HOME/adapters/*PE*.fa
/sw/apps/bioinfo/trimmomatic/0.36/rackham/adapters/TruSeq3-PE-2.fa->PE2_rc
/sw/apps/bioinfo/trimmomatic/0.36/rackham/adapters/TruSeq3-PE-2.fa:AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
$ grep -B1 --colour=always "AGATCGGAAGAGCGTCGTGTAGGGAAAG" $TRIMMOMATIC_HOME/adapters/*PE*.fa
/sw/apps/bioinfo/trimmomatic/0.36/rackham/adapters/TruSeq2-PE.fa->PCR_Primer1_rc
/sw/apps/bioinfo/trimmomatic/0.36/rackham/adapters/TruSeq2-PE.fa:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
--
/sw/apps/bioinfo/trimmomatic/0.36/rackham/adapters/TruSeq3-PE-2.fa->PE1_rc
/sw/apps/bioinfo/trimmomatic/0.36/rackham/adapters/TruSeq3-PE-2.fa:AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
Run Trimmomatic on Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz to only remove adapters. How many reads were trimmed for adapters?
java -jar $TRIMMOMATIC_HOME/trimmomatic-0.36.jar PE Enterococcus_faecalis/SRR492065_1.fastq.gz Enterococcus_faecalis/SRR492065_2.fastq.gz \
Enterococcus_faecalis/SRR492065_clean_paired_1.fastq.gz Enterococcus_faecalis/SRR492065_clean_unparied_1.fastq.gz \
Enterococcus_faecalis/SRR492065_clean_paired_2.fastq.gz Enterococcus_faecalis/SRR492065_clean_unparied_2.fastq.gz \
ILLUMINACLIP:$TRIMMOMATIC_HOME/adapters/TruSeq3-PE-2.fa:2:30:10
Input Read Pairs: 5354356 Both Surviving: 5339516 (99.72%) Forward Only Surviving: 11947 (0.22%) Reverse Only Surviving: 1888 (0.04%) Dropped: 1005 (0.02%)
Using the bacterial database, run Kraken on Enterococcus_faecalis/SRR492065_{1,2}.fastq.gz. How many sequences are classified? What are they classified as?
module load bioinfo-tools Kraken2/2.0.7-beta-bc14b13 Krona/2.7
KRAKEN2DB=$SNIC_TMP/kraken_bacterial_db
rsync -av "/proj/sllstore2017027/workshop-GA2018/databases/kraken_bacterial_db/" "$KRAKEN2DB"
kraken2 --threads "$CPUS" --db "$KRAKEN2DB" --report "${PREFIX}_kraken.rpt" --gzip-compressed --paired "$READ1" "$READ2" > "${PREFIX}_kraken.tsv"
ktImportTaxonomy <( cut -f2,3 "${PREFIX}_kraken.tsv" ) -o "${PREFIX}_kraken_krona.html"
It takes a very long time to build the database, so we have provided a build for you above.
CPUS=${SLURM_NPROCS:-10}
TMPDB=$(mktemp -u)
kraken2-build --download-taxonomy --db "$SNIC_TMP/$TMPDB"
kraken2-build --download-library bacteria --db "$SNIC_TMP/$TMPDB"
kraken2-build --build --threads "$CPUS" --db "$SNIC_TMP/$TMPDB"
rsync -av "$SNIC_TMP/$TMPDB/" kraken_bacterial_db
CPUS=${SLURM_NPROCS:-10}
READ1=Enterococcus_faecalis/SRR492065_1.fastq.gz
READ2=Enterococcus_faecalis/SRR492065_2.fastq.gz
PREFIX=$(basename "$READ1" _1.fastq.gz )
KRAKEN2DB=$SNIC_TMP/kraken_bacterial_db
rsync -av "/proj/sllstore2017027/workshop-GA2018/databases/kraken_bacterial_db/" "$KRAKEN2DB"
kraken2 --threads "$CPUS" --db "$KRAKEN2DB" --report "${PREFIX}_kraken.rpt" --gzip-compressed --paired "$READ1" "$READ2" > "${PREFIX}_kraken.tsv"
ktImportTaxonomy <( cut -f2,3 "${PREFIX}_kraken.tsv" ) -o "${PREFIX}_kraken_krona.html"
5354356 sequences (1070.87 Mbp) processed in 42.233s (7606.9 Kseq/m, 1521.39 Mbp/m).
4466191 sequences classified (83.41%)
888165 sequences unclassified (16.59%)
To make an image, open the html file and click on the snapshot button. Then save the resulting image to SRR492065_kraken_krona.svg
.
The Kraken analysis shows at least three organisms in the sample; Enterococcus, Staphylococcus, and Cutibacterium. Enterococcus also shows a higher abundance than both Staphylococcus and Cutibacterium, which are both in similar proportions.
Using the references, filter the reads that align to Staphylococcus and Cutibacterium.
/proj/sllstore2017027/workshop-GA2018/data/Illumina_SRR492065/Staphylococcus_aureus.fasta
/proj/sllstore2017027/workshop-GA2018/data/Illumina_SRR492065/Cutibacterium_avidum.fasta
# Load modules
module load bioinfo-tools bwa/0.7.17 samtools/1.8
# Align to the references.
PREFIX=$( basename "$REFERENCE" .fasta ) # Make a PREFIX from the reference name without .fasta on the end
bwa index "$REFERENCE"
bwa mem -t "$CPUS" "$REFERENCE" "$READ1" "$READ2" | samtools sort -@ "$CPUS" -T "$SNIC_TMP/$PREFIX" -O BAM -o "${PREFIX}_bwa_alignment.bam" -
samtools index "${PREFIX}_bwa_alignment.bam"
samtools flagstat "${PREFIX}_bwa_alignment.bam" > "${PREFIX}_bwa_alignment.bam.stats"
# Extract read names that align to reference
samtools view -@ "$CPUS" -F 4 "${PREFIX}_bwa_alignment.bam" | cut -f1 | sort -u -o "${PREFIX}_aligned_reads.tsv"
# Extract all read names
samtools view -@ "$CPUS" "${PREFIX}_bwa_alignment.bam" | cut -f1 | sort -u -o "${PREFIX}_all_reads.tsv"
# Extract read names unique to list1 between list1 (*.tsv) and list2 (*.tsv)
sort --parallel="$CPUS" "${LIST_1_TSV}" "${LIST_2_TSV}" "${LIST_2_TSV}" | uniq -u > "${LIST_3_TSV}"
# Use the list to extract the reads unique to list3
join -t ' ' <( zcat "$READ1" | paste - - - - | sort -k1,1 ) <( sed 's/^/@/' "${LIST_3_TSV}" ) | tr '\t' '\n' | pigz -c > "${LIST_3_TSV%%_*}_cleaned_R1.fastq.gz"
join -t ' ' <( zcat "$READ2" | paste - - - - | sort -k1,1 ) <( sed 's/^/@/' "${LIST_3_TSV}" ) | tr '\t' '\n' | pigz -c > "${LIST_3_TSV%%_*}_cleaned_R2.fastq.gz"
cp /proj/sllstore2017027/workshop-GA2018/data/Illumina_SRR492065/Staphylococcus_aureus.fasta .
cp /proj/sllstore2017027/workshop-GA2018/data/Illumina_SRR492065/Cutibacterium_avidum.fasta .
CPUS=${SLURM_NPROCS:-10}
READ1=Enterococcus_faecalis/SRR492065_1.fastq.gz
READ2=Enterococcus_faecalis/SRR492065_2.fastq.gz
# Reads aligned to Staphylococcus
REFERENCE=Staphylococcus_aureus.fasta
PREFIX=$( basename "$REFERENCE" .fasta )
bwa index "$REFERENCE"
bwa mem -t "$CPUS" "$REFERENCE" "$READ1" "$READ2" | samtools sort -@ "$CPUS" -T "$SNIC_TMP/$PREFIX" -O BAM -o "${PREFIX}_bwa_alignment.bam" -
samtools index "${PREFIX}_bwa_alignment.bam"
samtools flagstat "${PREFIX}_bwa_alignment.bam" > "${PREFIX}_bwa_alignment.bam.stats"
# List of reads aligned to Staphylococcus
samtools view -@ "$CPUS" -F 4 "${PREFIX}_bwa_alignment.bam" | cut -f1 | sort -u -o "${PREFIX}_aligned_reads.tsv"
# List of all reads (not just aligned to Staphylococcus)
samtools view -@ "$CPUS" "${PREFIX}_bwa_alignment.bam" | cut -f1 | sort -u -o "${PREFIX}_all_reads.tsv"
# Reads aligned to Cutibacterium
REFERENCE=Cutibacterium_avidum.fasta
PREFIX=$( basename "$REFERENCE" .fasta )
bwa index "$REFERENCE"
bwa mem -t "$CPUS" "$REFERENCE" "$READ1" "$READ2" | samtools sort -@ "$CPUS" -T "$SNIC_TMP/$PREFIX" -O BAM -o "${PREFIX}_bwa_alignment.bam" -
samtools index "${PREFIX}_bwa_alignment.bam"
samtools flagstat "${PREFIX}_bwa_alignment.bam" > "${PREFIX}_bwa_alignment.bam.stats"
# List of reads aligned to Cutibacterium
samtools view -@ "$CPUS" -F 4 "${PREFIX}_bwa_alignment.bam" | cut -f1 | sort -u -o "${PREFIX}_aligned_reads.tsv"
# List of read names not mapped to Staphylococcus
LIST_1_TSV=Staphylococcus_aureus_all_reads.tsv
LIST_2_TSV=Staphylococcus_aureus_aligned_reads.tsv
sort --parallel="$CPUS" "${LIST_1_TSV}" "${LIST_2_TSV}" "${LIST_2_TSV}" | uniq -u > "Staphylococcus_aureus_removed_reads.tsv"
# List of read names not mapped to Cutibacterium
LIST_1_TSV=Staphylococcus_aureus_removed_reads.tsv
LIST_2_TSV=Cutibacterium_avidum_aligned_reads.tsv
sort --parallel="$CPUS" "${LIST_1_TSV}" "${LIST_2_TSV}" "${LIST_2_TSV}" | uniq -u > "SRR492065_Enterococcus_only_reads.tsv"
# Get filtered fastqs
LIST_3_TSV=SRR492065_Enterococcus_only_reads.tsv
join -t ' ' <( zcat "$READ1" | paste - - - - | sort -k1,1 ) <( sed 's/^/@/' "${LIST_3_TSV}" ) | tr '\t' '\n' | pigz -c > "${LIST_3_TSV%%_*}_cleaned_R1.fastq.gz"
join -t ' ' <( zcat "$READ2" | paste - - - - | sort -k1,1 ) <( sed 's/^/@/' "${LIST_3_TSV}" ) | tr '\t' '\n' | pigz -c > "${LIST_3_TSV%%_*}_cleaned_R2.fastq.gz"