Contamination is not always easy to detect at the read level. Let’s assess the assemblies again using Kraken2, but this time at the assembly level.
module load bioinfo-tools Kraken2/2.0.7-beta-bc14b13 Krona/2.7
mkdir Kraken
cd Kraken
ln -s ../*.fasta .
KRAKEN2DB=$SNIC_TMP/kraken_bacterial_db
rsync -av "/proj/sllstore2017027/workshop-GA2018/databases/kraken_bacterial_db/" "$KRAKEN2DB"
apply_kraken () {
ASSEMBLY="$1" # The assembly is the first parameter to this function. This file must end in .fasta
PREFIX=$( basename "$ASSEMBLY" .fasta )
echo "Running Kraken2: $ASSEMBLY"
kraken2 --threads "${CPUS:-10}" --db "$KRAKEN2DB" --report "${PREFIX}_kraken.rpt" "$ASSEMBLY" > "${PREFIX}_kraken.tsv"
ktImportTaxonomy <( cut -f2,3 "${PREFIX}_kraken.tsv" ) -o "${PREFIX}_kraken_krona.html"
}
One could also classify the sequences using Blast, however this is a time consuming process. Running a single assembly on the resources allocated to you can take up to an hour.
Copy the results we have already provided and inspect them.
/proj/sllstore2017027/workshop-GA2018/data/SRR492065_blast/
The results were generated in the following way:
module load bioinfo-tools blast/2.7.1+ Krona/2.7
mkdir Blast
cd Blast
ln -s ../*.fasta .
set_difference () {
sort "$1" "$2" "$2" | uniq -u
}
apply_Blast () {
ASSEMBLY="$1" # The assembly is the first parameter to this function. The file must end in .fasta
PREFIX=$( basename "$ASSEMBLY" .fasta )
echo "Blast: $ASSEMBLY"
blastn -task megablast -query "$ASSEMBLY" -db "${BLASTDB:-/sw/data/uppnex/blast_databases}/nt" \
-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
-culling_limit 5 -num_threads "${CPUS:-10}" -evalue 1e-25 -out "${PREFIX}_blast_alignment.tsv"
# Blast tabular output does not include `no hit`. The set difference is used to include remaining unclassified sequence.
ktImportTaxonomy <( cat <( cut -f1,2 "${PREFIX}_blast_alignment.tsv" | sort -u ) <( set_difference <( grep ">" "$ASSEMBLY" | cut -c2- ) <(cut -f1 "${PREFIX}_blast_alignment.tsv" ) )) -o "${PREFIX}_blast_krona.html"
}
Run Blobtools on each assembly. Blobtools requires both a BAM file as input and blast output for the classification step. What do these plots show?
module load bioinfo-tools blobtools/1.0-af55ced
mkdir Blobtools
cd Blobtools
ln -s ../*.fasta .
ln -s ../BWA/*.bam .
ln -s ../Blast/*.tsv .
apply_blobtools () {
ASSEMBLY="$1" # The assembly is the first parameter to this function. The file must end in .fasta
BAM="$2" # The BAM file is the second parameter to this function
BLAST="$3" # The BLAST file is the third parameter to this function
PREFIX=$( basename "$ASSEMBLY" .fasta )
blobtools create -i "$ASSEMBLY" -b "$BAM" -t "$BLAST" -o "${PREFIX}_blobtools"
blobtools blobplot -i "${PREFIX}_blobtools.blobDB.json" -o "${PREFIX}_blobtools"
}
Bandage is a great tool to visualise assembly graphs. Load and draw some of the assembly graphs.
The assembly graphs for some of the assemblies can be found here:
/proj/sllstore2017027/workshop-GA2018/data/SRR492065_graphs
Bandage is loaded through the conda environment GA2018
.
conda activate GA2018
Bandage --help
Use the Blast results to create a label csv to load into bandage and identify scaffolds. Have a go at labelling only 5 sequences.
Look at the workshop wiki for a brief description of the GFA file format, and the Bandage webpage for information on how to construct the csv.
Hint: Use nano
, a unix file editor to write the file.
Optional: Use the unix command line tools such as grep
, sort
, cut
, and join
to manipulate the data into csv.
grep "NODE_1000_length_306_cov_2.50996" spades_k21-55_full_blast_alignment.tsv
Node,Contig,Blast
1511768+,NODE_1000_length_306_cov_2.509960,Staphylococcus epidermidis
9815238+,NODE_1001_length_305_cov_4.088000,Staphylococcus epidermidis
10091228+,NODE_1002_length_305_cov_3.296000,Staphylococcus epidermidis
1099986+,NODE_1003_length_305_cov_2.700000,Staphylococcus hominis
10320035+,NODE_1004_length_304_cov_39.807229,Paenibacillus sp. FSL R7-0331
cut -f1,15 spades_k21-55_full_blast_alignment.tsv | sort -u > spades_k21-55_full_blast_annotation.tsv
head spades_k21-55_full_blast_annotation.tsv
NODE_1000_length_306_cov_2.509960 Staphylococcus epidermidis
NODE_1000_length_306_cov_2.509960 Staphylococcus epidermidis ATCC 12228
NODE_1000_length_306_cov_2.509960 Staphylococcus epidermidis RP62A
NODE_1001_length_305_cov_4.088000 Staphylococcus epidermidis
NODE_1001_length_305_cov_4.088000 Staphylococcus epidermidis ATCC 12228
NODE_1001_length_305_cov_4.088000 Staphylococcus epidermidis RP62A
NODE_1002_length_305_cov_3.296000 Staphylococcus epidermidis
NODE_1002_length_305_cov_3.296000 Staphylococcus epidermidis ATCC 12228
NODE_1002_length_305_cov_3.296000 Staphylococcus epidermidis PM221
NODE_1003_length_305_cov_2.700000 Staphylococcus hominis
cut -f1,15 spades_k21-55_full_blast_alignment.tsv | sort -u -k1,1 > spades_k21-55_full_blast_annotation.tsv
NODE_1000_length_306_cov_2.509960 Staphylococcus epidermidis
NODE_1001_length_305_cov_4.088000 Staphylococcus epidermidis
NODE_1002_length_305_cov_3.296000 Staphylococcus epidermidis
NODE_1003_length_305_cov_2.700000 Staphylococcus hominis
NODE_1004_length_304_cov_39.807229 Paenibacillus sp. FSL R7-0331
NODE_1005_length_304_cov_2.815261 Staphylococcus aureus
NODE_1006_length_304_cov_2.112450 Staphylococcus hominis
NODE_1007_length_303_cov_3.193548 Staphylococcus epidermidis
NODE_1008_length_303_cov_2.286290 Staphylococcus hominis
NODE_1009_length_303_cov_1.326613 Staphylococcus hominis
grep "^P" spades_k21-55_full.gfa | cut -f2,3 | head
NODE_1_length_1448318_cov_80.178312_1 10311124+,10031624+,10311308+,9911828-,10258069-,10057802+,10272111+,10118480+,9985602+,10086712+,1392052-,9996864-,10171502-,10320089-,10273179-,10319496-,10273179-,10319498-,9886408-,10074530-,10294167-,10135462-,9886408-,10294169-,10294167-,10178112-,10128536+,10309890+,10128536+,10317320-,723032-,10317322-,723032-,10306893-,10226805-,10245575-,10226805-,10319919-,10318590+,10320057+,9955186-,10319875-,9711424-,10320055+,9955186-,10319933-,10318590+,10315156-,9711424-,10319640-,10302373+,10317464+,10314624+,10314630+,10302021+,10135030+,10310524+,10310194+,10310388+,235774+,10313776+,10300185+,10314676+,10314684+,10297987+,10313836+,10316434-,10017942-,10317812-,10314202-,1548974+,10309092+,10314132+,10314140+,10313694+,10314684-,10314676-,10318362-,10314186+,10314194+,10314202+,10317812+,10312142-,10218043+,10313772-,10310880+,9796758-,10311458+,10171996-,10270995+,666778+,10307503+,1936534-,10312728+,9906056+,10312142+,10317812-,10314202-,1548974+,10313566+,10287551+,10313866+,10297503+,10318937+,10085614+,10311126+,10085614+,10310820-,10200554-,10115886-,10085372-,10313984-,10314700-,10314692-,10314684-,10314676-,10300185-,10313832+,9946622+,10210134+,9946622+,10315032-,10311484+,10311344-,2957484+,10317710+,2957484+,10317708+,10171502+,10248517-,1392052+,10184644+,9985602-,10186332-,10272111-,10204768+,10258069+,10068726+,10311310-,10031624-,10320101-,10200554-,10186842+,10085372-,10287176-,10319791-,10313030-,10048488-,10110418-,9804040+,10309472-
NODE_2_length_336381_cov_10.156140_1 10285711+,10243057+,10268313+,10243057+,10287871-
NODE_2_length_336381_cov_10.156140_2 10320938+
NODE_3_length_320122_cov_11.770314_1 10247199-,10281667+
NODE_3_length_320122_cov_11.770314_2 10289913-
NODE_3_length_320122_cov_11.770314_3 10287919-
NODE_3_length_320122_cov_11.770314_4 10299737+,1484104+,10291809+
NODE_3_length_320122_cov_11.770314_5 10321250+,328964-,2720874-
NODE_4_length_310123_cov_12.870461_1 10291757+
NODE_4_length_310123_cov_12.870461_2 10300875+
grep "^P" spades_k21-55_full.gfa | cut -f2,3 | cut -f1 -d, | head
NODE_1_length_1448318_cov_80.178312_1 10311124+
NODE_2_length_336381_cov_10.156140_1 10285711+
NODE_2_length_336381_cov_10.156140_2 10320938+
NODE_3_length_320122_cov_11.770314_1 10247199-
NODE_3_length_320122_cov_11.770314_2 10289913-
NODE_3_length_320122_cov_11.770314_3 10287919-
NODE_3_length_320122_cov_11.770314_4 10299737+
NODE_3_length_320122_cov_11.770314_5 10321250+
NODE_4_length_310123_cov_12.870461_1 10291757+
NODE_4_length_310123_cov_12.870461_2 10300875+
grep "^P" spades_k21-55_full.gfa | cut -f2,3 | cut -f1 -d, | awk '{ print $2 "\t" $1 }' | cut -f1-6 -d"_" | head
10311124+ NODE_1_length_1448318_cov_80.178312
10285711+ NODE_2_length_336381_cov_10.156140
10320938+ NODE_2_length_336381_cov_10.156140
10247199- NODE_3_length_320122_cov_11.770314
10289913- NODE_3_length_320122_cov_11.770314
10287919- NODE_3_length_320122_cov_11.770314
10299737+ NODE_3_length_320122_cov_11.770314
10321250+ NODE_3_length_320122_cov_11.770314
10291757+ NODE_4_length_310123_cov_12.870461
10300875+ NODE_4_length_310123_cov_12.870461
grep "^P" spades_k21-55_full.gfa | cut -f2,3 | cut -f1 -d, | awk '{ print $2 "\t" $1 }' | cut -f1-6 -d"_" > spades_k21-55_full_node_contig_names.tsv
join -1 2 -2 1 <( sort -k2,2 spades_k21-55_full_node_contig_names.tsv ) <( sort -k1,1 spades_k21-55_full_blast_annotation.tsv) -t $'\t' -o 1.1,1.2,2.2 | head
1511768+ NODE_1000_length_306_cov_2.509960 Staphylococcus epidermidis
9815238+ NODE_1001_length_305_cov_4.088000 Staphylococcus epidermidis
10091228+ NODE_1002_length_305_cov_3.296000 Staphylococcus epidermidis
1099986+ NODE_1003_length_305_cov_2.700000 Staphylococcus hominis
10320035+ NODE_1004_length_304_cov_39.807229 Paenibacillus sp. FSL R7-0331
10115802+ NODE_1005_length_304_cov_2.815261 Staphylococcus aureus
9844216+ NODE_1006_length_304_cov_2.112450 Staphylococcus hominis
10271579+ NODE_1007_length_303_cov_3.193548 Staphylococcus epidermidis
10031180+ NODE_1008_length_303_cov_2.286290 Staphylococcus hominis
9908708+ NODE_1009_length_303_cov_1.326613 Staphylococcus hominis
join -1 2 -2 1 <( sort -k2,2 spades_k21-55_full_node_contig_names.tsv ) <( sort -k1,1 spades_k21-55_full_blast_annotation.tsv) -t $'\t' -o 1.1,1.2,2.2 | tr "\t" "," > spades_k21-55_full_bandage_labels.csv
head spades_k21-55_full_bandage_labels.csv
1511768+,NODE_1000_length_306_cov_2.509960,Staphylococcus epidermidis
9815238+,NODE_1001_length_305_cov_4.088000,Staphylococcus epidermidis
10091228+,NODE_1002_length_305_cov_3.296000,Staphylococcus epidermidis
1099986+,NODE_1003_length_305_cov_2.700000,Staphylococcus hominis
10320035+,NODE_1004_length_304_cov_39.807229,Paenibacillus sp. FSL R7-0331
10115802+,NODE_1005_length_304_cov_2.815261,Staphylococcus aureus
9844216+,NODE_1006_length_304_cov_2.112450,Staphylococcus hominis
10271579+,NODE_1007_length_303_cov_3.193548,Staphylococcus epidermidis
10031180+,NODE_1008_length_303_cov_2.286290,Staphylococcus hominis
9908708+,NODE_1009_length_303_cov_1.326613,Staphylococcus hominis
Bandage load spades_k21-55_full.gfa --draw