logo_raukr

RaukR 2018

Analyzing amplicon data using DADA2

31-May-2018

John Sundh & Marcin Kierczak

In this workshop we will use the DADA2 pipeline to analyze amplicon sequence datasets. We will use a subset of the samples used in the study by Kraemer et al (2018) Influence of pig farming on the human’s nasal microbiota: The key role of the airborne microbial communities.

However, if you have data of your own that you would like to analyze feel free to do so and modify the steps here according to your own data.

Some code is hidden by default. Click on the ‘Code’ button on the right side to show the code.

1 Setup

1.1 Installing DADA2

If you don’t have the dada2 package installed you can install it with

source("https://bioconductor.org/biocLite.R")
biocLite("dada2")

1.2 Loading packages

Let’s go ahead and load the R packages we will need.

library(dada2)
library(vegan)
library(ggplot2)
packageVersion("dada2")

1.3 Download the example data

The original Kraemer et al (2018) study included 255 samples from 43 pig farmers, 56 pigs and 27 air samples as well as 43 control subjects. For the purpose of this workshop we have scaled the sample number down to 24 total samples. Some general information for these samples are shown below. As you can see there are samples from 9 locations: 3 pig farms, 3 cow farms and 3 for negative controls. Two samples were taken per human individual (anterior and posterior nares).

df <- read.table("info/example_subset_info.tsv", sep="\t", header = TRUE)

datatable(df[,c(9,3,4,11,14,15)], options = list(pageLength=26, order=list(5, 'asc')))

The raw data for these samples can be downloaded from here: Dropbox link

Run the following to download the example data and unpack in the data/ directory.

system("curl -o data/example_data.tar.gz -L https://www.dropbox.com/s/rewj8p08y1xc8np/example_data.tar.gz?dl=0")
system("tar -C data -xvf data/example_data.tar.gz")

2 DADA2 pipeline

2.1 Loading the sample files

First of all we need to store the names of the sample files.

path <- paste(getwd(), "data", sep="/")
fR1 <- sort(list.files(path, pattern="L1_R1.fastq.gz", full.names = TRUE))
fR2 <- sort(list.files(path, pattern="L1_R2.fastq.gz", full.names = TRUE))

Let’s also extract the sample names from the files.

sample.names <- regmatches(basename(fR1),regexpr("[0-9]+",basename(fR1)))

2.2 Trimming and filtering reads

DADA2 comes with a function plotQualityProfile to plot the quality distribution of nucleotides in your reads. Let’s take a look at the profiles for the forward and reverse reads of the first sample in our list.

plotQualityProfile(c(fR1[1],fR2[1]))

This plot shows the distribution of quality scores for each position in the reads. Darker greyscale indicates a higher count. The green line is the mean and the dashed lines represent the 25th and 75th quantiles.

Because DADA2 uses quality scores to build the error model so low-quality nucleotides typically seen in Illumina sequencing (especially for the reverse reads as you can see) isn’t too much of a problem. However, getting rid of regions with very low quality can make DADA2 more sensitive. Also, adapters and primers with ambiguous nucleotides will cause problems so we need to make sure those are removed.

The function filterAndTrim in DADA2 can trim reads to a certain length and filter out reads based on expected errors after trimming.

First we’ll specify the output directory and file names for the filtered reads:

filt_path <- file.path(path, "filtered")
filtfR1 <- file.path(filt_path, paste0(sample.names, "_R1_filt.fastq.gz"))
filtfR2 <- file.path(filt_path, paste0(sample.names, "_R2_filt.fastq.gz"))

In the study by Kraemer the Methods section in the supplementary material states that:

Forward reads were trimmed at 200 bp and reverse reads were trimmed at 150 bp to remove low quality regions. The 20 first base pairs and instances of a quality score less than or equal to two were truncated from all reads. Reads (and their respective forward and reverse read) containing ambiguous bases and more than two expected errors were filtered out.

QUESTION: How would you use the filterAndTrim function to process the reads like the original study? Take a look at the documentation for the function using help(filterAndTrim). Give it some thought and then check the code below.

filt_trim <- filterAndTrim(fR1, filtfR1, fR2, filtfR2, truncLen=c(200,150), trimLeft = 20,
              maxN=0, maxEE=c(2,2), truncQ=2, rm.phix=TRUE,
              compress=TRUE, multithread=TRUE)

2.3 Dereplication

In the dereplication step all identical reads are combined and the total abundance of the unique reads is stored. This is done via the derepFastq function.

derepR1 <- derepFastq(filtfR1, verbose=TRUE)
names(derepR1) <- sample.names
derepR2 <- derepFastq(filtfR2, verbose=TRUE)
names(derepR2) <- sample.names

2.4 Estimating the error model

The error model used by DADA2 to infer the correct sequences is made up of transition probabilities and is estimated, or learned, from the data. This is done with the function learnErrors.

error_ratesR1 <- learnErrors(derepR1, multithread=TRUE)
error_ratesR2 <- learnErrors(derepR2, multithread=TRUE)

By default, the error rates are estimated from pooled samples using up to 1 million reads. Samples from different sequencing runs should typically not be pooled as they don’t share an “error history”. To estimate the error rates for just the first sample you could do:

error_ratesR1_sample1 <- learnErrors(derepR1[[1]], multithread=TRUE)
error_ratesR2_sample1 <- learnErrors(derepR2[[1]], multithread=TRUE)

plotErrors can be used to plot the observed nucleotide transitions as a function of quality. This is what forms the error model of DADA2. The plot can be used to check how well the error model fits your data. Run the function on the forward and reverse reads.

plotErrors(error_ratesR1)
plotErrors(error_ratesR2)

QUESTION: Does the model (the fitted line) seem to match the observed error rates? Try plotting error rates estimated from just one sample and compare to the pooled estimates. Is there any difference?

2.5 Inferring ASVs

The core algorithm of DADA2 is the sequence variant inference which is done with the aptly named function dada. In a paired-end dataset the algorithm is run separately for each read in a fragment.

dadaR1 <- dada(derepR1, err=error_ratesR1, multithread=TRUE)
dadaR2 <- dada(derepR2, err=error_ratesR2, multithread=TRUE)

Typically, it’s fine to use the default settings of the dada function but it’s always good to know about the parameters. Three important parameters of DADA2 are:

OMEGA_A: This sets the threshold for the ‘abundance p-value’, a measure of the probability that a sequence is the result of sequencing errors. During the partitioning of sequences, if the abundance p-value drops below the OMEGA_A threshold a new partition is created and all unique sequences are compared to that new partition. By default it is set to 1e-40.
KDIST_CUTOFF: This is one of the parameters used as a heuristic during the pairwise alignment phase of DADA2. Prior to aligning the kmer distance between sequences is calculated and only sequences with a smaller kmer distance than KDIST_CUTOFF are aligned. By default it is set to 0.42. Increasing it will increase the number of pairwise alignments but also the running time.
BAND_SIZE: This is a threshold on how many potential gaps to allow in a pairwise alignment. By default it is set to 16. The heuristic can be turned off using a negative value.

2.5.1 Merging reads

After ASVs have been inferred for each read the pairs are merged and potential pairs with insufficient overlap or too many mismatches in the overlapping regions are removed.

mergers <- mergePairs(dadaR1, derepR1, dadaR2, derepR2, verbose=TRUE)

Note that if you have non-overlapping pairs you can just concatenate the two reads in a pair rather than merge them using justConcatenate=TRUE in mergePairs.

2.5.2 Make sequence table

When the ASVs have been inferred, DADA2 can create a sequence table with counts for each ASV (columns) across the provided samples (rows).

seqtab <- makeSequenceTable(mergers)

2.5.3 To pool or not to pool

By default DADA2 infers ASVs separately for each sample in a dataset. However, the dada(...) function can be set to take into account information from all samples (pooling) by setting pool=TRUE. By not pooling samples the computation time and memory requirements are lowered. However, having more information can increase sensitivity of the algorithm. Take a moment to think about why pooling may increase sensitivity.

The original study pooled samples for estimations of the error rate but did not pool them when inferring ASVs. If you want you can run the ASV inference with pooled samples and investigate the differences yourself. Click the code section below to see an example on how to look at the differences in ASV numbers.

dadaR1_pooled <- dada(derepR1, err=error_ratesR1, multithread=TRUE, pool=TRUE)
dadaR2_pooled <- dada(derepR2, err=error_ratesR2, multithread=TRUE, pool=TRUE)

# Merge reads
mergers_pooled <- mergePairs(dadaR1_pooled, derepR1, dadaR2_pooled, derepR2, verbose=TRUE)

# Make sequence table
seqtab_pooled <- makeSequenceTable(mergers_pooled)

# Look at differences in ASV numbers
ncol(seqtab)
ncol(seqtab_pooled)

# Plot abundance differences
ASV_diff <- setdiff(colnames(seqtab_pooled),colnames(seqtab))
ASV_diff_perc <- rowSums(seqtab_pooled[,ASV_diff]) / rowSums(seqtab_pooled) * 100
ASV_diff_df <- data.frame(percent=ASV_diff_perc, Sample=names(ASV_diff_perc))
ASV_diff_df <- cbind(ASV_diff_df,paste("Loc",df[sort.list(df$sample_alias),"location.number"]))
colnames(ASV_diff_df) <- c("percent","Sample","Location")
ggplot(data=ASV_diff_df, mapping=aes(x=Sample, y=percent, fill=Location)) + geom_bar(stat="identity") + scale_x_discrete(labels=df[sort.list(df$location.number),"sample_alias"])

2.5.4 Remove chimeras

DADA2 has now performed error correction on the sequences but chimeras may still remain. Chimeric sequences are identified if they can be put together by the left- and right-segment of two more abundant sequences in the data. In the default consensus method a sequence is removed if it is flagged as chimeric in a fraction of the samples (0.9 by default). The original study used the pooled strategy instead in which all samples are pooled prior to chimera evaluation.

seqtab.nochim <- removeBimeraDenovo(seqtab, method="pooled", multithread=TRUE, verbose=TRUE)

It may be interesting to see what proportion of the sequences are identified as chimeric:

print(1-dim(seqtab.nochim)[2] / dim(seqtab)[2])

That may seem like a high proportion of chimeric sequences. Let’s take a look at how much they contribute to the total:

print((1-sum(seqtab.nochim)/sum(seqtab))*100)

However, the chimeras make up roughly 4% of the total sequences.

2.5.5 Check sequences throughout the pipeline

DADA2 also comes with a built-in way to follow the number of sequences through the pipeline.

getN <- function(x) sum(getUniques(x))
track <- cbind(filt_trim, sapply(dadaR1, getN), sapply(dadaR2, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")

You should see something like this:

datatable(track, options = list(pageLength=24))

2.6 Assign taxonomy

DADA2 can assign a taxonomy to the inferred ASVs. Here we’ll do it using the formatted data from the Ribosomal Database Project (RDP). For other database see here.

First download the training data for the RDP classifier:

system("curl -o data/rdp_species_assignment_16.fa.gz https://zenodo.org/record/801828/files/rdp_species_assignment_16.fa.gz")
system("curl -o data/rdp_train_set_16.fa.gz https://zenodo.org/record/801828/files/rdp_train_set_16.fa.gz")

Then run the assign taxonomy step:

taxa <- assignTaxonomy(seqtab.nochim, "data/rdp_train_set_16.fa.gz", multithread=TRUE)

And add species level assignments.

taxa <- addSpecies(taxa, "data/rdp_species_assignment_16.fa.gz")

3 Analysis

3.1 Sample differences (\(\beta\) diversity)

Beta diversity refers to the the difference in community composition between samples. We can take the sequence abundance table seqtab and plot the dissimilarities in a reduced space using Nonmetric Multidimensional Scaling (NMDS). There are several ways of doing this type of analysis but here we’ll use function metaMDS from the vegan package

# Transform to proportions
seqtab.prop <- seqtab.nochim/rowSums(seqtab.nochim)
mds <- metaMDS(seqtab.prop, k=3)
df$MDS1 <- mds$points[as.character(df$sample_alias),1]
df$MDS2 <- mds$points[as.character(df$sample_alias),2]

Then we can plot the first two dimensions, color samples by sample type and label them with sample_id.

ggplot(data=df, mapping=aes(x=MDS1, y=MDS2, color=sample_type)) + geom_point(size=3) +   geom_text(aes(label=sample_id,hjust=0),nudge_x=0.05,size=3)

QUESTION: What conclusions regarding the community compositions would you draw from this plot?

The metaMDS function also provides so called ‘species scores’ so that we can plot how the sequences themselves are spread in this reduced space. Let’s add some taxonomic information on top of that plot as well.

taxa <- as.data.frame(taxa)
# Merge sequence table with taxonomy
seqtab.prop_t <- t(seqtab.prop)
seqtab.prop_taxa <- merge(taxa, seqtab.prop_t[rownames(taxa),], by=0)
# Sum to rank
rank <- "Phylum"
seqtab.prop_rank <- aggregate(seqtab.prop_taxa[,sample.names], by=list(rank=seqtab.prop_taxa[,rank]), FUN=sum, drop=TRUE)
rownames(seqtab.prop_rank) <- seqtab.prop_rank$rank
seqtab.prop_rank <- seqtab.prop_rank[,sample.names]
# Get means across all samples
rank_means <- rowMeans(seqtab.prop_rank)
# Select phyla with top 10 mean abundances
top_phyla <- names(rank_means[sort.list(rank_means, decreasing=TRUE)][1:10])
# Add the species scores and phyla
ASVs <- rownames(taxa[taxa$Phylum%in%top_phyla,])
phyla <- taxa[taxa$Phylum%in%top_phyla,"Phylum"]
mds1 <- mds$species[ASVs,1]
mds2 <- mds$species[ASVs,2]
species_df <- data.frame(phylum=phyla, MDS1=mds1, MDS2=mds2)
ggplot() + geom_point(data = species_df, mapping=aes(x=MDS1, y=MDS2, color=phylum)) + scale_color_brewer(palette="Set3") + geom_point(data=df, mapping=aes(x=MDS1, y=MDS2), size=3) + geom_text(data=df, aes(x = MDS1, y=MDS2, label=sample_id,hjust=0),nudge_x=0.05,size=3)

Let’s also take a quick glance at the differences with a bar plot.

library(reshape2)
seqtab_tmp <- seqtab.prop_rank
seqtab_tmp$Phylum <- rownames(seqtab.prop_rank)
seqtab.prop_rank_melt <- melt(seqtab_tmp[top_phyla,], id.vars = c("Phylum"), variable.name = "Sample", value.name = "proportion")
# Reorder by location number 
sample_alias <- df[sort.list(df$location.number),"sample_alias"]
# Set xticklabels
xticklabels <- paste(df[sort.list(df$location.number),"location.number"], df[sort.list(df$location.number),"sample_type"], sep="/")
seqtab.prop_rank_melt$sample <- factor(seqtab.prop_rank_melt$Sample, levels = sample_alias)
ggplot(data=seqtab.prop_rank_melt, mapping=aes(x=Sample, y=proportion, fill=Phylum)) + geom_col()+ scale_fill_brewer(palette="Set3") + scale_x_discrete(name="Location/sample_type", limits=as.factor(sample_alias), labels=xticklabels) + theme(axis.text.x = element_text(angle = 90, hjust=1.1, vjust = 0.5))

3.2 \(\alpha\) diversity (using Phyloseq)

Results from DADA2 can be directly imported into the phyloseq package. Let’s give it a quick try.

First install and load the phyloseq package.

install.packages("phyloseq")
library(phyloseq)

Then create the phyloseq object by supplying a DADA2 sequence table and a dataframe of metadata.

ps_sample_df <- df
rownames(ps_sample_df) <- as.character(df$sample_alias)
ps <- phyloseq(otu_table(seqtab.nochim[as.character(df$sample_alias),], taxa_are_rows=FALSE), sample_data(ps_sample_df,tax_table(taxa)))

Plot Richness and Shannon diversity metrics by sample type. Essentially we are recreating Fig. 1A in the original article. Do the results that we get from this limited study appear to be similar?

plot_richness(ps, x="sample_type", measures=c("Observed")) + geom_boxplot(aes(fill=sample_type)) + scale_x_discrete(limits=c("pig","air","pigf","cowf","ne"), labels = c("pig","air","pig farmer","cow farmer","non-exposed"))

4 Challenge dataset

In case you’ve finished the workshop using the example data and still have time to spare you can go ahead and analyze a challenge dataset. The study is titled Longitudinal analysis of microbial interaction between humans and the indoor environment and was published in Science in 2014. Bacterial communities associated with humans and their indoor environment were investigated using 16S amplicon sequencing, so it follows the same theme as the rest of the workshop but there are differences: the study used single-end sequencing and clustered sequences into OTUS at 97% similarity. A total of 1,625 samples were collected (see here for the full set of raw data) but we’ve made a subset of 60 samples from 5 houses and 4 sample types that you can download here:

5 Session Info

For R Markdown, see http://rmarkdown.rstudio.com and https://rmarkdown.rstudio.com/authoring_pandoc_markdown.html

This document has been created in RStudio using R Markdown and related packages. For details about the OS, packages and versions, see detailed information below:

sessionInfo()

## R version 3.4.4 (2018-03-15)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.4
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] phyloseq_1.22.3 reshape2_1.4.3  vegan_2.5-1     lattice_0.20-35
##  [5] permute_0.9-4   dada2_1.6.0     Rcpp_0.12.17    ggplot2_2.2.1  
##  [9] DT_0.4          captioner_2.2.3 bookdown_0.7    knitr_1.20     
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.38.0             splines_3.4.4             
##  [3] foreach_1.4.4              jsonlite_1.5              
##  [5] RcppParallel_4.4.0         shiny_1.1.0               
##  [7] stats4_3.4.4               latticeExtra_0.6-28       
##  [9] GenomeInfoDbData_1.0.0     Rsamtools_1.30.0          
## [11] yaml_2.1.19                pillar_1.2.3              
## [13] backports_1.1.2            digest_0.6.15             
## [15] GenomicRanges_1.30.3       RColorBrewer_1.1-2        
## [17] promises_1.0.1             XVector_0.18.0            
## [19] colorspace_1.3-2           htmltools_0.3.6           
## [21] httpuv_1.4.3               Matrix_1.2-14             
## [23] plyr_1.8.4                 pkgconfig_2.0.1           
## [25] ShortRead_1.36.1           zlibbioc_1.24.0           
## [27] xtable_1.8-2               scales_0.5.0              
## [29] later_0.7.2                BiocParallel_1.12.0       
## [31] tibble_1.4.2               mgcv_1.8-23               
## [33] IRanges_2.12.0             SummarizedExperiment_1.8.1
## [35] BiocGenerics_0.24.0        lazyeval_0.2.1            
## [37] survival_2.42-3            magrittr_1.5              
## [39] mime_0.5                   evaluate_0.10.1           
## [41] nlme_3.1-137               MASS_7.3-50               
## [43] hwriter_1.3.2              Cairo_1.5-9               
## [45] tools_3.4.4                data.table_1.10.4-3       
## [47] matrixStats_0.53.1         stringr_1.3.1             
## [49] S4Vectors_0.16.0           munsell_0.4.3             
## [51] cluster_2.0.7-1            DelayedArray_0.4.1        
## [53] Biostrings_2.46.0          ade4_1.7-11               
## [55] compiler_3.4.4             GenomeInfoDb_1.14.0       
## [57] rlang_0.2.0                rhdf5_2.22.0              
## [59] grid_3.4.4                 RCurl_1.95-4.10           
## [61] biomformat_1.6.0           iterators_1.0.9           
## [63] htmlwidgets_1.2            igraph_1.2.1              
## [65] crosstalk_1.0.0            bitops_1.0-6              
## [67] labeling_0.3               rmarkdown_1.9             
## [69] multtest_2.34.0            codetools_0.2-15          
## [71] gtable_0.2.0               R6_2.2.2                  
## [73] GenomicAlignments_1.14.2   rprojroot_1.3-2           
## [75] ape_5.1                    stringi_1.2.2             
## [77] parallel_3.4.4             xfun_0.1

2018 | SciLifeLab > NBIS > RaukR