Analysing chip seq data is still much more of investigative process compared to the fairly mature pipelines and processes available for analysing RNA-sequence data. The small data we will use originates from the ENCODE (www.encodeproject.org) project. It consists of duplicates of ChIP-seq of a transcription factor REST in two different cell lines. The REST transcription factor is involved in repression of neural genes in non-neural cells. Here we will compare a neural cell to another cell type and would hence expect more binding to the non-neural cell type.
The input data is in the form of sorted, deduplicated bam files. For the sake of making the analysis feasible on a laptotp the data have been mapped only to chromosome 1 and 2 of the human genome.
The structure of the data is as follows:
In addition we have a access to the corresponding input libraries so that we can see if the experiment actually has worked.
All data is available here:
In csaw a param object that holds data about the settings used for the analysis is central. This is done to make sure that one is importing the same thing from the bam files throughout the analysis.
We create this file by first deciding on some parameter to add to the param file. We hence supply the name of the bam files with Chip data, fragment length, window width and blacklist. The blacklist is an imported bed file that contains information about areas in the genome that is not meaningful to analyse for chip signal. These are areas that are not suitable to retain for any downstream analysis as they will interfer with the true chip signal. The reason for this is not known, but in same cases it caused by repetitive sequence structures.
We start off by loading the libraries.
library(rtracklayer) # Importing bed
library(GenomicAlignments) # Working with aligned objects
library(GenomicRanges) # Working with geonomic ranges
library(Gviz) # Plot genome data
library(csaw) # Chip seq analysis
library(Rsamtools)
library(edgeR) # Differential binding analysis
We can then create the param object for the analysis. In this case we set the mapping quality and make sure to skip reads that are mapping in the blacklist region. Note that the blacklist file is often downloaded from external resources (https://personal.broadinstitute.org/anshul/projects/encode/rawdata/blacklists/hg19-blacklist-README.pdf).
bams <- list.files("~/data/chip/REST", pattern = ".bam$", full.names = TRUE)
frag.len <- 110
win.width <- 10
blacklist <- import("~/data/dukeExcludeRegions.bed")
param <- readParam(minq=50, discard = blacklist)
data <- windowCounts(bams, ext=frag.len, width=win.width, param=param)
After this one can look at the cross correlation to give us information on the fragment length. We hence use the correlateReads() function to estimate the distance between reads on forward and reverse strand in the data. This occurs as one will read the only the end of each DNA piece that was captured. This will hence tell us about the actual average size of the captured DNA.
x <- correlateReads(bams)
frag.len <- which.max(x) - 1
If one want to make sure that the cross correlation looks as expected one can easily make a plot showing the distribution.
plot(1:length(x)-1, x, xlab="Delay (bp)", ylab="CCF", type="l")
abline(v=frag.len, col="red")
text(x=frag.len, y=min(x), paste(frag.len, "bp"), pos=4, col="red")
We can do the but for the input libraries eg. libraries from the same cells that have not been subjected to enrichment where we do not expect any peak.
bams2 <- list.files("~/data/chip/input/", pattern = ".bam$", full.names = TRUE)
data <- windowCounts(bams2, ext=frag.len, width=win.width, param=param)
x2 <- correlateReads(bams2)
frag.len2 <- which.max(x2) - 1
plot(1:length(x2)-1, x2, xlab="Delay1 (bp)", ylab="CCF", type="l")
abline(v=frag.len2, col="blue")
text(x=frag.len2, y=min(x2), paste(frag.len2, "bp"), pos=4, col="red")
It does indeed look like the chip libraries have more pronounced peak suggesting that it works as expected. The “peak” one see at the later is much weaker as you can see by the scale of the y-axis.
The data can hence (at least based on this summary) be used for counting reads in windows and look for differences among the cell types. Depending on type of experiment the window length has to be altered, but for transcription factors a short lengh is needed to make sure that one capture even narrow peaks.
win.data <- windowCounts(bams, param = param, width=10, ext = frag.len)
win.data
## class: RangedSummarizedExperiment
## dim: 412699 4
## metadata(6): spacing width ... param final.ext
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(4): bam.files totals ext rlen
A common problem in identifying true peaks is if there is in general more binding in one cell type or one treatment as this might lead to identification “false” differences. This is handled by counting reads in larger bins and normalize out the effect of this in the the detection of differential binding.
bins <- windowCounts(bams, bin=TRUE, width=10000, param=param)
normfacs <- normOffsets(bins)
normfacs
y.bin <- asDGEList(bins)
bin.ab <- aveLogCPMliby.bin <- asDGEList(bins)
bin.ab <- aveLogCPM(y.bin)
adjc <- cpm(y.bin, log=TRUE)
## class: RangedSummarizedExperiment
## dim: 46356 4
## metadata(6): spacing width ... param final.ext
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(5): bam.files totals ext rlen norm.factors
Removal of low-abundance windows done with the following cammands.
filter.stat <- filterWindows(win.data, bins, type="global")
min.fc <- 3
keep <- filter.stat$filter > log2(min.fc)
summary(keep)
## Mode FALSE TRUE
## logical 358754 53945
filtered.data <- win.data[keep,]
Here we will use the edgeR to test for differences between the cell types. So we set up an design matrix in a similar way as was done for the RNA-seq data.
celltype <- factor(c("Neural", "Neural", "SKNSH", "SKNSH"))
designChip <- model.matrix(~0 + celltype)
colnames(designChip) <- levels(celltype)
We are now ready for estimating dispersion parameters.
yChip <- asDGEList(filtered.data, norm.factors =
normfacs$norm.factors)
yChip <- estimateDisp(yChip, designChip)
summary(yChip$trended.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1502 0.1998 0.2170 0.2157 0.2344 0.3648
We can now check weather or not the celltypes do look different by using a MDS plot.
plotMDS(cpm(yChip, log=TRUE), top=10000, labels = celltype,
col=c("tomato3", "blue")[as.integer(celltype)])
We first run run the linear model
fitChip <- glmQLFit(yChip, designChip, robust=TRUE)
Then we specify the contrast of interest and test for differential binding between cell types. Since there can be a very large number of windows identified one can merge signals from adjacent windows in an attempt to reduce the the need for multiple corrections.
contrast <- makeContrasts(Neural-SKNSH, levels = designChip)
resChip <- glmQLFTest(fitChip, contrast=contrast)
merged <- mergeWindows(rowRanges(filtered.data), tol=100, max.width=5000)
tabcom <- combineTests(merged$id, resChip$table)
tabbest <- getBestTest(merged$id, resChip$table)
is.sig <- tabcom$FDR <= 0.05
summary(is.sig)
## Mode FALSE TRUE
## logical 5038 5790
We can then rearrange and summarize data and have a go at plotting the patterns for a region that are different between cells. We make sure to extract the mid regions in best candidates
out.ranges <- merged$region
elementMetadata(out.ranges) <- data.frame(tabcom,
best.pos=mid(ranges(rowRanges(filtered.data[tabbest$best]))),
best.logFC=tabbest$logFC)
We read in two packages with annotation information.
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
We can then extract our top candidate regions
o <- order(out.ranges$PValue)
cur.region <- out.ranges[o[1]]
cur.region
## GRanges object with 1 range and 8 metadata columns:
## seqnames ranges strand | nWindows logFC.up logFC.down
## <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
## [1] chr2 75810051-75810210 * | 4 0 4
## PValue FDR direction best.pos
## <numeric> <numeric> <character> <integer>
## [1] 2.74771152494641e-07 0.000149807807342976 down 75810155
## best.logFC
## <numeric>
## [1] -8.47796941498082
## -------
## seqinfo: 2 sequences from an unspecified genome
anno <- detailRanges(out.ranges, orgdb=org.Hs.eg.db,
txdb=TxDb.Hsapiens.UCSC.hg19.knownGene)
gax <- GenomeAxisTrack(col="black", fontsize=10, size=2)
g_reg <- GeneRegionTrack(TxDb.Hsapiens.UCSC.hg19.knownGene,
showId=TRUE, geneSymbol=TRUE, name="",
background.title="transparent")
symbols <- unlist(mapIds(org.Hs.eg.db, gene(g_reg), "SYMBOL",
"ENTREZID", multiVals = "first"))
symbol(g_reg) <- symbols[gene(g_reg)]
collected <- list()
lib.sizes <- filtered.data$totals/1e6
for (i in 1:length(bams)) {
reads <- extractReads(bam.file=bams[i], cur.region, param=param)
pcov <- as(coverage(reads[strand(reads)=="+"])/lib.sizes[i],
"GRanges")
ncov <- as(coverage(reads[strand(reads)=="-"])/-lib.sizes[i],
"GRanges")
ptrack <- DataTrack(pcov, type="histogram", lwd=0,
ylim=c(-5, 5),
name=bams[i],
col.axis="black",
col.title="black",
fill="blue",
col.histogram=NA)
ntrack <- DataTrack(ncov, type="histogram",
lwd=0, ylim=c(-5, 5),
fill="red", col.histogram=NA)
collected[[i]] <- OverlayTrack(trackList=list(ptrack, ntrack))
}
plotTracks(c(gax, collected, g_reg), chromosome=as.character(seqnames(cur.region)),
from=start(cur.region)-1000, to=end(cur.region)+1000)
This is a crude plot that does not properly show the name of the bam files (why?) nor do we plot the input. Try to modify the plot to include more details and modify it according to your own preferences and plot other identified peaks and see if there are any that are closer to known transcripts.
The csaw approach is a R solution for doing Chip seq. analysis. It is a flexible, but there are many tools both outside and inside R that are well suited of designed for analysis of Chip sequence data.
Below is a few of tools that are often deployed. The last three are bioconductor packages. - Phantompeakqualtools - Deeptools - ChIPQC - ChIPpeakAnno - DiffBind