1 Chip sequence data

Analysing chip seq data is still much more of investigative process compared to the fairly mature pipelines and processes available for analysing RNA-sequence data. The small data we will use originates from the ENCODE (www.encodeproject.org) project. It consists of duplicates of ChIP-seq of a transcription factor REST in two different cell lines. The REST transcription factor is involved in repression of neural genes in non-neural cells. Here we will compare a neural cell to another cell type and would hence expect more binding to the non-neural cell type.

The input data is in the form of sorted, deduplicated bam files. For the sake of making the analysis feasible on a laptotp the data have been mapped only to chromosome 1 and 2 of the human genome.

The structure of the data is as follows:

  • ENCFF000OWQ neural replicate 1
  • ENCFF000OWM neural replicate 2
  • ENCFF000RAG SK-N-SH replicate 1
  • ENCFF000RAH SK-N-SH replicate 2

In addition we have a access to the corresponding input libraries so that we can see if the experiment actually has worked.

  • ENCFF000OXB neural replicate 1
  • ENCFF000OXE neural replicate 2
  • ENCFF000RBT SK-N-SH replicate 1
  • ENCFF000RBU SK-N-SH replicate 2

All data is available here:

ChipRest CHipInput Blacklist

1.1 csaw - ChIP-seq analysis with windows

In csaw a param object that holds data about the settings used for the analysis is central. This is done to make sure that one is importing the same thing from the bam files throughout the analysis.

We create this file by first deciding on some parameter to add to the param file. We hence supply the name of the bam files with Chip data, fragment length, window width and blacklist. The blacklist is an imported bed file that contains information about areas in the genome that is not meaningful to analyse for chip signal. These are areas that are not suitable to retain for any downstream analysis as they will interfer with the true chip signal. The reason for this is not known, but in same cases it caused by repetitive sequence structures.

We start off by loading the libraries.

library(rtracklayer) # Importing bed
library(GenomicAlignments) # Working with aligned objects
library(GenomicRanges) # Working with geonomic ranges
library(Gviz) # Plot genome data
library(csaw) # Chip seq analysis
library(Rsamtools)
library(edgeR) # Differential binding analysis

We can then create the param object for the analysis. In this case we set the mapping quality and make sure to skip reads that are mapping in the blacklist region. Note that the blacklist file is often downloaded from external resources (https://personal.broadinstitute.org/anshul/projects/encode/rawdata/blacklists/hg19-blacklist-README.pdf).

bams <- list.files("~/data/chip/REST", pattern = ".bam$", full.names = TRUE)
frag.len <- 110
win.width <- 10
blacklist <- import("~/data/dukeExcludeRegions.bed")
param <- readParam(minq=50, discard = blacklist)
data <- windowCounts(bams, ext=frag.len, width=win.width, param=param)

After this one can look at the cross correlation to give us information on the fragment length. We hence use the correlateReads() function to estimate the distance between reads on forward and reverse strand in the data. This occurs as one will read the only the end of each DNA piece that was captured. This will hence tell us about the actual average size of the captured DNA.

1.2 Cross correlations

x <- correlateReads(bams)
frag.len <- which.max(x) - 1

If one want to make sure that the cross correlation looks as expected one can easily make a plot showing the distribution.

plot(1:length(x)-1, x, xlab="Delay (bp)", ylab="CCF", type="l")
abline(v=frag.len, col="red")
text(x=frag.len, y=min(x), paste(frag.len, "bp"), pos=4, col="red")

We can do the but for the input libraries eg. libraries from the same cells that have not been subjected to enrichment where we do not expect any peak.

bams2 <- list.files("~/data/chip/input/", pattern = ".bam$", full.names = TRUE)
data <- windowCounts(bams2, ext=frag.len, width=win.width, param=param)
x2 <- correlateReads(bams2)
frag.len2 <- which.max(x2) - 1
plot(1:length(x2)-1, x2, xlab="Delay1 (bp)", ylab="CCF", type="l")
abline(v=frag.len2, col="blue")
text(x=frag.len2, y=min(x2), paste(frag.len2, "bp"), pos=4, col="red")

It does indeed look like the chip libraries have more pronounced peak suggesting that it works as expected. The “peak” one see at the later is much weaker as you can see by the scale of the y-axis.

The data can hence (at least based on this summary) be used for counting reads in windows and look for differences among the cell types. Depending on type of experiment the window length has to be altered, but for transcription factors a short lengh is needed to make sure that one capture even narrow peaks.

win.data <- windowCounts(bams, param = param, width=10, ext = frag.len)
win.data
## class: RangedSummarizedExperiment 
## dim: 412699 4 
## metadata(6): spacing width ... param final.ext
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(4): bam.files totals ext rlen

A common problem in identifying true peaks is if there is in general more binding in one cell type or one treatment as this might lead to identification “false” differences. This is handled by counting reads in larger bins and normalize out the effect of this in the the detection of differential binding.

bins <- windowCounts(bams, bin=TRUE, width=10000, param=param)
normfacs <- normOffsets(bins)
normfacs
y.bin <- asDGEList(bins)
bin.ab <- aveLogCPMliby.bin <- asDGEList(bins)
bin.ab <- aveLogCPM(y.bin)
adjc <- cpm(y.bin, log=TRUE)
## class: RangedSummarizedExperiment 
## dim: 46356 4 
## metadata(6): spacing width ... param final.ext
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames: NULL
## colData names(5): bam.files totals ext rlen norm.factors

Removal of low-abundance windows done with the following cammands.

filter.stat <- filterWindows(win.data, bins, type="global")
min.fc <- 3
keep <- filter.stat$filter > log2(min.fc)
summary(keep)
##    Mode   FALSE    TRUE 
## logical  358754   53945
filtered.data <- win.data[keep,]

1.3 Model

Here we will use the edgeR to test for differences between the cell types. So we set up an design matrix in a similar way as was done for the RNA-seq data.

celltype <- factor(c("Neural", "Neural", "SKNSH", "SKNSH"))
designChip <- model.matrix(~0 + celltype)
colnames(designChip) <- levels(celltype)

We are now ready for estimating dispersion parameters.

yChip <- asDGEList(filtered.data, norm.factors =
                     normfacs$norm.factors)
yChip <- estimateDisp(yChip, designChip)
summary(yChip$trended.dispersion)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1502  0.1998  0.2170  0.2157  0.2344  0.3648

We can now check weather or not the celltypes do look different by using a MDS plot.

plotMDS(cpm(yChip, log=TRUE), top=10000, labels = celltype,
    col=c("tomato3", "blue")[as.integer(celltype)])

1.4 Testing for Differential binding in the two cell types

We first run run the linear model

fitChip <- glmQLFit(yChip, designChip, robust=TRUE)

Then we specify the contrast of interest and test for differential binding between cell types. Since there can be a very large number of windows identified one can merge signals from adjacent windows in an attempt to reduce the the need for multiple corrections.

contrast <- makeContrasts(Neural-SKNSH, levels = designChip)
resChip <- glmQLFTest(fitChip, contrast=contrast)
merged <- mergeWindows(rowRanges(filtered.data), tol=100, max.width=5000)
tabcom <- combineTests(merged$id, resChip$table)
tabbest <- getBestTest(merged$id, resChip$table)
is.sig <- tabcom$FDR <= 0.05
summary(is.sig)
##    Mode   FALSE    TRUE 
## logical    5038    5790

1.5 Plot an example region

We can then rearrange and summarize data and have a go at plotting the patterns for a region that are different between cells. We make sure to extract the mid regions in best candidates

out.ranges <- merged$region
elementMetadata(out.ranges) <- data.frame(tabcom,
     best.pos=mid(ranges(rowRanges(filtered.data[tabbest$best]))),
     best.logFC=tabbest$logFC)

We read in two packages with annotation information.

library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

We can then extract our top candidate regions

o <- order(out.ranges$PValue)
cur.region <- out.ranges[o[1]]
cur.region
## GRanges object with 1 range and 8 metadata columns:
##       seqnames            ranges strand |  nWindows  logFC.up logFC.down
##          <Rle>         <IRanges>  <Rle> | <integer> <integer>  <integer>
##   [1]     chr2 75810051-75810210      * |         4         0          4
##                     PValue                  FDR   direction  best.pos
##                  <numeric>            <numeric> <character> <integer>
##   [1] 2.74771152494641e-07 0.000149807807342976        down  75810155
##              best.logFC
##               <numeric>
##   [1] -8.47796941498082
##   -------
##   seqinfo: 2 sequences from an unspecified genome
anno <- detailRanges(out.ranges, orgdb=org.Hs.eg.db,
     txdb=TxDb.Hsapiens.UCSC.hg19.knownGene)

gax <- GenomeAxisTrack(col="black", fontsize=10, size=2)
g_reg <- GeneRegionTrack(TxDb.Hsapiens.UCSC.hg19.knownGene,
                         showId=TRUE, geneSymbol=TRUE, name="",
                         background.title="transparent")

symbols <- unlist(mapIds(org.Hs.eg.db, gene(g_reg), "SYMBOL",
     "ENTREZID", multiVals = "first"))
symbol(g_reg) <- symbols[gene(g_reg)]
collected <- list()
lib.sizes <- filtered.data$totals/1e6
for (i in 1:length(bams)) {
    reads <- extractReads(bam.file=bams[i], cur.region, param=param)
    pcov <- as(coverage(reads[strand(reads)=="+"])/lib.sizes[i],
               "GRanges")
    ncov <- as(coverage(reads[strand(reads)=="-"])/-lib.sizes[i],
               "GRanges")
    ptrack <- DataTrack(pcov, type="histogram",  lwd=0, 
                        ylim=c(-5, 5), 
                        name=bams[i], 
                        col.axis="black", 
                        col.title="black",
                        fill="blue", 
                        col.histogram=NA)
    ntrack <- DataTrack(ncov, type="histogram", 
                        lwd=0, ylim=c(-5, 5),
                        fill="red", col.histogram=NA)
    collected[[i]] <- OverlayTrack(trackList=list(ptrack, ntrack))
}
plotTracks(c(gax, collected, g_reg), chromosome=as.character(seqnames(cur.region)),
           from=start(cur.region)-1000, to=end(cur.region)+1000)

This is a crude plot that does not properly show the name of the bam files (why?) nor do we plot the input. Try to modify the plot to include more details and modify it according to your own preferences and plot other identified peaks and see if there are any that are closer to known transcripts.

1.6 Further reading

The csaw approach is a R solution for doing Chip seq. analysis. It is a flexible, but there are many tools both outside and inside R that are well suited of designed for analysis of Chip sequence data.

Below is a few of tools that are often deployed. The last three are bioconductor packages. - Phantompeakqualtools - Deeptools - ChIPQC - ChIPpeakAnno - DiffBind