1 Genomic data

Reference genomic data for your projects are available from Ensembl. This is usually the latest build of the genome, transcriptome etc as well as the annotations in GTF or GFF format. Most common organisms are available from ensembl.org. You can select the organism and then click on Download FASTA/Download GTF/GFF which takes you to the FTP site.

You can also go directly to their FTP site ftp://ftp.ensembl.org/pub/release-96 where you can select the type of data you need, and then select the organism. For eg; homo_sapiens, under which you find cdna, cds, dna, dna_index, ncrna and pep. Under dna, the FASTA files are available as full genome or as separate chromosomes. Each of them are again available as regular (repeat content as normal bases), soft-masked (sm, repeat content in lowercase) or repeat-masked (rm, repeat content as Ns). Full genomes are also available as primary assembly or top-level. Primary assembly is what most people would need. The top-level is much larger in size and contains non-chromosomal contigs, patches, haplotypes etc. This is significantly larger in size compared to the primary assembly.

Clades such as metazoa, protists, bacteria, fungi and plants are available through separate ensembl websites. These are listed on http://ensemblgenomes.org/.

2 Biomart

2.1 Genes

In this section, we will download annotation data using R package biomaRt. Annotations refer to known features (verified experimentally or predicted) in the genome. Usually, our features of interest in RNA-Seq are genes, their IDs, position in the genome, gene biotype (protein coding, non-coding etc) etc. We will also use the dplyr package to pipe data through functions.

R
library(biomaRt)
library(dplyr)

listMarts()
##                biomart               version
## 1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 96
## 2   ENSEMBL_MART_MOUSE      Mouse strains 96
## 3     ENSEMBL_MART_SNP  Ensembl Variation 96
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 96

If this returns an error due to the ensembl server being down, you can also try listMarts(host="uswest.ensembl.org").

We will use the code below to find the name of the Human ensembl genes dataset under ensembl mart.

R
mart <- useMart("ENSEMBL_MART_ENSEMBL")
ds <- as.data.frame(listDatasets(mart=mart))

# find all rows in dataset 'ds' where column 'description' contains the string 'human'
ds[grepl("human",tolower(ds$description)),]
##                  dataset              description    version
## 73 hsapiens_gene_ensembl Human genes (GRCh38.p12) GRCh38.p12

Now that we know the name of the dataset, we can list all the columns (filters) in this dataset.

R
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset(mart=mart,dataset="hsapiens_gene_ensembl")
la <- listAttributes(mart=mart)
head(la)
##                            name                  description         page
## 1               ensembl_gene_id               Gene stable ID feature_page
## 2       ensembl_gene_id_version       Gene stable ID version feature_page
## 3         ensembl_transcript_id         Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5            ensembl_peptide_id            Protein stable ID feature_page
## 6    ensembl_peptide_id_version    Protein stable ID version feature_page

One can also search for attributes of interest.

R
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset(mart=mart,dataset="hsapiens_gene_ensembl")
searchAttributes(mart=mart,pattern="entrez")
##                     name                   description         page
## 55 entrezgene_trans_name EntrezGene transcript name ID feature_page
## 74            entrezgene                  NCBI gene ID feature_page

We create a vector of our columns of interest.

R
myattributes <- c("ensembl_gene_id",
                  "entrezgene",
                  "external_gene_name",
                  "chromosome_name",
                  "start_position",
                  "end_position",
                  "strand",
                  "gene_biotype",
                  "description")

We then use this to download our data. Note that this can be a slow step.

R
mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset(mart=mart,dataset="hsapiens_gene_ensembl")
bdata <- getBM(mart=mart,attributes=myattributes,uniqueRows=T)
head(bdata)
  ensembl_gene_id entrezgene external_gene_name chromosome_name start_position
1 ENSG00000210049         NA              MT-TF              MT            577
2 ENSG00000211459         NA            MT-RNR1              MT            648
3 ENSG00000210077         NA              MT-TV              MT           1602
4 ENSG00000210082         NA            MT-RNR2              MT           1671
5 ENSG00000209082         NA             MT-TL1              MT           3230
6 ENSG00000198888       4535             MT-ND1              MT           3307
  end_position strand   gene_biotype
1          647      1        Mt_tRNA
2         1601      1        Mt_rRNA
3         1670      1        Mt_tRNA
4         3229      1        Mt_rRNA
5         3304      1        Mt_tRNA
6         4262      1 protein_coding
                                                                                               description
1                              mitochondrially encoded tRNA-Phe (UUU/C) [Source:HGNC Symbol;Acc:HGNC:7481]
2                                       mitochondrially encoded 12S RNA [Source:HGNC Symbol;Acc:HGNC:7470]
3                                mitochondrially encoded tRNA-Val (GUN) [Source:HGNC Symbol;Acc:HGNC:7500]
4                                       mitochondrially encoded 16S RNA [Source:HGNC Symbol;Acc:HGNC:7471]
5                            mitochondrially encoded tRNA-Leu (UUA/G) 1 [Source:HGNC Symbol;Acc:HGNC:7490]
6 mitochondrially encoded NADH:ubiquinone oxidoreductase core subunit 1 [Source:HGNC Symbol;Acc:HGNC:7455]

We find that there are several duplicates for all the IDs. This needs to be fixed when this information is to be used downstream.

R
sum(duplicated(bdata$ensembl_gene_id))
sum(duplicated(bdata$entrezgene))
sum(duplicated(bdata$external_gene_name))
252
45751
6417
R
# arrange table by chr name and start position
bdata <- dplyr::arrange(bdata,chromosome_name,start_position)
write.table(bdata,"./data/human_genes.txt",row.names=F,quote=F,col.names=T,sep="\t")

2.2 Transcript

Here we download transcript to gene mappings. Notice that we can specify the mart and dataset in the useMart() function.

R
mart <- useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")
t2g <- getBM(attributes=c("ensembl_transcript_id","ensembl_gene_id","external_gene_name"),mart=mart)
write.table(t2g,"./data/human_transcripts.txt",row.names=F,quote=F,col.names=T,sep="\t")

The transcipt information file is saved to a file and will be used in the lab on Kallisto.

2.3 Gene ontology

Similarly, we can get entrez gene ID to GO ID relationships. List all the GO related filters:

R
mart <- biomaRt::useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")
lf <- listFilters(mart=mart)

# find all rows in dataset 'lf' where column 'name' contains the string 'go'
lf[grepl("go",tolower(lf$name)),]
                   name                        description
1               with_go                      With GO ID(s)
2       with_goslim_goa              With GOSlim GOA ID(s)
3                    go         GO ID(s) [e.g. GO:0000002]
4            goslim_goa GOSlim GOA ID(s) [e.g. GO:0000003]
5        go_parent_term              Parent term accession
6        go_parent_name                   Parent term name
7      go_evidence_code                   GO Evidence code
8   with_cdingo_homolog            Orthologous Dingo Genes
9 with_ggorilla_homolog          Orthologous Gorilla Genes
R
mart <- biomaRt::useMart(biomart="ensembl",dataset="hsapiens_gene_ensembl")
bdata <- getBM(mart=mart,attributes=c("entrezgene","go","go_evidence_code"),uniqueRows=T)
write.table(bdata,"./data/go.txt",row.names=F,quote=F,col.names=T,sep="\t")

2.4 ID conversion

We can also take a quick look at converting IDs. It is often desirable to convert a certain gene identifier to another (ensembl gene ID, entrez gene ID, gene ID). Sometimes, it may be necessary to convert gene IDs of one organism to another. biomaRt has a convenient function for this called getLDS().

Here is an example where we convert a few mouse ensembl IDs to Human Hugo gene IDs.

R
mouse_genes <- c("ENSMUSG00000035847","ENSMUSG00000000214")
mouse <- useMart("ensembl",dataset="mmusculus_gene_ensembl")
human <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
human_genes <- getLDS(attributes=c("ensembl_gene_id"),filters="ensembl_gene_id",values=mouse_genes,mart=mouse, attributesL=c("hgnc_symbol"),martL=human,valuesL="hgnc_symbol",uniqueRows=F)[,1]
      Gene.stable.ID HGNC.symbol
1 ENSMUSG00000000214          TH
2 ENSMUSG00000035847         IDS

3 Session info

Output
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     datasets  grDevices utils     graphics 
## [8] methods   base     
## 
## other attached packages:
##  [1] pheatmap_1.0.12             edgeR_3.24.3               
##  [3] limma_3.38.3                DESeq2_1.22.2              
##  [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
##  [7] BiocParallel_1.16.6         matrixStats_0.54.0         
##  [9] Biobase_2.42.0              GenomicRanges_1.34.0       
## [11] GenomeInfoDb_1.18.2         IRanges_2.16.0             
## [13] S4Vectors_0.20.1            BiocGenerics_0.28.0        
## [15] biomaRt_2.39.3              ggplot2_3.1.1              
## [17] dplyr_0.8.0.1               leaflet_2.0.2              
## [19] captioner_2.2.3             bookdown_0.9               
## [21] knitr_1.22                 
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137           bitops_1.0-6           bit64_0.9-7           
##  [4] RColorBrewer_1.1-2     progress_1.2.0         httr_1.4.0            
##  [7] tools_3.5.2            backports_1.1.4        R6_2.4.0              
## [10] rpart_4.1-13           mgcv_1.8-26            Hmisc_4.2-0           
## [13] DBI_1.0.0              lazyeval_0.2.2         colorspace_1.4-1      
## [16] nnet_7.3-12            withr_2.1.2            tidyselect_0.2.5      
## [19] gridExtra_2.3          prettyunits_1.0.2      curl_3.3              
## [22] bit_1.1-14             compiler_3.5.2         htmlTable_1.13.1      
## [25] labeling_0.3           scales_1.0.0           checkmate_1.9.1       
## [28] genefilter_1.64.0      stringr_1.4.0          digest_0.6.18         
## [31] foreign_0.8-71         rmarkdown_1.12         XVector_0.22.0        
## [34] base64enc_0.1-3        pkgconfig_2.0.2        htmltools_0.3.6       
## [37] htmlwidgets_1.3        rlang_0.3.4            rstudioapi_0.10       
## [40] RSQLite_2.1.1          shiny_1.3.2            jsonlite_1.6          
## [43] crosstalk_1.0.0        acepack_1.4.1          RCurl_1.95-4.12       
## [46] magrittr_1.5           GenomeInfoDbData_1.2.0 Formula_1.2-3         
## [49] Matrix_1.2-15          Rcpp_1.0.1             munsell_0.5.0         
## [52] stringi_1.4.3          yaml_2.2.0             zlibbioc_1.28.0       
## [55] plyr_1.8.4             grid_3.5.2             blob_1.1.1            
## [58] promises_1.0.1         crayon_1.3.4           lattice_0.20-38       
## [61] cowplot_0.9.4          splines_3.5.2          annotate_1.60.1       
## [64] hms_0.4.2              locfit_1.5-9.1         pillar_1.3.1          
## [67] ggpubr_0.2             geneplotter_1.60.0     XML_3.98-1.19         
## [70] glue_1.3.1             evaluate_0.13          latticeExtra_0.6-28   
## [73] data.table_1.12.2      httpuv_1.5.1           tidyr_0.8.3           
## [76] gtable_0.3.0           purrr_0.3.2            assertthat_0.2.1      
## [79] xfun_0.6               mime_0.6               xtable_1.8-3          
## [82] later_0.8.0            survival_2.43-3        tibble_2.1.1          
## [85] AnnotationDbi_1.44.0   memoise_1.1.0          cluster_2.0.7-1