The biomaRt package provides an interface to databases implementing the BioMart software suite, e.g. Ensembl and Uniprot. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. The Ensembl database holds a lot of data on genome sequences and annotations. Take a look at http://www.ensembl.org/biomart/martview/ to get an idea of what can be downloaded from Ensembl.


1 biomaRt in summary

Working with biomaRt typically consists of these 3 steps:

  1. Choose a database (useMart)
  2. Choose a dataset (useDataset)
  3. Query datasets (getBM)

First, make sure that the biomaRt package is installed and loaded. In the code examples below, you will also need the dplyr package.

source("https://bioconductor.org/biocLite.R")
biocLite("biomaRt")
library(biomaRt)
library(tidyverse)

1.1 Databases

List available databases and choose the mart corresponding to “Ensembl Genes 92”.

listMarts()
ensembl = useMart("ENSEMBL_MART_ENSEMBL")

1.2 Datasets

List available datasets. In Ensembl each species is a separate dataset. Let’s start by looking at the human genes.

listDatasets(ensembl)
ensembl_mart = useDataset("hsapiens_gene_ensembl",mart=ensembl)

1.3 Query function

getBM() is the main function for querying datasets. These queries consist of attributes, filters and filter values.

  • filters: These are used to restrict your query (e.g. chromosomal location or gene ids). You can provide as many or as few filters as you want.
  • values: These are the filter values you want retrieved (e.g. chromosme 11). You should provide as many values as filters.
  • attributes: What information do you want to retrieve (e.g. chromosome, gene name, description)?

1.3.1 Attributes

Use listAttributes() to view all available attributes

attributes = listAttributes(ensembl_mart)
head(attributes)

1.3.2 Filters

Use listFilters() to view all filter options

filters = listFilters(ensembl_mart)
head(filters)

You can see that there are a lot of attributes to fetch and a lot of options for filtering. If you feel lost among these, it might help to use grep() to search for the ones that are useful for you. For example, if you are intrested in the fruitfly (Drosophila melanogaster) homologs of the human genes, you can look for all attributes with “melanogaster” in the name:

grep("melanogaster", attributes$name, value=TRUE)
##  [1] "dmelanogaster_homolog_ensembl_gene"                
##  [2] "dmelanogaster_homolog_associated_gene_name"        
##  [3] "dmelanogaster_homolog_ensembl_peptide"             
##  [4] "dmelanogaster_homolog_chromosome"                  
##  [5] "dmelanogaster_homolog_chrom_start"                 
##  [6] "dmelanogaster_homolog_chrom_end"                   
##  [7] "dmelanogaster_homolog_canonical_transcript_protein"
##  [8] "dmelanogaster_homolog_subtype"                     
##  [9] "dmelanogaster_homolog_orthology_type"              
## [10] "dmelanogaster_homolog_perc_id"                     
## [11] "dmelanogaster_homolog_perc_id_r1"                  
## [12] "dmelanogaster_homolog_goc_score"                   
## [13] "dmelanogaster_homolog_wga_coverage"                
## [14] "dmelanogaster_homolog_dn"                          
## [15] "dmelanogaster_homolog_ds"                          
## [16] "dmelanogaster_homolog_orthology_confidence"

1.3.3 Filter options

Some filters have a limited set of values. For example, boolean filters take TRUE or FALSE. For other filters the function filterOptions() can be used to get all possible values.

Find all valid options for the filter ‘biotype’.

filterOptions('biotype',ensembl_mart)
## [1] "[3prime_overlapping_ncRNA,antisense,bidirectional_promoter_lncRNA,IG_C_gene,IG_C_pseudogene,IG_D_gene,IG_J_gene,IG_J_pseudogene,IG_pseudogene,IG_V_gene,IG_V_pseudogene,lincRNA,macro_lncRNA,miRNA,misc_RNA,Mt_rRNA,Mt_tRNA,non_coding,polymorphic_pseudogene,processed_pseudogene,processed_transcript,protein_coding,pseudogene,ribozyme,rRNA,scaRNA,scRNA,sense_intronic,sense_overlapping,snoRNA,snRNA,sRNA,TEC,transcribed_processed_pseudogene,transcribed_unitary_pseudogene,transcribed_unprocessed_pseudogene,translated_processed_pseudogene,TR_C_gene,TR_D_gene,TR_J_gene,TR_J_pseudogene,TR_V_gene,TR_V_pseudogene,unitary_pseudogene,unprocessed_pseudogene,vaultRNA]"

2 Building queries

Now we can put all this together to build queries using the function getBM(attributes = “”, filters = “”, values = “”, mart,…).

2.0.1 Gene annotation

Let’s first try to find the chromosomal position of the gene SRC.

getBM(attributes = c("chromosome_name", "start_position", "end_position"), filters = c("hgnc_symbol"), values = c("SRC"), mart = ensembl_mart)

You may also use getBM() to annotate all genes in a list with for example gene name, location and description.

q_genes <- c("ENSG00000197122", "ENSG00000182866")
gene_annot <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol","chromosome_name","band","strand","start_position","end_position","description", "gene_biotype"), filters = "ensembl_gene_id", values = q_genes, mart = ensembl_mart)
gene_annot

2.0.2 Gene ontology annotations

Next, we want to find all genes with the Gene Ontology (GO) annotation TOR complex (GO:0038201). First, we search for filters related to “go”:

grep("go", filters$name, value=TRUE)
## [1] "with_go"               "with_goslim_goa"       "go"                   
## [4] "goslim_goa"            "go_parent_term"        "go_parent_name"       
## [7] "go_evidence_code"      "with_ggorilla_homolog"

Let’s try to use the filter “go”:

tor_table <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol", "go_id"), filters = "go", values = "GO:0038201", mart = ensembl_mart) 
tor_table

How many genes were found? Note that the “go” filter will only give the genes annoatated to exactly this Gene Ontology term. If we want to find all genes annotated to the given term or any of the child terms, we instead use “go_parent_term”. This query takes a bit longer to run.

tor_table2 <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol", "go_id"), filters = "go_parent_term", values = "GO:0038201", mart = ensembl_mart) 
tor_table2

We can also try to retreive the GO terms (id + name) associated with the genes in our list. Let’s say that we are only interested in the GO domain “Biological Process” (see attribute namespace_1003).

go_table <- getBM( attributes = c("external_gene_name", "go_id","name_1006", "namespace_1003"), filters = "ensembl_gene_id", values = q_genes, mart = ensembl_mart) %>% filter(namespace_1003 == "biological_process")
go_table

2.1 Multiple filters

You can combine several filters in the same query. Note that the values should be a list of vectors. Search for all protein coding genes on chromosome Y that have an ortholog in fruit fly. It can be nice to sort the genes by start position.

orth_table <- getBM( attributes = c("hgnc_symbol", "chromosome_name", "start_position", "end_position"), filters = c("chromosome_name", "biotype", "with_dmelanogaster_homolog"), values = list("Y", "protein_coding", TRUE), mart = ensembl_mart) %>% arrange(start_position)
orth_table

2.2 Get sequences

BiomaRt can also be used to access sequence data. To find the sequence options, look at the “sequences” page of listAttributes().

#pages = attributePages(ensembl_mart)
listAttributes(ensembl_mart, page = "sequences")

2.2.1 Get sequences using getBM

We will first use getBM() to retrieve the cDNA sequences of the genes in q_genes.

seq <- getBM(attributes = c("ensembl_gene_id","cdna"), filters = "ensembl_gene_id", values = q_genes, mart = ensembl_mart)
seq[,c("ensembl_gene_id", "cdna")]

Note that you get several sequences per gene. These represent the different transcript isoforms. Add the transcript ids to the output to see this.

2.2.2 The getSequence function

It is also possible to use a wrapper function, getSequence(), to fetch the sequences.

Try to get the same sequences using this function. Valid seqType arguments can be found in the help for getSequence. We can also order them according to gene id.

seq <- getSequence(id = q_genes, type= "ensembl_gene_id", seqType = "cdna", mart = ensembl_mart) %>% arrange(ensembl_gene_id)
seq[,c("ensembl_gene_id", "cdna")] 

Next, we want the 100 bp upstream promoter sequences of the q_genes.

seq <- getSequence(id = q_genes, type = "ensembl_gene_id", seqType = "coding_gene_flank", upstream = 100, mart = ensembl_mart)
seq[,c("ensembl_gene_id", "coding_gene_flank")]

This function can also be used with chromosome positions. As an example, get the peptide sequences of Ensembl genes in: chr1:32251239-32286165. Note that you have to use type even if you filter on position.

seq <- getSequence(chromosome = gene_annot$chromosome_name[1], start = gene_annot$start_position[1], end = gene_annot$end_position[1], type = "ensembl_gene_id", seqType = "peptide", mart = ensembl_mart)
seq[,c("ensembl_gene_id", "peptide")]

If there is no sequence of this type it may be listed as “Sequence unavailable”.

2.2.3 Export to fasta

Sequences can be exported to file in fasta format by exportFASTA(). Export the sequences from the previous exercise. It may be wise to exclude entries with “Sequence unavailable”.

exportFASTA(seq[seq$peptide != "Sequence unavailable",],"myFastaFile.fa")

Note that if the file already exists, this command will add new sequences to the existing ones.

2.3 A real example

A typical use of the package is when you get a list of differentially expressed genes that you want to annotate. To try this out, load the file called DE_table. This file contains an example of what the output from a differential expression analysis (using EdgeR) can look like. You can also use your own data if you have any.

The dataset can be downloaded here.

de_tab <- read.table("DE_table.txt", sep="\t", header=TRUE, as.is=TRUE)

Annotate the genes in this list and merge the annotations with the original data. Try it out before looking at the code example!

de_gene_annot <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol","chromosome_name","strand","start_position","end_position","description", "gene_biotype"), filters = "ensembl_gene_id", values = de_tab$ensembl_gene_id, mart = ensembl_mart)
merge(de_tab, de_gene_annot, by = "ensembl_gene_id")

3 Other datasets

3.1 Variants

Now we will try a different database (“ENSEMBL_MART_SNP”) containing genetic variants. Select the mart and dataset, this can be done in two steps as before, or using a single command. Then use listFilters() and listAttributes() to see the filters and attributes available for this dataset.

#ensembl_snp = useMart("ENSEMBL_MART_SNP")
#snp_mart = useDataset("hsapiens_snp",mart=ensembl_snp)
snp_mart = useMart(biomart = "ENSEMBL_MART_SNP", dataset="hsapiens_snp")

Retrieve all common (minor allele frequency >= 0.01) nonsynonymous SNPs in the genes above (q_genes) and find at least the variant names, positions and consequences.

snps <- getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand','consequence_type_tv',"minor_allele_freq", "associated_gene"), 
      filters = c('ensembl_gene','minor_allele_freq_second'), 
      values = list(q_genes,'0.01'), 
      mart = snp_mart) %>% 
  filter(consequence_type_tv == "missense_variant")

snps

3.2 Linking datasets

It is also possible to link information between different datasets, e.g. to find orthologs between species. To do this you access two datasets at once, called the primary and the linked datasets. The function uses attributes, filters and values to query each dataset.

human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
mouse = useMart("ensembl", dataset = "mmusculus_gene_ensembl")

getLDS(attributes = c("hgnc_symbol","chromosome_name", "start_position"),
       filters = "hgnc_symbol", values = "SRC",mart = human,
      attributesL = c("mgi_id","chromosome_name","start_position"), martL = mouse)

4 Session Info

  • This document has been created in RStudio using R Markdown and related packages.
  • For R Markdown, see http://rmarkdown.rstudio.com
  • For details about the OS, packages and versions, see detailed information below:
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows >= 8 x64 (build 9200)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] biomaRt_2.34.2             forcats_0.3.0             
##  [3] stringr_1.3.1              dplyr_0.7.5               
##  [5] purrr_0.2.5                readr_1.1.1               
##  [7] tidyr_0.8.1                tibble_1.4.2              
##  [9] tidyverse_1.2.1            captioner_2.2.3           
## [11] bookdown_0.7               knitr_1.20                
## [13] DESeq2_1.18.1              shiny_1.0.5               
## [15] scater_1.6.3               SingleCellExperiment_1.0.0
## [17] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
## [19] matrixStats_0.53.1         GenomicRanges_1.30.3      
## [21] GenomeInfoDb_1.14.0        IRanges_2.12.0            
## [23] S4Vectors_0.16.0           Biobase_2.38.0            
## [25] BiocGenerics_0.24.0        bindrcpp_0.2.2            
## [27] gridExtra_2.3              Seurat_2.3.1              
## [29] Matrix_1.2-14              cowplot_0.9.2             
## [31] ggplot2_2.2.1             
## 
## loaded via a namespace (and not attached):
##   [1] prabclus_2.2-6         ModelMetrics_1.1.0     R.methodsS3_1.7.1     
##   [4] acepack_1.4.1          bit64_0.9-7            irlba_2.3.2           
##   [7] R.utils_2.6.0          data.table_1.11.4      rpart_4.1-13          
##  [10] RCurl_1.95-4.10        metap_0.9              snow_0.4-2            
##  [13] RSQLite_2.1.1          RANN_2.5.1             VGAM_1.0-5            
##  [16] proxy_0.4-22           bit_1.1-14             xml2_1.2.0            
##  [19] lubridate_1.7.4        httpuv_1.4.3           assertthat_0.2.0      
##  [22] viridis_0.5.1          gower_0.1.2            xfun_0.1              
##  [25] tximport_1.6.0         hms_0.4.2              evaluate_0.10.1       
##  [28] promises_1.0.1         DEoptimR_1.0-8         progress_1.1.2        
##  [31] caTools_1.17.1         readxl_1.1.0           igraph_1.2.1          
##  [34] DBI_1.0.0              geneplotter_1.56.0     htmlwidgets_1.2       
##  [37] ddalpha_1.3.3          backports_1.1.2        trimcluster_0.1-2     
##  [40] annotate_1.56.2        Cairo_1.5-9            ROCR_1.0-7            
##  [43] abind_1.4-5            caret_6.0-80           withr_2.1.2           
##  [46] sfsmisc_1.1-2          robustbase_0.93-0      checkmate_1.8.5       
##  [49] prettyunits_1.0.2      mclust_5.4             mnormt_1.5-5          
##  [52] cluster_2.0.7-1        ape_5.1                diffusionMap_1.1-0    
##  [55] segmented_0.5-3.0      lazyeval_0.2.1         crayon_1.3.4          
##  [58] genefilter_1.60.0      edgeR_3.20.9           recipes_0.1.2         
##  [61] pkgconfig_2.0.1        labeling_0.3           nlme_3.1-137          
##  [64] vipor_0.4.5            nnet_7.3-12            bindr_0.1.1           
##  [67] rlang_0.2.1            diptest_0.75-7         doSNOW_1.0.16         
##  [70] modelr_0.1.2           cellranger_1.1.0       rprojroot_1.3-2       
##  [73] lmtest_0.9-36          zoo_1.8-2              base64enc_0.1-3       
##  [76] beeswarm_0.2.3         ggridges_0.5.0         png_0.1-7             
##  [79] viridisLite_0.3.0      rjson_0.2.20           bitops_1.0-6          
##  [82] shinydashboard_0.7.0   R.oo_1.22.0            KernSmooth_2.23-15    
##  [85] blob_1.1.1             DRR_0.0.3              lars_1.2              
##  [88] scales_0.5.0.9000      memoise_1.1.0          magrittr_1.5          
##  [91] plyr_1.8.4             ica_1.0-2              gplots_3.0.1          
##  [94] gdata_2.18.0           zlibbioc_1.24.0        compiler_3.4.3        
##  [97] RColorBrewer_1.1-2     dimRed_0.1.0           fitdistrplus_1.0-9    
## [100] cli_1.0.0              dtw_1.20-1             XVector_0.18.0        
## [103] pbapply_1.3-4          htmlTable_1.12         magic_1.5-8           
## [106] Formula_1.2-3          MASS_7.3-50            tidyselect_0.2.4      
## [109] stringi_1.2.3          yaml_2.1.19            locfit_1.5-9.1        
## [112] latticeExtra_0.6-28    grid_3.4.3             tools_3.4.3           
## [115] rstudioapi_0.7         foreach_1.4.4          foreign_0.8-70        
## [118] prodlim_2018.04.18     scatterplot3d_0.3-41   Rtsne_0.13            
## [121] digest_0.6.15          FNN_1.1                lava_1.6.1            
## [124] fpc_2.1-11             Rcpp_0.12.17           broom_0.4.4           
## [127] SDMTools_1.1-221       later_0.7.3            httr_1.3.1            
## [130] AnnotationDbi_1.40.0   psych_1.8.4            kernlab_0.9-26        
## [133] colorspace_1.3-2       rvest_0.3.2            XML_3.98-1.11         
## [136] ranger_0.10.1          reticulate_1.8         CVST_0.2-2            
## [139] splines_3.4.3          RcppRoll_0.3.0         flexmix_2.3-14        
## [142] xtable_1.8-2           jsonlite_1.5           geometry_0.3-6        
## [145] timeDate_3043.102      modeltools_0.2-21      ipred_0.9-6           
## [148] tclust_1.4-1           R6_2.2.2               Hmisc_4.1-1           
## [151] pillar_1.2.3           htmltools_0.3.6        mime_0.5              
## [154] glue_1.2.0             BiocParallel_1.12.0    class_7.3-14          
## [157] codetools_0.2-15       tsne_0.1-3             mvtnorm_1.0-8         
## [160] lattice_0.20-35        mixtools_1.1.0         curl_3.2              
## [163] ggbeeswarm_0.6.0       gtools_3.5.0           survival_2.42-3       
## [166] limma_3.34.9           rmarkdown_1.10         munsell_0.5.0         
## [169] rhdf5_2.22.0           GenomeInfoDbData_1.0.0 iterators_1.0.9       
## [172] haven_1.1.1            reshape2_1.4.3         gtable_0.2.0

Page built on: 16-Jun-2018 at 14:12:02.


2018 | SciLifeLab > NBIS > RaukR website twitter