R is a programming language for statistical computing, and data wrangling. It is open-source, widely used in data science, has a wide range of functions and algorithms for graphing and data analyses.

We will use R in this course for generating plots from different biological data that are of higher quality and standard for publications. To do this, you will brush-up your memory on some important aspects of R that are important for this course below:

Before starting with the lab sessions of the entire course, you must have downloaded all the necessary files from [link removed] and make sure that the directory tree looks similar to the one displayed in that page

In that case, you can proceed with the exercise now and remember to have fun :)

1 Input/Output

Input and output of data and images is an important aspect with data analysis.

1.1 Text

Data can come in a variety of formats which needs to be read into R and converted to an R data type.

Text files are the most commonly used input. Text files can be read in using the function read.table. We have a sample file to use: metadata.csv.

dfr <- read.table("data/metadata.csv",sep = ";", header=TRUE,stringsAsFactors=F)

This reads in a tab-delimited text file with a header. The argument sep='\t' is set by default to specify that the delimiter is a tab. stringsAsFactors=F setting ensures that character columns are not automatically converted to factors.

It’s always a good idea to check the data after import.

head(dfr)
output
##   Sample_ID Sample_Name Time Replicate Cell
## 1  Sample_1        t0_A   t0         A A431
## 2  Sample_2        t0_B   t0         B A431
## 3  Sample_3        t0_C   t0         C A431
## 4  Sample_4        t2_A   t2         A A431
## 5  Sample_5        t2_B   t2         B A431
## 6  Sample_6        t2_C   t2         C A431
str(dfr)
output
## 'data.frame':    12 obs. of  5 variables:
##  $ Sample_ID  : chr  "Sample_1" "Sample_2" "Sample_3" "Sample_4" ...
##  $ Sample_Name: chr  "t0_A" "t0_B" "t0_C" "t2_A" ...
##  $ Time       : chr  "t0" "t0" "t0" "t2" ...
##  $ Replicate  : chr  "A" "B" "C" "A" ...
##  $ Cell       : chr  "A431" "A431" "A431" "A431" ...

Check ?read.table for other wrapper functions to read in text files.

Let’s filter this data.frame and create a new data set.

dfr1 <- dfr[dfr$Time == "t0",]

And we can write this as a text file.

write.table(dfr1,"iris-setosa.txt",sep="\t",row.names=F,quote=F)

sep="\t" sets the delimiter to tab. row.names=F denotes that rownames should not be written. quote=F specifies that doubles must not be placed around strings.

2 Data-Frames

You have probably learnt about data.frame in your previous course and this is the most important data structure for generating plots using ggplot. So, below you have some quick exercises on how to work with Data-Frames.

Vectors positions can be accessed using []. R follows 1-based indexing, meaning that the indexing starts at 1.

Data-frame or matrix positions can be accessed using [] specifying row and column like [row,column].

dfr <- data.frame(x = 1:3, y = c("a", "b", "c"))
dfr
dfr[1,]
dfr[,1]
dfr[2,2]
output
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
##   x y
## 1 1 a
## [1] 1 2 3
## [1] "b"

The function cbind() is used to join two data-frames column-wise.

dfr1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
dfr2 <- data.frame(p = 4:6, q = c("d", "e", "f"))
dfr1
dfr2

cbind(dfr1,dfr2)
output
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
##   p q
## 1 4 d
## 2 5 e
## 3 6 f
##   x y p q
## 1 1 a 4 d
## 2 2 b 5 e
## 3 3 c 6 f

Similarly, rbind() is used to join two data-frames row-wise.

dfr1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
dfr2 <- data.frame(x = 4:6, y = c("d", "e", "f"))
dfr1
dfr2

rbind(dfr1,dfr2)
output
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
##   x y
## 1 4 d
## 2 5 e
## 3 6 f
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f

Two data-frames can be merged based on a shared column using the merge() function.

dfr1 <- data.frame(x = 1:4, p = c("a", "b", "c","d"))
dfr2 <- data.frame(x = 3:6, q = c("l", "m", "n","o"))
dfr1
dfr2

merge(dfr1,dfr2,by="x")
merge(dfr1,dfr2,by="x",all.x=T)
merge(dfr1,dfr2,by="x",all.y=T)
merge(dfr1,dfr2,by="x",all=T)
output
##   x p
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
##   x q
## 1 3 l
## 2 4 m
## 3 5 n
## 4 6 o
##   x p q
## 1 3 c l
## 2 4 d m
##   x p    q
## 1 1 a <NA>
## 2 2 b <NA>
## 3 3 c    l
## 4 4 d    m
##   x    p q
## 1 3    c l
## 2 4    d m
## 3 5 <NA> n
## 4 6 <NA> o
##   x    p    q
## 1 1    a <NA>
## 2 2    b <NA>
## 3 3    c    l
## 4 4    d    m
## 5 5 <NA>    n
## 6 6 <NA>    o

3 Data formats

In terms of the R (and other similar programming languages), the data can be viewed or stored in two main formats! They are called wide and long formats. Below you will see what exactly they stand for and why is it important for plotting in ggplot.

3.1 Wide format

A quick preview:

Counts Table

Sample_1 Sample_2 Sample_3 Sample_4
ENSG00000000003 321 303 204 492
ENSG00000000005 0 0 0 0
ENSG00000000419 696 660 472 951
ENSG00000000457 59 54 44 109
ENSG00000000460 399 405 236 445
ENSG00000000938 0 0 0 0

And we usually have our metadata related to the samples in another table like below:

Metadata Table

Sample_ID Sample_Name Time Replicate Cell
Sample_1 t0_A t0 A A431
Sample_2 t0_B t0 B A431
Sample_3 t0_C t0 C A431
Sample_4 t2_A t2 A A431
  • Wide format data is called “wide” because it typically has a lot of columns which stretch widely across the page or your computer screen.
  • Most of us are familiar with looking at wide format data
    • It is convenient and we are more used to looking at data this way in our Excel sheets.
    • It often lets you see more of the data, at one time, on your screen

3.2 Long format

Below is glimpse how the long format of the same data look like:

Gene Samples count
ENSG00000000003 Sample_1 321
ENSG00000000005 Sample_1 0
ENSG00000000419 Sample_1 696
ENSG00000000457 Sample_1 59
ENSG00000000460 Sample_1 399
ENSG00000000938 Sample_1 0
ENSG00000000003 Sample_2 303
ENSG00000000005 Sample_2 0
ENSG00000000419 Sample_2 660
ENSG00000000457 Sample_2 54

Or to be even more complete and precise:

Sample_ID Sample_Name Time Replicate Cell Gene count
Sample_1 t0_A t0 A A431 ENSG00000000003 321
Sample_1 t0_A t0 A A431 ENSG00000000005 0
Sample_1 t0_A t0 A A431 ENSG00000000419 696
Sample_1 t0_A t0 A A431 ENSG00000000457 59
Sample_1 t0_A t0 A A431 ENSG00000000460 399
Sample_1 t0_A t0 A A431 ENSG00000000938 0
Sample_2 t0_B t0 B A431 ENSG00000000003 303
Sample_2 t0_B t0 B A431 ENSG00000000005 0
Sample_2 t0_B t0 B A431 ENSG00000000419 660
Sample_2 t0_B t0 B A431 ENSG00000000457 54
  • Long format data is typically less familiar to most humans
    • It seems awfully hard to get a good look at all (or most) of it
    • It seems like it would require more storage on your hard disk
    • It seems like it would be harder to enter data in a long format

3.3 Which is better?

  • Well, there are some contexts where putting things in wide format is computationally efficient because you can treat data in a matrix format and to efficient matrix calculations on it.
  • However, adding data to wide format data sets is very hard:
    1. It is very difficult to conceive of analytic schemes that apply generally across all wide-format data sets.
    2. Many tools in R want data in long format like ggplot
    3. The long format for data corresponds to the relational model for storing data, which is the model used in most modern data bases like the SQL family of data base systems.
  • A more technical treatment of wide versus long data requires some terminology:
    • Identifier variables are often categorical things that cross-classify observations into categories.
    • Measured variables are the names given to properties or characteristics that you can go out and measure.
    • Values are the values that you measure are record for any particular measured variable.
  • In any particular data set, what you might want to call an Identifier variables versus a Measured variables can not always be entirely clear.
    • Other people might choose to define things differently.
  • However, to my mind, it is less important to be able to precisely recognize these three entities in every possible situation (where there might be some fuzziness about which is which)
  • And it is more important to understand how Identifier variables, Measured variables, and Values interact and behave when we are transforming data between wide and long formats.

4 Conversion between formats

As for the biological data analysis, to be able to use tools such as ggplot, in simple terms we should learn to convert our data:

As per our previous examples: We should learn to convert

From this format

Sample_1 Sample_2 Sample_3 Sample_4
ENSG00000000003 321 303 204 492
ENSG00000000005 0 0 0 0
ENSG00000000419 696 660 472 951
ENSG00000000457 59 54 44 109
ENSG00000000460 399 405 236 445
ENSG00000000938 0 0 0 0

To this format

Sample_ID Sample_Name Time Replicate Cell Gene count
Sample_1 t0_A t0 A A431 ENSG00000000003 321
Sample_1 t0_A t0 A A431 ENSG00000000005 0
Sample_1 t0_A t0 A A431 ENSG00000000419 696
Sample_1 t0_A t0 A A431 ENSG00000000457 59
Sample_1 t0_A t0 A A431 ENSG00000000460 399
Sample_1 t0_A t0 A A431 ENSG00000000938 0
Sample_2 t0_B t0 B A431 ENSG00000000003 303
Sample_2 t0_B t0 B A431 ENSG00000000005 0
Sample_2 t0_B t0 B A431 ENSG00000000419 660
Sample_2 t0_B t0 B A431 ENSG00000000457 54

Here we will only cover the conversion from wide to long, as this is more relevant to us. For the other way around, one can look into spread() from the tidyr package.

4.1 Using reshape2

By using the melt() function from the reshape2 package we can convert the wide-formatted data into long-formatted data! Here, to combine the metadata table to the gene counts table, we will also use the merge() function like we did before!

library(reshape2)
gc <- read.table("data/counts_raw.txt", header = T, row.names = 1, sep = "\t")
md <- read.table("data/metadata.csv", header = T, sep = ";")
rownames(md) <- md$Sample_ID

#merging gene counts table with metadata
merged_data_wide <- merge(md, t(gc), by = 0)
#removing redundant columns
merged_data_wide$Row.names <- NULL

merged_data_long <- melt(merged_data_wide, id.vars = c("Sample_ID","Sample_Name","Time","Replicate","Cell"), variable.name = "Gene", value.name = "count")  
head(merged_data_long)
output
##   Sample_ID Sample_Name Time Replicate Cell            Gene count
## 1  Sample_1        t0_A   t0         A A431 ENSG00000000003   321
## 2 Sample_10       t24_A  t24         A A431 ENSG00000000003   950
## 3 Sample_11       t24_B  t24         B A431 ENSG00000000003   760
## 4 Sample_12       t24_C  t24         C A431 ENSG00000000003  1436
## 5  Sample_2        t0_B   t0         B A431 ENSG00000000003   303
## 6  Sample_3        t0_C   t0         C A431 ENSG00000000003   204

4.2 Using tidyr

If you are more familiar with using tidyverse or tidyr packages, you can combine tables by join() and then use gather() to make long formatted data in the same command. This is a powerful and more cleaner way of dealing with data in R.

library(tidyverse)
gc_long <- gc %>%
  rownames_to_column(var = "Gene") %>%
  gather(Sample_ID, count, -Gene) %>% 
  full_join(md, by = "Sample_ID") %>% 
  select(Sample_ID, everything()) %>% 
  select(-c(Gene,count), c(Gene,count)) 
gc_long %>%
  head(10)
output
##    Sample_ID Sample_Name Time Replicate Cell            Gene count
## 1   Sample_1        t0_A   t0         A A431 ENSG00000000003   321
## 2   Sample_1        t0_A   t0         A A431 ENSG00000000005     0
## 3   Sample_1        t0_A   t0         A A431 ENSG00000000419   696
## 4   Sample_1        t0_A   t0         A A431 ENSG00000000457    59
## 5   Sample_1        t0_A   t0         A A431 ENSG00000000460   399
## 6   Sample_1        t0_A   t0         A A431 ENSG00000000938     0
## 7   Sample_1        t0_A   t0         A A431 ENSG00000000971     0
## 8   Sample_1        t0_A   t0         A A431 ENSG00000001036   844
## 9   Sample_1        t0_A   t0         A A431 ENSG00000001084   535
## 10  Sample_1        t0_A   t0         A A431 ENSG00000001167   493

5 Exercise

Task   Here in this exercise we have used the counts_raw.txt file, you can try to make similar R objects for each of the other three counts (counts_filtered.txt, counts_vst.txt and counts_deseq2.txt) in long format. So for example you would have gc_filt, gc_vst and gc_deseq2 R objects in the end.

Tip   Remember to take a look at how these files are formatted, before you import!

6 Acknowledgements

Much of the data format explanations and exercises were obtained from the GitHub of Eric C. Anderson

7 Getting help

  • Use ?function to get function documentation
  • Use ??bla to search for a function
  • Use args(function) to get the arguments to a function
  • Go to the package CRAN page/webpage for vignettes
  • R Cookbook: General purpose reference.
  • Quick R: General purpose reference.
  • Stackoverflow: Online community to find solutions to your problems.

8 Session info

sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] treeio_1.23.1          ggtree_3.7.2           pheatmap_1.0.12       
##  [4] swemaps_1.0            mapdata_2.3.1          maps_3.4.1            
##  [7] gridExtra_2.3          jpeg_0.1-10            ggpubr_0.6.0          
## [10] cowplot_1.1.1          ggthemes_4.2.4         scales_1.2.1          
## [13] ggrepel_0.9.3          wesanderson_0.3.6      forcats_1.0.0         
## [16] stringr_1.5.0          purrr_1.0.1            readr_2.1.4           
## [19] tidyr_1.3.0            tibble_3.2.1           tidyverse_2.0.0       
## [22] reshape2_1.4.4         ggplot2_3.4.2          formattable_0.2.1     
## [25] kableExtra_1.3.4       dplyr_1.1.1            lubridate_1.9.2       
## [28] leaflet_2.1.2          yaml_2.3.7             fontawesome_0.5.0.9000
## [31] captioner_2.2.3        bookdown_0.33          knitr_1.42            
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-155            RColorBrewer_1.1-3      webshot_0.5.4          
##  [4] httr_1.4.5              tools_4.1.3             backports_1.4.1        
##  [7] bslib_0.4.2             utf8_1.2.3              R6_2.5.1               
## [10] lazyeval_0.2.2          mgcv_1.8-39             colorspace_2.1-0       
## [13] withr_2.5.0             tidyselect_1.2.0        compiler_4.1.3         
## [16] cli_3.6.1               rvest_1.0.3             xml2_1.3.3             
## [19] labeling_0.4.2          sass_0.4.5              yulab.utils_0.0.6      
## [22] systemfonts_1.0.4       digest_0.6.31           rmarkdown_2.21         
## [25] svglite_2.1.1           pkgconfig_2.0.3         htmltools_0.5.5        
## [28] fastmap_1.1.1           highr_0.10              htmlwidgets_1.6.2      
## [31] rlang_1.1.0             rstudioapi_0.14         gridGraphics_0.5-1     
## [34] jquerylib_0.1.4         farver_2.1.1            generics_0.1.3         
## [37] jsonlite_1.8.4          crosstalk_1.2.0         car_3.1-2              
## [40] magrittr_2.0.3          ggplotify_0.1.0         patchwork_1.1.2        
## [43] Matrix_1.5-4            Rcpp_1.0.10             munsell_0.5.0          
## [46] fansi_1.0.4             ape_5.7-1               abind_1.4-5            
## [49] lifecycle_1.0.3         stringi_1.7.12          carData_3.0-5          
## [52] plyr_1.8.8              parallel_4.1.3          lattice_0.20-45        
## [55] splines_4.1.3           hms_1.1.3               pillar_1.9.0           
## [58] ggsignif_0.6.4          glue_1.6.2              evaluate_0.20          
## [61] ggfun_0.0.9             leaflet.providers_1.9.0 vctrs_0.6.1            
## [64] tzdb_0.3.0              gtable_0.3.3            cachem_1.0.7           
## [67] xfun_0.38               broom_1.0.4             tidytree_0.4.2         
## [70] rstatix_0.7.2           viridisLite_0.4.1       aplot_0.1.10           
## [73] timechange_0.2.0        ellipsis_0.3.2