R is a programming language for statistical computing, and data wrangling. It is open-source, widely used in data science, has a wide range of functions and algorithms for graphing and data analyses.

We will use R in this course for generating plots from different biological data that are of higher quality and standard for publications. To do this, you will brush-up your memory on some important aspects of R that are important for this course below:

Before starting with the lab sessions of the entire course, you must have downloaded all the necessary files from [link removed] and make sure that the directory tree looks similar to the one displayed in that page

In that case, you can proceed with the exercise now and remember to have fun :)

1 Input/Output

Input and output of data and images is an important aspect with data analysis.

1.1 Text

Data can come in a variety of formats which needs to be read into R and converted to an R data type.

Text files are the most commonly used input. Text files can be read in using the function read.table. We have a sample file to use: metadata.csv.

dfr <- read.table("data/metadata.csv",sep = ";", header=TRUE,stringsAsFactors=F)

This reads in a tab-delimited text file with a header. The argument sep='\t' is set by default to specify that the delimiter is a tab. stringsAsFactors=F setting ensures that character columns are not automatically converted to factors.

It’s always a good idea to check the data after import.

head(dfr)

output

##   Sample_ID Sample_Name Time Replicate Cell
## 1  Sample_1        t0_A   t0         A A431
## 2  Sample_2        t0_B   t0         B A431
## 3  Sample_3        t0_C   t0         C A431
## 4  Sample_4        t2_A   t2         A A431
## 5  Sample_5        t2_B   t2         B A431
## 6  Sample_6        t2_C   t2         C A431

str(dfr)

output

## 'data.frame':    12 obs. of  5 variables:
##  $ Sample_ID  : chr  "Sample_1" "Sample_2" "Sample_3" "Sample_4" ...
##  $ Sample_Name: chr  "t0_A" "t0_B" "t0_C" "t2_A" ...
##  $ Time       : chr  "t0" "t0" "t0" "t2" ...
##  $ Replicate  : chr  "A" "B" "C" "A" ...
##  $ Cell       : chr  "A431" "A431" "A431" "A431" ...

Check ?read.table for other wrapper functions to read in text files.

Let’s filter this data.frame and create a new data set.

dfr1 <- dfr[dfr$Time == "t0",]

And we can write this as a text file.

write.table(dfr1,"iris-setosa.txt",sep="\t",row.names=F,quote=F)

sep="\t" sets the delimiter to tab. row.names=F denotes that rownames should not be written. quote=F specifies that doubles must not be placed around strings.

2 Data-Frames

You have probably learnt about data.frame in your previous course and this is the most important data structure for generating plots using ggplot. So, below you have some quick exercises on how to work with Data-Frames.

Vectors positions can be accessed using []. R follows 1-based indexing, meaning that the indexing starts at 1.

Data-frame or matrix positions can be accessed using [] specifying row and column like [row,column].

dfr <- data.frame(x = 1:3, y = c("a", "b", "c"))
dfr
dfr[1,]
dfr[,1]
dfr[2,2]

output

##   x y
## 1 1 a
## 2 2 b
## 3 3 c
##   x y
## 1 1 a
## [1] 1 2 3
## [1] "b"

The function cbind() is used to join two data-frames column-wise.

dfr1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
dfr2 <- data.frame(p = 4:6, q = c("d", "e", "f"))
dfr1
dfr2

cbind(dfr1,dfr2)

output

##   x y
## 1 1 a
## 2 2 b
## 3 3 c
##   p q
## 1 4 d
## 2 5 e
## 3 6 f
##   x y p q
## 1 1 a 4 d
## 2 2 b 5 e
## 3 3 c 6 f

Similarly, rbind() is used to join two data-frames row-wise.

dfr1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
dfr2 <- data.frame(x = 4:6, y = c("d", "e", "f"))
dfr1
dfr2

rbind(dfr1,dfr2)

output

##   x y
## 1 1 a
## 2 2 b
## 3 3 c
##   x y
## 1 4 d
## 2 5 e
## 3 6 f
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f

Two data-frames can be merged based on a shared column using the merge() function.

dfr1 <- data.frame(x = 1:4, p = c("a", "b", "c","d"))
dfr2 <- data.frame(x = 3:6, q = c("l", "m", "n","o"))
dfr1
dfr2

merge(dfr1,dfr2,by="x")
merge(dfr1,dfr2,by="x",all.x=T)
merge(dfr1,dfr2,by="x",all.y=T)
merge(dfr1,dfr2,by="x",all=T)

output

##   x p
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
##   x q
## 1 3 l
## 2 4 m
## 3 5 n
## 4 6 o
##   x p q
## 1 3 c l
## 2 4 d m
##   x p    q
## 1 1 a <NA>
## 2 2 b <NA>
## 3 3 c    l
## 4 4 d    m
##   x    p q
## 1 3    c l
## 2 4    d m
## 3 5 <NA> n
## 4 6 <NA> o
##   x    p    q
## 1 1    a <NA>
## 2 2    b <NA>
## 3 3    c    l
## 4 4    d    m
## 5 5 <NA>    n
## 6 6 <NA>    o

3 Data formats

In terms of the R (and other similar programming languages), the data can be viewed or stored in two main formats! They are called wide and long formats. Below you will see what exactly they stand for and why is it important for plotting in ggplot.

3.1 Wide format

A quick preview:

Counts Table

	Sample_1	Sample_2	Sample_3	Sample_4
ENSG00000000003	321	303	204	492
ENSG00000000005	0	0	0	0
ENSG00000000419	696	660	472	951
ENSG00000000457	59	54	44	109
ENSG00000000460	399	405	236	445
ENSG00000000938	0	0	0	0

And we usually have our metadata related to the samples in another table like below:

Metadata Table

Sample_ID	Sample_Name	Time	Replicate	Cell
Sample_1	t0_A	t0	A	A431
Sample_2	t0_B	t0	B	A431
Sample_3	t0_C	t0	C	A431
Sample_4	t2_A	t2	A	A431

Wide format data is called “wide” because it typically has a lot of columns which stretch widely across the page or your computer screen.
Most of us are familiar with looking at wide format data
- It is convenient and we are more used to looking at data this way in our Excel sheets.
- It often lets you see more of the data, at one time, on your screen

3.2 Long format

Below is glimpse how the long format of the same data look like:

Gene	Samples	count
ENSG00000000003	Sample_1	321
ENSG00000000005	Sample_1	0
ENSG00000000419	Sample_1	696
ENSG00000000457	Sample_1	59
ENSG00000000460	Sample_1	399
ENSG00000000938	Sample_1	0
ENSG00000000003	Sample_2	303
ENSG00000000005	Sample_2	0
ENSG00000000419	Sample_2	660
ENSG00000000457	Sample_2	54

Or to be even more complete and precise:

Sample_ID	Sample_Name	Time	Replicate	Cell	Gene	count
Sample_1	t0_A	t0	A	A431	ENSG00000000003	321
Sample_1	t0_A	t0	A	A431	ENSG00000000005	0
Sample_1	t0_A	t0	A	A431	ENSG00000000419	696
Sample_1	t0_A	t0	A	A431	ENSG00000000457	59
Sample_1	t0_A	t0	A	A431	ENSG00000000460	399
Sample_1	t0_A	t0	A	A431	ENSG00000000938	0
Sample_2	t0_B	t0	B	A431	ENSG00000000003	303
Sample_2	t0_B	t0	B	A431	ENSG00000000005	0
Sample_2	t0_B	t0	B	A431	ENSG00000000419	660
Sample_2	t0_B	t0	B	A431	ENSG00000000457	54

Long format data is typically less familiar to most humans
- It seems awfully hard to get a good look at all (or most) of it
- It seems like it would require more storage on your hard disk
- It seems like it would be harder to enter data in a long format

3.3 Which is better?

Well, there are some contexts where putting things in wide format is computationally efficient because you can treat data in a matrix format and to efficient matrix calculations on it.
However, adding data to wide format data sets is very hard:
1. It is very difficult to conceive of analytic schemes that apply generally across all wide-format data sets.
2. Many tools in R want data in long format like ggplot
3. The long format for data corresponds to the relational model for storing data, which is the model used in most modern data bases like the SQL family of data base systems.
A more technical treatment of wide versus long data requires some terminology:
- Identifier variables are often categorical things that cross-classify observations into categories.
- Measured variables are the names given to properties or characteristics that you can go out and measure.
- Values are the values that you measure are record for any particular measured variable.
In any particular data set, what you might want to call an Identifier variables versus a Measured variables can not always be entirely clear.
- Other people might choose to define things differently.
However, to my mind, it is less important to be able to precisely recognize these three entities in every possible situation (where there might be some fuzziness about which is which)
And it is more important to understand how Identifier variables, Measured variables, and Values interact and behave when we are transforming data between wide and long formats.

4 Conversion between formats

As for the biological data analysis, to be able to use tools such as ggplot, in simple terms we should learn to convert our data:

As per our previous examples: We should learn to convert

From this format

	Sample_1	Sample_2	Sample_3	Sample_4
ENSG00000000003	321	303	204	492
ENSG00000000005	0	0	0	0
ENSG00000000419	696	660	472	951
ENSG00000000457	59	54	44	109
ENSG00000000460	399	405	236	445
ENSG00000000938	0	0	0	0

To this format

Sample_ID	Sample_Name	Time	Replicate	Cell	Gene	count
Sample_1	t0_A	t0	A	A431	ENSG00000000003	321
Sample_1	t0_A	t0	A	A431	ENSG00000000005	0
Sample_1	t0_A	t0	A	A431	ENSG00000000419	696
Sample_1	t0_A	t0	A	A431	ENSG00000000457	59
Sample_1	t0_A	t0	A	A431	ENSG00000000460	399
Sample_1	t0_A	t0	A	A431	ENSG00000000938	0
Sample_2	t0_B	t0	B	A431	ENSG00000000003	303
Sample_2	t0_B	t0	B	A431	ENSG00000000005	0
Sample_2	t0_B	t0	B	A431	ENSG00000000419	660
Sample_2	t0_B	t0	B	A431	ENSG00000000457	54

Here we will only cover the conversion from wide to long, as this is more relevant to us. For the other way around, one can look into spread() from the tidyr package.

4.1 Using reshape2

By using the melt() function from the reshape2 package we can convert the wide-formatted data into long-formatted data! Here, to combine the metadata table to the gene counts table, we will also use the merge() function like we did before!

library(reshape2)
gc <- read.table("data/counts_raw.txt", header = T, row.names = 1, sep = "\t")
md <- read.table("data/metadata.csv", header = T, sep = ";")
rownames(md) <- md$Sample_ID

#merging gene counts table with metadata
merged_data_wide <- merge(md, t(gc), by = 0)
#removing redundant columns
merged_data_wide$Row.names <- NULL

merged_data_long <- melt(merged_data_wide, id.vars = c("Sample_ID","Sample_Name","Time","Replicate","Cell"), variable.name = "Gene", value.name = "count")  
head(merged_data_long)

output

##   Sample_ID Sample_Name Time Replicate Cell            Gene count
## 1  Sample_1        t0_A   t0         A A431 ENSG00000000003   321
## 2 Sample_10       t24_A  t24         A A431 ENSG00000000003   950
## 3 Sample_11       t24_B  t24         B A431 ENSG00000000003   760
## 4 Sample_12       t24_C  t24         C A431 ENSG00000000003  1436
## 5  Sample_2        t0_B   t0         B A431 ENSG00000000003   303
## 6  Sample_3        t0_C   t0         C A431 ENSG00000000003   204

4.2 Using tidyr

If you are more familiar with using tidyverse or tidyr packages, you can combine tables by join() and then use gather() to make long formatted data in the same command. This is a powerful and more cleaner way of dealing with data in R.

library(tidyverse)
gc_long <- gc %>%
  rownames_to_column(var = "Gene") %>%
  gather(Sample_ID, count, -Gene) %>% 
  full_join(md, by = "Sample_ID") %>% 
  select(Sample_ID, everything()) %>% 
  select(-c(Gene,count), c(Gene,count)) 
gc_long %>%
  head(10)

output

##    Sample_ID Sample_Name Time Replicate Cell            Gene count
## 1   Sample_1        t0_A   t0         A A431 ENSG00000000003   321
## 2   Sample_1        t0_A   t0         A A431 ENSG00000000005     0
## 3   Sample_1        t0_A   t0         A A431 ENSG00000000419   696
## 4   Sample_1        t0_A   t0         A A431 ENSG00000000457    59
## 5   Sample_1        t0_A   t0         A A431 ENSG00000000460   399
## 6   Sample_1        t0_A   t0         A A431 ENSG00000000938     0
## 7   Sample_1        t0_A   t0         A A431 ENSG00000000971     0
## 8   Sample_1        t0_A   t0         A A431 ENSG00000001036   844
## 9   Sample_1        t0_A   t0         A A431 ENSG00000001084   535
## 10  Sample_1        t0_A   t0         A A431 ENSG00000001167   493

5 Exercise

Task Here in this exercise we have used the counts_raw.txt file, you can try to make similar R objects for each of the other three counts (counts_filtered.txt, counts_vst.txt and counts_deseq2.txt) in long format. So for example you would have gc_filt, gc_vst and gc_deseq2 R objects in the end.

Tip Remember to take a look at how these files are formatted, before you import!

6 Acknowledgements

Much of the data format explanations and exercises were obtained from the GitHub of Eric C. Anderson

7 Getting help

Use ?function to get function documentation
Use ??bla to search for a function
Use args(function) to get the arguments to a function
Go to the package CRAN page/webpage for vignettes
R Cookbook: General purpose reference.
Quick R: General purpose reference.
Stackoverflow: Online community to find solutions to your problems.

8 Session info

sessionInfo()

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] treeio_1.23.1          ggtree_3.7.2           pheatmap_1.0.12       
##  [4] swemaps_1.0            mapdata_2.3.1          maps_3.4.1            
##  [7] gridExtra_2.3          jpeg_0.1-10            ggpubr_0.6.0          
## [10] cowplot_1.1.1          ggthemes_4.2.4         scales_1.2.1          
## [13] ggrepel_0.9.3          wesanderson_0.3.6      forcats_1.0.0         
## [16] stringr_1.5.0          purrr_1.0.1            readr_2.1.4           
## [19] tidyr_1.3.0            tibble_3.2.1           tidyverse_2.0.0       
## [22] reshape2_1.4.4         ggplot2_3.4.2          formattable_0.2.1     
## [25] kableExtra_1.3.4       dplyr_1.1.1            lubridate_1.9.2       
## [28] leaflet_2.1.2          yaml_2.3.7             fontawesome_0.5.0.9000
## [31] captioner_2.2.3        bookdown_0.33          knitr_1.42            
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-155            RColorBrewer_1.1-3      webshot_0.5.4          
##  [4] httr_1.4.5              tools_4.1.3             backports_1.4.1        
##  [7] bslib_0.4.2             utf8_1.2.3              R6_2.5.1               
## [10] lazyeval_0.2.2          mgcv_1.8-39             colorspace_2.1-0       
## [13] withr_2.5.0             tidyselect_1.2.0        compiler_4.1.3         
## [16] cli_3.6.1               rvest_1.0.3             xml2_1.3.3             
## [19] labeling_0.4.2          sass_0.4.5              yulab.utils_0.0.6      
## [22] systemfonts_1.0.4       digest_0.6.31           rmarkdown_2.21         
## [25] svglite_2.1.1           pkgconfig_2.0.3         htmltools_0.5.5        
## [28] fastmap_1.1.1           highr_0.10              htmlwidgets_1.6.2      
## [31] rlang_1.1.0             rstudioapi_0.14         gridGraphics_0.5-1     
## [34] jquerylib_0.1.4         farver_2.1.1            generics_0.1.3         
## [37] jsonlite_1.8.4          crosstalk_1.2.0         car_3.1-2              
## [40] magrittr_2.0.3          ggplotify_0.1.0         patchwork_1.1.2        
## [43] Matrix_1.5-4            Rcpp_1.0.10             munsell_0.5.0          
## [46] fansi_1.0.4             ape_5.7-1               abind_1.4-5            
## [49] lifecycle_1.0.3         stringi_1.7.12          carData_3.0-5          
## [52] plyr_1.8.8              parallel_4.1.3          lattice_0.20-45        
## [55] splines_4.1.3           hms_1.1.3               pillar_1.9.0           
## [58] ggsignif_0.6.4          glue_1.6.2              evaluate_0.20          
## [61] ggfun_0.0.9             leaflet.providers_1.9.0 vctrs_0.6.1            
## [64] tzdb_0.3.0              gtable_0.3.3            cachem_1.0.7           
## [67] xfun_0.38               broom_1.0.4             tidytree_0.4.2         
## [70] rstatix_0.7.2           viridisLite_0.4.1       aplot_0.1.10           
## [73] timechange_0.2.0        ellipsis_0.3.2

Lokesh Mano • 17-Apr-2023

1 Input/Output

1.1 Text

2 Data-Frames

3 Data formats

3.1 Wide format

3.2 Long format

3.3 Which is better?

4 Conversion between formats

4.1 Using reshape2

4.2 Using tidyr

5 Exercise

6 Acknowledgements

7 Getting help

8 Session info