R is a programming language for statistical computing, and data wrangling. It is open-source, widely used in data science, has a wide range of functions and algorithms for graphing and data analyses.
We will use R in this course for generating plots from different biological data that are of higher quality and standard for publications. To do this, you will brush-up your memory on some important aspects of R that are important for this course below:
Before starting with the lab sessions of the entire course, you must have downloaded all the necessary files from [link removed] and make sure that the directory tree looks similar to the one displayed in that page
In that case, you can proceed with the exercise now and remember to have fun :)
Input and output of data and images is an important aspect with data analysis.
Data can come in a variety of formats which needs to be read into R and converted to an R data type.
Text files are the most commonly used input. Text files can be read in using the function read.table
. We have a sample file to use: metadata.csv.
dfr <- read.table("data/metadata.csv",sep = ";", header=TRUE,stringsAsFactors=F)
This reads in a tab-delimited text file with a header. The argument sep='\t'
is set by default to specify that the delimiter is a tab. stringsAsFactors=F
setting ensures that character columns are not automatically converted to factors.
It’s always a good idea to check the data after import.
head(dfr)
## Sample_ID Sample_Name Time Replicate Cell
## 1 Sample_1 t0_A t0 A A431
## 2 Sample_2 t0_B t0 B A431
## 3 Sample_3 t0_C t0 C A431
## 4 Sample_4 t2_A t2 A A431
## 5 Sample_5 t2_B t2 B A431
## 6 Sample_6 t2_C t2 C A431
str(dfr)
## 'data.frame': 12 obs. of 5 variables:
## $ Sample_ID : chr "Sample_1" "Sample_2" "Sample_3" "Sample_4" ...
## $ Sample_Name: chr "t0_A" "t0_B" "t0_C" "t2_A" ...
## $ Time : chr "t0" "t0" "t0" "t2" ...
## $ Replicate : chr "A" "B" "C" "A" ...
## $ Cell : chr "A431" "A431" "A431" "A431" ...
Check ?read.table
for other wrapper functions to read in text files.
Let’s filter this data.frame and create a new data set.
dfr1 <- dfr[dfr$Time == "t0",]
And we can write this as a text file.
write.table(dfr1,"iris-setosa.txt",sep="\t",row.names=F,quote=F)
sep="\t"
sets the delimiter to tab. row.names=F
denotes that rownames should not be written. quote=F
specifies that doubles must not be placed around strings.
You have probably learnt about data.frame
in your previous course and this is the most important data structure for generating plots using ggplot
. So, below you have some quick exercises on how to work with Data-Frames.
Vectors positions can be accessed using []
. R follows 1-based indexing, meaning that the indexing starts at 1.
Data-frame or matrix positions can be accessed using []
specifying row and column like [row,column]
.
dfr <- data.frame(x = 1:3, y = c("a", "b", "c"))
dfr
dfr[1,]
dfr[,1]
dfr[2,2]
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## x y
## 1 1 a
## [1] 1 2 3
## [1] "b"
The function cbind()
is used to join two data-frames column-wise.
dfr1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
dfr2 <- data.frame(p = 4:6, q = c("d", "e", "f"))
dfr1
dfr2
cbind(dfr1,dfr2)
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## p q
## 1 4 d
## 2 5 e
## 3 6 f
## x y p q
## 1 1 a 4 d
## 2 2 b 5 e
## 3 3 c 6 f
Similarly, rbind()
is used to join two data-frames row-wise.
dfr1 <- data.frame(x = 1:3, y = c("a", "b", "c"))
dfr2 <- data.frame(x = 4:6, y = c("d", "e", "f"))
dfr1
dfr2
rbind(dfr1,dfr2)
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## x y
## 1 4 d
## 2 5 e
## 3 6 f
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## 5 5 e
## 6 6 f
Two data-frames can be merged based on a shared column using the merge()
function.
dfr1 <- data.frame(x = 1:4, p = c("a", "b", "c","d"))
dfr2 <- data.frame(x = 3:6, q = c("l", "m", "n","o"))
dfr1
dfr2
merge(dfr1,dfr2,by="x")
merge(dfr1,dfr2,by="x",all.x=T)
merge(dfr1,dfr2,by="x",all.y=T)
merge(dfr1,dfr2,by="x",all=T)
## x p
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
## x q
## 1 3 l
## 2 4 m
## 3 5 n
## 4 6 o
## x p q
## 1 3 c l
## 2 4 d m
## x p q
## 1 1 a <NA>
## 2 2 b <NA>
## 3 3 c l
## 4 4 d m
## x p q
## 1 3 c l
## 2 4 d m
## 3 5 <NA> n
## 4 6 <NA> o
## x p q
## 1 1 a <NA>
## 2 2 b <NA>
## 3 3 c l
## 4 4 d m
## 5 5 <NA> n
## 6 6 <NA> o
In terms of the R (and other similar programming languages), the data can be viewed or stored in two main formats! They are called wide
and long
formats. Below you will see what exactly they stand for and why is it important for plotting in ggplot.
A quick preview:
Counts Table
Sample_1 | Sample_2 | Sample_3 | Sample_4 | |
---|---|---|---|---|
ENSG00000000003 | 321 | 303 | 204 | 492 |
ENSG00000000005 | 0 | 0 | 0 | 0 |
ENSG00000000419 | 696 | 660 | 472 | 951 |
ENSG00000000457 | 59 | 54 | 44 | 109 |
ENSG00000000460 | 399 | 405 | 236 | 445 |
ENSG00000000938 | 0 | 0 | 0 | 0 |
And we usually have our metadata related to the samples in another table like below:
Metadata Table
Sample_ID | Sample_Name | Time | Replicate | Cell |
---|---|---|---|---|
Sample_1 | t0_A | t0 | A | A431 |
Sample_2 | t0_B | t0 | B | A431 |
Sample_3 | t0_C | t0 | C | A431 |
Sample_4 | t2_A | t2 | A | A431 |
Below is glimpse how the long format of the same data look like:
Gene | Samples | count |
---|---|---|
ENSG00000000003 | Sample_1 | 321 |
ENSG00000000005 | Sample_1 | 0 |
ENSG00000000419 | Sample_1 | 696 |
ENSG00000000457 | Sample_1 | 59 |
ENSG00000000460 | Sample_1 | 399 |
ENSG00000000938 | Sample_1 | 0 |
ENSG00000000003 | Sample_2 | 303 |
ENSG00000000005 | Sample_2 | 0 |
ENSG00000000419 | Sample_2 | 660 |
ENSG00000000457 | Sample_2 | 54 |
Or to be even more complete and precise:
Sample_ID | Sample_Name | Time | Replicate | Cell | Gene | count |
---|---|---|---|---|---|---|
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000003 | 321 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000005 | 0 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000419 | 696 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000457 | 59 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000460 | 399 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000938 | 0 |
Sample_2 | t0_B | t0 | B | A431 | ENSG00000000003 | 303 |
Sample_2 | t0_B | t0 | B | A431 | ENSG00000000005 | 0 |
Sample_2 | t0_B | t0 | B | A431 | ENSG00000000419 | 660 |
Sample_2 | t0_B | t0 | B | A431 | ENSG00000000457 | 54 |
As for the biological data analysis, to be able to use tools such as ggplot, in simple terms we should learn to convert our data:
As per our previous examples: We should learn to convert
From this format
Sample_1 | Sample_2 | Sample_3 | Sample_4 | |
---|---|---|---|---|
ENSG00000000003 | 321 | 303 | 204 | 492 |
ENSG00000000005 | 0 | 0 | 0 | 0 |
ENSG00000000419 | 696 | 660 | 472 | 951 |
ENSG00000000457 | 59 | 54 | 44 | 109 |
ENSG00000000460 | 399 | 405 | 236 | 445 |
ENSG00000000938 | 0 | 0 | 0 | 0 |
To this format
Sample_ID | Sample_Name | Time | Replicate | Cell | Gene | count |
---|---|---|---|---|---|---|
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000003 | 321 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000005 | 0 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000419 | 696 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000457 | 59 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000460 | 399 |
Sample_1 | t0_A | t0 | A | A431 | ENSG00000000938 | 0 |
Sample_2 | t0_B | t0 | B | A431 | ENSG00000000003 | 303 |
Sample_2 | t0_B | t0 | B | A431 | ENSG00000000005 | 0 |
Sample_2 | t0_B | t0 | B | A431 | ENSG00000000419 | 660 |
Sample_2 | t0_B | t0 | B | A431 | ENSG00000000457 | 54 |
Here we will only cover the conversion from wide
to long
, as this is more relevant to us. For the other way around, one can look into spread()
from the tidyr
package.
By using the melt()
function from the reshape2 package we can convert the wide-formatted data into long-formatted data! Here, to combine the metadata table to the gene counts table, we will also use the merge()
function like we did before!
library(reshape2)
gc <- read.table("data/counts_raw.txt", header = T, row.names = 1, sep = "\t")
md <- read.table("data/metadata.csv", header = T, sep = ";")
rownames(md) <- md$Sample_ID
#merging gene counts table with metadata
merged_data_wide <- merge(md, t(gc), by = 0)
#removing redundant columns
merged_data_wide$Row.names <- NULL
merged_data_long <- melt(merged_data_wide, id.vars = c("Sample_ID","Sample_Name","Time","Replicate","Cell"), variable.name = "Gene", value.name = "count")
head(merged_data_long)
## Sample_ID Sample_Name Time Replicate Cell Gene count
## 1 Sample_1 t0_A t0 A A431 ENSG00000000003 321
## 2 Sample_10 t24_A t24 A A431 ENSG00000000003 950
## 3 Sample_11 t24_B t24 B A431 ENSG00000000003 760
## 4 Sample_12 t24_C t24 C A431 ENSG00000000003 1436
## 5 Sample_2 t0_B t0 B A431 ENSG00000000003 303
## 6 Sample_3 t0_C t0 C A431 ENSG00000000003 204
If you are more familiar with using tidyverse
or tidyr
packages, you can combine tables by join()
and then use gather()
to make long formatted data in the same command. This is a powerful and more cleaner way of dealing with data in R.
library(tidyverse)
gc_long <- gc %>%
rownames_to_column(var = "Gene") %>%
gather(Sample_ID, count, -Gene) %>%
full_join(md, by = "Sample_ID") %>%
select(Sample_ID, everything()) %>%
select(-c(Gene,count), c(Gene,count))
gc_long %>%
head(10)
## Sample_ID Sample_Name Time Replicate Cell Gene count
## 1 Sample_1 t0_A t0 A A431 ENSG00000000003 321
## 2 Sample_1 t0_A t0 A A431 ENSG00000000005 0
## 3 Sample_1 t0_A t0 A A431 ENSG00000000419 696
## 4 Sample_1 t0_A t0 A A431 ENSG00000000457 59
## 5 Sample_1 t0_A t0 A A431 ENSG00000000460 399
## 6 Sample_1 t0_A t0 A A431 ENSG00000000938 0
## 7 Sample_1 t0_A t0 A A431 ENSG00000000971 0
## 8 Sample_1 t0_A t0 A A431 ENSG00000001036 844
## 9 Sample_1 t0_A t0 A A431 ENSG00000001084 535
## 10 Sample_1 t0_A t0 A A431 ENSG00000001167 493
Task Here in this exercise we have used the counts_raw.txt
file, you can try to make similar R objects for each of the other three counts (counts_filtered.txt
, counts_vst.txt
and counts_deseq2.txt
) in long format
. So for example you would have gc_filt
, gc_vst
and gc_deseq2
R objects in the end.
Tip Remember to take a look at how these files are formatted, before you import!
Much of the data format explanations and exercises were obtained from the GitHub of Eric C. Anderson
?function
to get function documentation??bla
to search for a functionargs(function)
to get the arguments to a functionsessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] treeio_1.23.1 ggtree_3.7.2 pheatmap_1.0.12
## [4] swemaps_1.0 mapdata_2.3.1 maps_3.4.1
## [7] gridExtra_2.3 jpeg_0.1-10 ggpubr_0.6.0
## [10] cowplot_1.1.1 ggthemes_4.2.4 scales_1.2.1
## [13] ggrepel_0.9.3 wesanderson_0.3.6 forcats_1.0.0
## [16] stringr_1.5.0 purrr_1.0.1 readr_2.1.4
## [19] tidyr_1.3.0 tibble_3.2.1 tidyverse_2.0.0
## [22] reshape2_1.4.4 ggplot2_3.4.2 formattable_0.2.1
## [25] kableExtra_1.3.4 dplyr_1.1.1 lubridate_1.9.2
## [28] leaflet_2.1.2 yaml_2.3.7 fontawesome_0.5.0.9000
## [31] captioner_2.2.3 bookdown_0.33 knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-155 RColorBrewer_1.1-3 webshot_0.5.4
## [4] httr_1.4.5 tools_4.1.3 backports_1.4.1
## [7] bslib_0.4.2 utf8_1.2.3 R6_2.5.1
## [10] lazyeval_0.2.2 mgcv_1.8-39 colorspace_2.1-0
## [13] withr_2.5.0 tidyselect_1.2.0 compiler_4.1.3
## [16] cli_3.6.1 rvest_1.0.3 xml2_1.3.3
## [19] labeling_0.4.2 sass_0.4.5 yulab.utils_0.0.6
## [22] systemfonts_1.0.4 digest_0.6.31 rmarkdown_2.21
## [25] svglite_2.1.1 pkgconfig_2.0.3 htmltools_0.5.5
## [28] fastmap_1.1.1 highr_0.10 htmlwidgets_1.6.2
## [31] rlang_1.1.0 rstudioapi_0.14 gridGraphics_0.5-1
## [34] jquerylib_0.1.4 farver_2.1.1 generics_0.1.3
## [37] jsonlite_1.8.4 crosstalk_1.2.0 car_3.1-2
## [40] magrittr_2.0.3 ggplotify_0.1.0 patchwork_1.1.2
## [43] Matrix_1.5-4 Rcpp_1.0.10 munsell_0.5.0
## [46] fansi_1.0.4 ape_5.7-1 abind_1.4-5
## [49] lifecycle_1.0.3 stringi_1.7.12 carData_3.0-5
## [52] plyr_1.8.8 parallel_4.1.3 lattice_0.20-45
## [55] splines_4.1.3 hms_1.1.3 pillar_1.9.0
## [58] ggsignif_0.6.4 glue_1.6.2 evaluate_0.20
## [61] ggfun_0.0.9 leaflet.providers_1.9.0 vctrs_0.6.1
## [64] tzdb_0.3.0 gtable_0.3.3 cachem_1.0.7
## [67] xfun_0.38 broom_1.0.4 tidytree_0.4.2
## [70] rstatix_0.7.2 viridisLite_0.4.1 aplot_0.1.10
## [73] timechange_0.2.0 ellipsis_0.3.2