R is an excellent tool for creating graphs and plots. The graphic capabilities and functions provided by the base R installation is called the base R graphics. Numerous packages exist to extend the functionality of base graphics.
We can try out plotting a few of the common plot types. Let’s start with a scatterplot. First we create a data.frame
as this is the most commonly used data object.
dfr <- data.frame(a=sample(1:100,10),b=sample(1:100,10))
Now we have a dataframe with two continuous variables that can be plotted against each other.
plot(dfr$a,dfr$b)
This is probably the simplest and most basic plots. We can modify the x and y axis labels.
plot(dfr$a,dfr$b,xlab="Variable a",ylab="Variable b")
We can change the point to a line.
plot(dfr$a,dfr$b,xlab="Variable a",ylab="Variable b",type="b")
Let’s add a categorical column to our dataframe.
dfr$cat <- rep(c("C1","C2"),each=5)
And then colour the points by category.
# subset data
dfr_c1 <- subset(dfr,dfr$cat == "C1")
dfr_c2 <- subset(dfr,dfr$cat == "C2")
plot(dfr_c1$a,dfr_c1$b,xlab="Variable a",ylab="Variable b",col="red",pch=1)
points(dfr_c2$a,dfr_c2$b,col="blue",pch=2)
legend(x="topright",legend=c("C1","C2"),
col=c("red","blue"),pch=c(1,2))
Let’s create a barplot.
ldr <- data.frame(a=letters[1:10],b=sample(1:50,10))
barplot(ldr$b,names.arg=ldr$a)
Grid graphics have a completely different underlying framework compared to base graphics. Generally, base graphics and grid graphics cannot be plotted together. The most popular grid-graphics based plotting library is ggplot2.
Let’s create the same plot as before using ggplot2. Make sure you have the package installed.
library(ggplot2)
ggplot(dfr)+
geom_point(mapping = aes(x=a,y=b,colour=cat))+
labs(x="Variable a",y="Variable b")
It is generally easier and more consistent to create plots using the ggplot2 package compared to the base graphics.
Let’s create a barplot as well.
ggplot(ldr,aes(x=a,y=b))+
geom_col()
Let’s take a look at saving plots.
Note This part is just to give you a quick look into how you can save images from Rstudio quickly. The different format of images will be explained in a lecture tomorrow.
The general idea for saving plots is open a graphics device, create the plot and then close the device. We will use png here. Check out ?png
for the arguments and other devices.
dfr <- data.frame(a=sample(1:100,10),b=sample(1:100,10))
png(filename="plot-base.png")
plot(dfr$a,dfr$b)
dev.off()
The same idea can be applied to ggplot2, but in a slightly different way. First save the file to a variable, and then export the plot.
p <- ggplot(dfr,aes(a,b)) + geom_point()
png(filename="plot-ggplot-1.png")
print(p)
dev.off()
Tip ggplot2 also has another easier helper function to export images.
ggsave(filename="plot-ggplot-2.png",plot=p)
Make sure the library is loaded in your environment.
library(ggplot2)
In the previous section we saw very quickly how to use ggplot
. Let’s take a look at it again a bit more carefully. For this let’s first look into a simple data that is available in R. We use the iris
data for this to start with.
This dataset has four continuous variables and one categorical variable. It is important to remember about the data type when plotting graphs
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
When we initiate the ggplot object using the data, it just creates a blank plot!
ggplot(iris)
Now we can specify what we want on the x and y axes using aesthetic mapping. And we specify the geometric using geoms
. Note that the variable names do not have double quotes ""
like in base plots.
ggplot(data=iris)+
geom_point(mapping=aes(x=Petal.Length,y=Petal.Width))
Further geoms can be added. For example let’s add a regression line. When multiple geoms with the same aesthetics are used, they can be specified as a common mapping. Note that the order in which geoms are plotted depends on the order in which the geoms are supplied in the code. In the code below, the points are plotted first and then the regression line.
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point()+
geom_smooth(method="lm")
There are many other geoms
and you can find most of them here in this cheatsheet
Let’s also try to use ggplot
for a “more common” gene counts dataset. Here, we will first convert the gene counts and the related metadata to their long
formats! If you are not aware of long
and wide
formats of the data! Please follow the material here
gc <- read.table("data/counts_raw.txt", header = T, row.names = 1, sep = "\t")
md <- read.table("data/metadata.csv", header = T, sep = ";")
rownames(md) <- md$Sample_ID
library(tidyverse)
gc_long <- gc %>%
rownames_to_column(var = "Gene") %>%
gather(Sample_ID, count, -Gene) %>%
full_join(md, by = "Sample_ID") %>%
select(Sample_ID, everything()) %>%
select(-c(Gene,count), c(Gene,count))
ggplot(data = gc_long) +
geom_boxplot(mapping = aes(x = Sample_Name, y = log10(count +1)))
Note You can notice that the ggplot sorts the factors
or vaiables
alpha-numerically, like in the case above with Sample_Name
.
Tip There is a trick that you can use to give the order of variables manually. The example is shown below:
gc_long$Sample_Name <- factor(gc_long$Sample_Name, levels = c("t0_A","t0_B","t0_C","t2_A","t2_B","t2_C","t6_A","t6_B","t6_C","t24_A","t24_B","t24_C"))
ggplot(data = gc_long) +
geom_boxplot(mapping = aes(x = Sample_Name, y = log10(count + 1)))
First, if we look at the iris
data, we can use the categorical column Species
to color the points. The color aesthetic is used by geom_point and geom_smooth. Three different regression lines are now drawn. Notice that a legend is automatically created
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width,color=Species))+
geom_point()+
geom_smooth(method="lm")
If we wanted to keep a common regression line while keeping the colors for the points, we could specify color aesthetic only for geom_point
.
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Species))+
geom_smooth(method="lm")
Similarly, we can do the same with the gene counts data.
ggplot(data = gc_long) +
geom_boxplot(mapping = aes(x = Sample_Name, y = log10(count + 1), color = Time))
Tip We can also use the fill
aesthetic to give it a better look.
ggplot(data = gc_long) +
geom_boxplot(mapping = aes(x = Sample_Name, y = log10(count + 1), fill = Time))
We can change the default colors by specifying new values inside a scale.
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Species))+
geom_smooth(method="lm")+
scale_color_manual(values=c("red","blue","green"))
Tip To specify manual colors, you could specify by their names
or their hexadecimal codes
. For example, you can choose the colors based on names
from an online source like in this cheatsheet or you can use the hexadecimal code
and choose it from a source like here. I personally prefer the hexa
based options for manual colors.
We can also map the colors to a continuous variable. This creates a color bar legend item.
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Sepal.Width))+
geom_smooth(method="lm")
Tip Here, you can also choose different palettes
for choosing the right continuous pallet. There are some common packages of palettes that are used very often. RColorBrewer and wesanderson, if you are fan of his choice of colors ;)
library(wesanderson)
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Sepal.Width))+
geom_smooth(method="lm") +
scale_color_gradientn(colours = wes_palette("Moonrise3"))
Tip You can also use simple R base color palettes like rainbow()
or terrain.colors()
. Use ?
and look at these functions to see, how to use them.
We can change the size of all points by a fixed amount by specifying size outside the aesthetic parameter.
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Species),size=3)+
geom_smooth(method="lm")
We can map another variable as size of the points. This is done by specifying size inside the aesthetic mapping. Now the size of the points denote Sepal.Width
. A new legend group is created to show this new aesthetic.
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Species,size=Sepal.Width))+
geom_smooth(method="lm")
Here, as a quick example, we will try to make use of the different combinations of geoms
, aes
and color
in simple plots.
Let’s take a quick look at some of widely used functions like histograms and density plots in ggplot
. Intuitively, these can be drawn with geom_histogram()
and geom_density()
. Using bins
and binwidth
in geom_histogram()
, one can customize the histogram.
ggplot(data=iris,mapping=aes(x=Sepal.Length))+
geom_histogram()
Let’s look at the sample plot in density.
ggplot(data=iris,mapping=aes(x=Sepal.Length))+
geom_density()
The above plot is not very informative, let’s see how the different species contribute:
ggplot(data=iris,mapping=aes(x=Sepal.Length))+
geom_density(aes(fill = Species), alpha = 0.8)
Note The alpha
option inside geom_density
controls the transparency of the plot.
Task Make boxplots
similar to the one we did here in this exercise for the other three counts (counts_filtered.txt
, counts_vst.txt
and counts_deseq2.txt
).
Tip You can save the plots themselves as R objects. You will get the plot by just calling those objects. You can then add layers to those objects. An example is shown below:
plot_obj_1 <- ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Sepal.Width))+
geom_smooth(method="lm")
plot_obj_1
plot_obj_2 <- plot_obj_1 +
scale_color_gradientn(colours = wes_palette("Moonrise3"))
plot_obj_2
This way, you can create different plot objects for the different counts, we will use them in the later exercises.
We can create subplots using the faceting functionality.
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Sepal.Width))+
geom_smooth(method="lm") +
facet_wrap(~Species)
If we try the same with the gene counts data faceted by time.
ggplot(data = gc_long) +
geom_boxplot(mapping = aes(x = Sample_Name, y = log10(count + 1), color = Time)) +
facet_wrap(~Time)
Here in the above plot, you see some empty samples in each facet. In this case, you could use facet_grid
together with space
and scales
options to make it look neat and intuitive. You can use ?facet_grid
and ?facet_wrap
to figure out the exact difference between the two.
ggplot(data = gc_long) +
geom_boxplot(mapping = aes(x = Sample_Name, y = log10(count + 1), color = Time)) +
facet_grid(~Time , scales = "free", space = "free")
You can also make grid with different variables one might have using vars()
function together with rows
and cols
options!
ggplot(data = gc_long) +
geom_boxplot(mapping = aes(x = Sample_Name, y = log10(count + 1), color = Time)) +
facet_grid(rows = vars(Time), cols = vars(Replicate), scales = "free", space = "free")
Here, we will quickly mention, how one can add labels to the plots. Items on the plot can be labelled using the geom_text
or geom_label
geoms.
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Species))+
geom_text(aes(label=Species,hjust=0),nudge_x=0.5,size=3)
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Species))+
geom_label(aes(label=Species,hjust=0),nudge_x=0.5,size=3)
The R package ggrepel allows for non-overlapping labels.
library(ggrepel)
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Species))+
geom_text_repel(aes(label=Species),size=3)
Custom annotations of any geom can be added arbitrarily anywhere on the plot.
ggplot(data=iris,mapping=aes(x=Petal.Length,y=Petal.Width))+
geom_point(aes(color=Species))+
annotate("text",x=2.5,y=2.1,label="There is a random line here")+
annotate("segment",x=2,xend=4,y=1.5,yend=2)
Let’s now make some bar charts with the data we have. We can start with the simple iris
data first.
ggplot(data=iris,mapping=aes(x=Species,y=Petal.Width))+
geom_col()
Note There are two types of bar charts: geom_bar()
and geom_col()
. geom_bar()
makes the height of the bar proportional to the number of cases in each group (or if the weight
aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col()
instead. geom_bar()
uses stat_count()
by default: it counts the number of cases at each x position. geom_col()
uses stat_identity()
and it leaves the data as is.
Similarly, we can use the gene counts
data to make a barplot as well. But first, let’s make the data into the right format so as to make the bar plots. This is where knowledge on tidyverse
would be super useful.
se <- function(x) sqrt(var(x)/length(x))
gc_long %>%
group_by(Time) %>%
summarise(mean=mean(log10(count +1)),se=se(log10(count +1))) %>%
head()
## # A tibble: 4 × 3
## Time mean se
## <chr> <dbl> <dbl>
## 1 t0 0.560 0.00237
## 2 t2 0.605 0.00248
## 3 t24 0.675 0.00262
## 4 t6 0.606 0.00247
Note There are a couple of things to note here. In the above example, we use the pipe %>%
symbol that redirects the output of one command as the input to another. Then we group the data by the variable Time
, followed by summarizing the count
with mean()
and sd()
functions to get the mean and standard deviation of their respective counts. The head()
function just prints the first few lines.
Now that we have summarized the data to be bale to plot the bar graph that we want, we can just input the data to ggplot as well using the %>%
sign.
gc_long %>%
group_by(Time) %>%
summarise(mean=mean(log10(count +1)),se=se(log10(count +1))) %>%
ggplot(aes(x=Time, y=mean)) +
geom_bar(stat = "identity")
Note Notice that the %>%
sign is used in the tidyverse
based commands and +
is used for all the ggplot
based commands.
One can also easily just flip the x
and y
axis.
gc_long %>%
group_by(Time) %>%
summarise(mean=mean(log10(count +1)),se=se(log10(count +1))) %>%
ggplot(aes(x=Time, y=mean)) +
geom_col() +
coord_flip()
Now that we have the bar plots, we can also add error bars to them using the sd
values we calculated in the previous step.
gc_long %>%
group_by(Time) %>%
summarise(mean=mean(log10(count +1)),se=se(log10(count +1))) %>%
ggplot(aes(x=Time, y=mean, fill = Time)) +
geom_col() +
geom_errorbar(aes(ymax=mean+se,ymin=mean-se),width=0.2)
Let’s now try to make stacked bars. For this let’s try to make the data more usable for stacked bars. For this let’s use the group_by
function to make the groups based on both Time
and Replicate
.
se <- function(x) sqrt(var(x)/length(x))
gc_long %>%
group_by(Time, Replicate) %>%
summarise(mean=mean(log10(count +1)),se=se(log10(count +1))) %>%
head()
## # A tibble: 6 × 4
## # Groups: Time [2]
## Time Replicate mean se
## <chr> <chr> <dbl> <dbl>
## 1 t0 A 0.587 0.00424
## 2 t0 B 0.579 0.00419
## 3 t0 C 0.515 0.00385
## 4 t2 A 0.622 0.00438
## 5 t2 B 0.617 0.00436
## 6 t2 C 0.577 0.00416
Let’s build the stacked bars!
gc_long %>%
group_by(Time, Replicate) %>%
summarise(mean=mean(log10(count +1)),se=se(log10(count +1))) %>%
ggplot(aes(x=Time, y=mean, fill = Replicate)) +
geom_col(position = "stack")
One can also have dodge
bars.
gc_long %>%
group_by(Time, Replicate) %>%
summarise(mean=mean(log10(count +1)),se=se(log10(count +1))) %>%
ggplot(aes(x=Time, y=mean, fill = Replicate)) +
geom_col(position = "dodge")
We can try now to plot error bars on them. The errorbars would look weird and complicated if one forgets to add position = dodge
to the geom_errorbar()
as well.
gc_long %>%
group_by(Time, Replicate) %>%
summarise(mean=mean(log10(count +1)),se=se(log10(count +1))) %>%
ggplot(aes(x= Time, y= mean, fill = Replicate)) +
geom_col(position = "dodge") +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se), position = "dodge")
Note It is important that you keep tract of what kind of aesthetics you give when you initialize ggplot()
and what you add in the geoms()
later.
You can also make these error bars look nicer by playing around with some of the parameters available, like example below:
gc_long %>%
group_by(Time, Replicate) %>%
summarise(mean=mean(log10(count +1)),se=se(log10(count +1))) %>%
ggplot(aes(x= Time, y= mean, fill = Replicate)) +
geom_col(position = position_dodge2()) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se), position = position_dodge2(.9, padding = .6))
Task Make the following plots.
Tip It is more of a tidyverse
exercise than ggplot
. Because to get these plots, you need get the data in the right format.
Task Plot 1:
Task Plot 2:
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggrepel_0.9.1 wesanderson_0.3.6 gridExtra_2.3
## [4] jpeg_0.1-9 ggpubr_0.4.0 cowplot_1.1.1
## [7] ggthemes_4.2.4 scales_1.2.1 forcats_0.5.2
## [10] stringr_1.4.1 purrr_0.3.5 readr_2.1.3
## [13] tidyr_1.2.1 tibble_3.1.8 tidyverse_1.3.2
## [16] reshape2_1.4.4 ggplot2_3.3.6 formattable_0.2.1
## [19] kableExtra_1.3.4 dplyr_1.0.10 lubridate_1.8.0
## [22] leaflet_2.1.1 yaml_2.3.5 fontawesome_0.3.0.9000
## [25] captioner_2.2.3 bookdown_0.29 knitr_1.40
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-155 fs_1.5.2 webshot_0.5.4
## [4] httr_1.4.4 tools_4.1.3 backports_1.4.1
## [7] bslib_0.4.0 utf8_1.2.2 R6_2.5.1
## [10] DBI_1.1.3 mgcv_1.8-39 colorspace_2.0-3
## [13] withr_2.5.0 tidyselect_1.2.0 compiler_4.1.3
## [16] cli_3.4.1 rvest_1.0.3 xml2_1.3.3
## [19] labeling_0.4.2 sass_0.4.2 systemfonts_1.0.4
## [22] digest_0.6.29 rmarkdown_2.17 svglite_2.1.0
## [25] pkgconfig_2.0.3 htmltools_0.5.3 dbplyr_2.2.1
## [28] fastmap_1.1.0 highr_0.9 htmlwidgets_1.5.4
## [31] rlang_1.0.6 readxl_1.4.1 rstudioapi_0.14
## [34] jquerylib_0.1.4 generics_0.1.3 farver_2.1.1
## [37] jsonlite_1.8.2 crosstalk_1.2.0 car_3.1-0
## [40] googlesheets4_1.0.1 magrittr_2.0.3 Matrix_1.5-1
## [43] Rcpp_1.0.9 munsell_0.5.0 fansi_1.0.3
## [46] abind_1.4-5 lifecycle_1.0.3 stringi_1.7.8
## [49] carData_3.0-5 plyr_1.8.7 crayon_1.5.2
## [52] lattice_0.20-45 haven_2.5.1 splines_4.1.3
## [55] hms_1.1.2 pillar_1.8.1 ggsignif_0.6.3
## [58] reprex_2.0.2 glue_1.6.2 evaluate_0.17
## [61] modelr_0.1.9 vctrs_0.4.2 tzdb_0.3.0
## [64] cellranger_1.1.0 gtable_0.3.1 assertthat_0.2.1
## [67] cachem_1.0.6 xfun_0.33 broom_1.0.1
## [70] rstatix_0.7.0 googledrive_2.0.0 viridisLite_0.4.1
## [73] gargle_1.2.1 ellipsis_0.3.2