RaukR 2024 • Advanced R for Bioinformatics
Marcin Kierczak
21-Jun-2024
When this module is complete, you will:
know what tidyverse is and a bit about its history
be able to use different pipes, including advanced ones and placeholders
know whether the data you work with are tidy
will be able to load, debug and tidy your data
understand how to combine data sets using join_*
be aware of useful packages within tidyverse

?(Tidyverse OR !Tidyverse)Warning
☠️ There are still some people out there talking about the tidyverse curse though… ☠️
Navigating the balance between base R and the tidyverse is a challenge to learn.
- Robert A. Muenchen

Source: http://www.storybench.org/getting-started-with-tidyverse-in-r/
magrittr package — tidyverse and beyond %>% pipe x %>% f \(\equiv\) f(x) x %>% f(y) \(\equiv\) f(x, y) x %>% f %>% g %>% h \(\equiv\) h(g(f(x)))Instead of writing this:
write this:
%T>%magritter, not in tidyverse%T>%magrittr Pipes — %$%We need the %$% pipe with exposition of variables:
This is because cor function does not have the x (data) argument – the very first argument of a pipe-friendly function.
magrittr Pipes — %<>%It exists but can lead to somewhat confusing code! 💀
x %<>% f \(\equiv\) x <- f(x)
From R >= 4.1.0 we have a native |> pipe that is a bit faster than %>% but currently has no placeholders mechanism.
magrittr PipesSometimes we want to pass the resulting data to other than the first argument of the next function in chain. magritter provides placeholder mechanism for this:
x %>% f(y, .) \(\equiv\) f(y, x),x %>% f(y, z = .) \(\equiv\) f(y, z = x).But for nested expressions:
x %>% f(a = p(.), b = q(.)) \(\equiv\) f(x, a = p(x), b = q(x))x %>% {f(a = p(.), b = q(.))} \(\equiv\) f(a = p(x), b = q(x))We can even use placeholders as the first element of a pipe:
Functional sequence with the following components:
1. sin(.)
2. cos(.)
Use 'functions' to extract the individual functions.
and, indeed the f function works:

tibble is one of the unifying features of tidyverse, data.frame realization, data.frame can be coerced to tibble using as_tibble()data.frame to tibble# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
tibble# A tibble: 50 × 4
x y z outcome
<dbl> <dbl> <dbl> <dbl>
1 1 0.568 1.32 -0.953
2 1 0.140 1.02 -0.267
3 1 0.669 1.45 -0.291
4 1 0.267 1.07 -0.439
5 1 0.533 1.28 -0.490
6 1 0.483 1.23 0.851
7 1 0.129 1.02 -0.782
8 1 0.798 1.64 -0.520
9 1 0.946 1.89 0.223
10 1 0.0507 1.00 -0.269
# ℹ 40 more rows
tibble:
my_tibble %>% print(n = 50, width = Inf),options(tibble.print_min = 15, tibble.print_max = 25),options(dplyr.print_min = Inf),options(tibble.width = Inf)vehicles will be our tibble version of cars
We can access data like this:
Or, alternatively, using placeholders:
Note! Not all old R functions work with tibbles, than you have to use as.data.frame(my_tibble).
In tidyverse you import data using readr package that provides a number of useful data import functions:
read_delim() a generic function for reading x-delimited files. There are a number of convenience wrappers:
read_csv() used to read comma-delimited files,read_csv2() reads semicolon-delimited files, read_tsv() that reads tab-delimited files.read_fwf for reading fixed-width files with its wrappers:
read_log() for reading Apache-style logs.The most commonly used read_csv() has some familiar arguments like:
skip – to specify the number of rows to skip (headers),col_names – to supply a vector of column names,comment – to specify what character designates a comment,na – to specify how missing values are represented.parse_* FunctionsUnder the hood, data-reading functions use parse_* functions:
[1] 272555850
parse_factor so that it warns you when an unknown level is present in the data:landscapes <- c('mountains', 'swamps', 'seaside')
parse_factor(c('mountains', 'plains', 'seaside', 'swamps'),
levels = landscapes)[1] mountains <NA> seaside swamps
attr(,"problems")
# A tibble: 1 × 4
row col expected actual
<int> <int> <chr> <chr>
1 2 NA value in level set plains
Levels: mountains swamps seaside
parse_
vector, time, number, logical, integer, double, character, date, datetime,guessThe readr package also provides functions useful for writing tibbled data into a file:
write_csv()write_tsv()write_excel_csv()They always save:
But saving in csv (or tsv) does mean you loose information about the type of data in particular columns. You can avoid this by using:
write_rds() and read_rds() to read/write objects in R binary rds format,write_feather() and read_feather() from package feather to read/write objects in a fast binary format that other programming languages can access.dplyrLet us create a tibble:
# A tibble: 5 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

filter()tidyverseCaution
🚣 Be careful with floating point comparisons!
🦜 Also, rows with comparison resulting in NA are skipped by default!
arrange()# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
4 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
5 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
6 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
Caution
The NAs always end up at the end of the rearranged tibble.
select()Note
rename is a variant of select, here used with everything() to move x to the beginning and rename it to var_x
# A tibble: 2 × 10
carat cut color clarity depth table price var_x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
Tip
use everything() to bring some columns to the front
# A tibble: 2 × 10
x y z carat cut color clarity depth table price
<dbl> <dbl> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int>
1 3.95 3.98 2.43 0.23 Ideal E SI2 61.5 55 326
2 3.89 3.84 2.31 0.21 Premium E SI1 59.8 61 326
mutate# A tibble: 5 × 9
carat cut color clarity x y z p q
<dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 3.95 3.98 2.43 6.38 10.4
2 0.21 Premium E SI1 3.89 3.84 2.31 6.2 10.0
3 0.23 Good E VS1 4.05 4.07 2.31 6.36 10.4
4 0.29 Premium I VS2 4.2 4.23 2.63 6.83 11.1
5 0.31 Good J SI2 4.34 4.35 2.75 7.09 11.4
transmute 🧙♂️Caution
Only the transformed variables will be retained.
bijou %>% group_by(cut, color) %>% summarize(max_price = max(price),
mean_price = mean(price),
min_price = min(price)) %>% head(n = 5)# A tibble: 5 × 5
# Groups: cut [3]
cut color max_price mean_price min_price
<ord> <ord> <int> <dbl> <int>
1 Good E 327 327 327
2 Good J 335 335 335
3 Very Good J 336 336 336
4 Premium E 326 326 326
5 Premium I 334 334 334
# A tibble: 4 × 2
cut count
<ord> <int>
1 Good 2
2 Very Good 1
3 Premium 2
4 Ideal 1
When you need to regroup within the same pipe, use ungroup().
Usually data are untidy in only one way. However, if you are unlucky, they are really untidy and thus a pain to work with…

Are these data tidy?
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| Species | variable | value |
|---|---|---|
| setosa | Sepal.Length | 5.1 |
| setosa | Sepal.Length | 4.9 |
| setosa | Sepal.Length | 4.7 |

Are these data tidy?
| Sepal.L.W | Petal.L.W | Species |
|---|---|---|
| 5.1/3.5 | 1.4/0.2 | setosa |
| 4.9/3 | 1.4/0.2 | setosa |
| 4.7/3.2 | 1.3/0.2 | setosa |
| Sepal.Length | 5.1 | 4.9 | 4.7 | 4.6 |
| Sepal.Width | 3.5 | 3.0 | 3.2 | 3.1 |
| Petal.Length | 1.4 | 1.4 | 1.3 | 1.5 |
| Petal.Width | 0.2 | 0.2 | 0.2 | 0.2 |
| Species | setosa | setosa | setosa | setosa |
pivot_longerIf some of your column names should be values of a variable, use pivot_longer (old gather):
pivot_widerIf some of your observations are scattered across many rows, use pivot_wider (old spread):
# A tibble: 9 × 5
cut price clarity dimension measurement
<ord> <int> <ord> <chr> <dbl>
1 Ideal 326 SI2 x 3.95
2 Premium 326 SI1 x 3.89
3 Good 327 VS1 x 4.05
4 Ideal 326 SI2 y 3.98
5 Premium 326 SI1 y 3.84
6 Good 327 VS1 y 4.07
7 Ideal 326 SI2 z 2.43
8 Premium 326 SI1 z 2.31
9 Good 327 VS1 z 2.31
separateIf some of your columns contain more than one value, use separate:
# A tibble: 2 × 4
cut price clarity dim
<ord> <int> <ord> <chr>
1 Ideal 326 SI2 3.95/3.98/2.43
2 Premium 326 SI1 3.89/3.84/2.31
# A tibble: 2 × 6
cut price clarity x y z
<ord> <int> <ord> <dbl> <dbl> <dbl>
1 Ideal 326 SI2 3.95 3.98 2.43
2 Premium 326 SI1 3.89 3.84 2.31
Note
Here, sep is here interpreted as the position to split on. It can also be a regular expression or a delimiting string/character. Pretty flexible approach!
uniteIf some of your columns contain more than one value
# A tibble: 5 × 7
cut price clarity_prefix clarity_suffix x y z
<ord> <int> <chr> <chr> <dbl> <dbl> <dbl>
1 Ideal 326 SI 2 3.95 3.98 2.43
2 Premium 326 SI 1 3.89 3.84 2.31
3 Good 327 VS 1 4.05 4.07 2.31
4 Premium 334 VS 2 4.2 4.23 2.63
5 Good 335 SI 2 4.34 4.35 2.75
complete# A tibble: 7 × 4
cut continent clarity price
<ord> <chr> <ord> <int>
1 Fair Aus <NA> NA
2 Fair Eur <NA> NA
3 Good Aus VS1 327
4 Good Eur SI2 335
5 Very Good Aus VVS2 336
6 Very Good Eur <NA> NA
7 Premium Aus SI1 326
Often, we need to combine a number of data tables (relational data) to get the full picture of the data. Here different types of joins come to help:
A based on matching observations (rows) from data table BA based on whether they match observations in data table BA and B as elements of a set.Let us create two example tibbles that share a key:
| key | x |
|---|---|
| a | A1 |
| b | A2 |
| c | A3 |
| e | A4 |
| key | y |
|---|---|
| a | B1 |
| b | NA |
| c | B3 |
| d | B4 |
inner_join| key | x |
|---|---|
| a | A1 |
| b | A2 |
| c | A3 |
| e | A4 |
| key | y |
|---|---|
| a | B1 |
| b | NA |
| c | B3 |
| d | B4 |
left_join| key | x |
|---|---|
| a | A1 |
| b | A2 |
| c | A3 |
| e | A4 |
| key | y |
|---|---|
| a | B1 |
| b | NA |
| c | B3 |
| d | B4 |
right_join| key | x |
|---|---|
| a | A1 |
| b | A2 |
| c | A3 |
| e | A4 |
| key | y |
|---|---|
| a | B1 |
| b | NA |
| c | B3 |
| d | B4 |
full_join| key | x |
|---|---|
| a | A1 |
| b | A2 |
| c | A3 |
| e | A4 |
| key | y |
|---|---|
| a | B1 |
| b | NA |
| c | B3 |
| d | B4 |
stringr for string manipulation and regular expressionsforcats for working with factorslubridate for working with dates _
platform x86_64-pc-linux-gnu
os linux-gnu
major 4
minor 3.2
2024 • SciLifeLab • NBIS • RaukR