Tidy work in Tidyverse

RaukR 2023 • Advanced R for Bioinformatics

Marcin Kierczak

27-Jun-2023

Learning Outcomes

When this module is complete, you will:

know what tidyverse is and a bit about its history
be able to use different pipes, including advanced ones and placeholders
know whether the data you work with are tidy
will be able to load, debug and tidy your data
understand how to combine data sets using join_*
be aware of useful packages within tidyverse

Tidyverse — what is it all about?

tidyverse is a collection of packages 📦,
created by Hadley Wickham,
has became a de facto standard in data analyses,
a philosophy of programming or a programming paradigm: everything is about 🌊 the flow of 🧹 tidy data.

`?(Tidyverse OR !Tidyverse)`

Warning

☠️ There are still some people out there talking about the tidyverse curse though… ☠️

Navigating the balance between base R and the tidyverse is a challenge to learn.
- Robert A. Muenchen

Typical Tidyverse Workflow

Source: http://www.storybench.org/getting-started-with-tidyverse-in-r/

Introduction to Pipes or Let My Data Flow 🌊

Rene Magritt, La trahison des images, Wikimedia Commons

magrittr package — tidyverse and beyond
the %>% pipe
- x %>% f $\equiv$ f(x)
- x %>% f(y) $\equiv$ f(x, y)
- x %>% f %>% g %>% h $\equiv$ h(g(f(x)))

Introduction to Pipes

Instead of writing this:

result <- head(iris, n=3)

write this:

iris %>% head(n=3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

Other Types of Pipes — `%T>%`

Provided by magritter, not in tidyverse
When you call a function for its side effects

rnorm(50) %>% 
  matrix(ncol = 2) %>% 
  plot() %>% 
  summary()

Length  Class   Mode 
     0   NULL   NULL

Other Types of Pipes — `%T>%`

rnorm(50) %>% 
  matrix(ncol = 2) %T>% 
  plot() %>% 
  summary()

       V1                V2         
 Min.   :-1.7845   Min.   :-2.2556  
 1st Qu.:-0.4140   1st Qu.:-0.7914  
 Median : 0.3771   Median :-0.1386  
 Mean   : 0.1313   Mean   :-0.1667  
 3rd Qu.: 0.6717   3rd Qu.: 0.5689  
 Max.   : 1.7395   Max.   : 2.2863

Other Types of `magrittr` Pipes — `%$%`

iris %>% cor(Sepal.Length, Sepal.Width)

Error in pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs", : object 'Sepal.Width' not found

We need the %$% pipe with exposition of variables:

iris %$% cor(Sepal.Length, Sepal.Width)

[1] -0.1175698

This is because cor function does not have the x (data) argument – the very first argument of a pipe-friendly function.

Other Types of `magrittr` Pipes — %<>%

It exists but can lead to somewhat confusing code! 💀

x %<>% f $\equiv$ x <- f(x)

M <- matrix(rnorm(16), nrow=4); M %<>% colSums(); M

[1] -0.6989181 -0.2663241 -0.9810612  0.8953174

Native R pipe

From R >= 4.1.0 we have a native | > pipe that is a bit faster than %>% but currently has no placeholders mechanism.

c(1,2,3,4,5) |> mean()

[1] 3

even simple placeholder _ is available now. But 💀 only for named arguments.

mtcars |> lm(mpg ~ disp, data = _)


Call:
lm(formula = mpg ~ disp, data = mtcars)

Coefficients:
(Intercept)         disp  
   29.59985     -0.04122

Placeholders in `magrittr` Pipes

Sometimes we want to pass the resulting data to other than the first argument of the next function in chain. magritter provides placeholder mechanism for this:

x %>% f(y, .) $\equiv$ f(y, x),
x %>% f(y, z = .) $\equiv$ f(y, z = x).

M <- rnorm(4) %>% matrix(nrow = 2)
M %>% `%*%`(., .)

          [,1]       [,2]
[1,]  7.788073  0.2613652
[2,] -4.643986 -0.1556185

Placeholders for nested expressions

But for nested expressions:

x %>% f(a = p(.), b = q(.)) $\equiv$ f(x, a = p(x), b = q(x))

x %>% {f(a = p(.), b = q(.))} $\equiv$ f(a = p(x), b = q(x))

print_M_summ <- function(nrow, ncol) paste0('Matrix M has: ', nrow, ' rows and ', ncol, ' cols.')
M %>% {print_M_summ(nrow(.), ncol(.))}

[1] "Matrix M has: 2 rows and 2 cols."

Placeholders – unary functions

We can even use placeholders as the first element of a pipe:

f <- . %>% sin %>% cos
f

Functional sequence with the following components:

 1. sin(.)
 2. cos(.)

Use 'functions' to extract the individual functions.

and, indeed the f function works:

7 %>% f

[1] 0.7918362

Time to do Lab 1.1

Tibbles

tibble is one of the unifying features of tidyverse,
it is a better data.frame realization,
objects data.frame can be coerced to tibble using as_tibble()

Convert `data.frame` to `tibble`

as_tibble(iris)

# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows

Tibbles from scratch with `tibble`

  tibble(
    x = 1,          # recycling
    y = runif(50), 
    z = x + y^2,
    outcome = rnorm(50)
  )

# A tibble: 50 × 4
       x     y     z outcome
   <dbl> <dbl> <dbl>   <dbl>
 1     1 0.152  1.02   0.230
 2     1 0.112  1.01  -1.39 
 3     1 0.283  1.08   0.340
 4     1 0.879  1.77  -0.411
 5     1 0.182  1.03  -0.731
 6     1 0.960  1.92  -0.141
 7     1 0.936  1.88   1.67 
 8     1 0.822  1.68   0.785
 9     1 0.745  1.55  -0.360
10     1 0.751  1.56   0.289
# ℹ 40 more rows

More on Tibbles

When you print a tibble:
- all columns that fit the screen are shown,
- only the first 10 rows are shown,
- data type for each column is shown.

as_tibble(cars)

# A tibble: 50 × 2
   speed  dist
   <dbl> <dbl>
 1     4     2
 2     4    10
 3     7     4
 4     7    22
 5     8    16
 6     9    10
 7    10    18
 8    10    26
 9    10    34
10    11    17
# ℹ 40 more rows

Tibble printing options

my_tibble %>% print(n = 50, width = Inf),
options(tibble.print_min = 15, tibble.print_max = 25),
options(dplyr.print_min = Inf),
options(tibble.width = Inf)

Subsetting Tibbles

vehicles will be our tibble version of cars

vehicles <- as_tibble(cars[1:5,])

We can access data like this:

vehicles[['speed']]
vehicles[[1]]
vehicles$speed

[1] 4 4 7 7 8
[1] 4 4 7 7 8
[1] 4 4 7 7 8

Or, alternatively, using placeholders:

vehicles %>% .$speed
vehicles %>% .[['speed']]
vehicles %>% .[[1]]

Note! Not all old R functions work with tibbles, than you have to use as.data.frame(my_tibble).

Partial Matching

cars <- cars[1:5,]; colnames(vehicles)

[1] "speed" "dist"

cars$spe      # partial matching

[1] 4 4 7 7 8

vehicles$spe  # no partial matching

Warning: Unknown or uninitialised column: `spe`.

NULL

Non-existing Columns

cars$gear

NULL

vehicles$gear

Warning: Unknown or uninitialised column: `gear`.

NULL

Time to do Lab 1.2

Loading Data

In tidyverse you import data using readr package that provides a number of useful data import functions:

read_delim() a generic function for reading x-delimited files. There are a number of convenience wrappers:
- read_csv() used to read comma-delimited files,
- read_csv2() reads semicolon-delimited files, read_tsv() that reads tab-delimited files.
read_fwf for reading fixed-width files with its wrappers:
- fwf_widths() for width-based reading,
- fwf_positions() for positions-based reading and
- read_table() for reading white space-delimited fixed-width files.
read_log() for reading Apache-style logs.

Loading Data

The most commonly used read_csv() has some familiar arguments like:

skip – to specify the number of rows to skip (headers),
col_names – to supply a vector of column names,
comment – to specify what character designates a comment,
na – to specify how missing values are represented.

Under the Hood – `parse_*` Functions

Under the hood, data-reading functions use parse_* functions:

parse_double("42.24")

[1] 42.24

parse_number("272'555'849,55", 
             locale = locale(decimal_mark = ",", 
                             grouping_mark = "'"
                            )
             )

[1] 272555850

parse_number(c('100%', 'price: 500$', '21sek', '42F'))

[1] 100 500  21  42

Parsing Strings

Strings can be represented in different encodings:

text1 <- 'På en ö är en å'
text2 <- 'Zażółć gęślą jaźń'

charToRaw(text2)
parse_character(text1, locale = locale(encoding = 'UTF-8'))
guess_encoding(charToRaw("Test"))
guess_encoding(charToRaw(text2))

Parsing Factors

R is using factors to represent categorical variables.
Supply known levels to parse_factor so that it warns you when an unknown level is present in the data:

landscapes <- c('mountains', 'swamps', 'seaside')
parse_factor(c('mountains', 'plains', 'seaside', 'swamps'), 
             levels = landscapes)

[1] mountains <NA>      seaside   swamps   
attr(,"problems")
# A tibble: 1 × 4
    row   col expected           actual
  <int> <int> <chr>              <chr> 
1     2    NA value in level set plains
Levels: mountains swamps seaside

Other Parsing Functions

parse_

vector, time, number, logical, integer, double, character, date, datetime,
guess

guess_parser("2018-06-11 09:00:00")
parse_guess("2018-06-11 09:00:00")

guess_parser(c(1, 2.3, "23$", "54%"))
parse_guess(c(1, 2.3, "23$", "54%"))

[1] "datetime"
[1] "2018-06-11 09:00:00 UTC"
[1] "character"
[1] "1"   "2.3" "23$" "54%"

Writing to a File

The readr package also provides functions useful for writing tibbled data into a file:

write_csv()
write_tsv()
write_excel_csv()

They always save:

Text in UTF-8,
Dates in ISO8601

But saving in csv (or tsv) does mean you loose information about the type of data in particular columns. You can avoid this by using:

write_rds() and read_rds() to read/write objects in R binary rds format,
Tse write_feather() and read_feather() from package feather to read/write objects in a fast binary format that other programming languages can access.

Time to do Lab 1.3

Basic Data Transformations with `dplyr`

Let us create a tibble:

bijou <- as_tibble(diamonds) %>% head()
bijou[1:5,]

# A tibble: 5 × 10
  carat cut     color clarity depth table price     x     y     z
  <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good    E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good    J     SI2      63.3    58   335  4.34  4.35  2.75

Picking Observations using `filter()`

bijou %>% filter(cut == 'Ideal' | cut == 'Premium', carat >= 0.23) %>% head()

# A tibble: 2 × 10
  carat cut     color clarity depth table price     x     y     z
  <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63

Floating point and `tidyverse`

Caution

⛵ Be careful with floating point comparisons!
🏴‍☠️ Also, rows with comparison resulting in NA are skipped by default!

bijou %>% filter(near(0.23, carat) | is.na(carat)) %>% head(n = 4)

# A tibble: 2 × 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.23 Good  E     VS1      56.9    65   327  4.05  4.07  2.31

Rearranging Observations using `arrange()`

bijou %>% arrange(cut, carat, desc(price))

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
2  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
3  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
4  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
5  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
6  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43

Caution

The NAs always end up at the end of the rearranged tibble.

bijou %>% select(color, clarity, x:z) %>% head(n = 4)

# A tibble: 4 × 5
  color clarity     x     y     z
  <ord> <ord>   <dbl> <dbl> <dbl>
1 E     SI2      3.95  3.98  2.43
2 E     SI1      3.89  3.84  2.31
3 E     VS1      4.05  4.07  2.31
4 I     VS2      4.2   4.23  2.63

bijou %>% select(-(x:z)) %>% head(n = 5)

# A tibble: 5 × 7
  carat cut     color clarity depth table price
  <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int>
1  0.23 Ideal   E     SI2      61.5    55   326
2  0.21 Premium E     SI1      59.8    61   326
3  0.23 Good    E     VS1      56.9    65   327
4  0.29 Premium I     VS2      62.4    58   334
5  0.31 Good    J     SI2      63.3    58   335

Renaming variables

Note

rename is a variant of select, here used with everything() to move x to the beginning and rename it to var_x

bijou %>% rename(var_x = x) %>% head(n = 2)

# A tibble: 2 × 10
  carat cut     color clarity depth table price var_x     y     z
  <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31

Bring columns to front

Tip

use everything() to bring some columns to the front

bijou %>% select(x:z, everything()) %>% head(n = 2)

# A tibble: 2 × 10
      x     y     z carat cut     color clarity depth table price
  <dbl> <dbl> <dbl> <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int>
1  3.95  3.98  2.43  0.23 Ideal   E     SI2      61.5    55   326
2  3.89  3.84  2.31  0.21 Premium E     SI1      59.8    61   326

Create/alter new Variables with `mutate`

bijou %>% 
  mutate(p = x + z, q = p + y) %>% 
  select(-(depth:price)) %>% 
  head(n = 5)

# A tibble: 5 × 9
  carat cut     color clarity     x     y     z     p     q
  <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
1  0.23 Ideal   E     SI2      3.95  3.98  2.43  6.38  10.4
2  0.21 Premium E     SI1      3.89  3.84  2.31  6.2   10.0
3  0.23 Good    E     VS1      4.05  4.07  2.31  6.36  10.4
4  0.29 Premium I     VS2      4.2   4.23  2.63  6.83  11.1
5  0.31 Good    J     SI2      4.34  4.35  2.75  7.09  11.4

Create/alter new Variables with `transmute` 🧙‍♂️

Caution

Only the transformed variables will be retained.

bijou %>% transmute(carat, cut, sum = x + y + z) %>% head(n = 5)

# A tibble: 5 × 3
  carat cut       sum
  <dbl> <ord>   <dbl>
1  0.23 Ideal    10.4
2  0.21 Premium  10.0
3  0.23 Good     10.4
4  0.29 Premium  11.1
5  0.31 Good     11.4

Group and Summarize

Simple
More complex

bijou %>% group_by(cut) %>% summarize(max_price = max(price),
                                      mean_price = mean(price),
                                      min_price = min(price))

# A tibble: 4 × 4
  cut       max_price mean_price min_price
  <ord>         <int>      <dbl>     <int>
1 Good            335        331       327
2 Very Good       336        336       336
3 Premium         334        330       326
4 Ideal           326        326       326

bijou %>% group_by(cut, color) %>%  summarize(max_price = max(price), 
                                              mean_price = mean(price), 
                                              min_price = min(price)) %>% head(n = 5)

# A tibble: 5 × 5
# Groups:   cut [3]
  cut       color max_price mean_price min_price
  <ord>     <ord>     <int>      <dbl>     <int>
1 Good      E           327        327       327
2 Good      J           335        335       335
3 Very Good J           336        336       336
4 Premium   E           326        326       326
5 Premium   I           334        334       334

Other data manipulation tips

bijou %>% group_by(cut) %>% summarize(count = n())

# A tibble: 4 × 2
  cut       count
  <ord>     <int>
1 Good          2
2 Very Good     1
3 Premium       2
4 Ideal         1

When you need to regroup within the same pipe, use ungroup().

The Concept of Tidy Data

Each and every observation is represented as exactly one row,
Each and every variable is represented by exactly one column,
Thus each data table cell contains only one value.

Usually data are untidy in only one way. However, if you are unlucky, they are really untidy and thus a pain to work with…

Tidy Data

Are these data tidy?

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa

Species	variable	value
setosa	Sepal.Length	5.1
setosa	Sepal.Length	4.9
setosa	Sepal.Length	4.7

Tidy Data

Are these data tidy?

Sepal.L.W	Petal.L.W	Species
5.1/3.5	1.4/0.2	setosa
4.9/3	1.4/0.2	setosa
4.7/3.2	1.3/0.2	setosa

Sepal.Length	5.1	4.9	4.7	4.6
Sepal.Width	3.5	3.0	3.2	3.1
Petal.Length	1.4	1.4	1.3	1.5
Petal.Width	0.2	0.2	0.2	0.2
Species	setosa	setosa	setosa	setosa

Tidying Data with `pivot_longer`

If some of your column names should be values of a variable, use pivot_longer (old gather):

bijou2 %>% head(n = 5)

# A tibble: 5 × 3
  cut     `2008` `2009`
  <ord>    <int>  <dbl>
1 Ideal      326    328
2 Premium    326    328
3 Good       327    329
4 Premium    334    336
5 Good       335    337

bijou2 %>% 
  pivot_longer(cols = c(`2008`, `2009`), names_to = 'year', values_to = 'price') %>% 
  head(n = 5)

# A tibble: 5 × 3
  cut     year  price
  <ord>   <chr> <dbl>
1 Ideal   2008    326
2 Ideal   2009    328
3 Premium 2008    326
4 Premium 2009    328
5 Good    2008    327

Tidying Data with `pivot_wider`

If some of your observations are scattered across many rows, use pivot_wider (old spread):

bijou3

# A tibble: 9 × 5
  cut     price clarity dimension measurement
  <ord>   <int> <ord>   <chr>           <dbl>
1 Ideal     326 SI2     x                3.95
2 Premium   326 SI1     x                3.89
3 Good      327 VS1     x                4.05
4 Ideal     326 SI2     y                3.98
5 Premium   326 SI1     y                3.84
6 Good      327 VS1     y                4.07
7 Ideal     326 SI2     z                2.43
8 Premium   326 SI1     z                2.31
9 Good      327 VS1     z                2.31

bijou3 %>% 
  pivot_wider(names_from = dimension, values_from = measurement) %>% 
  head(n = 5)

# A tibble: 3 × 6
  cut     price clarity     x     y     z
  <ord>   <int> <ord>   <dbl> <dbl> <dbl>
1 Ideal     326 SI2      3.95  3.98  2.43
2 Premium   326 SI1      3.89  3.84  2.31
3 Good      327 VS1      4.05  4.07  2.31

Tidying Data with `separate`

If some of your columns contain more than one value, use separate:

# A tibble: 2 × 4
  cut     price clarity dim           
  <ord>   <int> <ord>   <chr>         
1 Ideal     326 SI2     3.95/3.98/2.43
2 Premium   326 SI1     3.89/3.84/2.31

bijou4 %>% 
  separate(dim, into = c("x", "y", "z"), sep = "/", convert = T)

# A tibble: 2 × 6
  cut     price clarity     x     y     z
  <ord>   <int> <ord>   <dbl> <dbl> <dbl>
1 Ideal     326 SI2      3.95  3.98  2.43
2 Premium   326 SI1      3.89  3.84  2.31

Note

Here, sep is here interpreted as the position to split on. It can also be a regular expression or a delimiting string/character. Pretty flexible approach!

Tidying Data with `unite`

If some of your columns contain more than one value

# A tibble: 5 × 7
  cut     price clarity_prefix clarity_suffix     x     y     z
  <ord>   <int> <chr>          <chr>          <dbl> <dbl> <dbl>
1 Ideal     326 SI             2               3.95  3.98  2.43
2 Premium   326 SI             1               3.89  3.84  2.31
3 Good      327 VS             1               4.05  4.07  2.31
4 Premium   334 VS             2               4.2   4.23  2.63
5 Good      335 SI             2               4.34  4.35  2.75

bijou5 %>% unite(clarity, clarity_prefix, clarity_suffix, sep='')

# A tibble: 5 × 6
  cut     price clarity     x     y     z
  <ord>   <int> <chr>   <dbl> <dbl> <dbl>
1 Ideal     326 SI2      3.95  3.98  2.43
2 Premium   326 SI1      3.89  3.84  2.31
3 Good      327 VS1      4.05  4.07  2.31
4 Premium   334 VS2      4.2   4.23  2.63
5 Good      335 SI2      4.34  4.35  2.75

Completing Missing Values Using `complete`

bijou %>% head(n = 10) %>% select(cut, clarity, price) %>% 
  mutate(continent = sample(c('Aus', 'Eur'), size = 6, replace = T)) -> missing_stones

missing_stones %>% complete(cut, continent) %>% head(n = 7)

# A tibble: 7 × 4
  cut       continent clarity price
  <ord>     <chr>     <ord>   <int>
1 Fair      Eur       <NA>       NA
2 Good      Eur       VS1       327
3 Good      Eur       SI2       335
4 Very Good Eur       VVS2      336
5 Premium   Eur       SI1       326
6 Premium   Eur       VS2       334
7 Ideal     Eur       SI2       326

Combining Datasets

Often, we need to combine a number of data tables (relational data) to get the full picture of the data. Here different types of joins come to help:

mutating joins that add new variables to data table A based on matching observations (rows) from data table B

filtering joins that filter observations from data table A based on whether they match observations in data table B

set operations that treat observations in A and B as elements of a set.

Let us create two example tibbles that share a key:

key	x
a	A1
b	A2
c	A3
e	A4

key	y
a	B1
b	NA
c	B3
d	B4

The Joins Family — `inner_join`

key	x
a	A1
b	A2
c	A3
e	A4

key	y
a	B1
b	NA
c	B3
d	B4

A %>% inner_join(B, by = 'key')
# All non-matching rows are dropped!

# A tibble: 3 × 3
  key   x     y    
  <chr> <chr> <chr>
1 a     A1    B1   
2 b     A2    <NA> 
3 c     A3    B3

The Joins Family — `left_join`

key	x
a	A1
b	A2
c	A3
e	A4

key	y
a	B1
b	NA
c	B3
d	B4

A %>% left_join(B, by = 'key')

# A tibble: 4 × 3
  key   x     y    
  <chr> <chr> <chr>
1 a     A1    B1   
2 b     A2    <NA> 
3 c     A3    B3   
4 e     A4    <NA>

The Joins Family — `right_join`

key	x
a	A1
b	A2
c	A3
e	A4

key	y
a	B1
b	NA
c	B3
d	B4

A %>% right_join(B, by = 'key')

# A tibble: 4 × 3
  key   x     y    
  <chr> <chr> <chr>
1 a     A1    B1   
2 b     A2    <NA> 
3 c     A3    B3   
4 d     <NA>  B4

The Joins Family — `full_join`

key	x
a	A1
b	A2
c	A3
e	A4

key	y
a	B1
b	NA
c	B3
d	B4

A %>% full_join(B, by = 'key')

# A tibble: 5 × 3
  key   x     y    
  <chr> <chr> <chr>
1 a     A1    B1   
2 b     A2    <NA> 
3 c     A3    B3   
4 e     A4    <NA> 
5 d     <NA>  B4

Some Other Friends

stringr for string manipulation and regular expressions
forcats for working with factors
lubridate for working with dates

Thank you! Questions?

         _                  
platform x86_64-pc-linux-gnu
os       linux-gnu          
major    4                  
minor    2.3

2023 • SciLifeLab • NBIS • RaukR

Tidy work in Tidyverse

Learning Outcomes

Tidyverse — what is it all about?

?(Tidyverse OR !Tidyverse)

Typical Tidyverse Workflow

Introduction to Pipes or Let My Data Flow 🌊

Introduction to Pipes

Other Types of Pipes — %T>%

Other Types of Pipes — %T>%

Other Types of magrittr Pipes — %$%

Other Types of magrittr Pipes — %<>%

Native R pipe

Placeholders in magrittr Pipes

Placeholders for nested expressions

Placeholders – unary functions

Time to do Lab 1.1

Tibbles

Convert data.frame to tibble

Tibbles from scratch with tibble

More on Tibbles

Tibble printing options

Subsetting Tibbles

Partial Matching

Non-existing Columns

Time to do Lab 1.2

Loading Data

Loading Data

Under the Hood – parse_* Functions

Parsing Strings

Parsing Factors

Other Parsing Functions

Writing to a File

Time to do Lab 1.3

Basic Data Transformations with dplyr

Picking Observations using filter()

Floating point and tidyverse

Rearranging Observations using arrange()

Selecting Variables with select()

Renaming variables

Bring columns to front

Create/alter new Variables with mutate

Create/alter new Variables with transmute 🧙‍♂️

Group and Summarize

Other data manipulation tips

The Concept of Tidy Data

Tidy Data

Tidy Data

Tidying Data with pivot_longer

Tidying Data with pivot_wider

Tidying Data with separate

Tidying Data with unite

Completing Missing Values Using complete

Combining Datasets

The Joins Family — inner_join

The Joins Family — left_join

The Joins Family — right_join

The Joins Family — full_join

Some Other Friends

Thank you! Questions?

`?(Tidyverse OR !Tidyverse)`

Other Types of Pipes — `%T>%`

Other Types of Pipes — `%T>%`

Other Types of `magrittr` Pipes — `%$%`

Other Types of `magrittr` Pipes — %<>%

Placeholders in `magrittr` Pipes

Convert `data.frame` to `tibble`

Tibbles from scratch with `tibble`

Under the Hood – `parse_*` Functions

Basic Data Transformations with `dplyr`

Picking Observations using `filter()`

Floating point and `tidyverse`

Rearranging Observations using `arrange()`

Selecting Variables with `select()`

Create/alter new Variables with `mutate`

Create/alter new Variables with `transmute` 🧙‍♂️

Tidying Data with `pivot_longer`

Tidying Data with `pivot_wider`

Tidying Data with `separate`

Tidying Data with `unite`

Completing Missing Values Using `complete`

The Joins Family — `inner_join`

The Joins Family — `left_join`

The Joins Family — `right_join`

The Joins Family — `full_join`