class: center, middle, inverse, title-slide .title[ # Tidy work in Tidyverse ] .subtitle[ ## R Foundation for Life Scientists ] .author[ ### Marcin Kierczak ] --- exclude: true count: false <link href="https://fonts.googleapis.com/css?family=Roboto|Source+Sans+Pro:300,400,600|Ubuntu+Mono&subset=latin-ext" rel="stylesheet"> <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.3.1/css/all.css" integrity="sha384-mzrmE5qonljUremFsqc01SB46JvROS7bZs3IO2EmfFsd15uHvIt+Y8vEf7N7fWAU" crossorigin="anonymous"> <!-- ----------------- Only edit title & author above this ----------------- --> --- name: setup_livecode # Livecode Setup By typing: `http://livecode.kierczak.net:7777` in your browser, you can access the livecode server. --- name: learning_outcomes # Learning Outcomes <br> Upon completing this module, you will: * know what `tidyverse` is and a bit about its history * be aware of useful packages within `tidyverse` * be able to use basic pipes (including native R pipe) * know whether the data you are working with are tidy * will be able to do basic tidying of your data --- name: tidyverse_overview # Tidyverse -- What is it all About? * [tidyverse](http://www.tidyverse.org) is a collection of ๐ฆ ๐ฆ * created by [Hadley Wickham](http://hadley.nz) * has become a *de facto* standard in data analyses * a philosophy of programming or a **programming paradigm**: everything is about the ๐ flow of ๐งน tidy data .center[ <img src="data/slide_tidyverse/hex-tidyverse.png", style="height:200px;"> <img src="data/slide_tidyverse/Hadley-wickham2016-02-04.jpeg", style="height:200px;"> <img src="data/slide_tidyverse/RforDataScience.jpeg", style="height:200px;"> ] .vsmall[sources of images: www.tidyverse.org, Wikipedia, www.tidyverse.org] --- name: tidyverse_curse # ?(Tidyverse OR !Tidyverse) > โ ๏ธ There are still some people out there talking about the tidyverse curse though... โ ๏ธ<br> -- > Navigating the balance between base R and the tidyverse is a challenge to learn.<br>[-Robert A. Muenchen](http://r4stats.com/articles/why-r-is-hard-to-learn/) -- .center[<img src="data/slide_tidyverse/tidyverse-flow.png", style="height:400px;">] .vsmall[source: http://www.storybench.org/getting-started-with-tidyverse-in-r/] --- name: intro_to_pipes # Pipes or Let my Data Flow ๐ .pull-left-50[ .center[<img src="data/slide_tidyverse/pipe_magritte.jpg", style="width:300px;">] .vsmall[Rene Magritt, *La trahison des images*, [Wikimedia Commons](https://en.wikipedia.org/wiki/The_Treachery_of_Images#/media/File:MagrittePipe.jpg)] .center[<img src="data/slide_tidyverse/magrittr.png", style="width:150px;">] ] -- .pull-right-50[ * Let the data flow. * *Ceci n'est pas une pipe* -- `magrittr` * The `%>%` pipe: + `x %>% f` `\(\equiv\)` `f(x)` + `x %>% f(y)` `\(\equiv\)` `f(x, y)` + `x %>% f %>% g %>% h` `\(\equiv\)` `h(g(f(x)))` ] -- .pull-right-50[ instead of writing this: ``` r data <- iris data <- head(data, n=3) ``` ] -- .pull-right-50[ write this: ``` r iris %>% head(n=3) ``` ``` ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ``` ] --- name: native_r_pipe # Native R Pipe From R 4.1.0, we have a native pipe operator `|>` that is a bit faster than the `magrittr` pipe `%>%`. It, however, differs from the `magrittr` pipe in some aspects, e.g., it does not allow for the use of the dot `.` as a placeholder (it has a simple `_` placeholder though). ``` r c(1:5) |> mean() ``` ``` ## [1] 3 ``` ``` r c(1:5) %>% mean() ``` ``` ## [1] 3 ``` --- name: tibble_intro # Tibbles .pull-left-50[ .center[<img src="data/slide_tidyverse/tibble_tweet.jpg">] ] .pull-right-50[ * `tibble` is one of the unifying features of tidyverse, * it is a *better* `data.frame` realization, * objects `data.frame` can be coerced to `tibble` using `as_tibble()` ] --- name: convert_to_tibble # Convert `data.frame` to `tibble` ``` r as_tibble(iris) ``` ``` ## # A tibble: 150 ร 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <fct> ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## 11 5.4 3.7 1.5 0.2 setosa ## 12 4.8 3.4 1.6 0.2 setosa ## 13 4.8 3 1.4 0.1 setosa ## 14 4.3 3 1.1 0.1 setosa ## 15 5.8 4 1.2 0.2 setosa ## # โน 135 more rows ``` --- name: tibble_from_scratch # Tibbles from scratch with `tibble()` ``` r tibble( x = 1, # recycling y = runif(4), z = x + y^2, outcome = rnorm(4) ) ``` -- ``` r tibble( x = 1, # recycling y = runif(4), z = x + y^2, outcome = rnorm(4) ) ``` ``` ## # A tibble: 4 ร 4 ## x y z outcome ## <dbl> <dbl> <dbl> <dbl> ## 1 1 0.0861 1.01 2.04 ## 2 1 0.897 1.80 -0.527 ## 3 1 0.179 1.03 1.14 ## 4 1 0.694 1.48 0.485 ``` --- name: more_on_tibbles # More on Tibbles * When you print a `tibble`: + all columns that fit the screen are shown, + first 10 rows are shown, + data type for each column is shown. ``` r as_tibble(cars) ``` ``` ## # A tibble: 50 ร 2 ## speed dist ## <dbl> <dbl> ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ## 6 9 10 ## 7 10 18 ## 8 10 26 ## 9 10 34 ## 10 11 17 ## 11 11 28 ## 12 12 14 ## 13 12 20 ## 14 12 24 ## 15 12 28 ## # โน 35 more rows ``` --- name: tibble_printing_options # Tibble Printing Options * `my_tibble %>% print(n = 50, width = Inf)`, * `options(tibble.print_min = 15, tibble.print_max = 25)`, * `options(dplyr.print_min = Inf)`, * `options(tibble.width = Inf)` --- name: subsetting_tibbles # Subsetting Tibbles ``` r vehicles <- as_tibble(cars[1:5,]) vehicles %>% print(n = 5) ``` ``` ## # A tibble: 5 ร 2 ## speed dist ## <dbl> <dbl> ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ``` -- We can subset tibbles in a number of ways: ``` r vehicles[['speed']] # try also vehicles['speed'] vehicles[[1]] vehicles$speed ``` ``` ## [1] 4 4 7 7 8 ## [1] 4 4 7 7 8 ## [1] 4 4 7 7 8 ``` -- > **Note!** Not all old R functions work with tibbles, than you have to use `as.data.frame(my_tibble)`. --- name: tibbles_partial_matching # Tibbles are Stricter than `data.frames` ``` r cars <- cars[1:5,] ``` ``` r cars$spe # partial matching ``` ``` ## [1] 4 4 7 7 8 ``` ``` r vehicles$spe # no partial matching ``` ``` ## Warning: Unknown or uninitialised column: `spe`. ``` ``` ## NULL ``` ``` r cars$gear ``` ``` ## NULL ``` ``` r vehicles$gear ``` ``` ## Warning: Unknown or uninitialised column: `gear`. ``` ``` ## NULL ``` --- name: loading_data # Loading Data In `tidyverse` you import data using `readr` package that provides a number of useful data import functions: * `read_delim()` a generic function for reading *-delimited files. There are a number of convenience wrappers: + `read_csv()` used to read comma-delimited files, + `read_csv2()` reads semicolon-delimited files, `read_tsv()` that reads tab-delimited files. * `read_fwf` for reading fixed-width files with its wrappers: + fwf_widths() for width-based reading, + fwf_positions() for positions-based reading and + read_table() for reading white space-delimited fixed-width files. * `read_log()` for reading Apache-style logs. -- >The most commonly used `read_csv()` has some familiar arguments like: * `skip` -- to specify the number of rows to skip (headers), * `col_names` -- to supply a vector of column names, * `comment` -- to specify what character designates a comment, * `na` -- to specify how missing values are represented. --- name: readr_writing # Writing to a File The `readr` package also provides functions useful for writing tibbled data into a file: * `write_csv()` * `write_tsv()` * `write_excel_csv()` They **always** save: * text in UTF-8, * dates in ISO8601 But saving in csv (or tsv) does mean you loose information about the type of data in particular columns. You can avoid this by using: * `write_rds()` and `read_rds()` to read/write objects in R binary rds format, * use `write_feather()` and `read_feather()` from package `feather` to read/write objects in a fast binary format that other programming languages can access. --- name: basic_data_transformations # Basic Data Transformations with `dplyr` Let us create a tibble: ``` r bijou <- as_tibble(diamonds) %>% head() bijou[1:5, ] ``` ``` ## # A tibble: 5 ร 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ``` .center[ <img src="data/slide_tidyverse/diamonds.png", style="height:200px"> ] --- name: filter # Picking Observations using `filter()` ``` r bijou %>% filter(cut == 'Ideal' | cut == 'Premium', carat >= 0.23) %>% head(n = 4) ``` ``` ## # A tibble: 2 ร 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ``` -- >โต Be careful with floating point comparisons! <br> ๐ฆ Also, rows with comparison resulting in `NA` are skipped by default! ``` r bijou %>% filter(near(0.23, carat) | is.na(carat)) %>% head(n = 4) ``` ``` ## # A tibble: 2 ร 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ``` --- name: arrange # Rearranging Observations using `arrange()` ``` r bijou %>% arrange(cut, carat, desc(price)) ``` -- ``` ## # A tibble: 6 ร 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 ## 4 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 5 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 6 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ``` -- >The `NA`s always end up at the end of the rearranged `tibble`! --- name: select # Selecting Variables with `select()` Simple `select` with a range: ``` r bijou %>% select(color, clarity, x:z) %>% head(n = 4) ``` ``` ## # A tibble: 4 ร 5 ## color clarity x y z ## <ord> <ord> <dbl> <dbl> <dbl> ## 1 E SI2 3.95 3.98 2.43 ## 2 E SI1 3.89 3.84 2.31 ## 3 E VS1 4.05 4.07 2.31 ## 4 I VS2 4.2 4.23 2.63 ``` -- Exclusive `select`: ``` r bijou %>% select(-(x:z)) %>% head(n = 4) ``` ``` ## # A tibble: 4 ร 7 ## carat cut color clarity depth table price ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> ## 1 0.23 Ideal E SI2 61.5 55 326 ## 2 0.21 Premium E SI1 59.8 61 326 ## 3 0.23 Good E VS1 56.9 65 327 ## 4 0.29 Premium I VS2 62.4 58 334 ``` --- name: rename # Renaming Variables >`rename` is a variant of `select`, here used with `everything()` to move `x` to the beginning and rename it to `var_x` ``` r bijou %>% rename(var_x = x) %>% head(n = 5) ``` -- ``` r bijou %>% rename(var_x = x) %>% head(n = 5) ``` ``` ## # A tibble: 5 ร 10 ## carat cut color clarity depth table price var_x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ``` --- name: bring_to_front # Bring columns to front >use `everything()` to bring some columns to the front: ``` r bijou %>% select(x:z, everything()) %>% head(n = 4) ``` -- ``` ## # A tibble: 4 ร 10 ## x y z carat cut color clarity depth table price ## <dbl> <dbl> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> ## 1 3.95 3.98 2.43 0.23 Ideal E SI2 61.5 55 326 ## 2 3.89 3.84 2.31 0.21 Premium E SI1 59.8 61 326 ## 3 4.05 4.07 2.31 0.23 Good E VS1 56.9 65 327 ## 4 4.2 4.23 2.63 0.29 Premium I VS2 62.4 58 334 ``` --- name: mutate # Create/alter new Variables with `mutate` ``` r bijou %>% mutate(p = x + z, q = p + y) %>% select(-(depth:price)) %>% head(n = 5) ``` ``` ## # A tibble: 5 ร 9 ## carat cut color clarity x y z p q ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 3.95 3.98 2.43 6.38 10.4 ## 2 0.21 Premium E SI1 3.89 3.84 2.31 6.2 10.0 ## 3 0.23 Good E VS1 4.05 4.07 2.31 6.36 10.4 ## 4 0.29 Premium I VS2 4.2 4.23 2.63 6.83 11.1 ## 5 0.31 Good J SI2 4.34 4.35 2.75 7.09 11.4 ``` -- ``` r bijou %>% mutate(p = x + z, q = p + y) %>% select(-(depth:price)) %>% head(n = 5) ``` ``` ## # A tibble: 5 ร 9 ## carat cut color clarity x y z p q ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 3.95 3.98 2.43 6.38 10.4 ## 2 0.21 Premium E SI1 3.89 3.84 2.31 6.2 10.0 ## 3 0.23 Good E VS1 4.05 4.07 2.31 6.36 10.4 ## 4 0.29 Premium I VS2 4.2 4.23 2.63 6.83 11.1 ## 5 0.31 Good J SI2 4.34 4.35 2.75 7.09 11.4 ``` --- name: transmute # Create/alter new Variables with `transmute` ๐งโโ๏ธ >Only the transformed variables will be retained. ``` r bijou %>% transmute(carat, cut, sum = x + y + z) %>% head(n = 5) ``` ``` ## # A tibble: 5 ร 3 ## carat cut sum ## <dbl> <ord> <dbl> ## 1 0.23 Ideal 10.4 ## 2 0.21 Premium 10.0 ## 3 0.23 Good 10.4 ## 4 0.29 Premium 11.1 ## 5 0.31 Good 11.4 ``` --- name: grouped_summaries # Group and Summarize ``` r bijou %>% group_by(cut) %>% summarize(max_price = max(price), mean_price = mean(price), min_price = min(price)) ``` ``` ## # A tibble: 4 ร 4 ## cut max_price mean_price min_price ## <ord> <int> <dbl> <int> ## 1 Good 335 331 327 ## 2 Very Good 336 336 336 ## 3 Premium 334 330 326 ## 4 Ideal 326 326 326 ``` -- ``` r bijou %>% group_by(cut, color) %>% summarize(max_price = max(price), mean_price = mean(price), min_price = min(price)) %>% head(n = 4) ``` ``` ## # A tibble: 4 ร 5 ## # Groups: cut [3] ## cut color max_price mean_price min_price ## <ord> <ord> <int> <dbl> <int> ## 1 Good E 327 327 327 ## 2 Good J 335 335 335 ## 3 Very Good J 336 336 336 ## 4 Premium E 326 326 326 ``` --- name: other_data_manipulations # Other data manipulation tips ``` r bijou %>% group_by(cut) %>% summarize(count = n()) ``` ``` ## # A tibble: 4 ร 2 ## cut count ## <ord> <int> ## 1 Good 2 ## 2 Very Good 1 ## 3 Premium 2 ## 4 Ideal 1 ``` -- When you need to regroup within the same pipe, use `ungroup()`. --- name: concept_of_tidy_data # The Concept of Tidy Data Data are tidy *sensu Wickham* if: * each and every observation is represented as exactly one row, * each and every variable is represented by exactly one column, * thus each data table cell contains only one value. <img src="data/slide_tidyverse/tidy_data.png" width="2560" style="display: block; margin: auto auto auto 0;" /> Usually data are untidy in only one way. However, if you are unlucky, they are really untidy and thus a pain to work with... --- name: tidy_data # Tidy Data <img src="data/slide_tidyverse/tidy_data.png" style="height:100px"> -- .center[**Are these data tidy?**] .pull-left-70[ <table class="table table-striped table-hover table-responsive table-condensed" style=""> <thead> <tr> <th style="text-align:center;"> Sepal.Length </th> <th style="text-align:center;"> Sepal.Width </th> <th style="text-align:center;"> Petal.Length </th> <th style="text-align:center;"> Petal.Width </th> <th style="text-align:center;"> Species </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 5.1 </td> <td style="text-align:center;"> 3.5 </td> <td style="text-align:center;"> 1.4 </td> <td style="text-align:center;"> 0.2 </td> <td style="text-align:center;"> setosa </td> </tr> <tr> <td style="text-align:center;"> 4.9 </td> <td style="text-align:center;"> 3.0 </td> <td style="text-align:center;"> 1.4 </td> <td style="text-align:center;"> 0.2 </td> <td style="text-align:center;"> setosa </td> </tr> <tr> <td style="text-align:center;"> 4.7 </td> <td style="text-align:center;"> 3.2 </td> <td style="text-align:center;"> 1.3 </td> <td style="text-align:center;"> 0.2 </td> <td style="text-align:center;"> setosa </td> </tr> </tbody> </table> ] -- .pull-right-30[ <table class="table table-striped table-hover table-responsive table-condensed" style=""> <thead> <tr> <th style="text-align:center;"> Species </th> <th style="text-align:center;"> variable </th> <th style="text-align:center;"> value </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> setosa </td> <td style="text-align:center;"> Sepal.Length </td> <td style="text-align:center;"> 5.1 </td> </tr> <tr> <td style="text-align:center;"> setosa </td> <td style="text-align:center;"> Sepal.Length </td> <td style="text-align:center;"> 4.9 </td> </tr> <tr> <td style="text-align:center;"> setosa </td> <td style="text-align:center;"> Sepal.Length </td> <td style="text-align:center;"> 4.7 </td> </tr> </tbody> </table> ] <br> <hr><br> -- .pull-left-50[ <table class="table table-striped table-hover table-responsive table-condensed" style=""> <thead> <tr> <th style="text-align:center;"> Sepal.L.W </th> <th style="text-align:center;"> Petal.L.W </th> <th style="text-align:center;"> Species </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 5.1/3.5 </td> <td style="text-align:center;"> 1.4/0.2 </td> <td style="text-align:center;"> setosa </td> </tr> <tr> <td style="text-align:center;"> 4.9/3 </td> <td style="text-align:center;"> 1.4/0.2 </td> <td style="text-align:center;"> setosa </td> </tr> <tr> <td style="text-align:center;"> 4.7/3.2 </td> <td style="text-align:center;"> 1.3/0.2 </td> <td style="text-align:center;"> setosa </td> </tr> </tbody> </table> ] -- .pull-right-50[ <table class="table table-striped table-hover table-responsive table-condensed" style=""> <tbody> <tr> <td style="text-align:left;"> Sepal.Length </td> <td style="text-align:center;"> 5.1 </td> <td style="text-align:center;"> 4.9 </td> <td style="text-align:center;"> 4.7 </td> <td style="text-align:center;"> 4.6 </td> </tr> <tr> <td style="text-align:left;"> Sepal.Width </td> <td style="text-align:center;"> 3.5 </td> <td style="text-align:center;"> 3.0 </td> <td style="text-align:center;"> 3.2 </td> <td style="text-align:center;"> 3.1 </td> </tr> <tr> <td style="text-align:left;"> Petal.Length </td> <td style="text-align:center;"> 1.4 </td> <td style="text-align:center;"> 1.4 </td> <td style="text-align:center;"> 1.3 </td> <td style="text-align:center;"> 1.5 </td> </tr> <tr> <td style="text-align:left;"> Petal.Width </td> <td style="text-align:center;"> 0.2 </td> <td style="text-align:center;"> 0.2 </td> <td style="text-align:center;"> 0.2 </td> <td style="text-align:center;"> 0.2 </td> </tr> <tr> <td style="text-align:left;"> Species </td> <td style="text-align:center;"> setosa </td> <td style="text-align:center;"> setosa </td> <td style="text-align:center;"> setosa </td> <td style="text-align:center;"> setosa </td> </tr> </tbody> </table> ] --- name: tidying_data_pivot_longer # Tidying Data with `tidyr::pivot_longer` If some of your column names are actually values of a variable, use `pivot_longer` (replaces `gather`): ``` r bijou2 %>% head(n = 5) ``` ``` ## # A tibble: 5 ร 3 ## cut `2008` `2009` ## <ord> <int> <dbl> ## 1 Ideal 326 330. ## 2 Premium 326 330. ## 3 Good 327 331. ## 4 Premium 334 338. ## 5 Good 335 339. ``` ``` r bijou2 %>% pivot_longer(c(`2008`, `2009`), names_to = 'year', values_to = 'price') %>% head(n = 5) ``` ``` ## # A tibble: 5 ร 3 ## cut year price ## <ord> <chr> <dbl> ## 1 Ideal 2008 326 ## 2 Ideal 2009 330. ## 3 Premium 2008 326 ## 4 Premium 2009 330. ## 5 Good 2008 327 ``` --- name: tidying_data_pivot_wider # Tidying Data with `tidyr::pivot_wider` If some of your observations are scattered across many rows, use `pivot_wider` (replaces `gather`): ``` r bijou3 ``` ``` ## # A tibble: 9 ร 5 ## cut price clarity dimension measurement ## <ord> <int> <ord> <chr> <dbl> ## 1 Ideal 326 SI2 x 3.95 ## 2 Premium 326 SI1 x 3.89 ## 3 Good 327 VS1 x 4.05 ## 4 Ideal 326 SI2 y 3.98 ## 5 Premium 326 SI1 y 3.84 ## 6 Good 327 VS1 y 4.07 ## 7 Ideal 326 SI2 z 2.43 ## 8 Premium 326 SI1 z 2.31 ## 9 Good 327 VS1 z 2.31 ``` ``` r bijou3 %>% pivot_wider(names_from=dimension, values_from=measurement) %>% head(n = 4) ``` ``` ## # A tibble: 3 ร 6 ## cut price clarity x y z ## <ord> <int> <ord> <dbl> <dbl> <dbl> ## 1 Ideal 326 SI2 3.95 3.98 2.43 ## 2 Premium 326 SI1 3.89 3.84 2.31 ## 3 Good 327 VS1 4.05 4.07 2.31 ``` --- name: tidying_data_separate # Tidying Data with `separate` If some of your columns contain more than one value, use `separate`: ``` r bijou4 ``` ``` ## # A tibble: 5 ร 4 ## cut price clarity dim ## <ord> <int> <ord> <chr> ## 1 Ideal 326 SI2 3.95/3.98/2.43 ## 2 Premium 326 SI1 3.89/3.84/2.31 ## 3 Good 327 VS1 4.05/4.07/2.31 ## 4 Premium 334 VS2 4.2/4.23/2.63 ## 5 Good 335 SI2 4.34/4.35/2.75 ``` ``` r bijou4 %>% separate(dim, into = c("x", "y", "z"), sep = "/", convert = T) ``` ``` ## # A tibble: 5 ร 6 ## cut price clarity x y z ## <ord> <int> <ord> <dbl> <dbl> <dbl> ## 1 Ideal 326 SI2 3.95 3.98 2.43 ## 2 Premium 326 SI1 3.89 3.84 2.31 ## 3 Good 327 VS1 4.05 4.07 2.31 ## 4 Premium 334 VS2 4.2 4.23 2.63 ## 5 Good 335 SI2 4.34 4.35 2.75 ``` --- name: tidying_data_unite # Tidying Data with `unite` If some of your columns contain more than one value, use `separate`: ``` r bijou5 ``` ``` ## # A tibble: 5 ร 7 ## cut price clarity_prefix clarity_suffix x y z ## <ord> <int> <chr> <chr> <dbl> <dbl> <dbl> ## 1 Ideal 326 SI 2 3.95 3.98 2.43 ## 2 Premium 326 SI 1 3.89 3.84 2.31 ## 3 Good 327 VS 1 4.05 4.07 2.31 ## 4 Premium 334 VS 2 4.2 4.23 2.63 ## 5 Good 335 SI 2 4.34 4.35 2.75 ``` ``` r bijou5 %>% unite(clarity, clarity_prefix, clarity_suffix, sep='') ``` ``` ## # A tibble: 5 ร 6 ## cut price clarity x y z ## <ord> <int> <chr> <dbl> <dbl> <dbl> ## 1 Ideal 326 SI2 3.95 3.98 2.43 ## 2 Premium 326 SI1 3.89 3.84 2.31 ## 3 Good 327 VS1 4.05 4.07 2.31 ## 4 Premium 334 VS2 4.2 4.23 2.63 ## 5 Good 335 SI2 4.34 4.35 2.75 ``` --- name: missing_complete # Completing Missing Values Using `complete` ``` r bijou %>% head(n = 10) %>% select(cut, clarity, price) %>% mutate(continent = sample(c('AusOce', 'Eur'), size = 6, replace = T)) -> missing_stones ``` ``` r missing_stones %>% complete(cut, continent) ``` ``` ## # A tibble: 12 ร 4 ## cut continent clarity price ## <ord> <chr> <ord> <int> ## 1 Fair AusOce <NA> NA ## 2 Fair Eur <NA> NA ## 3 Good AusOce <NA> NA ## 4 Good Eur VS1 327 ## 5 Good Eur SI2 335 ## 6 Very Good AusOce <NA> NA ## 7 Very Good Eur VVS2 336 ## 8 Premium AusOce <NA> NA ## 9 Premium Eur SI1 326 ## 10 Premium Eur VS2 334 ## 11 Ideal AusOce SI2 326 ## 12 Ideal Eur <NA> NA ``` --- name: joins # Joining Data with `_join` .pull-left-50[ ``` ## # A tibble: 5 ร 2 ## key value1 ## <dbl> <chr> ## 1 1 a ## 2 2 b ## 3 3 c ## 4 4 d ## 5 5 e ``` ] .pull-right-50[ ``` ## # A tibble: 5 ร 2 ## key value2 ## <dbl> <chr> ## 1 1 A ## 2 2 B ## 3 3 C ## 4 6 F ## 5 7 G ``` ] **Example:** ``` r inner_join(tibble1, tibble2, by = 'key') ``` ``` ## # A tibble: 3 ร 3 ## key value1 value2 ## <dbl> <chr> <chr> ## 1 1 a A ## 2 2 b B ## 3 3 c C ``` `[inner, left, right, full]_join` are available. Try these! --- name: more_tidyverse # Some Other Friends * `stringr` for string manipulation and regular expressions, * `forcats` for working with factors, * `lubridate` for working with dates. --- name: end-slide class: end-slide # Thank you. Questions? [More?](https://nbisweden.github.io/raukr-2024/) .end-text[ <p class="smaller"> <span class="small" style="line-height: 1.2;">Graphics from </span><img src="./assets/freepik.jpg" style="max-height:20px; vertical-align:middle;"><br> Created: 31-Oct-2024 โข <a href="https://www.scilifelab.se/">SciLifeLab</a> โข <a href="https://nbis.se/">NBIS</a> </p> ]