This is the Tidyverse course work material for Introduction to R Programming for Life Scientists Course, Uppsala, fall 2018.
Welcome to the hands-on workshop “Tidy Work in Tidyverse”. Most of the things necessary to complete the tutorials and challenges were covered in the lecture. However, sometimes the tasks require that you check the docs or search online. Not all our solutions are optimal. Let us know if you can do better or solve things in a different way. If stuck, look at hints, next google and if still stuck, turn to TA. It is a lot of material, do not fee bad if you do not solve all tasks. Good luck!
Rewrite the following code chunk as one pipe (magrittr):
my_cars <- mtcars[, c(1:4, 7)]
my_cars <- my_cars[my_cars$disp > mean(my_cars$disp), ]
print(my_cars)
my_cars <- colMeans(my_cars)
my_cars <- mtcars %>%
select(c(1:4, 7)) %>%
filter(disp > mean(disp)) %T>%
print() %>%
colMeans()Rewrite the correlations below using pipes.
cor(mtcars$gear, mtcars$mpg)
mtcars %$% cor(gear, mpg)cor(mtcars)
mtcars %>% cor()mtcars dataset to a tibble vehicles.cyl) variable using:
[[index]] accessor,[[string]] accessor,$ accessor.vehicles back to a data.frame called automobiles.# 1
vehicles <- mtcars %>% as_tibble()
# 2
vehicles[['cyl']]
vehicles[[2]]
vehicles$cyl
# 3
vehicles %T>%
{print(.[['cyl']])} %T>%
{print(.[[2]])} %>%
.$cyl
# 4
vehicles
# 5
vehicles %>% head(n = 30)
# 6
options(tibble.print_min = 15, tibble.print_max = 30)
# 7
automobiles <- as.data.frame(vehicles)Do you think tibbles are lazy? Try to create a tibble that tests whether lazy evaluation applies to tibbles too.
tibble(x = sample(1:10, size = 10, replace = T), y = log10(x))The nycflights13 package contains information about all flights that departed from NYC (i.e., EWR, JFK and LGA) in 2013: 336,776 flights with 16 variables. To help understand what causes delays, it also includes a number of other useful datasets: weather, planes, airports, airlines. We will use it to train working with tibbles and dplyr.
nycflights13 package (install if necessary),flights tibble.carrier and arr_time,carrier, tailnum and origin,day through carrier,arrival (hint: ?tidyselect),v <- c("arr_time", "sched_arr_time", "arr_delay"),dest to destination using:
select() andrename()install.packages('nycflights13')
library('nycflights13')
?nycflights13
flights
flights %>% select(-carrier, -arr_time)
flights %>% select(carrier, tailnum, origin)
flights %>% select(-(day:carrier))
flights %>% select(contains('arr_')) # or
v <- c("arr_time", "sched_arr_time", "arr_delay")
flights %>% select(v) # or
flights %>% select(one_of(v))
flights %>% select(destination = dest)
flights %>% rename(destination = dest)
# select keeps only the renamed column while rename returns the whole dataset
# with the column renamed?slice),?sample_n()) 3 random flights per day in March,unique() routes and sort them by origin,distinct() routes and sort them by origin,unique() more efficient than distinct()?flights %>% filter(arr_delay < 0)
flights %>% filter(dep_delay >= 10, dep_delay <= 33) # or
flights %>% filter(between(dep_delay, 10, 33))
flights %>% filter(is.na(arr_time))
flights %>% slice(1234:1258)
flights %>% filter(month == 3) %>%
group_by(day) %>%
sample_n(3)
flights %>%
filter(month == 1) %>%
group_by(carrier) %>%
top_n(5, dep_delay)air_time is the amount of time in minutes spent in the air. Add a new column air_spd that will contain aircraft’s airspeed in mph,
as above, but keep only the new air_spd variable,
use rownames_to_column() on mtcars to add car model as an extra column,
flights %>% mutate(air_spd = distance/(air_time / 60))
flights %>% transmute(air_spd = distance/(air_time / 60))
mtcars %>% rownames_to_column('model')group_by(), summarise() and n() to see how many planes were delayed (departure) every month,flights %>%
filter(dep_delay > 0) %>%
group_by(month) %>%
summarise(num_dep_delayed = n())dep_delay per month?flights %>%
group_by(month) %>%
summarise(mean_dep_delay = mean(dep_delay, na.rm = T))flights %>%
filter(arr_delay > 0) %>%
group_by(origin) %>%
summarise(cnt = n()) %>%
arrange(desc(cnt))summarise() to sum total dep_delay per month in hours, flights %>%
group_by(month) %>%
summarize(tot_dep_delay = sum(dep_delay/60, na.rm = T))group_size() on carrier what does it return?flights %>%
group_by(carrier) %>%
group_size()n_groups() to check the number of unique origin-carrier pairs,flights %>%
group_by(carrier) %>%
n_groups()Note on ungroup Depending on the version of dplyr you may or may need to use the ungroup() if you want to group your data on some other variables. In the newer versions, summarise and mutate drop one aggregation level.
sessionInfo()## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] magrittr_1.5 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.8
## [5] purrr_0.2.5 readr_1.1.1 tidyr_0.8.2 tibble_1.4.2
## [9] ggplot2_3.1.0 tidyverse_1.2.1 captioner_2.2.3 bookdown_0.7
## [13] knitr_1.20
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 cellranger_1.1.0 plyr_1.8.4 compiler_3.5.0
## [5] pillar_1.3.0 bindr_0.1.1 tools_3.5.0 digest_0.6.18
## [9] lubridate_1.7.4 jsonlite_1.5 evaluate_0.12 nlme_3.1-137
## [13] gtable_0.2.0 lattice_0.20-38 pkgconfig_2.0.2 rlang_0.3.0.1
## [17] cli_1.0.1 rstudioapi_0.8 yaml_2.2.0 haven_1.1.2
## [21] xfun_0.4 bindrcpp_0.2.2 withr_2.1.2 xml2_1.2.0
## [25] httr_1.3.1 hms_0.4.2 rprojroot_1.3-2 grid_3.5.0
## [29] tidyselect_0.2.5 glue_1.3.0 R6_2.3.0 readxl_1.1.0
## [33] rmarkdown_1.10 modelr_0.1.2 backports_1.1.2 scales_1.0.0
## [37] htmltools_0.3.6 rvest_0.3.2 assertthat_0.2.0 colorspace_1.3-2
## [41] stringi_1.2.4 lazyeval_0.2.1 munsell_0.5.0 broom_0.5.0
## [45] crayon_1.3.4
Page built on: 14-Nov-2018 at 15:06:50.
2018 | SciLifeLab > NBIS)