This is the Tidyverse course work material for RaukR 2019.
Welcome to the hands-on workshop “Tidy Work in Tidyverse”. Most of the things necessary to complete the tutorials and challenges were covered in the lecture. However, sometimes the tasks require that you check the docs or search online. Not all our solutions are optimal. Let us know if you can do better or solve things in a different way. If stuck, look at hints, next google and if still stuck, turn to TA. It is a lot of material, do not fee bad if you do not solve all tasks. If you completed Challenge 3 you are good and have used the most important features of tidyverse! Good luck!
Datasets are available here.
Rewrite the following code chunks as pipes (Load package magrittr
because tidyverse
supports only the %>%
pipe!):
my_cars <- mtcars[, c(1:4, 10)]
my_cars <- my_cars[my_cars$disp > mean(my_cars$disp), ]
my_cars <- colMeans(my_cars)
mtcars %>%
select(c(1:4, 10)) %>%
filter(disp > mean(disp)) %>%
colMeans() -> my_cars
What is wrong with our solution?
# It is better to have the result assigned on the left hand side: result <- expression. In this case the expression is the **whole** pipe.
# Our 'expression -> result' is correct but can easily be missed when reading the code.
The summary(x)
function is a bit special: when you type summary(x)
in the console, print
is called in an implicite way. Pipe call does not do, so you will have to invoke print
in an explicite way. But the %T>%
does unbranch for one call only, you will have to make printing of the summary
a one single composed call using {}
.
summary(cars)
colSums(cars)
cars %T>% {print(summary(.))} %>% colSums()
Rewrite correlations to pipes.
cor(mtcars$gear, mtcars$mpg)
mtcars %$% cor(gear, mpg)
cor(mtcars)
mtcars %>% cor()
Given is the dim_summary(nrows, ncols)
function which takes matrix nrows
and ncols
as arguments and prints this info about its dimensions:
dim_summary <- function(nrows, ncols) {
print(
paste0('Matrix M has: ', nrows, ' rows and ', ncols, ' columns.')
)
}
Rewrite the code chunks below as pipes:
distr1 <- rnorm(16)
M <- matrix(distr1, ncol = 4)
plot(M)
M <- M + sample(M)
dim_summary(nrows = nrow(M), ncols = ncol(M))
distr2 <- rnorm(16)
N <- matrix(distr2, ncol = 4)
colnames(N) <- (letters[1:4])
summary(N)
P <- M %x% t(N)
heatmap(P)
colnames(P) <- letters[1:dim(P)[2]]
cor(P[ ,'a'], P[ ,'i'])
A class of functions, called the replacement functions are of the form function(arguments)<-value
and rownames(x) <- c('a', 'b', 'c')
is a good example of a replacement function. When writing pipes, we have bear in mind that whole function<-
is the name of the replacement function and thus we have to use it as such in the pipe.
Note. Not always one can put everything into one single pipe, sometimes the results of running two or more pipes have to be used in the final pipe.
dim_summary <- function(nrows, ncols) {
print(paste0('Matrix M has: ', nrows, ' rows and ', ncols, ' columns.'))
}
M <- rnorm(16) %>%
matrix(ncol = 4) %T>%
plot() %>%
`+`(., sample(.)) %T>%
{dim_summary(nrow(.), ncol(.))}
N <- rnorm(16) %>%
matrix(ncol = 4) %>%
`colnames<-`(letters[1:4]) %>%
summary()
P <- M %>%
`%x%`(., N) %T>%
heatmap() %>%
`colnames<-`(letters[1:dim(.)[2]]) %>%
as_data_frame() %$%
cor(a, i)
mtcars
dataset to a tibble vehicles
.cyl
) variable using:
[[index]]
accessor,[[string]]
accessor,$
accessor.tibble.print_max
and dplyr.print_min
? Is there any? Test it.vehicles
back to a data.frame
called automobiles
.# 1
vehicles <- mtcars %>% as_tibble()
# 2
vehicles[['cyl']]
vehicles[[2]]
vehicles$cyl
# 3
vehicles %T>%
{print(.[['cyl']])} %T>%
{print(.[[2]])} %>%
.$cyl
# 4
vehicles
# 5
vehicles %>% head(n = 30)
# 6
options(tibble.print_min = 15, tibble.print_max = 30)
# 7
# In theory there should be no difference. dplyr imports tibble from the tibble package
# and dplyr.width, dplyr.print_min and dplyr.print_min are passed down to the tibble.
# But test both behaviours. First with only the tibble package loaded, later with dplyr # loaded.
# 8
automobiles <- as.data.frame(vehicles)
Create the following tibble using tribble()
:
id | event | date |
---|---|---|
1 | success | 24-04-2017 |
2 | failed | 25-04-2017 |
3 | failed | 25-04-2017 |
4 | success | 27-04-2017 |
tab <- tribble(
~id, ~event, ~date,
1, 'success', '24-04-2017',
2, 'failed', '25-04-2017',
3, 'failed', '25-04-2017',
4, 'success', '27-04-2017'
)
Compare the performance of as.data.frame()
, as_data_frame()
and as_tibble()
on a
100 x 30 matrix filled with random integers. Use package microbenchmark
. Fill in your result here in the Tidyverse Lab sheet, Tibbles – performance.
tst <- replicate(30, sample(100), simplify = TRUE)
colnames(tst) = paste0(rep('col', times = dim(tst)[2]), 1:dim(tst)[2])
microbenchmark::microbenchmark(
as.data.frame(tst),
as_data_frame(tst),
as_tibble(tst)
)
Do you think tibbles are lazy? Try to create a tibble that tests whether lazy evaluation applies to tibbles too.
tibble(x = sample(1:10, size = 10, replace = T), y = log10(x))
Parse the following vectors:
vec1 <- c(1, 7.2, 3.84, -5.23)
– parse it as double (any problems? why?).c(1, 7.2, 3.84, -5.23)
as integer. What happens?vec2 <- c('2', '3,45', '?', '-7,28')
vec3 <- c('2', '3,45', '?', '-7.28')
vec4 <- c('barrel: 432.7$', 'liter: 15.42PLN', 'gallon costs approx 32.1SEK', 'sunny, wind gusts up till 55m/s')
vec4
as number? Do it if you can.vec5 <- "25 Dec 2015"
as date (hint: ?parse_date()
).10_Jul_1410
as date.vec1 <- c(1, 7.2, 3.84, -5.23)
vec2 <- c('2', '3,45', '?', '-7,28')
vec3 <- c('2', '3,45', '?', '-7.28')
vec4 <- c('barrel: 432.7$', 'liter: 15.42PLN', 'gallon costs approx 32.1SEK', 'sunny, wind gusts up till 55m/s')
vec5 <- "25 Dec 2015"
parse_double(vec1)
parse_integer(vec1)
parse_integer(as.integer(vec1)) # Is it the best way? Hint: rounding.
parse_double(vec2, na = '?', locale = locale(decimal_mark = ','))
parse_number(vec2, na = '?', locale = locale(decimal_mark = ','))
guess_parser(vec4)
parse_guess(vec4)
# Yes, you can:
parse_number(vec4)
parse_date(vec5, format="%d %b %Y")
parse_date("10_Jul_1410", format="%d%.%b%.%Y")
The nycflights13
package contains information about all flights that departed from NYC (i.e., EWR, JFK and LGA) in 2013: 336,776 flights with 16 variables. To help understand what causes delays, it also includes a number of other useful datasets: weather, planes, airports, airlines. We will use it to train working with tibbles and dplyr
.
nycflights13
package (install if necessary),flights
tibble.carrier
and arr_time
,carrier
, tailnum
and origin
,day
through carrier
,arr
ival (hint: ?tidyselect
),v <- c("arr_time", "sched_arr_time", "arr_delay")
,dest
to destination
using:
select()
andrename()
install.packages('nycflights13')
library('nycflights13')
?nycflights13
flights
flights %>% select(-carrier, -arr_time)
flights %>% select(carrier, tailnum, origin)
flights %>% select(-(day:carrier))
flights %>% select(contains('arr_')) # or
v <- c("arr_time", "sched_arr_time", "arr_delay")
flights %>% select(v) # or
flights %>% select(one_of(v))
flights %>% select(destination = dest)
flights %>% rename(destination = dest)
# select keeps only the renamed column while rename returns the whole dataset
# with the column renamed
?slice
),?sample_n()
) 3 random flights per day in March,unique()
routes and sort them by destination,distinct()
routes and sort them by destination,unique()
more efficient than distinct()
?flights %>% filter(arr_delay < 0)
flights %>% filter(dep_delay >= 10, dep_delay <= 33) # or
flights %>% filter(between(dep_delay, 10, 33))
flights %>% filter(is.na(arr_time))
flights %>% slice(1234:1258)
flights %>% filter(month == 3) %>%
group_by(day) %>%
sample_n(3)
flights %>%
filter(month == 1) %>%
group_by(carrier) %>%
top_n(5, dep_delay)
nycflights13::flights %>%
select(origin, dest) %>%
unique() %>% arrange(dest)
flights %>% mutate(route = paste(origin, dest, sep="-")) %>% select(route) %>% unique()
nycflights13::flights %>%
select(origin, dest) %>%
distinct() %>% arrange(dest)
# or
flights %>%
mutate(route = paste(origin, dest, sep="-")) %>%
distinct(route)
microbenchmark(nycflights13::flights %>%
select(origin, dest) %>%
unique() %>% arrange(dest))
microbenchmark(nycflights13::flights %>%
select(origin, dest) %>%
distinct() %>% arrange(dest))
Distinct is faster.
air_time
is the amount of time in minutes spent in the air. Add a new column air_spd
that will contain aircraft’s airspeed in mph,
as above, but keep only the new air_spd
variable,
use rownames_to_column()
on mtcars
to add car model as an extra column,
flights %>% mutate(air_spd = distance/(air_time / 60))
flights %>% transmute(air_spd = distance/(air_time / 60))
mtcars %>% rownames_to_column('model')
group_by()
, summarise()
and n()
to see how many planes were delayed (departure) every month,flights %>%
filter(dep_delay > 0) %>%
group_by(month) %>%
summarise(num_dep_delayed = n())
tally()
and count()
,flights %>%
filter(dep_delay > 0) %>%
group_by(month) %>%
tally()
flights %>%
filter(dep_delay > 0) %>%
count(month)
dep_delay
per month?flights %>%
group_by(month) %>%
summarise(mean_dep_delay = mean(dep_delay, na.rm = T))
flights %>%
filter(arr_delay > 0) %>%
group_by(origin) %>%
summarise(cnt = n()) %>%
arrange(desc(cnt))
tally()
flights %>%
filter(arr_delay > 0) %>%
group_by(origin) %>%
tally(sort = T)
summarise()
to sum total dep_delay
per month in hours, flights %>%
group_by(month) %>%
summarize(tot_dep_delay = sum(dep_delay/60, na.rm = T))
wt
parameter of count()
(works with tally()
too) to achieve the same, flights %>%
group_by(month) %>%
count(wt = dep_delay/60)
group_size()
on carrier
what does it return?flights %>%
group_by(carrier) %>%
group_size()
n_groups()
to check the number of unique origin-carrier pairs,flights %>%
group_by(carrier) %>%
n_groups()
Note on ungroup
Depending on the version of dplyr
you may or may need to use the ungroup()
if you want to group your data on some other variables:
flights %>%
group_by(origin) %>%
mutate(mean_delay_orig = (mean(dep_delay, na.rm = T) + mean(arr_delay, na.rm = T)) / 2) %>%
ungroup() %>%
group_by(carrier) %>%
mutate(mean_delay_carr = (mean(dep_delay, na.rm = T) + mean(arr_delay, na.rm = T)) / 2) %>%
select(origin, carrier, mean_delay_orig, mean_delay_carr)
may or may need ungroup depending on your dplyr
version. In the newer versions, summarise
and mutate
drop one aggregation level.
Given the following tibbles:
set1
:
knitr::kable(set1)
id | color |
---|---|
id1 | grey |
id1 | red |
id2 | green |
id3 | blue |
and set2
:
knitr::kable(set2)
id | size |
---|---|
id2 | XL |
id3 | M |
id4 | M |
Perform joins on id
that result in the grey area from the Venn diagrams below. We have not talked about all possible joins, so read the docs if you do not know which join to use.
left_join(set1, set2)
right_join(set1, set2)
inner_join(set1, set2) # or
semi_join(set1, set2) # semi_join removes duplicates in x
# and also returns only columns from x.
full_join(set1, set2) # or
anti_join(set1, set2) # or
Now time to do some data tidying. First install a package with some untidy data:
devtools::install_github("rstudio/EDAWR")
library(EDAWR)
cases
so that years are not in separate columns, but in the column called year
containing a value per each year.tidy_cases <- cases %>% gather(key = 'year', value = 'count', 2:4)
pollution
dataset. Tidy it so that there separate columns for large
and small
pollution values.tidy_pollution <- pollution %>% spread(size, amount)
storms
dataset contains the date
column. Make it into 3 columns: year
, month
and day
. Store the result as tidy_storms
tidy_storms <- storms %>%
separate(col = date,
into = c('year', 'month', 'day'),
sep = '-')
year
, month
and day
in tidy_storms
into a date
column again but in the “DD/MM/YYYY” format.tidy_storms %>% unite(col = "date", 4:6, sep = "/")
You will be given a fastq
file coming from MinION sequencer (Oxford Nanopore). This file contains test reads from the chicken genome. The flow-cell used here has 512 channels, each channel consists of 4 pores and only one pore is active at a time. Once your sequence gets stuck for some reason, the device will attempt to remove it from the pore by playing with reversing polarity on that pore. If this was successful, the pore will be re-used. Your task will be to visualise reading events from the meta-data in the fastq
dataset and to see how each and every channel behaved. Also, you will plot the distribution of reading times.
Datasets are available here.
First, we will need to load the necessary libraries. I will give you a hint – you need the following libraries:
here
– not necessary, but it is an elegant way of reading the data locally from the project folder,tidyverse
– well, quite obvious why,ShortRead
from Bioconductor – to deal with short reads in fastq
,lubridate
– to figure out reading times.library(here)
library(tidyverse)
library(ShortRead)
library(lubridate)
Now, let’s read the fastq data. Check ShortRead
documentation to see how to read our fastq
file. Also, try to use package here
. If you write: data <- here::here('data/my.fastq')
, the my.fastq
file will be read from the data
folder which is a subfolder of your project root, i.e. the folder where your r script is. It is a good practise and also prevents Jenny Bryan from coming to your office and setting your computer on fire.
Now think a bit, to plot reading events, do we need al the data in the file or only some specific part? You may want to see some few first lines of the fastq
to learn about the data structure.
raw_data <- here::here('docs/tidyverse_Marcin/lab/lab_assets/FUL1_fastqs_GRID2.fastq')
f <- FastqFile(raw_data)
rfq <- readFastq(f)
headers <- rfq@id
close(f)
In this step, we are extracting data from fastq headers of each and every read in the fastq file. Not super efficient and perhaps the slowest step of the whole analyses. Can you do it better than our example solution?
Desired output: a table (tibble/data.frame) with reads as rows and meta-data as columns.
Hint: use strsplit()
to explode string data into columns and str_remove_all()
to get rid of unwanted characters. Since we did not talk much about regular expressions and the stringr
package:
strsplit
. Here we split on comma.text <- "This text is long, or not?"
strsplit(text, ',')
str_remove_all(text, ",.*")
data <- dplyr::as_tibble(matrix(NA_character_, ncol = 6, nrow = length(headers)))
colnames(data) <- c('id', 'run_id', 'sample_id', 'read', 'channel', 'start_time')
for (i in 1:length(headers)) {
tmp <- toString(headers[[i]])
tmp_flat <- unlist(strsplit(tmp, ' ')) %>%
str_remove_all(".*=")
data[i,] <- tmp_flat
}
Now, the fun part begins:
start_dttm
that represents start time for a given read as proper datetime
object (read lubridate
docs) andchan
that is the proper numeric representation of the channel,Hint: read about lead()
data2 <- data %>%
mutate(start_dttm = as_datetime(start_time)) %>%
mutate(chan=as.numeric(channel)) %>%
group_by(chan) %>%
arrange(start_dttm) %>%
mutate(time_diff = lead(start_dttm) - start_dttm) %>%
arrange(chan)
Important! If you are not familiar with ggplot
either:
ggplot2
solution.Here, we want to see what was happening in each channel over time. Plot the data you have just prepared so that:
Can you visualise this in a better way? Different geometry?
ggplot(data2, mapping = aes(x = start_dttm,
y = as.factor(chan),
col = as.numeric(time_diff)
)
) +
geom_point(size = .5) +
theme_bw()
Now, we want to see how time-to-next-read is distributed. Since it has a veeeeery long right tail, I am cutting off everything above 200 seconds (just by eyeballing).
histogram
),# Show time-to-next read distribution
# thr <- mean(data2$time_diff, na.rm = T) + 3 * sd(data2$time_diff, na.rm = T)
tmp <- data2 %>%
ungroup() %>%
filter(time_diff < 200) %>%
select(time_diff)
hist(as.numeric(tmp$time_diff), breaks = 1000, las=1)
In this challenge, your task will be to analyse species composition of some samples. The samples, were actual products containing parts of plants. DNA has been isolated form the samples and an amplicon metabarcoding was performed using two sets of primers: for the ITS1 and the ITS2 region. Each sample had 3 technical replicates. Your task will be to transform BLAST output to a tidy form suitable for further analyses or visualisation.
We will obviously need tidyverse
, we will also do some string manipulations with stringr
also here
package is good to have.
library(tidyverse)
library(stringr)
library(here)
Here, we will define our input variables. We need:
file
that contains the path to the dataset,sample_name
is a string, the name of the sample you want to analyse,threshold
is an intiger saying what is a the minimal number of replicates that have to contain an OTU in order to call it a true positive (TP),strict
a logical. If set to TRUE, only the OTUs deemed TP will be shown.Below we set some examle values:
# Change the path to your project path, where your data is
file <- here::here("docs/tidyverse_Marcin/lab/lab_assets/blast_result.csv")
sample_name <- 'SAMPLE12'
threshold <- 1
strict <- F
Now, you should read the data:
species_orig <- read_csv(file, col_names = c("sample","its","replicate","OTU","size","hit","perc_ident","score","family","species")) %>% select(-score)
sample | its | replicate | OTU | size | hit | perc_ident | family | species |
---|---|---|---|---|---|---|---|---|
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | KX522674 | 97.08 | Anacardiaceae | Spondias_tuberosa |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | AJ783644 | 97.53 | Betulaceae | Betula_populifolia |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | AJ783641 | 97.53 | Betulaceae | Betula_alnoides |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | MF171078 | 98.71 | Pentaphylacaceae | Pentaphylax_euryoides |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214931 | 98.33 | Musaceae | Musa_ornata |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214926 | 96.23 | Musaceae | Musa_sp. |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214930 | 95.73 | Musaceae | Musa_basjoo |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214927 | 97.67 | Musaceae | Musella_lasiocarpa |
SAMPLE10 | ITS1 | R2 | OTU1 | 357 | AM233397 | 95.62 | Apocynaceae | Secamone_filiformis |
SAMPLE10 | ITS1 | R2 | OTU1 | 357 | FR832858 | 96.69 | Rubiaceae | Tricalysia_perrieri |
As you see, the following information are included in the data:
sample
is simply the name of the sample,its
is either ITS1 or ITS2 and tells which set of PCR primers has been used,replicate
contains information on which replicate the sequences come from,OTU
is a unique identifier of the so-called Operational Taxonomic Unit, an OTU often corresponds to one species but not always. Sometimes 2 OTUs represent the same species, sometimes 1 OTU consists of more than one species,size
is the number of reads that support that particular OTU,hit
is the BLAST hit identifier. The 4 top BLAST hits are reported per OTU,perc_identity
is the percentage identity of the sequence to the BLAST hit,family
is the identified plant family,species
is the identified plant species.Create a new dataset species
that contains an extra column n_replicates
. The column contains number of replicates this particular species is present in. Do it per sample and its.
species <- species_orig %>%
group_by(sample, its, species) %>%
mutate(n_replicates = n_distinct(replicate)) %>%
ungroup()
sample | its | replicate | OTU | size | hit | perc_ident | family | species | n_replicates |
---|---|---|---|---|---|---|---|---|---|
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | KX522674 | 97.08 | Anacardiaceae | Spondias_tuberosa | 2 |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | AJ783644 | 97.53 | Betulaceae | Betula_populifolia | 1 |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | AJ783641 | 97.53 | Betulaceae | Betula_alnoides | 1 |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | MF171078 | 98.71 | Pentaphylacaceae | Pentaphylax_euryoides | 1 |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214931 | 98.33 | Musaceae | Musa_ornata | 1 |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214926 | 96.23 | Musaceae | Musa_sp. | 1 |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214930 | 95.73 | Musaceae | Musa_basjoo | 1 |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214927 | 97.67 | Musaceae | Musella_lasiocarpa | 1 |
SAMPLE10 | ITS1 | R2 | OTU1 | 357 | AM233397 | 95.62 | Apocynaceae | Secamone_filiformis | 1 |
SAMPLE10 | ITS1 | R2 | OTU1 | 357 | FR832858 | 96.69 | Rubiaceae | Tricalysia_perrieri | 1 |
Now, your task is to filter out all but your sample_name
samples from the dataset.
Call the resulting dataset my_sample
.
my_sample <- species %>%
filter(sample == sample_name)
What happens if a set of primers failed to amplify or if one replicate was lost?
Use complete()
to make sure you have NA
values in such cases.
my_sample <- my_sample %>%
complete(its = c("ITS1", "ITS2"),
replicate = c("R1","R2","R3")
)
Look, the first sample in the table is SAMPLE10
. Why not SAMPLE1
?
Thats a sorting issue: if sorted as character, 10 will come before 1. WE have to fix this by adding trailing zero to the values in OTU
. We do not expect more than 99 OTUs in a sample, so it is ok with only one trailing 0 (otherwise the 100-th sample will spoil our sorting and come out like: SAMPLE100, SAMPLE01, SAMPLE10). We will need to use regular expression:
OTU
column that follow pattern “OTUdigit” we need to change to “OTU0digit”. Regular expression that matches this is OTU([0-9]$)
and it should be replaced by: OTU0\\1
. Ask your TAs to explain this if you do not know much about regular expressions and pattern matching.my_sample <- my_sample %>% mutate(OTU = str_replace(OTU,
pattern = "OTU([0-9]$)",
replacement = "OTU0\\1"
)
)
its | replicate | sample | OTU | size | hit | perc_ident | family | species | n_replicates |
---|---|---|---|---|---|---|---|---|---|
ITS1 | R1 | SAMPLE12 | OTU01 | 4169 | LC089998 | 99.44 | Brassicaceae | Capsella_bursa-pastoris | 3 |
ITS1 | R1 | SAMPLE12 | OTU02 | 2686 | KP794392 | 98.19 | Euphorbiaceae | Tragia_cf. | 3 |
ITS1 | R1 | SAMPLE12 | OTU02 | 2686 | KP794390 | 98.19 | Euphorbiaceae | Tragia_chlorocaulon | 3 |
ITS1 | R1 | SAMPLE12 | OTU02 | 2686 | KP794378 | 98.19 | Euphorbiaceae | Tragia_bahiensis | 3 |
ITS1 | R1 | SAMPLE12 | OTU02 | 2686 | KP794320 | 98.19 | Euphorbiaceae | Bia_alienata | 3 |
ITS1 | R1 | SAMPLE12 | OTU04 | 1353 | GQ396671 | 97.44 | Euphorbiaceae | Triadica_sebifera | 3 |
ITS1 | R1 | SAMPLE12 | OTU04 | 1353 | KX522674 | 96.25 | Anacardiaceae | Spondias_tuberosa | 3 |
ITS1 | R1 | SAMPLE12 | OTU04 | 1353 | KY860928 | 96.23 | Adoxaceae | Viburnum_prunifolium | 3 |
ITS1 | R1 | SAMPLE12 | OTU04 | 1353 | KX757310 | 96.23 | Caryophyllaceae | Silene_caesia | 3 |
ITS1 | R1 | SAMPLE12 | OTU06 | 577 | KY860926 | 100.00 | Asteraceae | Taraxacum_officinale | 3 |
Sometimes, an OTU generates two or more top BLAST hits that come from the same species. We have decided to sum reads in such cases. Do it!
my_sample <- my_sample %>%
ungroup() %>%
group_by(sample, its, replicate, OTU, species, n_replicates) %>%
summarise(n_reads = sum(size)) %>%
ungroup() %>%
group_by(its, species, OTU)
sample | its | replicate | OTU | species | n_replicates | n_reads |
---|---|---|---|---|---|---|
NA | ITS2 | R1 | NA | NA | NA | NA |
NA | ITS2 | R2 | NA | NA | NA | NA |
NA | ITS2 | R3 | NA | NA | NA | NA |
SAMPLE12 | ITS1 | R1 | OTU01 | Capsella_bursa-pastoris | 3 | 4169 |
SAMPLE12 | ITS1 | R1 | OTU02 | Bia_alienata | 3 | 2686 |
SAMPLE12 | ITS1 | R1 | OTU02 | Tragia_bahiensis | 3 | 2686 |
SAMPLE12 | ITS1 | R1 | OTU02 | Tragia_cf. | 3 | 2686 |
SAMPLE12 | ITS1 | R1 | OTU02 | Tragia_chlorocaulon | 3 | 2686 |
SAMPLE12 | ITS1 | R1 | OTU04 | Silene_caesia | 3 | 1353 |
SAMPLE12 | ITS1 | R1 | OTU04 | Spondias_tuberosa | 3 | 1353 |
Now, we want to see how many identifications an OTU got. Implement this. Store the result in a new tibble diversity
.
diversity <- my_sample %>%
ungroup() %>%
group_by(its, replicate, OTU) %>%
summarise(n_species = n())
its | replicate | OTU | n_species |
---|---|---|---|
ITS1 | R1 | OTU01 | 1 |
ITS1 | R1 | OTU02 | 4 |
ITS1 | R1 | OTU04 | 4 |
ITS1 | R1 | OTU06 | 4 |
ITS1 | R1 | OTU07 | 3 |
ITS1 | R1 | OTU08 | 2 |
ITS1 | R1 | OTU10 | 1 |
ITS1 | R1 | OTU11 | 3 |
ITS1 | R1 | OTU13 | 3 |
ITS1 | R1 | OTU15 | 2 |
Add the diversity
data to my_sample
using appropriate join
function.
Also, remove the column with sample names since we are dealing with only one sample.
my_sample <- my_sample %>%
left_join(diversity) %>%
select(-sample)
its | replicate | OTU | species | n_replicates | n_reads | n_species |
---|---|---|---|---|---|---|
ITS2 | R1 | NA | NA | NA | NA | 1 |
ITS2 | R2 | NA | NA | NA | NA | 1 |
ITS2 | R3 | NA | NA | NA | NA | 1 |
ITS1 | R1 | OTU01 | Capsella_bursa-pastoris | 3 | 4169 | 1 |
ITS1 | R1 | OTU02 | Bia_alienata | 3 | 2686 | 4 |
ITS1 | R1 | OTU02 | Tragia_bahiensis | 3 | 2686 | 4 |
ITS1 | R1 | OTU02 | Tragia_cf. | 3 | 2686 | 4 |
ITS1 | R1 | OTU02 | Tragia_chlorocaulon | 3 | 2686 | 4 |
ITS1 | R1 | OTU04 | Silene_caesia | 3 | 1353 | 4 |
ITS1 | R1 | OTU04 | Spondias_tuberosa | 3 | 1353 | 4 |
Can you think of a good way of visualising the data? Think of:
ITS
, replicate
, n_reads
, n_replicates
, OTU
, threshold
, n_species
and specie
. Well, pretty much all of them :-)Our example solution involves some ggplot2
magics. But would base-R be good enough for this type of plot?
Use the FAA report and tidyverse
to learn more about aircraft incidents with wildlife. Use your imagination and NYC data science blog for inspiration!
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] lubridate_1.7.4 ShortRead_1.40.0
## [3] GenomicAlignments_1.18.0 SummarizedExperiment_1.12.0
## [5] DelayedArray_0.8.0 matrixStats_0.54.0
## [7] Biobase_2.42.0 Rsamtools_1.34.0
## [9] GenomicRanges_1.34.0 GenomeInfoDb_1.18.0
## [11] Biostrings_2.50.1 XVector_0.22.0
## [13] IRanges_2.16.0 S4Vectors_0.20.0
## [15] BiocParallel_1.16.0 BiocGenerics_0.28.0
## [17] here_0.1 eulerr_5.1.0
## [19] bsplus_0.1.1 forcats_0.4.0
## [21] stringr_1.4.0 dplyr_0.8.1
## [23] purrr_0.3.2 readr_1.3.1
## [25] tidyr_0.8.3 tibble_2.1.1
## [27] tidyverse_1.2.1 ggplot2_3.1.1
## [29] captioner_2.2.3 bookdown_0.11
## [31] knitr_1.23
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.0 jsonlite_1.6 modelr_0.1.4
## [4] assertthat_0.2.1 highr_0.8 latticeExtra_0.6-28
## [7] GenomeInfoDbData_1.2.0 cellranger_1.1.0 yaml_2.2.0
## [10] pillar_1.4.0 backports_1.1.4 lattice_0.20-38
## [13] glue_1.3.1 digest_0.6.19 RColorBrewer_1.1-2
## [16] polyclip_1.10-0 rvest_0.3.4 colorspace_1.4-1
## [19] htmltools_0.3.6 Matrix_1.2-17 plyr_1.8.4
## [22] pkgconfig_2.0.2 broom_0.5.2 haven_2.1.0
## [25] zlibbioc_1.28.0 scales_1.0.0 generics_0.0.2
## [28] withr_2.1.2 lazyeval_0.2.2 cli_1.1.0
## [31] magrittr_1.5 crayon_1.3.4 readxl_1.3.1
## [34] evaluate_0.14 nlme_3.1-140 hwriter_1.3.2
## [37] xml2_1.2.0 tools_3.5.0 hms_0.4.2
## [40] munsell_0.5.0 compiler_3.5.0 rlang_0.3.4
## [43] grid_3.5.0 RCurl_1.95-4.12 rstudioapi_0.10
## [46] labeling_0.3 bitops_1.0-6 rmarkdown_1.13.1
## [49] gtable_0.3.0 R6_2.4.0 rprojroot_1.3-2
## [52] polylabelr_0.1.0 stringi_1.4.3 Rcpp_1.0.1
## [55] tidyselect_0.2.5 xfun_0.7
Page built on: 12-Jun-2019 at 12:43:16.
2018 | SciLifeLab > NBIS > RaukR