<- mtcars[, c(1:4, 10)]
my_cars <- my_cars[my_cars$disp > mean(my_cars$disp), ]
my_cars <- colMeans(my_cars) my_cars
Welcome to the hands-on workshop “Tidy Work in Tidyverse”. Most of the functions necessary to complete the tutorials and challenges were covered in the lecture. However, sometimes the tasks require that you check the docs or search online. Our solutions are not the only possible ones! Let us know if you can do better or solve things in a different way!
If stuck, look at hints, next do some google searches and, if still stuck, turn to a TA.
It is a lot of material, we know! Do not feel bad if you do not solve all the tasks. If you completed Challenge 3, you have used all the most important features of tidyverse
! Good luck!
1 General exercises
Datasets are available here.
1.1 Pipes
1.1.1 Chunk 1
Rewrite the following code chunks as pipes (Load package magrittr
because tidyverse
supports only the %>%
pipe!):
This is our solution:
Code
%>%
mtcars select(c(1:4, 10)) %>%
filter(disp > mean(disp)) %>%
colMeans() -> my_cars
What is wrong with our solution?
- It is better to have the result assigned on the left hand side: result <- expression. In this case the expression is the whole pipe.
- Our ‘expression -> result’ is correct but can easily be missed when reading the code.
1.1.2 Chunk 2
The summary(x)
function is a bit special: when you type summary(x)
in the console, print
is called in an implicit way. Pipe call does not do such implicite call, so you will have to invoke print
in an explicit way. But the %T>%
does unbranch for one call only, you will have to make printing of the summary
a one single composed call using {}
. Try to wrap your mind around this. If in doubt, turn to a TA.
summary(cars)
colSums(cars)
Code
%T>% {print(summary(.))} %>% colSums() cars
1.1.3 Chunk 3
Rewrite the following correlations using pipes.
cor(mtcars$gear, mtcars$mpg)
Code
%$% cor(gear, mpg) mtcars
cor(mtcars)
Code
%>% cor() mtcars
1.1.4 Chunk 4
Given is the dim_summary(nrows, ncols)
function which takes nrows
and ncols
as arguments and prints this info:
<- function(nrows, ncols) {
dim_summary print(
paste0('Matrix M has: ', nrows, ' rows and ', ncols, ' columns.')
) }
Rewrite each of the code chunks below using pipes:
<- rnorm(16)
distr1 <- matrix(distr1, ncol = 4)
M plot(M)
<- M + sample(M)
M dim_summary(nrows = nrow(M), ncols = ncol(M))
<- rnorm(16)
distr2 <- matrix(distr2, ncol = 4)
N colnames(N) <- (letters[1:4])
summary(N)
<- N + 0 N
<- M %x% t(N)
P heatmap(P)
colnames(P) <- letters[1:dim(P)[2]]
cor(P[ ,'a'], P[ ,'i'])
Beware of a class of functions, called replacement functions. These beasts are of the form: function(arguments) <- value
and rownames(x) <- c('a', 'b', 'c')
is a good example of such beast. When writing pipes, we have bear in mind that whole function <-
is the name of the replacement function and thus we have to use it as such in the pipe enquoted using backticks. Yes, we know… but you wont see this too often.
Sometimes, it may not be possible to put everything into one single pipe and the results of running two or more pipes have to be used in the final pipe.
Code
<- function(nrows, ncols) {
dim_summary print(paste0('Matrix M has: ', nrows, ' rows and ', ncols, ' columns.'))
}
<- rnorm(16) %>%
M matrix(ncol = 4) %T>%
plot() %>%
`+`(., sample(.)) %T>%
dim_summary(nrow(.), ncol(.))}
{
<- rnorm(16) %>%
N matrix(ncol = 4) %>%
`colnames<-`(letters[1:4]) %T>%
summary() %>% `+`(., 0)
<- M %>%
P `%x%`(., t(N)) %T>%
heatmap() %>%
`colnames<-`(letters[1:dim(.)[2]]) %>%
as_data_frame() %$%
cor(a, i)
1.2 Tibbles
1.2.1 Task 1
- Convert the
mtcars
dataset to a tibblevehicles
. - Select the number of cylinders (
cyl
) variable using:- the
[[index]]
accessor, - the
[[string]]
accessor, - the
$
accessor.
- the
- Do the same selection as above, but using pipe and placeholders (use all three ways of accessing a variable).
- Print the tibble.
- Print the 30 first rows of the tibble.
- Change the default behavior of printing a tibble so that at least 15 and at most 30 rows are printed.
- What is the difference between the
tibble.print_max
anddplyr.print_min
? Is there any? Test it. - Convert
vehicles
back to adata.frame
calledautomobiles
.
Code
# 1
<- mtcars %>% as_tibble()
vehicles
# 2
'cyl']]
vehicles[[2]]
vehicles[[$cyl
vehicles
# 3
%T>%
vehicles print(.[['cyl']])} %T>%
{print(.[[2]])} %>%
{$cyl
.
# 4
vehicles
# 5
%>% head(n = 30)
vehicles
# 6
options(tibble.print_min = 15, tibble.print_max = 30)
# 7
# In theory there should be no difference. dplyr imports tibble from the tibble package
# and dplyr.width, dplyr.print_min and dplyr.print_min are passed down to the tibble.
# But test both behaviours. First with only the tibble package loaded, later with dplyr # loaded.
# 8
<- as.data.frame(vehicles) automobiles
1.2.2 Task 2
Create the following tibble using tribble()
:
id | event | date |
---|---|---|
1 | success | 24-04-2017 |
2 | failed | 25-04-2017 |
3 | failed | 25-04-2017 |
4 | success | 27-04-2017 |
Code
<- tribble(
tab ~id, ~event, ~date,
1, 'success', '24-04-2017',
2, 'failed', '25-04-2017',
3, 'failed', '25-04-2017',
4, 'success', '27-04-2017'
)
1.2.3 Task 3
Compare the performance of as.data.frame()
, as_data_frame()
and as_tibble()
on a 100 x 30 matrix filled with some random integers. Use package microbenchmark
. Fill in your result here in the Tidyverse Lab sheet, Tibbles – performance.
Code
<- replicate(30, sample(100), simplify = TRUE)
tst colnames(tst) = paste0(rep('col', times = dim(tst)[2]), 1:dim(tst)[2])
::microbenchmark(
microbenchmarkas.data.frame(tst),
as_data_frame(tst),
as_tibble(tst)
)
1.2.4 Task 4
Do you think tibbles are lazy? Try to create a tibble that tests whether lazy evaluation applies to tibbles too.
Code
tibble(x = sample(1:10, size = 10, replace = T), y = log10(x))
1.3 Parsing
Parse the following vectors using parse_
functions:
vec1 <- c(1, 7.2, 3.84, -5.23)
– parse it asdouble
(any problems? why?).- Now, parse the same vector
c(1, 7.2, 3.84, -5.23)
asinteger
. What happens? - Can you still parse it as
integer
somehow? - Parse as double
vec2 <- c('2', '3,45', '?', '-7,28')
- Parse correctly
vec3 <- c('2', '3,45', '?', '-7.28')
- Parse the following guessing the parser:
vec4 <- c('barrel: 432.7$', 'liter: 15.42PLN', 'gallon costs approx 32.1SEK', 'sunny, wind gusts up till 55m/s')
- Can you parse
vec4
as number? Do it if you can. - Parse
vec5 <- "25 Dec 2015"
as date (hint:?parse_date()
). - Parse
10_Jul_1410
as date.
Code
<- c(1, 7.2, 3.84, -5.23)
vec1 <- c('2', '3,45', '?', '-7,28')
vec2 <- c('2', '3,45', '?', '-7.28')
vec3 <- c('barrel: 432.7$', 'liter: 15.42PLN', 'gallon costs approx 32.1SEK', 'sunny, wind gusts up till 55m/s')
vec4 <- "25 Dec 2015"
vec5 parse_double(vec1)
parse_integer(vec1)
parse_integer(as.integer(vec1)) # Is it the best way? Hint: rounding.
parse_double(vec2, na = '?', locale = locale(decimal_mark = ','))
parse_number(vec2, na = '?', locale = locale(decimal_mark = ','))
guess_parser(vec4)
parse_guess(vec4)
# Yes, you can:
parse_number(vec4)
parse_date(vec5, format="%d %b %Y")
parse_date("10_Jul_1410", format="%d%.%b%.%Y")
2 NYC flights Challenge
The nycflights13
package contains information about all flights that departed from NYC (i.e., EWR, JFK and LGA) in 2013: 336,776 flights with 16 variables. To help understand what causes delays, it also includes a number of other useful datasets: weather, planes, airports, airlines. We will use it to train working with tibbles and dplyr
.
2.1 Task 1: Selecting column
- Load the
nycflights13
package (install if necessary). - Read about the data in the package docs.
- Inspect the
flights
tibble. - Select all columns but
carrier
andarr_time
. - Select
carrier
,tailnum
andorigin
. - Hide columns from
day
throughcarrier
. - Select all columns that have to do with
arr
ival (hint:?tidyselect
). - Select columns based on a vector
v <- c("arr_time", "sched_arr_time", "arr_delay")
. - Rename column
dest
todestination
using:select()
andrename()
What is the difference between the two approaches?
Code
install.packages('nycflights13')
library('nycflights13')
?nycflights13
flights
%>% select(-carrier, -arr_time)
flights
%>% select(carrier, tailnum, origin)
flights
%>% select(-(day:carrier))
flights
%>% select(contains('arr_')) # or
flights
<- c("arr_time", "sched_arr_time", "arr_delay")
v %>% select(v) # ambiguous, or better
flights %>% select(all_of(v))
flights
%>% select(destination = dest)
flights %>% rename(destination = dest)
flights # select keeps only the renamed column while rename returns the whole dataset
# with the column renamed.
2.2 Task 2: Filtering rows
- Filter only the flights that arrived ahead of schedule.
Code
%>% filter(arr_delay < 0) flights
- Filter the flights that had departure delay between 10 and 33.
Code
%>% filter(dep_delay >= 10, dep_delay <= 33) # or
flights %>% filter(between(dep_delay, 10, 33)) flights
- Fish out all flights with unknown arrival time.
Code
%>% filter(is.na(arr_time)) flights
- Retrieve rows 1234:1258 (hint:
?slice
).
Code
%>% slice(1234:1258) flights
- Sample (
?sample_n()
) 3 random flights per day in March.
Code
::flights %>% filter(month == 3) %>%
nycflights13group_by(day) %>%
slice_sample(n = 3)
- Show 5 most departure-delayed flights in January per carrier.
Code
::flights %>%
nycflights13filter(month == 1) %>%
group_by(carrier) %>%
slice_max(dep_delay, n = 5)
- Retrieve all
unique()
routes and sort them by destination.
Code
::flights %>%
nycflights13select(origin, dest) %>%
unique() %>%
arrange(dest)
::flights %>%
nycflights13mutate(route = paste(origin, dest, sep="-")) %>%
select(route) %>%
unique()
- Retrieve all
distinct()
routes and sort them by destination.
Code
::flights %>%
nycflights13select(origin, dest) %>%
distinct() %>%
arrange(dest)
# or
%>%
flights mutate(route = paste(origin, dest, sep="-")) %>%
distinct(route)
- Is
unique()
more efficient thandistinct()
?
Code
::microbenchmark(
microbenchmarkunique = nycflights13::flights %>%
select(origin, dest) %>%
unique() %>%
arrange(dest),
distinct = nycflights13::flights %>%
distinct(origin, dest) %>%
arrange(dest),
times = 10L
)
# Distinct is faster.
2.3 Task 3: Trans(mutations)
air_time
is the amount of time in minutes spent in the air. Add a new columnair_spd
that will contain aircraft’s airspeed in mph.- As above, but keep only the new
air_spd
variable. - Use
rownames_to_column()
onmtcars
to add car model as an extra column.
Code
%>% mutate(air_spd = distance/(air_time / 60))
flights %>% transmute(air_spd = distance/(air_time / 60))
flights %>% rownames_to_column('model') mtcars
2.4 Task 4: Groups and counts
- Use
group_by()
,summarise()
andn()
to see how many planes were delayed (departure) every month.
Code
%>%
flights filter(dep_delay > 0) %>%
group_by(month) %>%
summarise(num_dep_delayed = n())
- Do the same but using
tally()
andcount()
.
Code
%>%
flights filter(dep_delay > 0) %>%
group_by(month) %>%
tally()
%>%
flights filter(dep_delay > 0) %>%
count(month)
- What was the mean
dep_delay
per month?
Code
%>%
flights group_by(month) %>%
summarise(mean_dep_delay = mean(dep_delay, na.rm = T))
- Count the number of incoming delayed flights from each unique origin and sort origins by this count (descending).
Code
%>%
flights filter(arr_delay > 0) %>%
group_by(origin) %>%
summarise(cnt = n()) %>%
arrange(desc(cnt))
- Do the same using
tally()
Code
%>%
flights filter(arr_delay > 0) %>%
group_by(origin) %>%
tally(sort = T)
- Use
summarise()
to sum totaldep_delay
per month in hours.
Code
%>%
flights group_by(month) %>%
summarize(tot_dep_delay = sum(dep_delay/60, na.rm = T))
- Use the
wt
parameter ofcount()
(works withtally()
too) to achieve the same.
Code
%>%
flights group_by(month) %>%
count(wt = dep_delay/60)
- Run
group_size()
oncarrier
what does it return?
Code
%>%
flights group_by(carrier) %>%
group_size()
- Use
n_groups()
to check the number of unique origin-carrier pairs.
Code
%>%
flights group_by(carrier) %>%
n_groups()
ungroup
Depending on the version of dplyr
, you may or may need to use the ungroup()
if you want to group your data on some other variables. In the newer versions, summarise
and mutate
drop one aggregation level.
%>%
flights group_by(origin) %>%
mutate(mean_delay_orig = (mean(dep_delay, na.rm = T) + mean(arr_delay, na.rm = T)) / 2) %>%
ungroup() %>%
group_by(carrier) %>%
mutate(mean_delay_carr = (mean(dep_delay, na.rm = T) + mean(arr_delay, na.rm = T)) / 2) %>%
select(origin, carrier, mean_delay_orig, mean_delay_carr)
2.5 Task 5: Joins
Given the following tibbles set1
and set2
:
<- tribble(
set1 ~id, ~color,
'id1', 'grey',
'id1', 'red',
'id2', 'green',
'id3', 'blue'
)
<- tribble(
set2 ~id, ~size,
'id2', 'XL',
'id3', 'M',
'id4', 'M'
)
set1 set2
id | color |
---|---|
id1 | grey |
id1 | red |
id2 | green |
id3 | blue |
id | size |
---|---|
id2 | XL |
id3 | M |
id4 | M |
Perform joins on id
that result in the grey area from the Venn diagrams below. We have not talked about all possible joins, so read the docs if you do not know which join to use.
Code
left_join(set1, set2, by = 'id')
Code
right_join(set1, set2, by = 'id')
Code
inner_join(set1, set2, by = 'id') # or
semi_join(set1, set2, by = 'id') # semi_join removes duplicates in x
# and also returns only columns from x.
Code
full_join(set1, set2, by = 'id') # or
Code
anti_join(set1, set2, by = 'id')
3 Tidying data
Now time to do some data tidying. First install a package with some untidy data:
#renv::install("rstudio/EDAWR")
library(EDAWR)
- Tidy
cases
so that years are not in separate columns, but in the column calledyear
containing a value per each year.
Code
<- cases %>%
tidy_cases pivot_longer(-country, names_to = "year", values_to = "count")
- Now time for the
pollution
dataset. Tidy it so that there separate columns forlarge
andsmall
pollution values.
Code
<- pollution %>%
tidy_pollution pivot_wider(city, names_from = size, values_from = amount)
- The
storms
dataset contains thedate
column. Make it into 3 columns:year
,month
andday
. Store the result astidy_storms
.
Code
<- storms %>%
tidy_storms separate(col = date,
into = c("year", "month", "day"),
sep = "-")
- Now, merge
year
,month
andday
intidy_storms
into adate
column again but in the “DD/MM/YYYY” format.
Code
%>% unite(col = "date", 4:6, sep = "/") tidy_storms
4 Nanopore Channel Activity Challenge
4.1 Introduction
You will be given a fastq
file coming from Oxford Nanopore MinION sequencer This file contains test reads from the chicken genome. The flow-cell used here has 512 channels, each channel consists of 4 pores and only one pore is active at a time. Once your sequence gets stuck for some reason, the device will attempt to remove it from the pore by playing with reversing polarity on that pore. If this was successful, the pore will be re-used. Your task will be to visualize reading events from the meta-data in the fastq
dataset and to see how each and every channel behaved. Also, you will plot the distribution of reading times.
Datasets are available here.
4.2 Preparations
First, we will need to load the necessary libraries. I will give you a hint – you need the following libraries:
here
– not necessary, but it is an elegant way of reading the data locally from the project folder,tidyverse
– well, quite obvious why,ShortRead
from Bioconductor – to deal with short reads infastq
,lubridate
– to figure out reading times.
library(here)
library(tidyverse)
#BiocManager::install("ShortRead")
library(ShortRead)
library(lubridate)
4.3 Reading data
Now, let’s read the fastq data. Check ShortRead
documentation to see how to read our fastq
file. Also, try to use package here
. If you write: data <- here::here('data/my.fastq')
, the my.fastq
file will be read from the data
folder which is a sub folder of your project root, i.e. the folder where your r script is. It is a good practice and also prevents Jenny Bryan from coming to your office and setting your computer on fire.
Now think a bit, to plot reading events, do we need all the data in the file or only some specific part? You may want to see some few first lines of the fastq
to learn about the data structure.
<- here::here('labs/tidyverse/assets/FUL1_fastqs_GRID2.fastq')
raw_data <- ShortRead::FastqFile(raw_data)
f <- ShortRead::readFastq(f)
rfq <- rfq@id
headers close(f)
4.4 Extracting information you need
In this step, we are extracting data from fastq headers of each and every read in the fastq file. Not super efficient and perhaps the slowest step of the whole analyses. Can you do it better than our example solution?
Desired output: a table (tibble/data.frame) with reads as rows and meta-data as columns.
We will use the stringr
package for string manipulation. Use str_split()
to explode string data into columns and str_remove_all()
to get rid of unwanted characters.
- To split string on a particular character, group of characters use
str_split
. Here we split on comma.
text <- "This text is long, or not?"
str_split(text, ',')
- To remove everything following a given character, e.g. comma:
str_remove_all(text, ",.*")
Code
<- dplyr::as_tibble(matrix(NA_character_, ncol = 6, nrow = length(headers)), .name_repair = 'minimal')
data colnames(data) <- c('id', 'run_id', 'sample_id', 'read', 'channel', 'start_time')
for (i in 1:length(headers)) {
<- toString(headers[[i]]) %>%
data[i,] strsplit(' ') %>%
unlist() %>%
str_remove_all(".*=") %>% t()
}
4.5 Preparing tidy dataset
Now, the fun part begins:
- Add column
start_dttm
that represents start time for a given read as properdatetime
object (readlubridate
docs) and - Add column
chan
that is the proper numeric representation of the channel, - Group reads by channel,
- Arrange them by time,
- Add time to next read (NA if this was the last read) and
- Sort by channel again.
Read about lead()
Code
<- data %>%
data2 mutate(start_dttm = as_datetime(start_time)) %>%
mutate(chan=as.numeric(channel)) %>%
group_by(chan) %>%
arrange(start_dttm) %>%
mutate(time_diff = lead(start_dttm) - start_dttm) %>%
arrange(chan)
4.6 Plotting events per channel
Here, we want to see what was happening in each channel over time. Plot the data you have just prepared so that:
- Each point is the start of a new read,
- Colour corresponds to the lag to the next read.
Can you visualize this in a better way? Different geometry?
Code
ggplot(data2, mapping = aes(x = start_dttm,
y = as.factor(chan),
col = as.numeric(time_diff)
)+
) geom_point(size = .5) +
theme_bw()
4.7 Distribution of time intervals
Now, we want to see how time-to-next-read is distributed. Since it has a veeeeery long right tail, I am cutting off everything above 200 seconds (just by eyeballing).
- Plot time-to-next-read is distribution (you can use base-R
histogram
), - Can you find a better cutoff value?
Code
# Show time-to-next read distribution
# thr <- mean(data2$time_diff, na.rm = T) + 3 * sd(data2$time_diff, na.rm = T)
<- data2 %>%
tmp ungroup() %>%
filter(time_diff < 200) %>%
select(time_diff)
hist(as.numeric(tmp$time_diff), breaks = 1000, las=1)
5 Species Identification Challenge
In this challenge, your task will be to analyze species composition of some samples. The samples, were actual products containing parts of plants. DNA has been isolated form the samples and an amplicon metabarcoding was performed using two sets of primers: for the ITS1 and the ITS2 region. Each sample had 3 technical replicates. Your task will be to transform BLAST output to a tidy form suitable for further analyses or visualization.
5.1 Load necessary libraries
We will obviously need tidyverse
, we will also do some string manipulations with stringr
also here
package is good to have.
library(tidyverse)
library(stringr)
library(here)
5.2 Input variables
Here, we will define our input variables. We need:
file
that contains the path to the dataset,sample_name
is a string, the name of the sample you want to analyze,threshold
is an integer saying what is a the minimal number of replicates that have to contain an OTU in order to call it a true positive (TP),strict
a logical. If set to TRUE, only the OTUs deemed TP will be shown.
Below we set some example values:
# Change the path to your project path, where your data is
file <- here::here("docs/tidyverse_Marcin/lab/assets/blast_result.csv")
sample_name <- 'SAMPLE12'
threshold <- 1
strict <- F
5.3 Reading the data
Now, you should read the data:
Code
<- read_csv(file, col_names = c("sample","its","replicate","OTU","size","hit","perc_ident","score","family","species")) %>%
species_orig select(-score)
head(species_orig,n = 10)
sample | its | replicate | OTU | size | hit | perc_ident | family | species |
---|---|---|---|---|---|---|---|---|
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | KX522674 | 97.08 | Anacardiaceae | Spondias_tuberosa |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | AJ783644 | 97.53 | Betulaceae | Betula_populifolia |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | AJ783641 | 97.53 | Betulaceae | Betula_alnoides |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | MF171078 | 98.71 | Pentaphylacaceae | Pentaphylax_euryoides |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214931 | 98.33 | Musaceae | Musa_ornata |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214926 | 96.23 | Musaceae | Musa_sp. |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214930 | 95.73 | Musaceae | Musa_basjoo |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214927 | 97.67 | Musaceae | Musella_lasiocarpa |
SAMPLE10 | ITS1 | R2 | OTU1 | 357 | AM233397 | 95.62 | Apocynaceae | Secamone_filiformis |
SAMPLE10 | ITS1 | R2 | OTU1 | 357 | FR832858 | 96.69 | Rubiaceae | Tricalysia_perrieri |
As you see, the following information are included in the data:
sample
is simply the name of the sample,its
is either ITS1 or ITS2 and tells which set of PCR primers has been used,replicate
contains information on which replicate the sequences come from,OTU
is a unique identifier of the so-called Operational Taxonomic Unit, an OTU often corresponds to one species but not always. Sometimes 2 OTUs represent the same species, sometimes 1 OTU consists of more than one species,size
is the number of reads that support that particular OTU,hit
is the BLAST hit identifier. The 4 top BLAST hits are reported per OTU,perc_identity
is the percentage identity of the sequence to the BLAST hit,family
is the identified plant family,species
is the identified plant species.
5.4 Number of replicates per species
Create a new dataset species
that contains an extra column n_replicates
. The column contains number of replicates this particular species is present in. Do it per sample and its.
Code
<- species_orig %>%
species group_by(sample, its, species) %>%
mutate(n_replicates = n_distinct(replicate)) %>%
ungroup()
head(species,n = 10)
sample | its | replicate | OTU | size | hit | perc_ident | family | species | n_replicates |
---|---|---|---|---|---|---|---|---|---|
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | KX522674 | 97.08 | Anacardiaceae | Spondias_tuberosa | 2 |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | AJ783644 | 97.53 | Betulaceae | Betula_populifolia | 1 |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | AJ783641 | 97.53 | Betulaceae | Betula_alnoides | 1 |
SAMPLE10 | ITS1 | R1 | OTU1 | 372 | MF171078 | 98.71 | Pentaphylacaceae | Pentaphylax_euryoides | 1 |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214931 | 98.33 | Musaceae | Musa_ornata | 1 |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214926 | 96.23 | Musaceae | Musa_sp. | 1 |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214930 | 95.73 | Musaceae | Musa_basjoo | 1 |
SAMPLE10 | ITS1 | R1 | OTU2 | 14 | KY214927 | 97.67 | Musaceae | Musella_lasiocarpa | 1 |
SAMPLE10 | ITS1 | R2 | OTU1 | 357 | AM233397 | 95.62 | Apocynaceae | Secamone_filiformis | 1 |
SAMPLE10 | ITS1 | R2 | OTU1 | 357 | FR832858 | 96.69 | Rubiaceae | Tricalysia_perrieri | 1 |
5.5 Filter out unwanted samples
Now, your task is to filter out all but your sample_name
samples from the dataset. Call the resulting dataset my_sample
.
Code
<- species %>%
my_sample filter(sample == sample_name)
5.6 Missing observations
What happens if a set of primers failed to amplify or if one replicate was lost? Use complete()
to make sure you have NA
values in such cases.
Code
<- my_sample %>%
my_sample complete(its = c("ITS1", "ITS2"),
replicate = c("R1","R2","R3"))
5.7 Sorting issue
Look, the first sample in the table is SAMPLE10
. Why not SAMPLE1
? That’s a sorting issue: if sorted as character, 10 will come before 1. WE have to fix this by adding trailing zero to the values in OTU
. We do not expect more than 99 OTUs in a sample, so it is ok with only one trailing 0 (otherwise the 100th sample will spoil our sorting and come out like: SAMPLE100, SAMPLE01, SAMPLE10). We will need to use regular expression:
- All values in the
OTU
column that follow pattern “OTUdigit” we need to change to “OTU0digit”. Regular expression that matches this isOTU([0-9]$)
and it should be replaced by:OTU0\\1
. Ask your TAs to explain this if you do not know much about regular expressions and pattern matching.
<- my_sample %>%
my_sample mutate(OTU = str_replace(OTU,pattern = "OTU([0-9]$)",
replacement = "OTU0\\1"))
head(my_sample,n = 10)
its | replicate | sample | OTU | size | hit | perc_ident | family | species | n_replicates |
---|---|---|---|---|---|---|---|---|---|
ITS1 | R1 | SAMPLE12 | OTU01 | 4169 | LC089998 | 99.44 | Brassicaceae | Capsella_bursa-pastoris | 3 |
ITS1 | R1 | SAMPLE12 | OTU02 | 2686 | KP794392 | 98.19 | Euphorbiaceae | Tragia_cf. | 3 |
ITS1 | R1 | SAMPLE12 | OTU02 | 2686 | KP794390 | 98.19 | Euphorbiaceae | Tragia_chlorocaulon | 3 |
ITS1 | R1 | SAMPLE12 | OTU02 | 2686 | KP794378 | 98.19 | Euphorbiaceae | Tragia_bahiensis | 3 |
ITS1 | R1 | SAMPLE12 | OTU02 | 2686 | KP794320 | 98.19 | Euphorbiaceae | Bia_alienata | 3 |
ITS1 | R1 | SAMPLE12 | OTU04 | 1353 | GQ396671 | 97.44 | Euphorbiaceae | Triadica_sebifera | 3 |
ITS1 | R1 | SAMPLE12 | OTU04 | 1353 | KX522674 | 96.25 | Anacardiaceae | Spondias_tuberosa | 3 |
ITS1 | R1 | SAMPLE12 | OTU04 | 1353 | KY860928 | 96.23 | Adoxaceae | Viburnum_prunifolium | 3 |
ITS1 | R1 | SAMPLE12 | OTU04 | 1353 | KX757310 | 96.23 | Caryophyllaceae | Silene_caesia | 3 |
ITS1 | R1 | SAMPLE12 | OTU06 | 577 | KY860926 | 100.00 | Asteraceae | Taraxacum_officinale | 3 |
5.8 Supporting reads
Sometimes, an OTU generates two or more top BLAST hits that come from the same species. We have decided to sum reads in such cases. Do it!
Code
<- my_sample %>%
my_sample ungroup() %>%
group_by(sample, its, replicate, OTU, species, n_replicates) %>%
summarise(n_reads = sum(size)) %>%
ungroup() %>%
group_by(its, species, OTU)
head(my_sample,n=10)
sample | its | replicate | OTU | species | n_replicates | n_reads |
---|---|---|---|---|---|---|
SAMPLE12 | ITS1 | R1 | OTU01 | Capsella_bursa-pastoris | 3 | 4169 |
SAMPLE12 | ITS1 | R1 | OTU02 | Bia_alienata | 3 | 2686 |
SAMPLE12 | ITS1 | R1 | OTU02 | Tragia_bahiensis | 3 | 2686 |
SAMPLE12 | ITS1 | R1 | OTU02 | Tragia_cf. | 3 | 2686 |
SAMPLE12 | ITS1 | R1 | OTU02 | Tragia_chlorocaulon | 3 | 2686 |
SAMPLE12 | ITS1 | R1 | OTU04 | Silene_caesia | 3 | 1353 |
SAMPLE12 | ITS1 | R1 | OTU04 | Spondias_tuberosa | 3 | 1353 |
SAMPLE12 | ITS1 | R1 | OTU04 | Triadica_sebifera | 3 | 1353 |
SAMPLE12 | ITS1 | R1 | OTU04 | Viburnum_prunifolium | 3 | 1353 |
SAMPLE12 | ITS1 | R1 | OTU06 | Taraxacum_amplum | 3 | 577 |
5.9 Within-OTU species diversity
Now, we want to see how many identifications an OTU got. Implement this. Store the result in a new tibble diversity
.
Code
<- my_sample %>%
diversity ungroup() %>%
group_by(its, replicate, OTU) %>%
summarise(n_species = n())
head(diversity,n=10)
its | replicate | OTU | n_species |
---|---|---|---|
ITS1 | R1 | OTU01 | 1 |
ITS1 | R1 | OTU02 | 4 |
ITS1 | R1 | OTU04 | 4 |
ITS1 | R1 | OTU06 | 4 |
ITS1 | R1 | OTU07 | 3 |
ITS1 | R1 | OTU08 | 2 |
ITS1 | R1 | OTU10 | 1 |
ITS1 | R1 | OTU11 | 3 |
ITS1 | R1 | OTU13 | 3 |
ITS1 | R1 | OTU15 | 2 |
5.10 Adding diversity data
Add the diversity
data to my_sample
using appropriate join
function. Also, remove the column with sample names since we are dealing with only one sample.
Code
<- my_sample %>%
my_sample left_join(diversity) %>%
select(-sample)
head(my_sample,n=10)
its | replicate | OTU | species | n_replicates | n_reads | n_species |
---|---|---|---|---|---|---|
ITS1 | R1 | OTU01 | Capsella_bursa-pastoris | 3 | 4169 | 1 |
ITS1 | R1 | OTU02 | Bia_alienata | 3 | 2686 | 4 |
ITS1 | R1 | OTU02 | Tragia_bahiensis | 3 | 2686 | 4 |
ITS1 | R1 | OTU02 | Tragia_cf. | 3 | 2686 | 4 |
ITS1 | R1 | OTU02 | Tragia_chlorocaulon | 3 | 2686 | 4 |
ITS1 | R1 | OTU04 | Silene_caesia | 3 | 1353 | 4 |
ITS1 | R1 | OTU04 | Spondias_tuberosa | 3 | 1353 | 4 |
ITS1 | R1 | OTU04 | Triadica_sebifera | 3 | 1353 | 4 |
ITS1 | R1 | OTU04 | Viburnum_prunifolium | 3 | 1353 | 4 |
ITS1 | R1 | OTU06 | Taraxacum_amplum | 3 | 577 | 4 |
- What columns did we join on in our example solution?
5.11 Visualising the data
Can you think of a good way of visualizing the data? Think of:
- Type of plot you want. The simpler the better?
- Which variables you would like to visualize? We have chosen
ITS
,replicate
,n_reads
,n_replicates
,OTU
,threshold
,n_species
andspecie
. Well, pretty much all of them :-) - How do you represent variables: colors, shapes, separate plots, sizes?
- What transformations do you need to apply before visualizing the data?
Our example solution involves some ggplot2
magics. But would base-R be good enough for this type of plot?
6 Wildlife Aircraft Strikes Challenge
Use the FAA report and tidyverse
to learn more about aircraft incidents with wildlife. Use your imagination and NYC data science blog for inspiration!