The objective of this lab is to improve your coding skills by focusing on coding style, code benchmarking and optimization. Below, you will find a number of tasks connected to the topics covered in the Best Coding Practices lecture. Some tasks extend lectures content and require you to find some more information online. Please, note that while we are providing example solutions to many tasks, these are only examples. If you solve a task in a different way it does not matter your solution is wrong. In fact, it may be better than our solution. If in doubt, ask PI for help. We are here for you!
Which of the following are valid/good variable names in R. What is wrong with the ones that are invalid/bad?
var1
, 3way_handshake
, .password
, __test__
, my-matrix-M
, three.dimensional.array
, 3D.distance
, .2objects
, wz3gei92
, next
, P
, Q
, R
, S
, T
, X
, is.larger?
The code below works, but can be improved. Do improve it!
myIterAtoR.max <- 5
second_iterator.max<-7
col.NUM= 10
row.cnt =10
fwzy45 <- matrix(rep(1, col.NUM*row.cnt),nrow=row.cnt)
for(haystack in (2-1):col.NUM){
for(needle in 1:row.cnt) {
if(haystack>=myIterAtoR.max){
fwzy45[haystack, needle]<-NA}
}}
iter_max <- 5
col_num <- 10
row_num <- 10
A <- matrix(rep(1, col_num * row_num), nrow = row_num)
for (i in 1:col_num) {
for (j in 1:row_num) {
if (i >= iter_max) {
A[i, j] <- NA
}
}
}
# Can you improve the code more by eliminating loops or at least one of them?
Improve formatting and style of the following code:
simulate_genotype <- function( q, N=100 ) {
if( length(q)==1 ){
p <- (1 - q)
f_gt <- c(p^2, 2*p*q, q^2) # AA, AB, BB
}else{
f_gt<-q
}
tmp <- sample( c('AA','AB','BB'), size =N, prob=f_gt, replace=T )
return(tmp)
}
simulate_genotype <- function(q, N = 100) {
if (length(q) == 1) {
p <- (1 - q)
f_gt <- c(p^2, 2*p*q, q^2) # AA, AB, BB
} else {
f_gt <- q
}
genotype <- sample(c('AA', 'AB', 'BB'),
size = N,
prob = f_gt,
replace = T)
return(genotype)
}
Modify the function below so that it works with R pipes %>%
:
my_filter <- function(threshold = 1, data, scalar = 5) {
data[data >= threshold] <- NA
data <- data * scalar
return(data)
}
Note you need to have the magrittr
or tidyverse
package loaded in order to be able to use the pipe %>%
!
my_filter <- function(data, threshold = 1, scalar = 5) {
data[data >= threshold] <- NA
data <- data * scalar
return(data)
}
# Test:
c(-5, 5) %>% my_filter()
Is the code below correct? Can it be improved?
simulate_phenotype <- function(pop_params, gp_map, gtype) {
pop_mean <- pop_params[1]
pop_var <- pop_params[2]
pheno <- rnorm(n = N, mean = pop_mean, sd = sqrt(pop_var))
effect <- rep(0, times = length(N))
for (gt_iter in c('AA', 'AB', 'BB')) {
effect[gtype == gt_iter] <- rnorm(n = sum(gtype == gt_iter),
mean = gp_map[gt_iter, 'mean_eff'],
sd = sqrt(gp_map[gt_iter, 'var_eff']))
}
dat <- data.frame(gt = gtype, raw_pheno = pheno, effect = effect, pheno = pheno + effect)
return(dat)
}
Maybe some small improvements can be done, but in principle the code is clean! Except that... the N is not initialized anywhere.
Write a modular code (function or functions) that computes the sample standard deviation given a vector of numbers. Decide how to logically structure the code. Assume there are no built-in R functions for computing mean and variance available. The formula for variance is: \(SD = \sqrt{\frac{\Sigma_{i=1}^{N}(x_i - \bar{x})^2}{N-1}}\). Standard deviation is \(Var=SD^2\).
Hint: consider that you may want to re-use some computed values in future, e.g. variance.
sample_mean <- function(x) {
mean <- sum(x) / length(x)
return(mean)
}
sum_squared_deviations <- function(x) {
tmp <- (x - sample_mean(x)) ^ 2
sum_sq_dev <- sum(tmp)
return(sum_sq_dev)
}
std_dev <- function(x) {
variance <- sqrt(sum_squared_deviations(x) / (length(x) - 1))
return(variance)
}
variance <- function(x) {
return(std_dev(x) ^ 2)
}
You found two functions in two different packages: the randomSampleInt
function that generates a random sample of integer numbers and the randomSampleLetter
function for generating a random sample of letters. Unfortunately, the functions are called in different ways which you want to unify in order to use them interchangeably in your code. Write a wrapper function around the randomSampleLetter
that will provide the same interface to the function as the randomSampleInt
. Also, the randomSampleLetter
cannot handle the seed. Can you add this feature to your wrapper?
randomSampleInt <- function(x, verbose, length, seed = 42) {
if (verbose) {
print(paste0('Generating random sample of ', length, ' integers using seed ', seed))
}
set.seed(seed)
sampleInt <- sample(x = x, size = length, replace = TRUE)
return(sampleInt)
}
randomSampleLetter <- function(N, silent=T, lett) {
if (!silent) {
print(paste0('Generating random sample of ', N, ' letters.'))
}
sample <- sample(x = lett, size = N, replace = TRUE)
return(sample)
}
randomSampleLetterWrapper <- function(x, verbose, length, seed = 42) {
set.seed(seed)
result <- randomSampleLetter(N = length, silent = !verbose, lett = x)
return(result)
}
plot
.Write a wrapper around the graphics::plot
function that modifies its default behaviour so that it plots red crosses instead of black points. Do it in a way that enables the user to modify other function arguments. Hint: you may want to have a look at graphics::plot.default
.
my_plot <- function(x, ...) {
plot(x, pch = 3, color = 'red', ...)
}
What if you want to pass some additional parameters to a function and, sadly, the authors forgot to add ...
to the list of function arguments. There is a way out – you can bind extra arguments supplied as alist
structure to the original function arguments retrieved by formals
. Try to fix the function below, so that the call red_plot(1, 1, col='red', pch=19)
will result in points being represented by red circles. Do use alist
and formals
and do not edit the red_plot
itself! Hint: read help for alist
and formals
. Original function:
red_plot <- function(x, y) {
plot(x, y, las=1, cex.axis=.8, ...)
}
red_plot <- function(x, y) {
plot(x, y, las=1, cex.axis=.8, ...)
}
red_plot(1, 1, col='red', pch=19) # Does not work.
formals(red_plot) <- c(formals(red_plot), alist(... = )) # Fix.
red_plot(1, 1, col='red', pch=19) # Works!
options
.Use options
to change the default prompt in R to hello :-) >
.
Check what options are stored in the hidden variable called Options.
options(prompt = "hello :-) > ")
.Options
options(prompt = "> ") # restoring the default
Which of the following chunks of code are correct and which contain errors? Identify these errors.
input <- sample(1:1000, size = 1000, replace = T)
currmin <- NULL
for (i in input) {
if (input > currmin) {
currmin <- input
print(paste0("The new minimum is: ", currmin))
}
}
input <- sample(1:1000, size = 1000, replac = T)
currmin <- NULL
for (i in input) {
if (input < currmin) {
currmin <- input
print(paste0("The new minimum is: ", currmin))
}
}
for (cnt in 1:100) {
if (cnt > 12) {
print("12+")
} else {
print("Not 12+")
}
}
result <- logical(10)
input <- sample(1:10, size = 10, replace = T)
for (i in 0:length(input)) {
if (input[i] >= 5) {
result[i] <- TRUE
}
}
Play with debugger as described in lecture slides 17-21.
Can you fix the code below so that it produces more reliable result?
Hint: think in terms of system-specific representation \(\epsilon\).
Put the value of your double \(\epsilon\) into this spreadsheet (Best Coding Practises Lab sheet).
vec <- seq(0.1, 0.9, by=0.1)
vec == 0.7
# One way is to use epsilon
# Check machine's floating point representation
vec <- seq(0.1, 0.9, by=0.1)
# Make a custom function that uses machines' epsilon for comparing
# values
is_equal <- function(x, y) {
isEqual <- F
if (abs(x - y) < unlist(.Machine)['double.eps']) {
isEqual <- T
}
isEqual
}
# Some tests
0.7 == 0.6 + 0.1
is_equal(0.7, 0.6 + 0.1)
0.7 == 0.8 - 0.1
is_equal(0.7, 0.8 - 0.1)
# Now you can use the is_equal to fix the code!
Create a 10 000 x 10 000 matrix and fill it with random numbers (from 1 to 42), first row by row and later column by column. Use proc.time
to see if there is any difference. Is the measurement reliable? Record the values you got in this spreadsheet:
N <- 10e3 * 10e3
# By row
t1 <- proc.time()
M <- matrix(sample(1:42, size = N, replace = T), nrow = sqrt(N), byrow = T)
t2 <- proc.time()
(t2 - t1)
# By column
t1 <- proc.time()
M <- matrix(sample(1:42, size = N, replace = T), nrow = sqrt(N), byrow = F)
t2 <- proc.time()
(t2 - t1)
In the lecture slides, you have seen how to time sampling from Gaussian distribution:
system.time(rnorm(n = 10e6))
Is such single measurement reliable? Run the code 100 times, plot and record the mean and the variance of the elapsed
time. Put these values (elapsed.time mean and variance) into this spreadsheet (Best Coding Practises Lab sheet).
timing <- double(100)
for (i in 1:100) {
st <- system.time(rnorm(n = 10e6))
timing[i] <- st[3]
}
boxplot(timing)
mean(timing)
var(timing)
An alternative approach or, more exactly, an alternative notation that achieves the same as the previous chunk of code but in a more compact way makes use of the replicate
, a wrapper function around sapply
that simplifies repeated evaluation of expressions. The drawback is you do not get the vector of the actual timing values but the results of calling system.time
are already averaged for you. Try to read about the replicate
and use it to re-write the code above. Put the elapsed.time
into this spreadsheet (Best Coding Practises Lab sheet). How does this value compare to calling system.time
within a loop in the previous chunk of code? Are the values similar?
st2 <- system.time(replicate(n = 100, rnorm(n = 10e6)))
While system.time
might be sufficient most of the time, there is also an excellent package microbenchmark
that enables more accurate time profiling, aiming at microsecond resolution that most of modern operating systems offer. Most of the benchmarking the microbenchmark
does is implemented in low-overhead C functions and also the package makes sure to:
* estimate granularity and resolution of timing for your particular OS,
* warm up your processor before measuring, i.e. wake the processor up from any idle state or likewise.
Begin by installing the microbenchmark
package.
microbenchmark::get_nanotime()
Modify the code below so that it uses the current value of platform’s timer:
timing <- double(100)
for (i in 1:100) {
st <- system.time(rnorm(n = 10e6))
timing[i] <- st[3]
}
boxplot(timing)
Put the mean and the variance into this spreadsheet (Best Coding Practises Lab sheet, Microbenchmark – loop)
library(microbenchmark)
timing <- double(100)
for (i in 1:100) {
nanotime_start <- get_nanotime()
rnorm(n = 10e6)
nanotime_stop <- get_nanotime()
timing[i] <- nanotime_stop - nanotime_start
}
mean(timing)
var(timing)
boxplot(timing)
microbenchmark
package that helps the package estimate granularity and resolution of your particular timing subsystem. According to the documentation, the function measures the overhead of timing a C function call rounds times and returns all non-zero timings observed.
microtiming_precision
function and put the mean and the variance of the resulting vector into this spreadsheet (Best Coding Practises Lab sheet, Microbenchmark – precision)
precision <- microbenchmark::microtiming_precision()
mean(precision)
var(precision)
# In version 1.4-4 of the package, all three ways give different results!
microbenchmark::microtiming_precision()
precision <- microbenchmark::microtiming_precision()
?microbenchmark::microtiming_precision
Finally, let’s benchmark our rnorm
example using microbenchmark
:
rnorm(n = 10e6)
expression,ggplot2
and a boxplot (read the microbenchmark
package documentation),require(microbenchmark)
require(ggplot2)
mb <- microbenchmark(rnorm(n = 10e6))
autoplot(mb)
boxplot(mb)
summary(mb)
f <- function() {}
mb2 <- microbenchmark(f(), pi, 2+2)
summary(mb2)
autoplot(mb2)
Now, we will use a even more sophisticated approach to profiling.
Rprof
way.Write three functions that fill by row a \(N \times N\) matrix \(M\) with randomly generated numbers from a vector given as argument bag
, allow for passing random seed value as function argument with the default value of 42. After filling the matrix with values, add to each and every element of \(M\) the number of column the element is in and return such matrix from the function. Functions should:
fill_alloc
) – use memory allocation prior to loop in which the matrix is being filled and allocate memory using init
value passed as argument and by default set to NULL
,fill_noalloc
– not use memory allocation prior to the loop,fill_noloop
should not the loop for filling the matrix in.
** NOTE! ** do not perform addition of column number in the same
Following this and using rnorm(1000, mean = 0, sd = 1)
:
Rprof
to profile the functinos using the same seed and N=100,Rprof
to check whether there is a difference between initializing the matrix using NULL
and 0 in fill_alloc
,fill_noloop <- function(N, bag, seed = 42) {
set.seed(seed)
values <- sample(bag, size = N^2, replace = T)
M <- matrix(data = values, nrow = N, byrow = T)
for (col_num in 1:N) {
M[, col_num] <- M[, col_num] + col_num
}
return(M)
}
fill_noalloc <- function(N, bag, seed = 42) {
set.seed(seed)
values <- sample(bag, size = N^2, replace = T)
M <- NULL
cnt = 1
for (row in 1:N) {
row_tmp <- c()
for (col in 1:N) {
row_tmp <- c(row_tmp, values[cnt])
cnt <- cnt + 1
}
M <- rbind(M, row_tmp)
}
for (col_num in 1:N) {
M[, col_num] <- M[, col_num] + col_num
}
return(M)
}
fill_alloc <- function(N, bag, seed = 42, init = NA) {
set.seed(seed)
values <- sample(bag, size = N^2, replace = T)
M <- matrix(rep(init, times=N^2), nrow = N, byrow = T)
cnt = 1
for (row in 1:N) {
for (col in 1:N) {
M[row, col] <- values[cnt]
cnt <- cnt + 1
}
}
for (col_num in 1:N) {
M[, col_num] <- M[, col_num] + col_num
}
return(M)
}
summary <- summaryRprof('profiler_test_fillers.out', memory='both')
summary$by.self
# answers to the remaining questions are not given
Have a look at our answers.
fill_alloc
even further (call the optimized version fill_alloc_opt
)?fill_alloc_opt <- function(N, bag, seed = 42, init = NA) {
set.seed(seed)
values <- sample(bag, size = N^2, replace = T)
M <- matrix(rep(init, times=N^2), nrow = N, byrow = T)
cnt = 1
for (row in 1:N) {
for (col in 1:N) {
M[row, col] <- values[cnt] + col
cnt <- cnt + 1
}
}
return(M)
}
fill_noloop
to fill_noloops
that does not use any loops at all.fill_noloops <- function(N, bag, seed = 42) {
values <- sample(bag, size = N^2, replace = T)
inc <- rep(x = 1:N, times = N)
M <- matrix(data = values + inc, nrow = N, byrow = T)
return(M)
}
profr
package.profr
package.profr
to profile fill_noloop
, fill_noloops
and fill_alloc_opt
.library(profr)
Rprof("profr_noloop.out", interval = 0.01)
fill_noloop(1000, rnorm(1000), seed = 42)
Rprof(NULL)
profile_noloop_df <- parse_rprof('profr_noloop.out')
Rprof("profr_noloops.out", interval = 0.01)
fill_noloops(100, rnorm(1000), seed = 42)
Rprof(NULL)
profile_noloops_df <- parse_rprof('profr_noloops.out')
Rprof("profr_alloc_opt.out", interval = 0.01)
fill_alloc_opt(10, rnorm(1000), seed = 42)
Rprof(NULL)
profile_alloc_opt_df <- parse_rprof('profr_alloc_opt.out')
profr::ggplot.profr(profile_noloop_df)
profr::ggplot.profr(profile_noloops_df)
profr::ggplot.profr(profile_alloc_opt_df)
profvis
package.profvis
package.profvis
to profile fill_noloop
, and fill_alloc
functions.In this section, we will deal with some selected ways to optimize your code.
Given is a function:
optimize_me <- function(N = 1000, values = c(1:1e4)) {
N = 10; values = c(1:1e4)
dat1 <- matrix(size = N^2)
for (i in 1:N) {
for (j in 1:N) {
dat1[i, j] <- sample(values, 1)
}
}
dat0 <- dat1
dat1[lower.tri(dat1)] <- t(dat1)[lower.tri(dat1)]
dat2 <- NULL
for (i in 1:N) {
i_tmp <- c()
for (j in 1:N) {
i_tmp <- c(i_tmp, sample(values, 1))
}
dat2 <- rbind(dat2, i_tmp)
}
dat2[lower.tri(dat2)] <- t(dat2)[lower.tri(dat2)]
M <- dat2
for (i in 1:N) {
for (j in 1:N) {
M[i, j] <- dat1[i, j] * dat2[i, j]
}
}
for (i in 1:N) {
for (j in 1:N) {
M[i, j] <- M[i, j] + values[3]
}
}
N <- M %*% dat0
result <- apply(N, 2, mean)
return(result)
}
dat1 <- matrix(size = N^2)
better than dat1 <- matrix(NA, nrow=N, ncol=N)
?BLAS
?apply
somewhere?apply
further?sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS 10.14.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bsplus_0.1.1 captioner_2.2.3 bookdown_0.11 knitr_1.23
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.0 magrittr_1.5 tools_3.5.0 htmltools_0.3.6
## [5] yaml_2.2.0 Rcpp_1.0.1 lubridate_1.7.4 stringi_1.4.3
## [9] rmarkdown_1.13.1 stringr_1.4.0 xfun_0.7 digest_0.6.19
## [13] evaluate_0.14
Page built on: 09-Jun-2019 at 22:25:37.
2018 | SciLifeLab > NBIS > RaukR