+ - 0:00:00
Notes for current slide
Notes for next slide

Best Coding Practices in R

Advanced R for Bioinformatics. Visby, 2018

Marcin Kierczak

28 May, 2018

RaukR 2018 . 1/41

Topics of This Presentation



Code:

  • Style -- __howTo_style.yourCode?

  • Structure -- manufacture your own building blocks.

  • Debugging -- my code does not run.

  • Profiling -- now it does run but... out of memory!

  • Optimization -- making things better.

RaukR 2018 . 2/41

Coding Style

  • Naming conventions -- assigning names to variables.

  • Code formatting -- placement of braces, use of whitespace characters etc.


    From: Behind The Lines 2010-09-23. By Oliver Widder, Webcomics Geek And Poke.

RaukR 2018 . 3/41

Naming Conventions

A syntactically valid name:

  • Consists of:

    • letters: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
    • digits: 0123456789
    • period: .
    • underscore: _
  • begins with a letter or the period (.) not followed by a number

  • cannot be one of the reserved words: if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, NA_character_

  • also cannot be: c, q, t, C, D, I

RaukR 2018 . 4/41

Naming Style

Variable names that are legal are not necesserily a good style and they may be dangerous:

F + T
F <- T
F + T
## [1] 1
## [1] 2

do not do this!

RaukR 2018 . 5/41

Naming Style

Variable names that are legal are not necesserily a good style and they may be dangerous:

F + T
F <- T
F + T
## [1] 1
## [1] 2

do not do this! unless you are a politician...


Avoid T and F as variable names.

RaukR 2018 . 5/41

Customary Variable Names

Also, there is a number of variable names that are traditionally used to name particular variables:

  • usr -- user,
  • pwd -- password,
  • x, y, z -- vectors,
  • w -- weights,
  • f, g -- functions,
  • n -- number of rows,
  • p -- number of columns,
  • i, j, k -- indexes,
  • df -- data frame,
  • cnt -- counter,
  • M, N, W -- matrices,
  • tmp -- temporary variables

Sometimes these are domain-specific:

  • p, q -- allele frequencies in genetics,
  • N, k -- number of trials and number of successes in stats

    Try to avoid use these in this way to avoid possible confusion.
RaukR 2018 . 6/41

Different Notations

People use different notation styles hroughout their code:

  • snake_notation_looks_like_this,
  • camelNotationLooksLikeThis,
  • period.notation.looks.like.this,
  • LousyNotation_looks.likeThis

Try to be consistent and stick to one of them. Bear in mind period.notation is used by S3 classes to create generic functions, e.g. plot.my.object. A reason to avoid it?


It is also important to maintain code readability by having your variable names:

  • informative, e.g. genotypes vs. fsjht45jkhsdf4,

  • consistent across your code - the same naming convention,

  • not too long, e.g. - weight vs. phenotype.weight.measured,

  • in the period notation and the snake notation avoid my.var.2 or my_var_2, use my.var2 and my_var2 instead,

RaukR 2018 . 7/41

Special Variable Names

Few more things to consider:

  • there are built-in variable names:

    • LETTERS: the 26 upper-case letters of the Roman alphabet

    • letters: the 26 lower-case letters of the Roman alphabet

    • month.abb: the three-letter abbreviations for the English month names

    • month.name: the English names for the months of the year

    • pi: the ratio of the circumference of a circle to its diameter

  • variable names beginning with period are hidden: .my_secret_variable will not be shown but can be accessed.

RaukR 2018 . 8/41

Structure Your Code

Decompose the problem!


source: Wikimedia Commons

RaukR 2018 . 9/41

Structure Your Code

Decompose the problem!


source: Wikimedia Commons

  • Divide et impera / top-down approach -- split your BIG problem into small subproblems recursively and, at some level, encapsulate your code in functional blocks (functions).

  • A function should be performing a small task. Should be a logical program unit.

When should I write a function?

  • one screen rule (resolution...),
  • re-use twice rule.

Consider creating an S4 class -- data-type safety!

RaukR 2018 . 9/41

How to write functions

  • Avoid accessing (and modifying) globals!

  • Use data as the very first argument (pipes).

  • Set parameters to defaults -- better more params than too few.

  • Remember that global defaults can be changed by options.

  • If you are re-using someone else's function -- write a wrapper.

  • Showing progress and messages is good, but let the others turn this off.

  • If you are calling other functions, consider using ...

RaukR 2018 . 10/41

Debugging Your Code

  • Sooner or later ALL of us, even the most experienced programmers, introduce errors to their code.

  • 20 percent of the code has 80 percent of the errors. Find them, fix them! -- Lowell Arthur

  • Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald Knuth

  • The process of debugging is about confirming, one-by-one, that our beliefs about the code are actually true (loosely inspired by Pete Salzman).

  • Debug in a top-down and modular manner! provided you have coded in a modular way...

RaukR 2018 . 11/41

Types of bugs

There are different types of bugs we can introduce:

  • Syntax -- prin(var1), mean(sum(seq((x + 2) * (y - 9 * b)))

  • Arithmetic -- x/0 (not in R, though!) Inf/Inf

  • Type -- mean('a')

  • Logic -- everything works and produces seemingly valid output that is WRONG!

To avoid bugs:

  • Encapsulate your code in smaller units (functions), you can test.

  • Use classes and type checking.

  • Test at the boundaries, e.g. loops at min and max value.

  • Feed your functions with test data that should result with a known output.

  • Use antibugging: stopifnot(y <= 75)

RaukR 2018 . 12/41

Arithmetic bugs

(vec <- seq(0.1, 0.9, by=0.1))
vec == 0.7
vec == 0.5
(0.5 + 0.1) - 0.6
(0.7 + 0.1) - 0.8
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [1] 0
## [1] -1.110223e-16

Beware of floating point arithmetics!

head(unlist(.Machine))
head(unlist(.Platform))
## double.eps double.neg.eps double.xmin double.xmax double.base
## 2.220446e-16 1.110223e-16 2.225074e-308 1.797693e+308 2.000000e+00
## double.digits
## 5.300000e+01
## OS.type file.sep dynlib.ext GUI endian pkgType
## "unix" "/" ".so" "unknown" "little" "source"
RaukR 2018 . 13/41

Handling Errors

input <- c(1, 10, -7, -2/5, 0, 'char', 100, pi)
for (val in input) {
(paste0('Log of ', val, 'is ', log10(val)))
}
## Error in log10(val): non-numeric argument to mathematical function

One option is to use the try block:

for (val in input) {
val <- as.numeric(val)
try(print(paste0('Log of ', val, ' is ', log10(val))))
}
## [1] "Log of 1 is 0"
## [1] "Log of 10 is 1"
## [1] "Log of -7 is NaN"
## [1] "Log of -0.4 is NaN"
## [1] "Log of 0 is -Inf"
## [1] "Log of NA is NA"
## [1] "Log of 100 is 2"
## [1] "Log of 3.14159265358979 is 0.497149872694133"
RaukR 2018 . 14/41

Handling Errors with tryCatch

for (val in input) {
val <- as.numeric(val)
result <- tryCatch(log10(val),
warning = function(w) { print('Negative argument supplied. Negating.'); log10(-val) },
error = function(e) { print('Not a number!'); NaN })
print(paste0('Log of ', val, ' is ', result))
}
## [1] "Log of 1 is 0"
## [1] "Log of 10 is 1"
## [1] "Negative argument supplied. Negating."
## [1] "Log of -7 is 0.845098040014257"
## [1] "Negative argument supplied. Negating."
## [1] "Log of -0.4 is -0.397940008672038"
## [1] "Log of 0 is -Inf"
## [1] "Log of NA is NA"
## [1] "Log of 100 is 2"
## [1] "Log of 3.14159265358979 is 0.497149872694133"
RaukR 2018 . 15/41

Debugging -- Errors and Warnings

  • An error in your code will result in a call to the stop() function that:

    • breaks the execution of the program (loop, if-statemetnt, etc.),

    • performs the action defined by the global parameter error.

  • A warning just prints out the warning message (or reports it in another way).

  • Global parameter error defines what R should do when an error occurs.

options(error = )
  • You can use simpleError() and simpleWarning() to generate errors and warnings in your code:
f <- function(x) {
if (x < 0) {
x <- abs(x)
w <- simpleWarning("Value less than 0. Taking abs(x)")
w
}
}
RaukR 2018 . 16/41

Debugging -- What are my Options?

  • Old-school debugging: a lot of print statements

    • print values of your variables at some checkpoints,
    • sometimes fine but often laborious,
    • need to remove/comment out manually after debugging.
  • Dumping frames

    • on error, R state will be saved to a file,
    • file can be read into debugger,
    • values of all variables can be checked,
    • can debug on another machine, e.g. send dump to your colleague!
  • Traceback

    • a list of the recent function calls with values of their params,
  • Step-by-step debugging

    • execute code line by line within the debugger
RaukR 2018 . 17/41

Option 1: Dumping Frames

options(error = quote(dump.frames("testdump", TRUE)))
f <- function(x) {
sin(x)
}
f('test')
## Error in sin(x): non-numeric argument to mathematical function
options(error = NULL)
load("testdump.rda")
# debugger(testdump)
Message: Error in sin(x) : non-numeric argument to mathematical function
Available environments had calls:
1: f("test")

Enter an environment number, or 0 to exit
Selection: 1
Browsing in the environment with call:
f("test")
Called from: debugger.look(ind)
Browse[1]> x
[1] "test"
Browse[1]>
[1] "test"
Browse[1]>
Last empty line brings you back to the environments menu.
RaukR 2018 . 18/41

Option 2: Traceback

f <- function(x) {
log10(x)
}
g <- function(x) {
f(x)
}
g('test')
## Error in log10(x): non-numeric argument to mathematical function
> traceback()
2: f(x) at #2
1: g("test")

traceback() shows what were the function calls and what parameters were passed to them when the error occured.

RaukR 2018 . 19/41

Option 3: Debug step-by-step

Let us define a new function h(x, y):

h <- function(x, y) {
f(x)
f(y)
}

Now, we can use debug() to debug the function in a step-by-step manner:

debug(h)
h('text', 7)
undebug(h)
RaukR 2018 . 20/41

Option 3: Debug step-by-step cted.

n -- execute next line, c -- execute whole function, q -- quit debugger mode.

> debug(h)
> h('text', 7)
debugging in: h("text", 7)
debug at #1: {
f(x)
f(y)
}
RaukR 2018 . 21/41

Option 3: Debug step-by-step cted.

n -- execute next line, c -- execute whole function, q -- quit debugger mode.

> debug(h)
> h('text', 7)
debugging in: h("text", 7)
debug at #1: {
f(x)
f(y)
}
Browse[2]> x
[1] "text"
RaukR 2018 . 21/41

Option 3: Debug step-by-step cted.

n -- execute next line, c -- execute whole function, q -- quit debugger mode.

> debug(h)
> h('text', 7)
debugging in: h("text", 7)
debug at #1: {
f(x)
f(y)
}
Browse[2]> x
[1] "text"
Browse[2]> y
[1] 7
RaukR 2018 . 21/41

Option 3: Debug step-by-step cted.

n -- execute next line, c -- execute whole function, q -- quit debugger mode.

> debug(h)
> h('text', 7)
debugging in: h("text", 7)
debug at #1: {
f(x)
f(y)
}
Browse[2]> x
[1] "text"
Browse[2]> y
[1] 7
Browse[2]> n
debug at #2: f(x)
RaukR 2018 . 21/41

Option 3: Debug step-by-step cted.

n -- execute next line, c -- execute whole function, q -- quit debugger mode.

> debug(h)
> h('text', 7)
debugging in: h("text", 7)
debug at #1: {
f(x)
f(y)
}
Browse[2]> x
[1] "text"
Browse[2]> y
[1] 7
Browse[2]> n
debug at #2: f(x)
Browse[2]> x
[1] "text"
RaukR 2018 . 21/41

Option 3: Debug step-by-step cted.

n -- execute next line, c -- execute whole function, q -- quit debugger mode.

> debug(h)
> h('text', 7)
debugging in: h("text", 7)
debug at #1: {
f(x)
f(y)
}
Browse[2]> x
[1] "text"
Browse[2]> y
[1] 7
Browse[2]> n
debug at #2: f(x)
Browse[2]> x
[1] "text"
Browse[2]> n
Error in log10(x) : non-numeric argument to mathematical function
RaukR 2018 . 21/41

Profiling -- proc.time()

Profiling is the process of identifying memory and time bottlenecks in your code.

proc.time()
## user system elapsed
## 2.497 0.205 2.931
  • user time -- CPU time charged for the execution of user instructions of the calling process,
  • system time -- CPU time charged for execution by the system on behalf of the calling process,
  • elapsed time -- total CPU time elapsed for the currently running R process.
pt1 <- proc.time()
tmp <- runif(n = 10e5)
pt2 <- proc.time()
pt2 - pt1
## user system elapsed
## 0.035 0.003 0.038
RaukR 2018 . 22/41

Profiling -- system.time()

system.time(runif(n = 10e6))
system.time(rnorm(n = 10e6))
## user system elapsed
## 0.388 0.026 0.420
## user system elapsed
## 0.677 0.010 0.695
RaukR 2018 . 23/41

Profiling in Action

Let's see profiling in action! We will define four functions that fill a large vector in two different ways:

RaukR 2018 . 24/41

Profiling in Action

Let's see profiling in action! We will define four functions that fill a large vector in two different ways:

fun_fill_loop1 <- function(n = 10e6, f) {
result <- NULL
for (i in 1:n) {
result <- c(result, eval(call(f, 1)))
}
return(result)
}
RaukR 2018 . 24/41

Profiling in Action

Let's see profiling in action! We will define four functions that fill a large vector in two different ways:

fun_fill_loop1 <- function(n = 10e6, f) {
result <- NULL
for (i in 1:n) {
result <- c(result, eval(call(f, 1)))
}
return(result)
}
fun_fill_loop2 <- function(n = 10e6, f) {
result <- vector(length = n)
for (i in 1:n) {
result[i] <- eval(call(f, 1))
}
return(result)
}
RaukR 2018 . 24/41

Profiling in Action cted.

It is maybe better to use...

RaukR 2018 . 25/41

Profiling in Action cted.

It is maybe better to use...vectorization!

RaukR 2018 . 25/41

Profiling in Action cted.

It is maybe better to use...vectorization!

fun_fill_vec1 <- function(n = 10e6, f) {
result <- NULL
result <- eval(call(f, n))
return(result)
}
RaukR 2018 . 25/41

Profiling in Action cted.

It is maybe better to use...vectorization!

fun_fill_vec1 <- function(n = 10e6, f) {
result <- NULL
result <- eval(call(f, n))
return(result)
}
fun_fill_vec2 <- function(n = 10e6, f) {
result <- vector(length = n)
result <- eval(call(f, n))
return(result)
}
RaukR 2018 . 25/41

Profiling our functions

system.time(fun_fill_loop1(n = 10e4, "runif")) # Loop 1
system.time(fun_fill_loop2(n = 10e4, "runif")) # Loop 2
system.time(fun_fill_vec1(n = 10e4, "runif")) # Vectorized 1
system.time(fun_fill_vec2(n = 10e4, "runif")) # Vectorized 2
## user system elapsed
## 17.723 10.508 28.850
## user system elapsed
## 0.308 0.037 0.346
## user system elapsed
## 0.004 0.000 0.004
## user system elapsed
## 0.005 0.000 0.005

The system.time() function is not the most accurate though. During the lab, we will experiment with package microbenchmark.

RaukR 2018 . 26/41

More advanced profiling

We can also do a bit more advanced profiling, including the memory profiling, using, e.g. Rprof() function.

And let us summarise:

summary <- summaryRprof('profiler_test.out', memory='both')
datatable(summary$by.self, options=list(pageLength = 10, searching = F, info = F))
#knitr::kable(summary$by.self)
 
RaukR 2018 . 27/41

Profiling -- profr package

There are also packages available that enable even more advanced profiling:

library(profr)
Rprof("profiler_test2.out", interval = 0.01)
tmp <- table(sort(rnorm(1e5)))
Rprof(NULL)
profile_df <- parse_rprof('profiler_test2.out')

This returns a table that can be visualised:

 
RaukR 2018 . 28/41

Profiling -- profr package cted.

We can also use the overloaded ggplot function:

profr::ggplot.profr(profile_df)

RaukR 2018 . 29/41

Profiling with profvis

Yet another nice way to profile your code is by using Hadley Wickham's profvis package:

library(profvis)
profvis({fun_fill_loop2(1e4, 'runif')
fun_fill_vec2(1e4, 'runif')
})
RaukR 2018 . 30/41

Profiling with profvis cted.

Flame Graph
Data
Options ▾
<expr>MemoryTime
library(profvis)
profvis({fun_fill_loop2(1e4, 'runif')
c()
})
0
RaukR 2018 . 31/41

Optimizing your code

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.

-- Donald Knuth

RaukR 2018 . 32/41

Ways to optimize the code

  • write it in a more efficient way, e.g. use vectorization or *apply family instead of loops etc.,
  • allocating memory to avoid copy-on-modify,
  • use package BLAS for linear algebra,
  • use bigmemory package,
  • GPU computations,
  • multicore support, e.g. multicore, snow
  • use data.table or tibble instead of data.frame
RaukR 2018 . 33/41

Copy-on-modify

library(pryr)
order <- 1024
matrix_A <- matrix(rnorm(order^2), nrow = order)
matrix_B <- matrix_A
RaukR 2018 . 34/41

Copy-on-modify

library(pryr)
order <- 1024
matrix_A <- matrix(rnorm(order^2), nrow = order)
matrix_B <- matrix_A

Check where the objects are in the memory:

RaukR 2018 . 34/41

Copy-on-modify

library(pryr)
order <- 1024
matrix_A <- matrix(rnorm(order^2), nrow = order)
matrix_B <- matrix_A

Check where the objects are in the memory:

address(matrix_A)
address(matrix_B)
## [1] "0x112e5a000"
## [1] "0x112e5a000"
RaukR 2018 . 34/41

Copy-on-modify

library(pryr)
order <- 1024
matrix_A <- matrix(rnorm(order^2), nrow = order)
matrix_B <- matrix_A

Check where the objects are in the memory:

address(matrix_A)
address(matrix_B)
## [1] "0x112e5a000"
## [1] "0x112e5a000"

What happens if we modify a value in one of the matrices?

RaukR 2018 . 34/41

Copy-on-modify

library(pryr)
order <- 1024
matrix_A <- matrix(rnorm(order^2), nrow = order)
matrix_B <- matrix_A

Check where the objects are in the memory:

address(matrix_A)
address(matrix_B)
## [1] "0x112e5a000"
## [1] "0x112e5a000"

What happens if we modify a value in one of the matrices?

matrix_B[1,1] <- 1
address(matrix_A)
address(matrix_B)
## [1] "0x112e5a000"
## [1] "0x11365b000"
RaukR 2018 . 34/41

Avoid copying by allocating memory

No memory allocation

f1 <- function(to = 3, silent=F) {
tmp <- c()
for (i in 1:to) {
a1 <- address(tmp)
tmp <- c(tmp, i)
a2 <- address(tmp)
if(!silent) { print(paste0(a1, " --> ", a2)) }
}
}
f1()
## [1] "0x7f83af012b78 --> 0x7f83b4d25538"
## [1] "0x7f83b4d25538 --> 0x7f83b4d25898"
## [1] "0x7f83b4d25898 --> 0x7f83b474cf88"
RaukR 2018 . 35/41

Avoid copying by allocating memory cted.

With allocation

f2 <- function(to = 3, silent = FALSE) {
tmp <- vector(length = to, mode='numeric')
for (i in 1:to) {
a1 <- address(tmp)
tmp[i] <- i
a2 <- address(tmp)
if(!silent) { print(paste0(a1, " --> ", a2)) }
}
}
f2()
## [1] "0x7f83b5805808 --> 0x7f83b5805808"
## [1] "0x7f83b5805808 --> 0x7f83b5805808"
## [1] "0x7f83b5805808 --> 0x7f83b5805808"
RaukR 2018 . 36/41

Allocating memory -- benchmark.

library(microbenchmark)
benchmrk <- microbenchmark(f1(to = 1e3, silent = T),
f2(to = 1e3, silent = T),
times = 100L)
autoplot(benchmrk)

RaukR 2018 . 37/41

GPU

library(gpuR)
library(microbenchmark)
A = matrix(rnorm(1000^2), nrow=1000) # stored: RAM, computed: CPU
B = matrix(rnorm(1000^2), nrow=1000)
gpuA = gpuMatrix(A, type = "double") # stored: RAM, computed: GPU
gpuB = gpuMatrix(B, type = "double")
vclA = vclMatrix(A, type = "double") # stored: GPU, computed: GPU
vclB = vclMatrix(B, type = "double")
bch <- microbenchmark(
cpuC = A %*% B,
gpuC = gpuA %*% gpuB,
vclC = vclA %*% vclB,
times = 10L)
RaukR 2018 . 38/41

GPU cted.

autoplot(bch)

RaukR 2018 . 39/41

Parallelization using package parallel

Easiest to paralllelize is lapply:

result <- lapply(1:2, function(x) { c(x, x^2, x^3) } )
result
## [[1]]
## [1] 1 1 1
##
## [[2]]
## [1] 2 4 8
library(parallel)
num_cores <- detectCores() - 1
cl <- makeCluster(num_cores) # Init cluster
parLapply(cl, 1:2, function(x) { c(x, x^2, x^3)} )
stopCluster(cl)
## [[1]]
## [1] 1 1 1
##
## [[2]]
## [1] 2 4 8
RaukR 2018 . 40/41

Thank you

RaukR 2018 . 41/41

Topics of This Presentation



Code:

  • Style -- __howTo_style.yourCode?

  • Structure -- manufacture your own building blocks.

  • Debugging -- my code does not run.

  • Profiling -- now it does run but... out of memory!

  • Optimization -- making things better.

RaukR 2018 . 2/41
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow