Code:
Style -- __howTo_style.yourCode?
Structure -- manufacture your own building blocks.
Debugging -- my code does not run.
Profiling -- now it does run but... out of memory!
Optimization -- making things better.
Naming conventions -- assigning names to variables.
Code formatting -- placement of braces, use of whitespace characters etc.
From: Behind The Lines 2010-09-23. By Oliver Widder, Webcomics Geek And Poke.
A syntactically valid name:
Consists of:
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
.
_
begins with a letter or the period (.
) not followed by a number
cannot be one of the reserved words: if
, else
, repeat
, while
, function
, for
, in
, next
, break
, TRUE
, FALSE
, NULL
, Inf
, NaN
, NA
, NA_integer_
, NA_real_
, NA_complex_
, NA_character_
also cannot be: c
, q
, t
, C
, D
, I
Variable names that are legal are not necesserily a good style and they may be dangerous:
F + T F <- T F + T
## [1] 1## [1] 2
do not do this!
Variable names that are legal are not necesserily a good style and they may be dangerous:
F + T F <- T F + T
## [1] 1## [1] 2
do not do this! unless you are a politician...
Avoid T
and F
as variable names.
Also, there is a number of variable names that are traditionally used to name particular variables:
usr
-- user, pwd
-- password,x
, y
, z
-- vectors,w
-- weights,f
, g
-- functions,n
-- number of rows,p
-- number of columns,i
, j
, k
-- indexes,df
-- data frame,cnt
-- counter,M
, N
, W
-- matrices, tmp
-- temporary variablesSometimes these are domain-specific:
p
, q
-- allele frequencies in genetics,N
, k
-- number of trials and number of successes in stats
People use different notation styles hroughout their code:
snake_notation_looks_like_this
,camelNotationLooksLikeThis
,period.notation.looks.like.this
,LousyNotation_looks.likeThis
Try to be consistent and stick to one of them. Bear in mind period.notation
is used by S3 classes to create generic functions, e.g. plot.my.object
. A reason to avoid it?
It is also important to maintain code readability by having your variable names:
informative, e.g. genotypes
vs. fsjht45jkhsdf4
,
consistent across your code - the same naming convention,
not too long, e.g. - weight
vs. phenotype.weight.measured
,
in the period notation and the snake notation avoid my.var.2
or my_var_2
, use my.var2
and my_var2
instead,
Few more things to consider:
there are built-in variable names:
LETTERS: the 26 upper-case letters of the Roman alphabet
letters: the 26 lower-case letters of the Roman alphabet
month.abb: the three-letter abbreviations for the English month names
month.name: the English names for the months of the year
pi: the ratio of the circumference of a circle to its diameter
variable names beginning with period are hidden: .my_secret_variable
will not be shown but can be accessed.
Decompose the problem!
source: Wikimedia Commons
Decompose the problem!
source: Wikimedia Commons
Divide et impera / top-down approach -- split your BIG problem into small subproblems recursively and, at some level, encapsulate your code in functional blocks (functions).
A function should be performing a small task. Should be a logical program unit.
When should I write a function?
Consider creating an S4 class -- data-type safety!
Avoid accessing (and modifying) globals!
Use data as the very first argument (pipes).
Set parameters to defaults -- better more params than too few.
Remember that global defaults can be changed by options
.
If you are re-using someone else's function -- write a wrapper.
Showing progress and messages is good, but let the others turn this off.
If you are calling other functions, consider using ...
source: http://www.xkcd/com/292
Sooner or later ALL of us, even the most experienced programmers, introduce errors to their code.
20 percent of the code has 80 percent of the errors. Find them, fix them! -- Lowell Arthur
Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald Knuth
The process of debugging is about confirming, one-by-one, that our beliefs about the code are actually true (loosely inspired by Pete Salzman).
Debug in a top-down and modular manner! provided you have coded in a modular way...
Syntax -- prin(var1), mean(sum(seq((x + 2) * (y - 9 * b)))
Arithmetic -- x/0
(not in R, though!) Inf/Inf
Type -- mean('a')
Logic -- everything works and produces seemingly valid output that is WRONG!
Encapsulate your code in smaller units (functions), you can test.
Use classes and type checking.
Test at the boundaries, e.g. loops at min and max value.
Feed your functions with test data that should result with a known output.
Use antibugging: stopifnot(y <= 75)
(vec <- seq(0.1, 0.9, by=0.1))vec == 0.7vec == 0.5(0.5 + 0.1) - 0.6(0.7 + 0.1) - 0.8
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE## [1] 0## [1] -1.110223e-16
Beware of floating point arithmetics!
head(unlist(.Machine))head(unlist(.Platform))
## double.eps double.neg.eps double.xmin double.xmax double.base ## 2.220446e-16 1.110223e-16 2.225074e-308 1.797693e+308 2.000000e+00 ## double.digits ## 5.300000e+01 ## OS.type file.sep dynlib.ext GUI endian pkgType ## "unix" "/" ".so" "unknown" "little" "source"
input <- c(1, 10, -7, -2/5, 0, 'char', 100, pi)for (val in input) { (paste0('Log of ', val, 'is ', log10(val)))}
## Error in log10(val): non-numeric argument to mathematical function
One option is to use the try
block:
for (val in input) { val <- as.numeric(val) try(print(paste0('Log of ', val, ' is ', log10(val))))}
## [1] "Log of 1 is 0"## [1] "Log of 10 is 1"## [1] "Log of -7 is NaN"## [1] "Log of -0.4 is NaN"## [1] "Log of 0 is -Inf"## [1] "Log of NA is NA"## [1] "Log of 100 is 2"## [1] "Log of 3.14159265358979 is 0.497149872694133"
tryCatch
for (val in input) { val <- as.numeric(val) result <- tryCatch(log10(val), warning = function(w) { print('Negative argument supplied. Negating.'); log10(-val) }, error = function(e) { print('Not a number!'); NaN }) print(paste0('Log of ', val, ' is ', result))}
## [1] "Log of 1 is 0"## [1] "Log of 10 is 1"## [1] "Negative argument supplied. Negating."## [1] "Log of -7 is 0.845098040014257"## [1] "Negative argument supplied. Negating."## [1] "Log of -0.4 is -0.397940008672038"## [1] "Log of 0 is -Inf"## [1] "Log of NA is NA"## [1] "Log of 100 is 2"## [1] "Log of 3.14159265358979 is 0.497149872694133"
An error in your code will result in a call to the stop()
function that:
breaks the execution of the program (loop, if-statemetnt, etc.),
performs the action defined by the global parameter error
.
A warning just prints out the warning message (or reports it in another way).
Global parameter error
defines what R should do when an error occurs.
options(error = )
simpleError()
and simpleWarning()
to generate errors and warnings in your code:f <- function(x) { if (x < 0) { x <- abs(x) w <- simpleWarning("Value less than 0. Taking abs(x)") w }}
Old-school debugging: a lot of print
statements
Dumping frames
Traceback
Step-by-step debugging
options(error = quote(dump.frames("testdump", TRUE)))f <- function(x) { sin(x)}f('test')
## Error in sin(x): non-numeric argument to mathematical function
options(error = NULL)load("testdump.rda")# debugger(testdump)
f <- function(x) { log10(x) }g <- function(x) { f(x)}g('test')
## Error in log10(x): non-numeric argument to mathematical function
> traceback()traceback()
shows what were the function calls and what parameters were passed to them when the error occured.
Let us define a new function h(x, y)
:
h <- function(x, y) { f(x) f(y) }
Now, we can use debug()
to debug the function in a step-by-step manner:
debug(h)h('text', 7)undebug(h)
n
-- execute next line, c
-- execute whole function, q
-- quit debugger mode.
n
-- execute next line, c
-- execute whole function, q
-- quit debugger mode.
n
-- execute next line, c
-- execute whole function, q
-- quit debugger mode.
n
-- execute next line, c
-- execute whole function, q
-- quit debugger mode.
n
-- execute next line, c
-- execute whole function, q
-- quit debugger mode.
n
-- execute next line, c
-- execute whole function, q
-- quit debugger mode.
proc.time()
Profiling is the process of identifying memory and time bottlenecks in your code.
proc.time()
## user system elapsed ## 2.497 0.205 2.931
user time
-- CPU time charged for the execution of user instructions of the calling process, system time
-- CPU time charged for execution by the system on behalf of the calling process,elapsed time
-- total CPU time elapsed for the currently running R process.pt1 <- proc.time()tmp <- runif(n = 10e5)pt2 <- proc.time()pt2 - pt1
## user system elapsed ## 0.035 0.003 0.038
system.time()
system.time(runif(n = 10e6))system.time(rnorm(n = 10e6))
## user system elapsed ## 0.388 0.026 0.420 ## user system elapsed ## 0.677 0.010 0.695
Let's see profiling in action! We will define four functions that fill a large vector in two different ways:
Let's see profiling in action! We will define four functions that fill a large vector in two different ways:
fun_fill_loop1 <- function(n = 10e6, f) { result <- NULL for (i in 1:n) { result <- c(result, eval(call(f, 1))) } return(result)}
Let's see profiling in action! We will define four functions that fill a large vector in two different ways:
fun_fill_loop1 <- function(n = 10e6, f) { result <- NULL for (i in 1:n) { result <- c(result, eval(call(f, 1))) } return(result)}
fun_fill_loop2 <- function(n = 10e6, f) { result <- vector(length = n) for (i in 1:n) { result[i] <- eval(call(f, 1)) } return(result)}
It is maybe better to use...
It is maybe better to use...vectorization!
It is maybe better to use...vectorization!
fun_fill_vec1 <- function(n = 10e6, f) { result <- NULL result <- eval(call(f, n)) return(result)}
It is maybe better to use...vectorization!
fun_fill_vec1 <- function(n = 10e6, f) { result <- NULL result <- eval(call(f, n)) return(result)}
fun_fill_vec2 <- function(n = 10e6, f) { result <- vector(length = n) result <- eval(call(f, n)) return(result)}
system.time(fun_fill_loop1(n = 10e4, "runif")) # Loop 1system.time(fun_fill_loop2(n = 10e4, "runif")) # Loop 2system.time(fun_fill_vec1(n = 10e4, "runif")) # Vectorized 1system.time(fun_fill_vec2(n = 10e4, "runif")) # Vectorized 2
## user system elapsed ## 17.723 10.508 28.850 ## user system elapsed ## 0.308 0.037 0.346 ## user system elapsed ## 0.004 0.000 0.004 ## user system elapsed ## 0.005 0.000 0.005
The system.time()
function is not the most accurate though. During the lab, we will experiment with package microbenchmark
.
We can also do a bit more advanced profiling, including the memory profiling, using, e.g. Rprof()
function.
And let us summarise:
summary <- summaryRprof('profiler_test.out', memory='both')datatable(summary$by.self, options=list(pageLength = 10, searching = F, info = F))#knitr::kable(summary$by.self)
profr
packageThere are also packages available that enable even more advanced profiling:
library(profr)Rprof("profiler_test2.out", interval = 0.01)tmp <- table(sort(rnorm(1e5)))Rprof(NULL)profile_df <- parse_rprof('profiler_test2.out')
This returns a table that can be visualised:
profr
package cted.We can also use the overloaded ggplot
function:
profr::ggplot.profr(profile_df)
profvis
Yet another nice way to profile your code is by using Hadley Wickham's profvis
package:
library(profvis)profvis({fun_fill_loop2(1e4, 'runif') fun_fill_vec2(1e4, 'runif') })
profvis
cted.<expr> | Memory | Time | ||||||
---|---|---|---|---|---|---|---|---|
library(profvis) | ||||||||
profvis({fun_fill_loop2(1e4, 'runif') | ||||||||
c() | ||||||||
}) |
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
-- Donald Knuth
*apply
family instead of loops etc.,BLAS
for linear algebra,bigmemory
package,multicore
, snow
data.table
or tibble
instead of data.frame
library(pryr)order <- 1024matrix_A <- matrix(rnorm(order^2), nrow = order)matrix_B <- matrix_A
library(pryr)order <- 1024matrix_A <- matrix(rnorm(order^2), nrow = order)matrix_B <- matrix_A
Check where the objects are in the memory:
library(pryr)order <- 1024matrix_A <- matrix(rnorm(order^2), nrow = order)matrix_B <- matrix_A
Check where the objects are in the memory:
address(matrix_A)address(matrix_B)
## [1] "0x112e5a000"## [1] "0x112e5a000"
library(pryr)order <- 1024matrix_A <- matrix(rnorm(order^2), nrow = order)matrix_B <- matrix_A
Check where the objects are in the memory:
address(matrix_A)address(matrix_B)
## [1] "0x112e5a000"## [1] "0x112e5a000"
What happens if we modify a value in one of the matrices?
library(pryr)order <- 1024matrix_A <- matrix(rnorm(order^2), nrow = order)matrix_B <- matrix_A
Check where the objects are in the memory:
address(matrix_A)address(matrix_B)
## [1] "0x112e5a000"## [1] "0x112e5a000"
What happens if we modify a value in one of the matrices?
matrix_B[1,1] <- 1address(matrix_A)address(matrix_B)
## [1] "0x112e5a000"## [1] "0x11365b000"
f1 <- function(to = 3, silent=F) { tmp <- c() for (i in 1:to) { a1 <- address(tmp) tmp <- c(tmp, i) a2 <- address(tmp) if(!silent) { print(paste0(a1, " --> ", a2)) } }}f1()
## [1] "0x7f83af012b78 --> 0x7f83b4d25538"## [1] "0x7f83b4d25538 --> 0x7f83b4d25898"## [1] "0x7f83b4d25898 --> 0x7f83b474cf88"
f2 <- function(to = 3, silent = FALSE) { tmp <- vector(length = to, mode='numeric') for (i in 1:to) { a1 <- address(tmp) tmp[i] <- i a2 <- address(tmp) if(!silent) { print(paste0(a1, " --> ", a2)) } }}f2()
## [1] "0x7f83b5805808 --> 0x7f83b5805808"## [1] "0x7f83b5805808 --> 0x7f83b5805808"## [1] "0x7f83b5805808 --> 0x7f83b5805808"
library(microbenchmark)benchmrk <- microbenchmark(f1(to = 1e3, silent = T), f2(to = 1e3, silent = T), times = 100L)autoplot(benchmrk)
library(gpuR)library(microbenchmark)A = matrix(rnorm(1000^2), nrow=1000) # stored: RAM, computed: CPUB = matrix(rnorm(1000^2), nrow=1000) gpuA = gpuMatrix(A, type = "double") # stored: RAM, computed: GPUgpuB = gpuMatrix(B, type = "double")vclA = vclMatrix(A, type = "double") # stored: GPU, computed: GPUvclB = vclMatrix(B, type = "double")bch <- microbenchmark( cpuC = A %*% B, gpuC = gpuA %*% gpuB, vclC = vclA %*% vclB, times = 10L)
More on Charles Determan's Blog.
autoplot(bch)
parallel
Easiest to paralllelize is lapply
:
result <- lapply(1:2, function(x) { c(x, x^2, x^3) } )result
## [[1]]## [1] 1 1 1## ## [[2]]## [1] 2 4 8
library(parallel)num_cores <- detectCores() - 1cl <- makeCluster(num_cores) # Init clusterparLapply(cl, 1:2, function(x) { c(x, x^2, x^3)} )stopCluster(cl)
## [[1]]## [1] 1 1 1## ## [[2]]## [1] 2 4 8
Code:
Style -- __howTo_style.yourCode?
Structure -- manufacture your own building blocks.
Debugging -- my code does not run.
Profiling -- now it does run but... out of memory!
Optimization -- making things better.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |