class: center, middle, inverse, title-slide # Best Coding Practices in R ## Advanced R for Bioinformatics. Visby, 2018 ### Marcin Kierczak ### 28 May, 2018 --- class: spaced name: overview ## Topics of This Presentation <br><br> Code: <br> * **Style** -- __howTo_style.yourCode? * **Structure** -- manufacture your own building blocks. * **Debugging** -- my code does not run. * **Profiling** -- now it does run but... out of memory! * **Optimization** -- making things better. --- name: coding-style ## Coding Style * Naming conventions -- assigning names to variables. * Code formatting -- placement of braces, use of whitespace characters etc. .center[ <img src="./assets/coding_style.jpg" class="fancyimage", style="width:49%; height:49%; box-shadow:0px 0px 0px white"><br> .vsmall[From: [Behind The Lines](http://geekandpoke.typepad.com/geekandpoke/2010/09/behind-the-lines.html) 2010-09-23. By Oliver Widder, Webcomics Geek And Poke.] ] --- name: naming-conventions ## Naming Conventions A syntactically valid name: * Consists of: + letters: `abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ` + digits: `0123456789` + period: `.` + underscore: `_` * begins with a letter or the period (`.`) **not** followed by a number * cannot be one of the *reserved words*: `if`, `else`, `repeat`, `while`, `function`, `for`, `in`, `next`, `break`, `TRUE`, `FALSE`, `NULL`, `Inf`, `NaN`, `NA`, `NA_integer_`, `NA_real_`, `NA_complex_`, `NA_character_` * also cannot be: `c`, `q`, `t`, `C`, `D`, `I` --- name: naming-styles ## Naming Style Variable names that are legal are not necesserily a good style and they may be dangerous: ```r F + T F <- T F + T ``` ``` ## [1] 1 ## [1] 2 ``` do not do this! -- unless you are a politician... <br><br><br> .center[.large[Avoid `T` and `F` as variable names.]] --- ## Customary Variable Names Also, there is a number of variable names that are traditionally used to name particular variables: * `usr` -- user, * `pwd` -- password, * `x`, `y`, `z` -- vectors, * `w` -- weights, * `f`, `g` -- functions, * `n` -- number of rows, * `p` -- number of columns, * `i`, `j`, `k` -- indexes, * `df` -- data frame, * `cnt` -- counter, * `M`, `N`, `W` -- matrices, * `tmp` -- temporary variables Sometimes these are domain-specific: * `p`, `q` -- allele frequencies in genetics, * `N`, `k` -- number of trials and number of successes in stats <br><br> .center[.large[Try to avoid use these in this way to avoid possible confusion.]] --- ## Different Notations People use different notation styles hroughout their code: * `snake_notation_looks_like_this`, * `camelNotationLooksLikeThis`, * `period.notation.looks.like.this`, * `LousyNotation_looks.likeThis` Try to be consistent and stick to one of them. Bear in mind `period.notation` is used by S3 classes to create generic functions, e.g. `plot.my.object`. A reason to avoid it? .center[***] It is also important to maintain code readability by having your variable names: * informative, e.g. `genotypes` vs. `fsjht45jkhsdf4`, * consistent across your code - the same naming convention, * not too long, e.g. - `weight` vs. `phenotype.weight.measured`, * in the period notation and the snake notation avoid `my.var.2` or `my_var_2`, use `my.var2` and `my_var2` instead, --- ## Special Variable Names Few more things to consider: * there are built-in variable names: + LETTERS: the 26 upper-case letters of the Roman alphabet + letters: the 26 lower-case letters of the Roman alphabet + month.abb: the three-letter abbreviations for the English month names + month.name: the English names for the months of the year + pi: the ratio of the circumference of a circle to its diameter * variable names beginning with period are **hidden**: `.my_secret_variable` will not be shown but can be accessed. --- name: structuring_your_code ## Structure Your Code Decompose the problem! .center[ <img src="./assets/Philip-ii-of-macedon.jpg" class="fancyimage", style="height:200px; box-shadow:0px 0px 0px white"> <img src="./assets/Julius_Ceasar.jpg" class="fancyimage", style="height:200px; box-shadow:0px 0px 0px white"> <img src="./assets/Napoleon_Bonaparte.jpg" class="fancyimage", style="height:200px; box-shadow:0px 0px 0px white"><br> .vsmall[source: Wikimedia Commons] ] -- * *Divide et impera* / top-down approach -- split your BIG problem into small subproblems recursively and, **at some level**, encapsulate your code in functional blocks (functions). * A function should be performing a small task. Should be a logical program unit. **When should I write a function?** * one screen rule (resolution...), * re-use twice rule. Consider creating an S4 class -- data-type safety! --- name: how_to_write_functions ## How to write functions * Avoid accessing (and modifying) globals! * Use data as the very first argument (pipes). * Set parameters to defaults -- better more params than too few. * Remember that global defaults can be changed by `options`. * If you are re-using someone else's function -- write a wrapper. * Showing progress and messages is good, but let the others turn this off. * If you are calling other functions, consider using `...` .center[ <img src="./assets/goto.png" class="fancyimage", style="height:230px; box-shadow:0px 0px 0px white"><br> .vsmall[source: http://www.xkcd/com/292] ] --- name: debugging ## Debugging Your Code * Sooner or later ALL of us, even the most experienced programmers, introduce errors to their code. * *20 percent of the code has 80 percent of the errors. Find them, fix them!* .right[*-- Lowell Arthur*] * *Beware of bugs in the above code; I have only proved it correct, not tried it.* .right[*-- Donald Knuth*] * The process of debugging is about confirming, one-by-one, that our beliefs about the code are actually true (loosely inspired by Pete Salzman). * Debug in a *top-down* and *modular* manner! provided you have coded in a modular way... <img src="assets/konqui_debugging.png" style="display:block; width:30%; margin-left:auto; margin-right:auto;"> --- name: types_of_bugs ## Types of bugs ### There are different types of bugs we can introduce: * Syntax -- `prin(var1), mean(sum(seq((x + 2) * (y - 9 * b)))` * Arithmetic -- `x/0` (not in R, though!) `Inf/Inf` * Type -- `mean('a')` * Logic -- everything works and produces seemingly valid output that is WRONG! ### To avoid bugs: * Encapsulate your code in smaller units (functions), you can test. * Use classes and type checking. * Test at the boundaries, e.g. loops at min and max value. * Feed your functions with test data that should result with a known output. * Use *antibugging*: `stopifnot(y <= 75)` --- name: arithmetic_bugs ## Arithmetic bugs ```r (vec <- seq(0.1, 0.9, by=0.1)) vec == 0.7 vec == 0.5 (0.5 + 0.1) - 0.6 (0.7 + 0.1) - 0.8 ``` ``` ## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE ## [1] 0 ## [1] -1.110223e-16 ``` Beware of floating point arithmetics! .small[ ```r head(unlist(.Machine)) head(unlist(.Platform)) ``` ``` ## double.eps double.neg.eps double.xmin double.xmax double.base ## 2.220446e-16 1.110223e-16 2.225074e-308 1.797693e+308 2.000000e+00 ## double.digits ## 5.300000e+01 ## OS.type file.sep dynlib.ext GUI endian pkgType ## "unix" "/" ".so" "unknown" "little" "source" ``` ] --- name: error_handling ## Handling Errors ```r input <- c(1, 10, -7, -2/5, 0, 'char', 100, pi) for (val in input) { (paste0('Log of ', val, 'is ', log10(val))) } ``` ``` ## Error in log10(val): non-numeric argument to mathematical function ``` One option is to use the `try` block: ```r for (val in input) { val <- as.numeric(val) try(print(paste0('Log of ', val, ' is ', log10(val)))) } ``` ``` ## [1] "Log of 1 is 0" ## [1] "Log of 10 is 1" ## [1] "Log of -7 is NaN" ## [1] "Log of -0.4 is NaN" ## [1] "Log of 0 is -Inf" ## [1] "Log of NA is NA" ## [1] "Log of 100 is 2" ## [1] "Log of 3.14159265358979 is 0.497149872694133" ``` --- name: error_handling_cted ## Handling Errors with `tryCatch` ```r for (val in input) { val <- as.numeric(val) result <- tryCatch(log10(val), warning = function(w) { print('Negative argument supplied. Negating.'); log10(-val) }, error = function(e) { print('Not a number!'); NaN }) print(paste0('Log of ', val, ' is ', result)) } ``` ``` ## [1] "Log of 1 is 0" ## [1] "Log of 10 is 1" ## [1] "Negative argument supplied. Negating." ## [1] "Log of -7 is 0.845098040014257" ## [1] "Negative argument supplied. Negating." ## [1] "Log of -0.4 is -0.397940008672038" ## [1] "Log of 0 is -Inf" ## [1] "Log of NA is NA" ## [1] "Log of 100 is 2" ## [1] "Log of 3.14159265358979 is 0.497149872694133" ``` --- ## Debugging -- Errors and Warnings * An error in your code will result in a call to the `stop()` function that: + breaks the execution of the program (loop, if-statemetnt, etc.), + performs the action defined by the global parameter `error`. * A warning just prints out the warning message (or reports it in another way). * Global parameter `error` defines what R should do when an error occurs. ```r options(error = ) ``` * You can use `simpleError()` and `simpleWarning()` to generate errors and warnings in your code: ```r f <- function(x) { if (x < 0) { x <- abs(x) w <- simpleWarning("Value less than 0. Taking abs(x)") w } } ``` --- ## Debugging -- What are my Options? * Old-school debugging: a lot of `print` statements + print values of your variables at some checkpoints, + sometimes fine but often laborious, + need to remove/comment out manually after debugging. * Dumping frames + on error, R state will be saved to a file, + file can be read into debugger, + values of all variables can be checked, + can debug on another machine, e.g. send dump to your colleague! * Traceback + a list of the recent function calls with values of their params, * Step-by-step debugging + execute code line by line within the debugger --- name: debugging_dump_frames ## Option 1: Dumping Frames ```r options(error = quote(dump.frames("testdump", TRUE))) f <- function(x) { sin(x) } f('test') ``` ``` ## Error in sin(x): non-numeric argument to mathematical function ``` ```r options(error = NULL) load("testdump.rda") # debugger(testdump) ``` .small[<tt>Message: Error in sin(x) : non-numeric argument to mathematical function <br> Available environments had calls: <br> 1: f("test") <br> <br> Enter an environment number, or 0 to exit <br> Selection: 1 <br> Browsing in the environment with call: <br> f("test") <br> Called from: debugger.look(ind) <br> Browse[1]> x <br> [1] "test" <br> Browse[1]> <br> [1] "test" <br> Browse[1]> </tt> ] Last empty line brings you back to the environments menu. --- name: debugging_traceback ## Option 2: Traceback ```r f <- function(x) { log10(x) } g <- function(x) { f(x) } g('test') ``` ``` ## Error in log10(x): non-numeric argument to mathematical function ``` <tt> > traceback()<br> 2: f(x) at #2<br> 1: g("test")<br> </tt> `traceback()` shows what were the function calls and what parameters were passed to them when the error occured. --- ## Option 3: Debug step-by-step Let us define a new function `h(x, y)`: ```r h <- function(x, y) { f(x) f(y) } ``` Now, we can use `debug()` to debug the function in a step-by-step manner: ```r debug(h) h('text', 7) undebug(h) ``` --- ## Option 3: Debug step-by-step cted. `n` -- execute next line, `c` -- execute whole function, `q` -- quit debugger mode. <tt> > debug(h)<br> > h('text', 7)<br> debugging in: h("text", 7)<br> debug at #1: {<br> f(x)<br> f(y)<br> }<br> </tt> -- <tt>Browse[2]> x<br> [1] "text"<br></tt> -- <tt>Browse[2]> y<br> [1] 7<br></tt> -- <tt>Browse[2]> n<br> debug at #2: f(x)<br></tt> -- <tt>Browse[2]> x<br> [1] "text"<br></tt> -- <tt>Browse[2]> n<br> Error in log10(x) : non-numeric argument to mathematical function<br> </tt> --- name: profiling_proc_time ## Profiling -- `proc.time()` Profiling is the process of identifying memory and time bottlenecks in your code. ```r proc.time() ``` ``` ## user system elapsed ## 2.497 0.205 2.931 ``` * `user time` -- CPU time charged for the execution of user instructions of the calling process, * `system time` -- CPU time charged for execution by the system on behalf of the calling process, * `elapsed time` -- total CPU time elapsed for the currently running R process. ```r pt1 <- proc.time() tmp <- runif(n = 10e5) pt2 <- proc.time() pt2 - pt1 ``` ``` ## user system elapsed ## 0.035 0.003 0.038 ``` --- name: profiling_system_time ## Profiling -- `system.time()` ```r system.time(runif(n = 10e6)) system.time(rnorm(n = 10e6)) ``` ``` ## user system elapsed ## 0.388 0.026 0.420 ## user system elapsed ## 0.677 0.010 0.695 ``` --- name: profiling_system_time ## Profiling in Action Let's see profiling in action! We will define four functions that fill a large vector in two different ways: -- ```r fun_fill_loop1 <- function(n = 10e6, f) { result <- NULL for (i in 1:n) { result <- c(result, eval(call(f, 1))) } return(result) } ``` -- ```r fun_fill_loop2 <- function(n = 10e6, f) { result <- vector(length = n) for (i in 1:n) { result[i] <- eval(call(f, 1)) } return(result) } ``` --- ## Profiling in Action cted. It is maybe better to use... -- vectorization! -- ```r fun_fill_vec1 <- function(n = 10e6, f) { result <- NULL result <- eval(call(f, n)) return(result) } ``` -- ```r fun_fill_vec2 <- function(n = 10e6, f) { result <- vector(length = n) result <- eval(call(f, n)) return(result) } ``` --- name: compare_loop_vec_sys_time ## Profiling our functions ```r system.time(fun_fill_loop1(n = 10e4, "runif")) # Loop 1 system.time(fun_fill_loop2(n = 10e4, "runif")) # Loop 2 system.time(fun_fill_vec1(n = 10e4, "runif")) # Vectorized 1 system.time(fun_fill_vec2(n = 10e4, "runif")) # Vectorized 2 ``` ``` ## user system elapsed ## 17.723 10.508 28.850 ## user system elapsed ## 0.308 0.037 0.346 ## user system elapsed ## 0.004 0.000 0.004 ## user system elapsed ## 0.005 0.000 0.005 ``` The `system.time()` function is not the most accurate though. During the lab, we will experiment with package `microbenchmark`. --- name: Rprof ## More advanced profiling We can also do a bit more advanced profiling, including the memory profiling, using, e.g. `Rprof()` function. And let us summarise: .small[ ```r summary <- summaryRprof('profiler_test.out', memory='both') datatable(summary$by.self, options=list(pageLength = 10, searching = F, info = F)) #knitr::kable(summary$by.self) ```
] --- name: profr_package ## Profiling -- `profr` package There are also packages available that enable even more advanced profiling: ```r library(profr) Rprof("profiler_test2.out", interval = 0.01) tmp <- table(sort(rnorm(1e5))) Rprof(NULL) profile_df <- parse_rprof('profiler_test2.out') ``` This returns a table that can be visualised:
--- name: profr_package_cted ## Profiling -- `profr` package cted. We can also use the overloaded `ggplot` function: ```r profr::ggplot.profr(profile_df) ``` <img src="" style="display: block; margin: auto;" /> --- name: profviz_package ## Profiling with `profvis` Yet another nice way to profile your code is by using Hadley Wickham's `profvis` package: ```r library(profvis) profvis({fun_fill_loop2(1e4, 'runif') fun_fill_vec2(1e4, 'runif') }) ``` --- name: profvis_run ## Profiling with `profvis` cted.
--- name: optimizing_code1 ## Optimizing your code *We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.* <br><br> *-- Donald Knuth* <div class="pull-left"> <img src="./assets/xkcd_automation.png" style="height:300px;"> <br> .vsmall[source: http://www.xkcd/com/1319] </div> <div class="pull-right"> <img src="./assets/xkcd_is_it_worth_the_time_2x.png" style="height:300px;"> <br> .vsmall[source: http://www.xkcd/com/1205] </div> --- name: optimization_types ## Ways to optimize the code * write it in a more efficient way, e.g. use vectorization or `*apply` family instead of loops etc., * allocating memory to avoid copy-on-modify, * use package `BLAS` for linear algebra, * use `bigmemory` package, * GPU computations, * multicore support, e.g. `multicore`, `snow` * use `data.table` or `tibble` instead of `data.frame` --- name: copy-on-modify ## Copy-on-modify ```r library(pryr) order <- 1024 matrix_A <- matrix(rnorm(order^2), nrow = order) matrix_B <- matrix_A ``` -- Check where the objects are in the memory: -- ```r address(matrix_A) address(matrix_B) ``` ``` ## [1] "0x112e5a000" ## [1] "0x112e5a000" ``` -- What happens if we modify a value in one of the matrices? -- ```r matrix_B[1,1] <- 1 address(matrix_A) address(matrix_B) ``` ``` ## [1] "0x112e5a000" ## [1] "0x11365b000" ``` --- ## Avoid copying by allocating memory ### No memory allocation ```r f1 <- function(to = 3, silent=F) { tmp <- c() for (i in 1:to) { a1 <- address(tmp) tmp <- c(tmp, i) a2 <- address(tmp) if(!silent) { print(paste0(a1, " --> ", a2)) } } } f1() ``` ``` ## [1] "0x7f83af012b78 --> 0x7f83b4d25538" ## [1] "0x7f83b4d25538 --> 0x7f83b4d25898" ## [1] "0x7f83b4d25898 --> 0x7f83b474cf88" ``` --- ## Avoid copying by allocating memory cted. ### With allocation ```r f2 <- function(to = 3, silent = FALSE) { tmp <- vector(length = to, mode='numeric') for (i in 1:to) { a1 <- address(tmp) tmp[i] <- i a2 <- address(tmp) if(!silent) { print(paste0(a1, " --> ", a2)) } } } f2() ``` ``` ## [1] "0x7f83b5805808 --> 0x7f83b5805808" ## [1] "0x7f83b5805808 --> 0x7f83b5805808" ## [1] "0x7f83b5805808 --> 0x7f83b5805808" ``` --- ## Allocating memory -- benchmark. ```r library(microbenchmark) benchmrk <- microbenchmark(f1(to = 1e3, silent = T), f2(to = 1e3, silent = T), times = 100L) autoplot(benchmrk) ``` <img src="" style="display: block; margin: auto auto auto 0;" /> --- ## GPU ```r library(gpuR) library(microbenchmark) A = matrix(rnorm(1000^2), nrow=1000) # stored: RAM, computed: CPU B = matrix(rnorm(1000^2), nrow=1000) gpuA = gpuMatrix(A, type = "double") # stored: RAM, computed: GPU gpuB = gpuMatrix(B, type = "double") vclA = vclMatrix(A, type = "double") # stored: GPU, computed: GPU vclB = vclMatrix(B, type = "double") bch <- microbenchmark( cpuC = A %*% B, gpuC = gpuA %*% gpuB, vclC = vclA %*% vclB, times = 10L) ``` .small[ More on [Charles Determan's Blog](https://www.r-bloggers.com/r-gpu-programming-for-all-with-gpur/). ] --- ## GPU cted. ```r autoplot(bch) ``` <img src="" style="display: block; margin: auto;" /> --- ## Parallelization using package `parallel` Easiest to paralllelize is `lapply`: ```r result <- lapply(1:2, function(x) { c(x, x^2, x^3) } ) result ``` ``` ## [[1]] ## [1] 1 1 1 ## ## [[2]] ## [1] 2 4 8 ``` ```r library(parallel) num_cores <- detectCores() - 1 cl <- makeCluster(num_cores) # Init cluster parLapply(cl, 1:2, function(x) { c(x, x^2, x^3)} ) stopCluster(cl) ``` ``` ## [[1]] ## [1] 1 1 1 ## ## [[2]] ## [1] 2 4 8 ``` --- name: end-slide class: end-slide <h2 style="color:#fff"> Thank you</h2>