RaukR 2025 • R Beyond the Basics
Marcin Kierczak
08-May-2025
Everything works and produces seemingly valid output that is WRONG!
IMHO those are the hardest 💀 to debug!
stopifnot(y <= 75)
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[1] 0
[1] -1.110223e-16
💀 Beware of floating point arithmetic! 💀
Comparing floating point numbers:
double.eps double.neg.eps double.xmin double.xmax double.base
2.220446e-16 1.110223e-16 2.225074e-308 1.797693e+308 2.000000e+00
double.digits
5.300000e+01
Let us generate some errors:
input <- c(1, 10, -7, -2/5, 0, 'char', 100, pi, NaN)
for (val in input) {
(paste0('Log of ', val, 'is ', log10(val)))
}
Error in log10(val): non-numeric argument to mathematical function
So, how to handle this mess?
try
[1] "Log of 1 is 0"
[2] "Log of 10 is 1"
[3] "Log of -7 is NaN"
[4] "Log of -0.4 is NaN"
[5] "Log of 0 is -Inf"
[6] "Log of char is NA"
[7] "Log of 100 is 2"
[8] "Log of 3.14159265358979 is 0.497149872694133"
[9] "Log of NaN is NaN"
tryCatch
block:[1] "Log of 1 is 0"
[1] "Log of 10 is 1"
[1] "Warning! Negative argument supplied. Negating."
[1] "Log of -7 is 0.845098040014257"
[1] "Warning! Negative argument supplied. Negating."
[1] "Log of -0.4 is -0.397940008672038"
[1] "Log of 0 is -Inf"
[1] "Log of NA is NA"
[1] "Log of 100 is 2"
[1] "Log of 3.14159265358979 is 0.497149872694133"
[1] "Log of NaN is NaN"
stop()
function that:
error
.print
statements
Hint: Last empty line brings you back to the environments menu.
Error in log10(x): non-numeric argument to mathematical function
> traceback()
2: f(x) at #2
1: g("test")
traceback()
shows what were the function calls and what parameters were passed to them when the error occurred.
proc.time()
Profiling is the process of identifying memory and time bottlenecks 🍾 in your code.
user time
– CPU time charged for the execution of user instructions of the calling process,system time
– CPU time charged for execution by the system on behalf of the calling process,elapsed time
– total CPU time elapsed for the currently running R process.system.time()
user system elapsed
0.268 0.016 0.285
user system elapsed
0.447 0.020 0.468
These 4 functions fill a large vector with values supplied by function f
.
1 – loop without memory allocation.
But it is maybe better to use…
vectorization!
3 – vectorized loop without memory allocation.
fn | user.self | sys.self | elapsed |
---|---|---|---|
fn1 | 7.644 | 0.166 | 7.811 |
fn2 | 0.479 | 0.000 | 0.480 |
fn3 | 0.001 | 0.000 | 0.002 |
fn4 | 0.002 | 0.000 | 0.002 |
The system.time()
function is not the most accurate though. During the lab, we will experiment with package microbenchmark
.
We can also do a bit more advanced profiling, including the memory profiling, using, e.g. Rprof()
function.
Rprof('profiler_test.out', interval = 0.01, memory.profiling = T)
for (i in 1:5) {
result <- fun_fill_loop2(n = 10e4, "runif")
print(head(result))
}
Rprof(NULL)
[1] 0.6536342 0.6183371 0.6189789 0.3408437 0.2664858 0.6185547
[1] 0.6383574 0.5914124 0.2921400 0.3233635 0.9352206 0.1025050
[1] 0.98932042 0.03798655 0.64528481 0.83488187 0.14626268 0.32709305
[1] 0.63197483 0.32713220 0.34499497 0.32908306 0.26927382 0.09823248
[1] 0.569201405 0.860486382 0.096056464 0.941512789 0.001748405 0.055280883
And let us summarise:
summary <- summaryRprof("profiler_test.out", memory = "both")
knitr::kable(summary$by.self)
unlink("profiler_test.out")
self.time | self.pct | total.time | total.pct | mem.total | |
---|---|---|---|---|---|
“eval” | 1.21 | 50.21 | 2.41 | 100.00 | 1606.6 |
“force” | 0.50 | 20.75 | 0.66 | 27.39 | 464.5 |
“runif” | 0.28 | 11.62 | 0.28 | 11.62 | 175.6 |
“fun_fill_loop2” | 0.26 | 10.79 | 2.41 | 100.00 | 1606.6 |
“parent.frame” | 0.07 | 2.90 | 0.07 | 2.90 | 54.4 |
“is.list” | 0.05 | 2.07 | 0.05 | 2.07 | 30.0 |
“is.pairlist” | 0.03 | 1.24 | 0.03 | 1.24 | 18.3 |
“baseenv” | 0.01 | 0.41 | 0.01 | 0.41 | 9.5 |
profr
packageThere are also packages available that enable even more advanced profiling:
This returns a table that can be visualised:
level | g_id | t_id | f | start | end | n | leaf | time | source |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | .main | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
2 | 1 | 1 | execute | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
3 | 1 | 1 | rmarkdown::render | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
4 | 1 | 1 | knitr::knit | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
5 | 1 | 1 | process_file | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
6 | 1 | 1 | xfun:::handle_error | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
7 | 1 | 1 | withCallingHandlers | 0 | 0.28 | 1 | FALSE | 0.28 | base |
8 | 1 | 1 | process_group | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
9 | 1 | 1 | call_block | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
10 | 1 | 1 | block_exec | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
11 | 1 | 1 | eng_r | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
12 | 1 | 1 | in_input_dir | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
13 | 1 | 1 | in_dir | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
14 | 1 | 1 | evaluate | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
15 | 1 | 1 | evaluate::evaluate | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
16 | 1 | 1 | withRestarts | 0 | 0.28 | 1 | FALSE | 0.28 | base |
17 | 1 | 1 | withRestartList | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
18 | 1 | 1 | withOneRestart | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
19 | 1 | 1 | doWithOneRestart | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
20 | 1 | 1 | withRestartList | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
21 | 1 | 1 | withOneRestart | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
22 | 1 | 1 | doWithOneRestart | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
23 | 1 | 1 | with_handlers | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
24 | 1 | 1 | eval | 0 | 0.28 | 1 | FALSE | 0.28 | BiocGenerics |
25 | 1 | 1 | eval | 0 | 0.28 | 1 | FALSE | 0.28 | BiocGenerics |
26 | 1 | 1 | withCallingHandlers | 0 | 0.28 | 1 | FALSE | 0.28 | base |
27 | 1 | 1 | withVisible | 0 | 0.28 | 1 | FALSE | 0.28 | base |
28 | 1 | 1 | eval | 0 | 0.28 | 1 | FALSE | 0.28 | BiocGenerics |
29 | 1 | 1 | eval | 0 | 0.28 | 1 | FALSE | 0.28 | BiocGenerics |
30 | 1 | 1 | table | 0 | 0.28 | 1 | FALSE | 0.28 | BiocGenerics |
31 | 1 | 1 | standardGeneric | 0 | 0.28 | 1 | FALSE | 0.28 | base |
32 | 1 | 1 | eval | 0 | 0.28 | 1 | FALSE | 0.28 | BiocGenerics |
33 | 1 | 1 | eval | 0 | 0.28 | 1 | FALSE | 0.28 | BiocGenerics |
34 | 1 | 1 | eval | 0 | 0.28 | 1 | FALSE | 0.28 | BiocGenerics |
35 | 1 | 1 | table | 0 | 0.28 | 1 | FALSE | 0.28 | BiocGenerics |
36 | 1 | 1 | base::table | 0 | 0.28 | 1 | FALSE | 0.28 | NA |
37 | 1 | 1 | factor | 0 | 0.28 | 1 | FALSE | 0.28 | base |
38 | 1 | 1 | unique | 0 | 0.16 | 1 | FALSE | 0.16 | BiocGenerics |
39 | 1 | 1 | unique.default | 0 | 0.16 | 1 | TRUE | 0.16 | base |
profr
package cted.We can also plot the results using – proftools
package-
profvis
Yet another nice way to profile your code is by using Hadley Wickham’s profvis
package:
profvis
cted.We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be deluded into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
– Donald Knuth
source: https://xkcd.com/1319
source: https://xkcd.com/1205/
*apply
family instead of loops etc.,BLAS
for linear algebra,bigmemory
package,multicore
, snow
futures
data.table
or tibble
instead of data.frame
Check where the objects are in the memory:
What happens if we modify a value in one of the matrices?
No memory allocation
With memory allocation
f2 <- function(to = 3, silent = FALSE) {
tmp <- vector(length = to, mode='numeric')
for (i in 1:to) {
a1 <- address(tmp)
tmp[i] <- i
a2 <- address(tmp)
if(!silent) { print(paste0(a1, " --> ", a2)) }
}
}
f2()
[1] "0x556ee39c8e78 --> 0x556ee39c8e78"
[1] "0x556ee39c8e78 --> 0x556ee39c8e78"
[1] "0x556ee39c8e78 --> 0x556ee39c8e78"
A = matrix(rnorm(1000^2), nrow=1000) # stored: RAM, computed: CPU
B = matrix(rnorm(1000^2), nrow=1000)
gpuA = gpuMatrix(A, type = "float") # stored: RAM, computed: GPU
gpuB = gpuMatrix(B, type = "float")
vclA = vclMatrix(A, type = "float") # stored: GPU, computed: GPU
vclB = vclMatrix(B, type = "float")
bch <- microbenchmark(
cpu_ram = A %*% B,
gpu_ram = gpuA %*% gpuB,
gpu_vcl = vclA %*% vclB,
times = 10L)
More on Charles Determan’s Blog.
parallel
Easiest to parallelize is lapply
: