Code: Simplicity or Speed?
While delving into the details of this StackOverflow question
I am trying to generate a vector containing a increasing, reverse series such as
1,2,1,3,2,1,4,3,2,1,5,4,3,2,1.
various solutions arose. The simplest (in terms of number of key-strokes) was provided by user Henrik
rev(sequence(5:1))
which is indeed a very elegant yet simple solution. However, this wasn't the fastest solution, as we will soon see.
As with many programming problems there is often a trade-off between code simplicity and speed. One of the first lessons in R (especially if you are moving from other languages) is that it's better forgo the apparent simplicity of constructs like for
loops, for optimised functions like the apply
family. On the other hand there are whole libraries (think dplyr
and the tidyverse
in general) whose primary aims include improving code readability.
With that in mind, let's get back to the StackOverflow example. In addition to Henrik's solution, the ever-present user akrun (with input from others) suggested
unlist(lapply(1:5, ":", 1))
which is also a nice solution that requires a few more key strokes, but in practice runs faster.
The need for speed..
In trying to provide an alternative answer I went back to basics looking for a faster implementation. Coupled with what I've learnt while integrating C++
into my googleway package, I came up with a simple for-loop
written in Rcpp
.
(And, hopefully as the loop is written in C++
all the for-loop-in-R
haters will be appeased)
library(Rcpp)
cppFunction('NumericVector reverseSequence(int maxValue, int vectorLength){
NumericVector out(vectorLength);
int counter = 0;
for(int i = 1; i <= maxValue; i++){
for(int j = i; j > 0; j--){
out[counter] = j;
counter++;
}
}
return out;
}')
maxValue <- 5
reverseSequence(maxValue, sum(1:maxValue))
# [1] 1 2 1 3 2 1 4 3 2 1 5 4 3 2 1
Rcpp
provides methods that allow you to easily integrate R
and C++
. And its speed benefit became clear when I benchmarked it against the two R
solutions. The looping C++
implementation is faster most of the time (median speed of 1037ms, compared with 1900ms (akrun) and 4994ms (henrik)).
library(microbenchmark)
maxValue <- 1000
microbenchmark(
henrik = {
rev(sequence(maxValue:1))
},
akrun = {
unlist(lapply(1:maxValue, ":", 1))
},
symbolix = {
reverseSequence(maxValue, sum(1:maxValue))
}
)
# Unit: microseconds
# expr min lq mean median uq max neval
# henrik 3788.987 4567.422 7085.908 4993.793 5689.287 35355.34 100
# akrun 1533.615 1723.819 3302.222 1900.983 2688.463 35944.15 100
# symbolix 502.540 663.786 2818.100 1037.945 1545.540 33808.83 100
Righto, so which one?
Back to the title of this blog; Simplicity or Speed? Well, I can't answer that for you, you'll have to decide whether those extra few seconds are worth the time spent designing a longer piece of code. In this case we have a sequence of 1000 and a difference of just under a second. But if we have a sequence of one million, the impact is much larger.
Because we deal with big data we often look for speed over code simplicity. I like watching my codes tick away for a while, but it wears thin if every test takes an hour (or a day) to complete.
If you are still not sure, you can always refer to the repository of all programmng wisdom, xkcd: