spatial.data.table 1 - Efficient Haversine Calculation
I'm a fan of data.table
, and I'm a fan of StackOverflow. So when I see questions like How to efficiently calculate distance between pair of coordinates using data.table, my intrigue is naturally piqued.
The then-accepted answer used geosphere::distGeo()
inside a data.table
'update-by-reference' function (:=). This seemed llike a good, logical approach (from my years of using data.table
).
It was only when writing this answer a few days later that I started to look into the solution more deeply. For example, it seemed a bit unnecessary to have to convert the columns of the data.table
into a matrix just to use geosphere::distHaversine()
(you'll also see from a comment to my answer that using good ol' c()
didn't work either).
So, I opened up distHaversine
to have a look:
geosphere::distHaversine
function (p1, p2, r = 6378137)
{
toRad <- pi/180
p1 <- .pointsToMatrix(p1) * toRad
p2 <- .pointsToMatrix(p2) * toRad
p = cbind(p1[, 1], p1[, 2], p2[, 1], p2[, 2], as.vector(r))
dLat <- p[, 4] - p[, 2]
dLon <- p[, 3] - p[, 1]
a <- sin(dLat/2) * sin(dLat/2) + cos(p[, 2]) * cos(p[, 4]) *
sin(dLon/2) * sin(dLon/2)
dist <- 2 * atan2(sqrt(a), sqrt(1 - a)) * p[, 5]
return(as.vector(dist))
}
Note that
The arguments
p1
andp2
"can be a vector of two numbers, a matrix of 2 columns, or a SpatialPoints* object".There are two calls to
.pointsToMatrix()
, which again convert the points into a matrix (plus a few other checks).
This seemed a tad unnecessary to me. And having used data.table
on a daily basis for quite a while, I was used to just passing-in unquoted column names into my functions.
Simplifying the functions
The first thing I did was to write my own version (which is unashamedly a copy of geosphere
, but without the matrix conversion).
dt.haversine <- function(lat_from, lon_from, lat_to, lon_to, r = 6378137){
radians <- pi/180
lat_to <- lat_to * radians
lat_from <- lat_from * radians
lon_to <- lon_to * radians
lon_from <- lon_from * radians
dLat <- (lat_to - lat_from)
dLon <- (lon_to - lon_from)
a <- (sin(dLat/2)^2) + (cos(lat_from) * cos(lat_to)) * (sin(dLon/2)^2)
return(2 * atan2(sqrt(a), sqrt(1 - a)) * r)
}
A quick benchmark showed it to be a non-insignificant improvement (but more on the benchmarking to come). And as it turns out, all the geosphere::dist_
calculations use this .pointsToMatrix()
function.
So, I re-wrote most of them without it, packaged it up, et voila, spatialdatatable
is born.
The need for speed
To try and eek out even more speed when working on multiple rows of data at a time, I re-wrote the functions again in C++.*
And, if you don't believe me about the speed improvements, here's some benchmarks:
## install via github
# devtools::install_github("SymbolixAU/spatialdatatable")
library(spatialdatatable)
library(microbenchmark)
library(geosphere)
n <- 100000
set.seed(20170511)
lats <- -90:90
lons <- -180:180
dt <- data.table::data.table(lat1 = sample(lats, size = n, replace = T),
lon1 = sample(lons, size = n, replace = T),
lat2 = sample(lats, size = n, replace = T),
lon2 = sample(lons, size = n, replace = T))
dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
m <- microbenchmark(
sdt = { dt1[, dtDistance := dtHaversine(lat1, lon1, lat2, lon2)] },
geo1 = { dt2[, geoDistance := distHaversine(
matrix(c(lon1, lat1), ncol = 2),
matrix(c(lon2, lat2), ncol = 2))] },
geo2 = { dt3[, geoDistance := distHaversine(dt3[, .(lon1, lat1)], dt3[, .(lon2, lat2)])] }
)
* A word of warning
- This is still a development package, so use at your own risk
- I have no idea what's causing that long tail in the benchmark, happy to hear any ideas
- I'm not a C++ programmer