Make a date with R and MariaDB (and avoid timezone troubles)
If you work with POSIXct objects in R, and save your data in a database, read on, this post could save your sanity.
Most programmers and analysts have experienced the pain of managing timezones in databases and across different programming environments, and R is no different. Lately there have been a number of issues and discussions about the right way to handle timezones for data stored in the MariaDB database, but accessed using R. The most complete and current discussions can be found here and here.
For those of you that don't have tons of time to read through github issues, we have run some benchmarks to explore this problem and come up with some recommendations.
What's all the fuss about?
If you use RMariaDB
drivers to both insert and read data, you're probably safe.
If you query data using RMariaDB
and
- you don't know how it was inserted
- it was inserted using another driver such as
RMySQL
orDBI
- it was inserted using a SQL GUI
- it was inserted using a SQL terminal
- any other possible scenario
you may need to check your dates.
Why?
As a handy guide here's a summary table showing how POSIX objects get stored in MariaDB when inserting data from R using various combinations of libraries and drivers.
Notice the posix_datetime
objects are different when RMariaDB
is used to insert the POSIXct data? This post will hopefully answer the 'why'.
What we're working with
- MariaDB database
- MySQL Workbench
- R 3.5.1
- RMariaDB 1.0.6
- Based in Melbourne, Australia (UTC+10 / UTC+11 in summer time)
The timezone of the database server is set to system
(i.e. local time). You can check your own server's timezone with the command SELECT @@session.time_zone;
.
Set up the database
To store data, we need a database and a table. So let's make one.
For this we work directly in SQL, using MySQL Workbench:
CREATE TABLE `tbl_time` (
posix_datetime DATETIME,
posix_varchar VARCHAR(25)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Now we have a table with two columns, posix_datetime
which will store DATETIME objects, and posix_varchar
, which will store VARCHAR (characters).
We've made these two columns so you can see both how the date object is stored, and the string representation of what we think it should be.
Inserting data - part 1
We first insert and query the data directly from within Workbench.
INSERT INTO tbl_time (posix_datetime, posix_varchar) VALUE ('2018-01-01 12:00:00', '2018-01-01 12:00:00');
Everything seems OK here
Getting data into R
To get the data into R we are using the RMariaDB driver, since
RMariaDB is a database interface and MariaDB driver for R. This version is aimed at full compliance with the DBI specification, as a replacement for the old RMySQL package.
library(RMariaDB)
con_maria <- RMariaDB::dbConnect(
drv = RMariaDB::MariaDB()
, user = "abcd"
, host = "127.0.0.1"
, port = 1234
, password = "xxxxxx"
, dbname = "mydb"
)
dt <- RMariaDB::dbGetQuery(con_maria, "SELECT * FROM tbl_time")
dt
# posix_datetime posix_varchar
# 1 2018-01-01 23:00:00 2018-01-01 12:00:00
Notice the posix_datetime
value is now '2018-01-01 23:00:00', whereas in workbench it was '2018-01-01 12:00:00'.
This is because the RMariaDB
driver auto-converts to UTC.
dt$posix_datetime
"2018-01-01 23:00:00 AEDT"
Reading a DATETIME from the database using RMariaDB
driver will assume the database is UTC, and will convert it to your local / R / system timezone.
What about other drivers?
Using either of
- RMySQL::MySQL()
- DBI::dbDriver("MySQL")
will return DATETIME
objects as character
. So you'll need to manually convert to POSIXct in R yourself. For the sake of time & the length of this post I'm leaving that as a task for the reader, but the summary table shows the result of various driver combinations.
Inserting data - part 2
Now we insert data from R. Here we create a data.frame with two columns, one POSIXct and one character (to reflect the database structure)
df <- data.frame(
posix_datetime = as.POSIXct("2018-01-01 12:00:00")
, posix_varchar = "2018-01-01 12:00:00"
)
RMariaDB::dbWriteTable(
conn = con_maria
, name = "tbl_time"
, value = df
, append = TRUE
)
What's going on here?
Well, we created our POSIXct object without setting the tz
argument. So it takes the default timezone of your R session, which in our case is Australia/Melbourne.
df$posix_datetime
# [1] "2018-01-01 12:00:00 AEDT"
Which then gets stored in the database as UTC.
If we set the tz
to UTC we can see it doesn't change when in the database
df <- data.frame(
posix_datetime = as.POSIXct("2018-01-01 12:00:00", tz = "UTC")
, posix_varchar = "2018-01-01 12:00:00"
)
RMariaDB::dbWriteTable(
conn = con_maria
, name = "tbl_time"
, value = df
, append = TRUE
)
Making sense?
Here's that summary table again, just to reinfoce the point
I get it. But what should I DO?
First rule of timezones - be careful, very careful.
Rule 2: Know the timezones of your server and your system and your data
Rule 3: Make sure the method you use to write your data is compatable with the method you will use to extract it.
- If you only have small data sets then writing/reading using the
MariaDB
driver is a good option. - The
MariaDB
write method uses row by row inserts so can be very slow if you have bigger datasets. In this case you will want to use aLOAD DATA INFILE
or similar to write the data in bulk chunks. In this case you will need to do a timezone conversion when reading the data in usingRMariaDB
drivers - Keep an eye on the online discussions. This issue is in active discussion on the
RMariaDB
github pages (see the links at the top of this post). Watch the discussion pages and package news to see if any changes are made in future.