Make a date with R and MariaDB (and avoid timezone troubles)

If you work with POSIXct objects in R, and save your data in a database, read on, this post could save your sanity.

Photo by Jon Tyson on Unsplash

Photo by Jon Tyson on Unsplash

Most programmers and analysts have experienced the pain of managing timezones in databases and across different programming environments, and R is no different. Lately there have been a number of issues and discussions about the right way to handle timezones for data stored in the MariaDB database, but accessed using R. The most complete and current discussions can be found here and here.

For those of you that don't have tons of time to read through github issues, we have run some benchmarks to explore this problem and come up with some recommendations.

What's all the fuss about?

If you use RMariaDB drivers to both insert and read data, you're probably safe.

If you query data using RMariaDB and

  • you don't know how it was inserted
  • it was inserted using another driver such as RMySQL or DBI
  • it was inserted using a SQL GUI
  • it was inserted using a SQL terminal
  • any other possible scenario

you may need to check your dates.

Why?

As a handy guide here's a summary table showing how POSIX objects get stored in MariaDB when inserting data from R using various combinations of libraries and drivers.

summary.png

Notice the posix_datetime objects are different when RMariaDB is used to insert the POSIXct data? This post will hopefully answer the 'why'.

What we're working with

  • MariaDB database
  • MySQL Workbench
  • R 3.5.1
  • RMariaDB 1.0.6
  • Based in Melbourne, Australia (UTC+10 / UTC+11 in summer time)

The timezone of the database server is set to system (i.e. local time). You can check your own server's timezone with the command SELECT @@session.time_zone;.

Set up the database

To store data, we need a database and a table. So let's make one.

For this we work directly in SQL, using MySQL Workbench:

CREATE TABLE `tbl_time` (
    posix_datetime DATETIME,
    posix_varchar VARCHAR(25)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Now we have a table with two columns, posix_datetime which will store DATETIME objects, and posix_varchar, which will store VARCHAR (characters).

We've made these two columns so you can see both how the date object is stored, and the string representation of what we think it should be.

Inserting data - part 1

We first insert and query the data directly from within Workbench.

INSERT INTO tbl_time (posix_datetime, posix_varchar) VALUE ('2018-01-01 12:00:00', '2018-01-01 12:00:00');
Insert_data.png

Everything seems OK here

Getting data into R

To get the data into R we are using the RMariaDB driver, since

RMariaDB is a database interface and MariaDB driver for R. This version is aimed at full compliance with the DBI specification, as a replacement for the old RMySQL package.

library(RMariaDB)

con_maria <- RMariaDB::dbConnect(
    drv = RMariaDB::MariaDB()
    , user = "abcd"
    , host = "127.0.0.1"
    , port = 1234
    , password = "xxxxxx"
    , dbname = "mydb"
)

dt <- RMariaDB::dbGetQuery(con_maria, "SELECT * FROM tbl_time")
dt

#        posix_datetime       posix_varchar
# 1 2018-01-01 23:00:00 2018-01-01 12:00:00

Notice the posix_datetime value is now '2018-01-01 23:00:00', whereas in workbench it was '2018-01-01 12:00:00'.

This is because the RMariaDB driver auto-converts to UTC.

dt$posix_datetime
"2018-01-01 23:00:00 AEDT"

Reading a DATETIME from the database using RMariaDB driver will assume the database is UTC, and will convert it to your local / R / system timezone.

What about other drivers?

Using either of

  • RMySQL::MySQL()
  • DBI::dbDriver("MySQL")

will return DATETIME objects as character. So you'll need to manually convert to POSIXct in R yourself. For the sake of time & the length of this post I'm leaving that as a task for the reader, but the summary table shows the result of various driver combinations.

Inserting data - part 2

Now we insert data from R. Here we create a data.frame with two columns, one POSIXct and one character (to reflect the database structure)

df <- data.frame(
  posix_datetime = as.POSIXct("2018-01-01 12:00:00")
  , posix_varchar = "2018-01-01 12:00:00"
  )

RMariaDB::dbWriteTable(
  conn = con_maria
  , name = "tbl_time"
  , value = df
  , append = TRUE
  )
insert_data_r.png

What's going on here?

Well, we created our POSIXct object without setting the tz argument. So it takes the default timezone of your R session, which in our case is Australia/Melbourne.

df$posix_datetime
# [1] "2018-01-01 12:00:00 AEDT"

Which then gets stored in the database as UTC.

If we set the tz to UTC we can see it doesn't change when in the database

df <- data.frame(
  posix_datetime = as.POSIXct("2018-01-01 12:00:00", tz = "UTC")
  , posix_varchar = "2018-01-01 12:00:00"
  )

RMariaDB::dbWriteTable(
  conn = con_maria
  , name = "tbl_time"
  , value = df
  , append = TRUE
  )
insert_data_utc.png

Making sense?

Here's that summary table again, just to reinfoce the point

summary.png

I get it. But what should I DO?

First rule of timezones - be careful, very careful.

Rule 2: Know the timezones of your server and your system and your data

Rule 3: Make sure the method you use to write your data is compatable with the method you will use to extract it.

  • If you only have small data sets then writing/reading using the MariaDB driver is a good option.
  • The MariaDB write method uses row by row inserts so can be very slow if you have bigger datasets. In this case you will want to use a LOAD DATA INFILE or similar to write the data in bulk chunks. In this case you will need to do a timezone conversion when reading the data in using RMariaDB drivers
  • Keep an eye on the online discussions. This issue is in active discussion on the RMariaDB github pages (see the links at the top of this post). Watch the discussion pages and package news to see if any changes are made in future.
https://imgs.xkcd.com/comics/supervillain_plan.png

https://imgs.xkcd.com/comics/supervillain_plan.png