Image source: Wikipedia

Day 6: Lubridate

Working with times ⏳ and dates 📅 is tricky for all kinds of reasons, not least of which because dates are fundamentally absurd. A year isn’t a perfect multiple of days, so a solar calendar needs leap years and other corrections. The year doesn’t divide easily into 12 months – or frankly any other useful number either – so months have a variable number of days and doing “addition” with months is really weird. The week days don’t distribute themelves cleanly across months or years, so those are a headache too 🤕.

It’s not just an issue of nomenclature either. We organise our lives around seasons, weekdays/weekends, day/night and various other cycles that don’t always play nicely with one another. The weirdness of dates reflects the structure of the world and our lives, and data analysts can’t pretend that this complexity isn’t there

Adding to all that, you have time zones which are a whole other layer of complexity, as is daylight savings time, and the fact that various jurisdictions randomly change the rules on a whim. Oh, and then there are places like South Australia that for some reason have a time zone that is 30 minute offset from the east coast of Australia, and many software systems only let you specify a timezone in hour increments from UTC. Once you add in cultural and lingustic variations, we end up with all sorts of ambiguities. Does 9/11 refer to September 11 or November 9? It depends on where you live. 🤷

Because of all this, date specification is hard. R natively comes with POSIXlt and POSIXct classes representing dates and times in a nice standards compliant way and – for me at least 😄 – they’re terrible to work with. Not because they’re badly implemented or anything – quite the contrary! – it’s just that times and dates are just inherently messy.

In practice though, a lot of the time we don’t want to have to deal with low level stuff; we just want to be able to represent the time and date, and do basic operations – e.g., length of time between two dates – with them for the purposes of data analysis. The Date class is the main thing I work with on a regular basis, I suppose, and the big headache most of the time is getting data into a date format.

Many have tried to make times and dates easy. Many have failed. However, I hear that the lubridate package is pretty good? It’s a tidyverse package so it’s probably pretty cool – they always are! Since I’m working on a new machine today and waiting for Xcode to install (sigh, 😒) I have some time to kill ⏳ 🔪.

Looking at the tutorial

Okay, reading through the examples on the lubridate page is really promising!

As long as you know the order of “year, month, day”" your data come in, then you just pass it straight to the relevant function: ymd, mdy etc. The example they give is ymd(20101215), which produces this as output:

x <- ymd(20101215)
print(x)
## [1] "2010-12-15"
class(x)
## [1] "Date"

In this example, the ymd function recognises that the numeric input has four digits for the year, two for the month and two for the day without having to be told, and returns a Date object. The nice thing is that the functions are pretty flexible:

ymd(20101215) == mdy("12/15/10")
## [1] TRUE

Nice! The functions are smart enough to handle strings and numbers as input, and if the year is indeterminate it assumes a 21st century date. As a consequence, both of the expressions above return 15th December 2010 as the date. I can see why we need separate functions though, since the same numeric input, formatted differently…

dmy(20101215)
## [1] "1215-10-20"

… might indeed correspond to 20th October 1215. Admittedly this might be very plausible if the data set somehow pertained to the Magna Carta!

Arithmetic with months!

A quick look at the package suggests there’s a lot of cool things it can do, but I’m on a time budget with these posts so I’ll quickly note one.

I love ❤️ the fact that lubridate supplies an addition operator %m+% that allows you to add months to dates in a somewhat coherent way: e.g., the last day of one month maps to the last day of the next month even if they have a different number of days. For instance, if we “add one month” to “January 31st 2000” we get this:

dmy("31-Jan-2000") %m+% months(1)
## [1] "2000-02-29"

It maps the last day to the last day, and since 2000 was a leap year this returns a date of Feb 29th. Doing the same thing for 2001 gives a Feb 28 result:

dmy("31-Jan-2001") %m+% months(1)
## [1] "2001-02-28"

In contrast, April always has 30 days, so adding three months to Jan 31 always would always give April 30.

Fun!

A real world example?

That being said… when it comes to time and date data, I’ve had my heart broken before 💔. So often when a new data set arrives on my desk, it has so many weirdnesses going on with the date field that I end up having to do quite a bit of work to make sense of it. My partner, who works as an applied statistician and consultant, has many more horror stories when it comes to dates.

So, in order to get a sense of how helpful lubridate can be, I asked her to send me a couple of real examples of messily formatted date fields. Here’s the first one she sent…

example1 <- read.csv("https://djnavarro.net/files/oddballdate1.csv")
head(example1)
##   Day.no.   Date X X.1
## 1     196 110662 B  NW
## 2     197 120662 B  SE
## 3     198 140662 A   W
## 4     200 260562 A   0
## 5     202 180662 A WSW
## 6     203 190662 B SVV

Okay, those numbers are a bit weird? It’s a six digit number so I’m guessing it’s presumably a two-digit year code? They all end in “62” so I’m guessing that’s supposed to be 1962? The first two numbers can be larger than 12, so that’s probably the day, so I’m thinking that these data are numeric formatted “ddmmyy” data. So let’a try the dmy function

mydate <- dmy(example1$Date)
mydate
##  [1] "2062-06-11" "2062-06-12" "2062-06-14" "2062-05-26" "2062-06-18"
##  [6] "2062-06-19" "2062-07-15" "2062-07-21" "2062-07-22" "2062-08-01"
## [11] "2062-08-06" "2062-08-07" "2062-08-16" "2062-08-18" "2062-08-21"
## [16] "2062-09-01" "2062-09-03" "2062-09-05" "2062-09-07" "2062-09-11"
## [21] "2062-09-16" "2062-09-18" "2062-10-13" "2063-05-22" "2063-05-23"
## [26] "2063-06-01" "2063-06-03" "2063-06-05" "2063-06-06" "2063-06-08"
## [31] "2063-06-10" "2063-06-15" "2063-06-16" "2063-06-28" "2063-07-03"
## [36] "2063-07-04" "2063-07-07" "2063-07-08" "2063-07-26" "2063-08-02"
## [41] "2063-08-09" "2063-08-12" "2063-08-13" "2063-08-15" "2063-08-17"
## [46] "2063-09-01" "2063-09-03" "2063-09-05" "2063-09-18" "2063-09-19"

Nice.

Apart from being obviously the wrong century, that looks like it worked perfectly! I have no idea what this data set is about but just eyeballing it makes me feel reasonably confident thart this is the right ordering. The “year” is 2062 or 2063 for all cases in the data set. The “month” varies a bit more but is always in the range 1-12, and the “day” varies quite dramatically, but again over a sensible raing. Now, about that century issue? A quick check of the docs suggest I can use the year function to extract the year component…

year(mydate)
##  [1] 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062
## [15] 2062 2062 2062 2062 2062 2062 2062 2062 2062 2063 2063 2063 2063 2063
## [29] 2063 2063 2063 2063 2063 2063 2063 2063 2063 2063 2063 2063 2063 2063
## [43] 2063 2063 2063 2063 2063 2063 2063 2063

… and reassign it?

year(mydate) <- year(mydate) - 100
mydate
##  [1] "1962-06-11" "1962-06-12" "1962-06-14" "1962-05-26" "1962-06-18"
##  [6] "1962-06-19" "1962-07-15" "1962-07-21" "1962-07-22" "1962-08-01"
## [11] "1962-08-06" "1962-08-07" "1962-08-16" "1962-08-18" "1962-08-21"
## [16] "1962-09-01" "1962-09-03" "1962-09-05" "1962-09-07" "1962-09-11"
## [21] "1962-09-16" "1962-09-18" "1962-10-13" "1963-05-22" "1963-05-23"
## [26] "1963-06-01" "1963-06-03" "1963-06-05" "1963-06-06" "1963-06-08"
## [31] "1963-06-10" "1963-06-15" "1963-06-16" "1963-06-28" "1963-07-03"
## [36] "1963-07-04" "1963-07-07" "1963-07-08" "1963-07-26" "1963-08-02"
## [41] "1963-08-09" "1963-08-12" "1963-08-13" "1963-08-15" "1963-08-17"
## [46] "1963-09-01" "1963-09-03" "1963-09-05" "1963-09-18" "1963-09-19"

Woohoo! 🎉 Test number 1 survived!

Or not?

My initial enthusiasm notwithstanding … I think there’s a typo in the raw data for the fourth date. The data set has a “day”" variable as well as a “date”" variable, and they mostly match up, in the sense that (say) day 196 of the study (whatever it was) would be “1962-06-11”, day 197 was “1962-06-12” and so on. They’re often out by one day, but I can think of all sorts of ways in which a study could be designed in which that would be expected.

Row 4 is deeply suspicious. Ostensibly the date is “1962-05-26”, but that’s supposed to correspond to day 200 of the study? Probably not. If the other dates are all correct, then day 200 should be “1962-06-16” or thereabouts? That might be a typo or two? Someone wrote “260562” when it should have been “160662”? In real life I’d want to query that! Either there’s a typo or I made a mistake! Of course lubridate can’t catch things like that, so as usual there’s no substitute for actually looking at the data!

I’m on the clock, so here’s the plot showing the date extracted by the dmy() function against the “day” variable in the raw data set:

plot(mydate, example1$Day.no., pch=19)

So presumably this was a study over two years (1962 and 1963), conducted at roughly the same time of year in both years, with the “day” variable recorded consecutively… but the speculation that Row 4 has a typo seems to be born out. Every other day-date pair falls on a monotonic curve, so I’m pretty sure there’s there’s a typo in one of the dates!

Still, I admit that as date-wrangling goes, lubridate did make it a pretty painless exercise.

Another real world example?

Okay, time for a tougher test. The sneaky typo in the first one notwithstanding, the truth is that this is a pretty easy one to handle. All the dates are in the same format, visual inspection tells you which function to try, and the answers all seem to make sense (though for all I know I might be completely wrong). What happens when the data aren’t formatted in consistent way? Enter annoying real world data set #2…

example2 <- readLines("https://djnavarro.net/files/oddballdate2.csv")
example2
##  [1] "4/4/16 9:00"   "4/4/16 14:13"  "6/4/16 5:55"   "6/4/16 11:27" 
##  [5] "8/4/16 15:04"  "12/4/16 12:29" "14/4/16 12:14" "14/4/16 19:16"
##  [9] "18/4/16 16:18" "20/4/16 11:10" "28/4/16 16:10" "28/4/16"      
## [13] "1/5/16 17:11"  "2/5/16 14:53"  "3/5/16 10:32"  "3/5/16 12:46" 
## [17] "3/5/16 13:08"  "3/5/16 15:51"  "4/5/16 11:04"  "4/5/16 11:13" 
## [21] "4/5/16 11:16"  "4/5/16 12:54"  "4/5/16 14:40"  "4/5/16 20:16" 
## [25] "6/5/16 15:13"  "10/5/16 11:17" "11/5/16 12:41" "12/5/16"      
## [29] "12/5/16"       "5/13/16"       "16/5/16 10:11" "16/5/16 22:47"
## [33] "17/5/16 15:20" "18/5/16"       "18/5/16 10:32" "19/5/16 2:27" 
## [37] "19/5/16 10:59" "19/5/16"

Yes. Yes, that is annoying. My partner is grinning maliciously at me 😈 and I hate her so much right now.

Okay, so most of the cases have a time stamp and are formatted in a way that would suit dmy_hm but a others are date only and I guess I need dmy for those. Trying both functions individually gives me this…

dmy(example2)
## Warning: 33 failed to parse.
##  [1] NA           NA           NA           NA           NA          
##  [6] NA           NA           NA           NA           NA          
## [11] NA           "2016-04-28" NA           NA           NA          
## [16] NA           NA           NA           NA           NA          
## [21] NA           NA           NA           NA           NA          
## [26] NA           NA           "2016-05-12" "2016-05-12" NA          
## [31] NA           NA           NA           "2016-05-18" NA          
## [36] NA           NA           "2016-05-19"
dmy_hm(example2)
## Warning: 6 failed to parse.
##  [1] "2016-04-04 09:00:00 UTC" "2016-04-04 14:13:00 UTC"
##  [3] "2016-04-06 05:55:00 UTC" "2016-04-06 11:27:00 UTC"
##  [5] "2016-04-08 15:04:00 UTC" "2016-04-12 12:29:00 UTC"
##  [7] "2016-04-14 12:14:00 UTC" "2016-04-14 19:16:00 UTC"
##  [9] "2016-04-18 16:18:00 UTC" "2016-04-20 11:10:00 UTC"
## [11] "2016-04-28 16:10:00 UTC" NA                       
## [13] "2016-05-01 17:11:00 UTC" "2016-05-02 14:53:00 UTC"
## [15] "2016-05-03 10:32:00 UTC" "2016-05-03 12:46:00 UTC"
## [17] "2016-05-03 13:08:00 UTC" "2016-05-03 15:51:00 UTC"
## [19] "2016-05-04 11:04:00 UTC" "2016-05-04 11:13:00 UTC"
## [21] "2016-05-04 11:16:00 UTC" "2016-05-04 12:54:00 UTC"
## [23] "2016-05-04 14:40:00 UTC" "2016-05-04 20:16:00 UTC"
## [25] "2016-05-06 15:13:00 UTC" "2016-05-10 11:17:00 UTC"
## [27] "2016-05-11 12:41:00 UTC" NA                       
## [29] NA                        NA                       
## [31] "2016-05-16 10:11:00 UTC" "2016-05-16 22:47:00 UTC"
## [33] "2016-05-17 15:20:00 UTC" NA                       
## [35] "2016-05-18 10:32:00 UTC" "2016-05-19 02:27:00 UTC"
## [37] "2016-05-19 10:59:00 UTC" NA

Okay, so the functions do something pretty sensible: they convert the dates they can, insert NA for the dates they can’t, and then loudly complain about the fact that something is amiss.

So let’s do this:

# convert the dmy_hm cases
mydate2 <- example2 %>% dmy_hm %>% date
## Warning: 6 failed to parse.
# convert the dmy cases
ind <- is.na(mydate2) 
mydate2[ind] <- example2[ind] %>% dmy %>% date 
## Warning: 1 failed to parse.
# print
mydate2
##  [1] "2016-04-04" "2016-04-04" "2016-04-06" "2016-04-06" "2016-04-08"
##  [6] "2016-04-12" "2016-04-14" "2016-04-14" "2016-04-18" "2016-04-20"
## [11] "2016-04-28" "2016-04-28" "2016-05-01" "2016-05-02" "2016-05-03"
## [16] "2016-05-03" "2016-05-03" "2016-05-03" "2016-05-04" "2016-05-04"
## [21] "2016-05-04" "2016-05-04" "2016-05-04" "2016-05-04" "2016-05-06"
## [26] "2016-05-10" "2016-05-11" "2016-05-12" "2016-05-12" NA          
## [31] "2016-05-16" "2016-05-16" "2016-05-17" "2016-05-18" "2016-05-18"
## [36] "2016-05-19" "2016-05-19" "2016-05-19"

Well that almost worked. What’s the one NA value?

example2[is.na(mydate2)]
## [1] "5/13/16"

Argh!!! 😡 Looking at all the other dates, every single one is in April or May… but this last one is supposedly on the 4th day of the 13th month? Sigh, obviously we have one entry formatted in a mdy() form.

badcase <- is.na(mydate2)
mydate2[badcase] <- example2[badcase] %>% mdy %>% date
mydate2
##  [1] "2016-04-04" "2016-04-04" "2016-04-06" "2016-04-06" "2016-04-08"
##  [6] "2016-04-12" "2016-04-14" "2016-04-14" "2016-04-18" "2016-04-20"
## [11] "2016-04-28" "2016-04-28" "2016-05-01" "2016-05-02" "2016-05-03"
## [16] "2016-05-03" "2016-05-03" "2016-05-03" "2016-05-04" "2016-05-04"
## [21] "2016-05-04" "2016-05-04" "2016-05-04" "2016-05-04" "2016-05-06"
## [26] "2016-05-10" "2016-05-11" "2016-05-12" "2016-05-12" "2016-05-13"
## [31] "2016-05-16" "2016-05-16" "2016-05-17" "2016-05-18" "2016-05-18"
## [36] "2016-05-19" "2016-05-19" "2016-05-19"

There. Done.

Sadly, when you get sufficiently messy data it’s not as simple as asking lubridate to solve it for you, but I will admit that it felt easier than it usually is. At least I spent most of my time puzzling over the data format rather than getting frustrated by having to write a parsing function!

Waait…

Just like last time, I would prefer to visualise the dates to provide a quick sanity check. I don’t trust this data…

freq <- mydate2 %>% table # frequency table 
counts <- freq %>% as.numeric # numeric counts
dates <- freq %>% names %>% ymd # unique dates

plot(
  x = dates, 
  y = counts, 
  xlab="Date", 
  ylab="Number of Sign Ups",  
  type="h", 
  ylim=c(0,max(counts)), 
  lwd=2
)

I check with my partner… she agrees that’s probably about right for the data set, so yay!!!

Summary?

In conclusion, lubridate is lovely, but dates are still a huge pain in the… um… neck.

Avatar
Danielle Navarro
Associate Professor of Cognitive Science

Related