Thursday, July 11, 2013

Let's Make a Date: Date and Time classes in R

Of all the frustrating data manipulations to deal with in any programming language, dates and times are the worst in my opinion. In R, there are many different packages that use various functions to deal with dates, which lead to different classes of dates that are not always compatible. Depending on how your data is organized, there are different solutions to your date and time problems. Here, I'll show the way that I think is easiest to deal with dates depending on the organization of your data.

Ok so I find in public health data that birthdays and death days are the most common dates to be dealing with. In a dataset like the DHS, dates can either be found as three separate integer variables (month, day, year) or they can be in one character variable like “05/03/2009”. So let's start with the first situation, numeric dates.

Numeric dates

#Create some data
dates<-as.data.frame(cbind(c(1,3,6,11,4,12,5,3), 
                           c(30,14,NA,NA,16,NA,20,31), 
                           c(1980, 1980, 1980, 1983,1983, 1983, 1986, 1980), 
                           c(2, NA, NA, NA, NA, 12, 4, NA), 
                           c(2, NA, NA, NA, NA, NA, 29, NA), 
                           c(1980, NA, NA, 1985, NA, 1983, 1987, NA)))
colnames(dates)<-c("birth_month", "birth_day", "birth_year", "death_month", "death_day", "death_year")
dates
##   birth_month birth_day birth_year death_month death_day death_year
## 1           1        30       1980           2         2       1980
## 2           3        14       1980          NA        NA         NA
## 3           6        NA       1980          NA        NA         NA
## 4          11        NA       1983          NA        NA       1985
## 5           4        16       1983          NA        NA         NA
## 6          12        NA       1983          12        NA       1983
## 7           5        20       1986           4        29       1987
## 8           3        31       1980          NA        NA         NA

I've included a lot of missing cells, even in birth date, because that's the most common problem that I have - birth and death data, especially from developing countries, is full of missing birth months or days.

Ok, so what I would like to do here is to figure out if I have any infant mortalities in my sample size of 8. I find the easiest way to create a date object from three integer month/day/year variables is by using ISOdate(), which is just in base R. ISOdate() follows the following syntax:

ISOdate(year, month, day, hour = 12, min = 0, sec = 0, tz = “GMT”)

and if you had hours and minutes you could easily throw those in too into the ISOdatetime() function which is the same syntax. So let's create a birth date:

dates$DOB <- ISOdate(dates$birth_year, dates$birth_month, dates$birth_day)
dates
##   birth_month birth_day birth_year death_month death_day death_year                 DOB
## 1           1        30       1980           2         2       1980 1980-01-30 12:00:00
## 2           3        14       1980          NA        NA         NA 1980-03-14 12:00:00
## 3           6        NA       1980          NA        NA         NA                <NA>
## 4          11        NA       1983          NA        NA       1985                <NA>
## 5           4        16       1983          NA        NA         NA 1983-04-16 12:00:00
## 6          12        NA       1983          12        NA       1983                <NA>
## 7           5        20       1986           4        29       1987 1986-05-20 12:00:00
## 8           3        31       1980          NA        NA         NA 1980-03-31 12:00:00

This is super easy. The column DOB that we have added to the dataframe is of class POSIXlt, which is a class of calendar date and time. Since I don't have any time variables, I don't want it crowding up my dataset so I can use the function strptime() to get rid of it. Here's the syntax:

strptime(x, format, tz = “”)

where x is your POSIXlt object, and format is whatever the format is of that object (see below). The strptime help file is a good place to understand the formats.

dates$DOB <- strptime(dates$DOB, format = "%Y-%m-%d")
dates$DOD <- strptime(ISOdate(dates$death_year, dates$death_month, dates$death_day), format = "%Y-%m-%d")

You can, of course, do this in one step, which I've done for death date. Now we can look at our dataset:

dates
##   birth_month birth_day birth_year death_month death_day death_year        DOB        DOD
## 1           1        30       1980           2         2       1980 1980-01-30 1980-02-02
## 2           3        14       1980          NA        NA         NA 1980-03-14       <NA>
## 3           6        NA       1980          NA        NA         NA       <NA>       <NA>
## 4          11        NA       1983          NA        NA       1985       <NA>       <NA>
## 5           4        16       1983          NA        NA         NA 1983-04-16       <NA>
## 6          12        NA       1983          12        NA       1983       <NA>       <NA>
## 7           5        20       1986           4        29       1987 1986-05-20 1987-04-29
## 8           3        31       1980          NA        NA         NA 1980-03-31       <NA>

Now if we want age at death, we can use the difftime() function that follows the following syntax and produces a difftime object (which you can convert to numeric using as.numeric() if you want to, which I highly recommend):

difftime(time1, time2, tz,units = c(“auto”, “secs”, “mins”, “hours”,“days”, “weeks”))

dates$Age.atdeath <- difftime(dates$DOD, dates$DOB, unit = "days")
dates$Age.atdeath
## Time differences in days
## [1]   3  NA  NA  NA  NA  NA 344  NA
## attr(,"tzone")
## [1] ""
class(dates$Age.atdeath)
## [1] "difftime"
# check if there were an infant mortalities
dates$Age.atdeath < 365
## [1] TRUE   NA   NA   NA   NA   NA TRUE   NA

Ok, I found two infant mortalities, but I see that there's a problem. I see that I'm missing two birthdays that come up as NA, and this is because I'm missing the birth day for those two people. However, I do have birth month and birth year, and this is information I don't want to lose. There are many ways to deal with this, including imputing birth days or assigning them randomly from a uniform distribution or whatever.

In this case, what I will do is very simple - just replace the missing birth day with 1 if it's missing (and similarly replace a missing birth month with 1 if it's missing) and replace missing death month and day with 12 and 30, respectively. That way I have the maximum possible age at death and I don't lose potentially important information. There are a number of ways to accomplish what I want to do, but I love using the ifelse() function because I find it extremely intuitive so I will do that:

dates$DOB2<-strptime(ISOdate(year=dates$birth_year, 
                             month=ifelse(is.na(dates$birth_month), 1, dates$birth_month), 
                             day=ifelse(is.na(dates$birth_day),1, dates$birth_day)), 
                     format="%Y-%m-%d")

dates$DOD2<-strptime(ISOdate(year=dates$death_year, 
                             month=ifelse(is.na(dates$death_month),12,dates$death_month), 
                             day=ifelse(is.na(dates$death_day),30, dates$death_day)), 
                     format="%Y-%m-%d")

dates$Ageatdeath_2<-as.numeric(difftime(dates$DOD2,dates$DOB2,unit="days"))

dates[,c(1:6,10:12)]
##   birth_month birth_day birth_year death_month death_day death_year       DOB2       DOD2 Ageatdeath_2
## 1           1        30       1980           2         2       1980 1980-01-30 1980-02-02            3
## 2           3        14       1980          NA        NA         NA 1980-03-14       <NA>           NA
## 3           6        NA       1980          NA        NA         NA 1980-06-01       <NA>           NA
## 4          11        NA       1983          NA        NA       1985 1983-11-01 1985-12-30          790
## 5           4        16       1983          NA        NA         NA 1983-04-16       <NA>           NA
## 6          12        NA       1983          12        NA       1983 1983-12-01 1983-12-30           29
## 7           5        20       1986           4        29       1987 1986-05-20 1987-04-29          344
## 8           3        31       1980          NA        NA         NA 1980-03-31       <NA>           NA

So now we see above that I have all birthdays completed, and I was able to reveal that I have another infant mortality in my data, and another death more generally.

Character Dates

Ok so onto the second type of dataset you may encounter, which is that somebody inputed the dates into Excel or whatever like this:

dates2<-as.data.frame(cbind(c(1:5), 
                            c("8/31/70", "NA", "10/12/60", "1/1/66", "12/31/80"), 
                            c("8/31/56", "12-31-1977", "12Aug55", "July 31 1965" ,"30jan1952")))
colnames(dates2)<-c("ID", "date_factor", "date_horrible")
dates2
##   ID date_factor date_horrible
## 1  1     8/31/70       8/31/56
## 2  2          NA    12-31-1977
## 3  3    10/12/60       12Aug55
## 4  4      1/1/66  July 31 1965
## 5  5    12/31/80     30jan1952

In the first column at least all of the dates have the same format, but in the second (which happens really often!), every date is a different format which seems, at first, like a total nightmare. But fortunately R is here to rescue us.

Ok, let's start with the easy one. The tricky part with this kind of data is that R often immediately converts the dates to factors, like so:

class(dates2$date_factor)
## [1] "factor"

This happens if you are reading in a csv file as well. You can either stop R from doing that by using the as.is option like so (assuming your dates were the second and third columns):

df <- read.table("data_with_dates.txt", header = TRUE, as.is = 2:3)

or we just have to remember to use the as.character() function before we do anything. Ok, so the point here is that even though these look like dates, they are not dates. They are factors or maybe characters, but not manipulatable like you would want. For example, this gives you the following error:

# NO: this gives an error, you can't do this with characters, need the date format
dates2$age <- difftime("02/27/13", as.character(dates2$date_factor), unit = "days")
## Error: character string is not in a standard unambiguous format

So we need to get these character dates into actual date formats. If all of our formats are the same, we can use the chron package and the chron() function really easily like so.

chron(dates., times., format = c(dates = “m/d/y”, times = “h:m:s”),out.format, origin.)

where dates is the vector of character dates, and the format is whatever the format is of the data you have. Note that it yields warning messages because of the missing value, but that is ok.

library(chron)
dates2$date.fmt <- chron(as.character(dates2$date_factor), format = "m/d/y")
## Warning: wrong number of fields in entry(ies) 2
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
class(dates2$date.fmt)
## [1] "dates" "times"
dates2[, c(1, 2, 4)]
##   ID date_factor date.fmt
## 1  1     8/31/70 08/31/70
## 2  2          NA     <NA>
## 3  3    10/12/60 10/12/60
## 4  4      1/1/66 01/01/66
## 5  5    12/31/80 12/31/80

Note that it looks very similar to the date_factor column but it is not, because it is now a dates and times class as I show above. We can also change how we want the outgoing format to look:

dates2$date.fmt <- chron(as.character(dates2$date_factor), format = "m/d/y", out.format = "month day year")
dates2[, c(1, 2, 4)]
##   ID date_factor         date.fmt
## 1  1     8/31/70   August 31 1970
## 2  2          NA             <NA>
## 3  3    10/12/60  October 12 1960
## 4  4      1/1/66  January 01 1966
## 5  5    12/31/80 December 31 1980

And now we can find the age of these people using difftime() as before. Let's say I interviewed everyone on March 1st of this year and I want their age in years at interview:

dates2$age <- as.numeric(floor(difftime(chron("03/01/2013"), dates2$date.fmt, unit = "days")/360))
dates2[c(1, 2, 4, 5)]
##   ID date_factor         date.fmt age
## 1  1     8/31/70   August 31 1970  43
## 2  2          NA             <NA>  NA
## 3  3    10/12/60  October 12 1960  53
## 4  4      1/1/66  January 01 1966  47
## 5  5    12/31/80 December 31 1980  32

I can also do things like add a day to everybody's date if I had some reason to do that, and I can compare dates to see which one came first, which can be useful:

# Add a day to everyone's date for some reason
dates2$date.fmt + 1
## [1] September 01 1970 <NA>              October 13 1960   January 02 1966   January 01 1981
# Compare the date to some other date to see which came first using < operator
dates2$date.fmt < chron("04/02/62")
## [1] FALSE    NA  TRUE FALSE FALSE

Ok finally what if your data is as horrible as what we see in our second column?

dates2[, c(1, 3)]
##   ID date_horrible
## 1  1       8/31/56
## 2  2    12-31-1977
## 3  3       12Aug55
## 4  4  July 31 1965
## 5  5     30jan1952

Chron won't help us here, as it needs one format for everyone:

# NO: chron needs the same format
chron(as.character(dates2$date_horrible))
## [1] 08/31/56 <NA>     <NA>     <NA>     <NA>

It just gives us NAs (and warnings, which I've hid). However, there is a package called date that will take any type of date and figure it out for us. Watch how amazing it is, the only argument to the function you need is the vector of character dates:

library(date)
# as.date (lower case) will correctly convert dates in vector
as.date(as.character(dates2$date_horrible))
## [1] 31Aug56 31Dec77 12Aug55 31Jul65 30Jan52

Extraordinary! :) It just knows. However, I find that it stubbornly reformats itself into the number of days since 1960 when adding it as a column to the dataframe:

dates2$date_autofmt <- as.date(as.character(dates2$date_horrible))
dates2[, c(1, 3, 6)]
##   ID date_horrible date_autofmt
## 1  1       8/31/56        -1218
## 2  2    12-31-1977         6574
## 3  3       12Aug55        -1603
## 4  4  July 31 1965         2038
## 5  5     30jan1952        -2893

It's also not super easy to work with because it's a date object, not a Date object (VERY case-sensitive stuff here!). Feel free to chime in here if you have a better solution, but my simple fix on that is just to envelop it in an as.Date() function (from base R) like so:

dates2$date_amazing <- as.Date(as.date(as.character(dates2$date_horrible)))
dates2[, c(1, 3, 7)]
##   ID date_horrible date_amazing
## 1  1       8/31/56   1956-08-31
## 2  2    12-31-1977   1977-12-31
## 3  3       12Aug55   1955-08-12
## 4  4  July 31 1965   1965-07-31
## 5  5     30jan1952   1952-01-30

Now it works great. That's pretty incredible that one short line of code can transform your awful dates column into perfectly coordinated and workable dates. You can use difftime() the same way as before. Hope that this was useful for those pulling their hair out over date and time objects in R.

Also, if you made it all the way to the end, please enjoy this hilarious episode of Let's Make a Date on “Whose Line Is It Anyway?'' featuring none other than Stephen Colbert. :D


library(date) library(chron) options(width=150) Of all the frustrating data manipulations to deal with in any programming language, dates and times are the worst in my opinion. In R, there are many different packages that use various functions to deal with dates, which lead to different classes of dates that are not always compatible. Depending on how your data is organized, there are different solutions to your date and time problems. Here, I'll show the way that I think is easiest to deal with dates depending on the organization of your data. Ok so I find in public health data that birthdays and death days are the most common dates to be dealing with. In a dataset like the DHS, dates can either be found as three separate integer variables (month, day, year) or they can be in one character variable like "05/03/2009". So let's start with the first situation, numeric dates. Numeric dates #Create some data dates<-as.data.frame(cbind(c(1,3,6,11,4,12,5,3), c(30,14,NA,NA,16,NA,20,31), c(1980, 1980, 1980, 1983,1983, 1983, 1986, 1980), c(2, NA, NA, NA, NA, 12, 4, NA), c(2, NA, NA, NA, NA, NA, 29, NA), c(1980, NA, NA, 1985, NA, 1983, 1987, NA))) colnames(dates)<-c("birth_month", "birth_day", "birth_year", "death_month", "death_day", "death_year") dates I've included a lot of missing cells, even in birth date, because that's the most common problem that I have - birth and death data, especially from developing countries, is full of missing birth months or days. Ok, so what I would like to do here is to figure out if I have any infant mortalities in my sample size of 8. I find the easiest way to create a date object from three integer month/day/year variables is by using **ISOdate()**, which is just in base R. ISOdate() follows the following syntax: ISOdate(year, month, day, hour = 12, min = 0, sec = 0, tz = "GMT") and if you had hours and minutes you could easily throw those in too into the ISOdatetime() function which is the same syntax. So let's create a birth date: dates$DOB<-ISOdate(dates$birth_year, dates$birth_month, dates$birth_day) dates This is super easy. The column DOB that we have added to the dataframe is of class POSIXlt, which is a class of calendar date and time. Since I don't have any time variables, I don't want it crowding up my dataset so I can use the function strptime() to get rid of it. Here's the syntax: strptime(x, format, tz = "") where x is your POSIXlt object, and format is whatever the format is of that object (see below). The [strptime help file](http://astrostatistics.psu.edu/su07/R/html/base/html/strptime.html) is a good place to understand the formats. dates$DOB<-strptime(dates$DOB, format="%Y-%m-%d") dates$DOD<-strptime(ISOdate(dates$death_year, dates$death_month, dates$death_day), format="%Y-%m-%d") You can, of course, do this in one step, which I've done for death date. Now we can look at our dataset: dates Now if we want age at death, we can use the difftime() function that follows the following syntax and produces a difftime object (which you can convert to numeric using as.numeric() if you want to, which I highly recommend): difftime(time1, time2, tz,units = c("auto", "secs", "mins", "hours","days", "weeks")) dates$Age.atdeath<-difftime(dates$DOD, dates$DOB, unit="days") dates$Age.atdeath class(dates$Age.atdeath) #check if there were an infant mortalities dates$Age.atdeath<365 Ok, I found two infant mortalities, but I see that there's a problem. I see that I'm missing two birthdays that come up as NA, and this is because I'm missing the birth day for those two people. However, I do have birth month and birth year, and this is information I don't want to lose. There are many ways to deal with this, including imputing birth days or assigning them randomly from a uniform distribution or whatever. In this case, what I will do is very simple - just replace the missing birth day with 1 if it's missing (and similarly replace a missing birth month with 1 if it's missing) and replace missing death month and day with 12 and 30, respectively. That way I have the maximum possible age at death and I don't lose potentially important information. There are a number of ways to accomplish what I want to do, but I love using the ifelse() function because I find it extremely intuitive so I will do that: dates$DOB2<-strptime(ISOdate(year=dates$birth_year, month=ifelse(is.na(dates$birth_month), 1, dates$birth_month), day=ifelse(is.na(dates$birth_day),1, dates$birth_day)), format="%Y-%m-%d") dates$DOD2<-strptime(ISOdate(year=dates$death_year, month=ifelse(is.na(dates$death_month),12,dates$death_month), day=ifelse(is.na(dates$death_day),30, dates$death_day)), format="%Y-%m-%d") dates$Ageatdeath_2<-as.numeric(difftime(dates$DOD2,dates$DOB2,unit="days")) dates[,c(1:6,10:12)] So now we see above that I have all birthdays completed, and I was able to reveal that I have another infant mortality in my data, and another death more generally. Character Dates Ok so onto the second type of dataset you may encounter, which is that somebody inputed the dates into Excel or whatever like this: dates2<-as.data.frame(cbind(c(1:5), c("8/31/70", "NA", "10/12/60", "1/1/66", "12/31/80"), c("8/31/56", "12-31-1977", "12Aug55", "July 31 1965" ,"30jan1952"))) colnames(dates2)<-c("ID", "date_factor", "date_horrible") dates2 In the first column at least all of the dates have the same format, but in the second (which happens really often!), every date is a different format which seems, at first, like a total nightmare. But fortunately R is here to rescue us. Ok, let's start with the easy one. The tricky part with this kind of data is that R often immediately converts the dates to factors, like so: class(dates2$date_factor) This happens if you are reading in a csv file as well. You can either stop R from doing that by using the as.is option like so (assuming your dates were the second and third columns): df <- read.table("data_with_dates.txt", header = TRUE, as.is = 2:3) or we just have to remember to use the **as.character()** function before we do anything. Ok, so the point here is that even though these look like dates, they are *not* dates. They are factors or maybe characters, but not manipulatable like you would want. For example, this gives you the following error: #NO: this gives an error, you can't do this with characters, need the date format dates2$age<-difftime("02/27/13", as.character(dates2$date_factor), unit="days") So we need to get these character dates into actual date formats. If all of our formats are the same, we can use the chron package and the **chron()** function really easily like so. chron(dates., times., format = c(dates = "m/d/y", times = "h:m:s"),out.format, origin.) where dates is the vector of **character** dates, and the format is whatever the format is of the data you have. Note that it yields warning messages because of the missing value, but that is ok. library(chron) dates2$date.fmt<-chron(as.character(dates2$date_factor), format="m/d/y") class(dates2$date.fmt) dates2[,c(1,2,4)] Note that it looks very similar to the date_factor column but it is not, because it is now a dates and times class as I show above. We can also change how we want the outgoing format to look: dates2$date.fmt<-chron(as.character(dates2$date_factor), format="m/d/y", out.format="month day year") dates2[,c(1,2,4)] And now we can find the age of these people using difftime() as before. Let's say I interviewed everyone on March 1st of this year and I want their age in years at interview: dates2$age<-as.numeric(floor(difftime(chron("03/01/2013"), dates2$date.fmt, unit="days")/360)) dates2[c(1,2,4,5)] I can also do things like add a day to everybody's date if I had some reason to do that, and I can compare dates to see which one came first, which can be useful: #Add a day to everyone's date for some reason dates2$date.fmt+1 #Compare the date to some other date to see which came first using < operator Ok finally what if your data is as horrible as what we see in our second column? dates2[,c(1,3)] Chron won't help us here, as it needs one format for everyone: #NO: chron needs the same format chron(as.character(dates2$date_horrible)) It just gives us NAs (and warnings, which I've hid). However, there is a package called date that will take any type of date and figure it out for us. Watch how amazing it is, the only argument to the function you need is the vector of character dates: library(date) #as.date (lower case) will correctly convert dates in vector as.date(as.character(dates2$date_horrible)) Extraordinary! :) It just knows. However, I find that it stubbornly reformats itself into the number of days since 1960 when adding it as a column to the dataframe: dates2$date_autofmt<-as.date(as.character(dates2$date_horrible)) dates2[,c(1,3,6)] It's also not super easy to work with because it's a date object, not a Date object (VERY case-sensitive stuff here!). Feel free to chime in here if you have a better solution, but my simple fix on that is just to envelop it in an **as.Date()** function (from base R) like so: dates2$date_amazing<-as.Date(as.date(as.character(dates2$date_horrible))) dates2[,c(1,3,7)] Now it works great. That's pretty incredible that one short line of code can transform your awful dates column into perfectly coordinated and workable dates. You can use difftime() the same way as before. Hope that this was useful for those pulling their hair out over date and time objects in R. Also, if you made it all the way to the end, please enjoy [this hilarious episode] (http://www.youtube.com/watch?v=PD3aHKeFlSI) of Let's Make a Date on ``Whose Line Is It Anyway?'' featuring none other than Stephen Colbert. :D

5 comments:

  1. Hi Slawa -- this is great. Am just about to get into some survey data where I am going to need to be tidy with my dates so this should be really useful. Thanks for posting! N

    ps - Am viewing this on a Chrome browser and the last 3 lines seem to get cut-off (appeared fine when I opened in IE). Would have missed the Steven Colbert link!

    ReplyDelete
    Replies
    1. Thanks! Glad it's useful :) I'm still getting used to making the knitr seamless with blogger, and it's been a challenge. Does it look better now? Are you on a mac or PC?

      Delete
  2. Well I think lubridate package could help a lot here http://cran.r-project.org/web/packages/lubridate/index.html

    ReplyDelete
    Replies
    1. Yeah there are lots of packages for dates and lubridate has a lot of capabilities but it seemed to me more useful for when you already have date objects. The functions and packages I used here are great for raw data that are not yet in any date format.

      Delete