Thursday, July 11, 2013

Let's Make a Date: Date and Time classes in R


library(date) library(chron) options(width=150) Of all the frustrating data manipulations to deal with in any programming language, dates and times are the worst in my opinion. In R, there are many different packages that use various functions to deal with dates, which lead to different classes of dates that are not always compatible. Depending on how your data is organized, there are different solutions to your date and time problems. Here, I'll show the way that I think is easiest to deal with dates depending on the organization of your data. Ok so I find in public health data that birthdays and death days are the most common dates to be dealing with. In a dataset like the DHS, dates can either be found as three separate integer variables (month, day, year) or they can be in one character variable like "05/03/2009". So let's start with the first situation, numeric dates. Numeric dates #Create some data dates<-as.data.frame(cbind(c(1,3,6,11,4,12,5,3), c(30,14,NA,NA,16,NA,20,31), c(1980, 1980, 1980, 1983,1983, 1983, 1986, 1980), c(2, NA, NA, NA, NA, 12, 4, NA), c(2, NA, NA, NA, NA, NA, 29, NA), c(1980, NA, NA, 1985, NA, 1983, 1987, NA))) colnames(dates)<-c("birth_month", "birth_day", "birth_year", "death_month", "death_day", "death_year") dates I've included a lot of missing cells, even in birth date, because that's the most common problem that I have - birth and death data, especially from developing countries, is full of missing birth months or days. Ok, so what I would like to do here is to figure out if I have any infant mortalities in my sample size of 8. I find the easiest way to create a date object from three integer month/day/year variables is by using **ISOdate()**, which is just in base R. ISOdate() follows the following syntax: ISOdate(year, month, day, hour = 12, min = 0, sec = 0, tz = "GMT") and if you had hours and minutes you could easily throw those in too into the ISOdatetime() function which is the same syntax. So let's create a birth date: dates$DOB<-ISOdate(dates$birth_year, dates$birth_month, dates$birth_day) dates This is super easy. The column DOB that we have added to the dataframe is of class POSIXlt, which is a class of calendar date and time. Since I don't have any time variables, I don't want it crowding up my dataset so I can use the function strptime() to get rid of it. Here's the syntax: strptime(x, format, tz = "") where x is your POSIXlt object, and format is whatever the format is of that object (see below). The [strptime help file](http://astrostatistics.psu.edu/su07/R/html/base/html/strptime.html) is a good place to understand the formats. dates$DOB<-strptime(dates$DOB, format="%Y-%m-%d") dates$DOD<-strptime(ISOdate(dates$death_year, dates$death_month, dates$death_day), format="%Y-%m-%d") You can, of course, do this in one step, which I've done for death date. Now we can look at our dataset: dates Now if we want age at death, we can use the difftime() function that follows the following syntax and produces a difftime object (which you can convert to numeric using as.numeric() if you want to, which I highly recommend): difftime(time1, time2, tz,units = c("auto", "secs", "mins", "hours","days", "weeks")) dates$Age.atdeath<-difftime(dates$DOD, dates$DOB, unit="days") dates$Age.atdeath class(dates$Age.atdeath) #check if there were an infant mortalities dates$Age.atdeath<365 Ok, I found two infant mortalities, but I see that there's a problem. I see that I'm missing two birthdays that come up as NA, and this is because I'm missing the birth day for those two people. However, I do have birth month and birth year, and this is information I don't want to lose. There are many ways to deal with this, including imputing birth days or assigning them randomly from a uniform distribution or whatever. In this case, what I will do is very simple - just replace the missing birth day with 1 if it's missing (and similarly replace a missing birth month with 1 if it's missing) and replace missing death month and day with 12 and 30, respectively. That way I have the maximum possible age at death and I don't lose potentially important information. There are a number of ways to accomplish what I want to do, but I love using the ifelse() function because I find it extremely intuitive so I will do that: dates$DOB2<-strptime(ISOdate(year=dates$birth_year, month=ifelse(is.na(dates$birth_month), 1, dates$birth_month), day=ifelse(is.na(dates$birth_day),1, dates$birth_day)), format="%Y-%m-%d") dates$DOD2<-strptime(ISOdate(year=dates$death_year, month=ifelse(is.na(dates$death_month),12,dates$death_month), day=ifelse(is.na(dates$death_day),30, dates$death_day)), format="%Y-%m-%d") dates$Ageatdeath_2<-as.numeric(difftime(dates$DOD2,dates$DOB2,unit="days")) dates[,c(1:6,10:12)] So now we see above that I have all birthdays completed, and I was able to reveal that I have another infant mortality in my data, and another death more generally. Character Dates Ok so onto the second type of dataset you may encounter, which is that somebody inputed the dates into Excel or whatever like this: dates2<-as.data.frame(cbind(c(1:5), c("8/31/70", "NA", "10/12/60", "1/1/66", "12/31/80"), c("8/31/56", "12-31-1977", "12Aug55", "July 31 1965" ,"30jan1952"))) colnames(dates2)<-c("ID", "date_factor", "date_horrible") dates2 In the first column at least all of the dates have the same format, but in the second (which happens really often!), every date is a different format which seems, at first, like a total nightmare. But fortunately R is here to rescue us. Ok, let's start with the easy one. The tricky part with this kind of data is that R often immediately converts the dates to factors, like so: class(dates2$date_factor) This happens if you are reading in a csv file as well. You can either stop R from doing that by using the as.is option like so (assuming your dates were the second and third columns): df <- read.table("data_with_dates.txt", header = TRUE, as.is = 2:3) or we just have to remember to use the **as.character()** function before we do anything. Ok, so the point here is that even though these look like dates, they are *not* dates. They are factors or maybe characters, but not manipulatable like you would want. For example, this gives you the following error: #NO: this gives an error, you can't do this with characters, need the date format dates2$age<-difftime("02/27/13", as.character(dates2$date_factor), unit="days") So we need to get these character dates into actual date formats. If all of our formats are the same, we can use the chron package and the **chron()** function really easily like so. chron(dates., times., format = c(dates = "m/d/y", times = "h:m:s"),out.format, origin.) where dates is the vector of **character** dates, and the format is whatever the format is of the data you have. Note that it yields warning messages because of the missing value, but that is ok. library(chron) dates2$date.fmt<-chron(as.character(dates2$date_factor), format="m/d/y") class(dates2$date.fmt) dates2[,c(1,2,4)] Note that it looks very similar to the date_factor column but it is not, because it is now a dates and times class as I show above. We can also change how we want the outgoing format to look: dates2$date.fmt<-chron(as.character(dates2$date_factor), format="m/d/y", out.format="month day year") dates2[,c(1,2,4)] And now we can find the age of these people using difftime() as before. Let's say I interviewed everyone on March 1st of this year and I want their age in years at interview: dates2$age<-as.numeric(floor(difftime(chron("03/01/2013"), dates2$date.fmt, unit="days")/360)) dates2[c(1,2,4,5)] I can also do things like add a day to everybody's date if I had some reason to do that, and I can compare dates to see which one came first, which can be useful: #Add a day to everyone's date for some reason dates2$date.fmt+1 #Compare the date to some other date to see which came first using < operator Ok finally what if your data is as horrible as what we see in our second column? dates2[,c(1,3)] Chron won't help us here, as it needs one format for everyone: #NO: chron needs the same format chron(as.character(dates2$date_horrible)) It just gives us NAs (and warnings, which I've hid). However, there is a package called date that will take any type of date and figure it out for us. Watch how amazing it is, the only argument to the function you need is the vector of character dates: library(date) #as.date (lower case) will correctly convert dates in vector as.date(as.character(dates2$date_horrible)) Extraordinary! :) It just knows. However, I find that it stubbornly reformats itself into the number of days since 1960 when adding it as a column to the dataframe: dates2$date_autofmt<-as.date(as.character(dates2$date_horrible)) dates2[,c(1,3,6)] It's also not super easy to work with because it's a date object, not a Date object (VERY case-sensitive stuff here!). Feel free to chime in here if you have a better solution, but my simple fix on that is just to envelop it in an **as.Date()** function (from base R) like so: dates2$date_amazing<-as.Date(as.date(as.character(dates2$date_horrible))) dates2[,c(1,3,7)] Now it works great. That's pretty incredible that one short line of code can transform your awful dates column into perfectly coordinated and workable dates. You can use difftime() the same way as before. Hope that this was useful for those pulling their hair out over date and time objects in R. Also, if you made it all the way to the end, please enjoy [this hilarious episode] (http://www.youtube.com/watch?v=PD3aHKeFlSI) of Let's Make a Date on ``Whose Line Is It Anyway?'' featuring none other than Stephen Colbert. :D

5 comments:

  1. Hi Slawa -- this is great. Am just about to get into some survey data where I am going to need to be tidy with my dates so this should be really useful. Thanks for posting! N

    ps - Am viewing this on a Chrome browser and the last 3 lines seem to get cut-off (appeared fine when I opened in IE). Would have missed the Steven Colbert link!

    ReplyDelete
    Replies
    1. Thanks! Glad it's useful :) I'm still getting used to making the knitr seamless with blogger, and it's been a challenge. Does it look better now? Are you on a mac or PC?

      Delete
  2. Well I think lubridate package could help a lot here http://cran.r-project.org/web/packages/lubridate/index.html

    ReplyDelete
    Replies
    1. Yeah there are lots of packages for dates and lubridate has a lot of capabilities but it seemed to me more useful for when you already have date objects. The functions and packages I used here are great for raw data that are not yet in any date format.

      Delete