Showing posts with label character. Show all posts

Thursday, July 11, 2013

Let's Make a Date: Date and Time classes in R

Of all the frustrating data manipulations to deal with in any programming language, dates and times are the worst in my opinion. In R, there are many different packages that use various functions to deal with dates, which lead to different classes of dates that are not always compatible. Depending on how your data is organized, there are different solutions to your date and time problems. Here, I'll show the way that I think is easiest to deal with dates depending on the organization of your data.

Ok so I find in public health data that birthdays and death days are the most common dates to be dealing with. In a dataset like the DHS, dates can either be found as three separate integer variables (month, day, year) or they can be in one character variable like “05/03/2009”. So let's start with the first situation, numeric dates.

Numeric dates

#Create some data
dates<-as.data.frame(cbind(c(1,3,6,11,4,12,5,3), 
                           c(30,14,NA,NA,16,NA,20,31), 
                           c(1980, 1980, 1980, 1983,1983, 1983, 1986, 1980), 
                           c(2, NA, NA, NA, NA, 12, 4, NA), 
                           c(2, NA, NA, NA, NA, NA, 29, NA), 
                           c(1980, NA, NA, 1985, NA, 1983, 1987, NA)))
colnames(dates)<-c("birth_month", "birth_day", "birth_year", "death_month", "death_day", "death_year")
dates

##   birth_month birth_day birth_year death_month death_day death_year
## 1           1        30       1980           2         2       1980
## 2           3        14       1980          NA        NA         NA
## 3           6        NA       1980          NA        NA         NA
## 4          11        NA       1983          NA        NA       1985
## 5           4        16       1983          NA        NA         NA
## 6          12        NA       1983          12        NA       1983
## 7           5        20       1986           4        29       1987
## 8           3        31       1980          NA        NA         NA

I've included a lot of missing cells, even in birth date, because that's the most common problem that I have - birth and death data, especially from developing countries, is full of missing birth months or days.

Ok, so what I would like to do here is to figure out if I have any infant mortalities in my sample size of 8. I find the easiest way to create a date object from three integer month/day/year variables is by using ISOdate(), which is just in base R. ISOdate() follows the following syntax:

ISOdate(year, month, day, hour = 12, min = 0, sec = 0, tz = “GMT”)

and if you had hours and minutes you could easily throw those in too into the ISOdatetime() function which is the same syntax. So let's create a birth date:

dates$DOB <- ISOdate(dates$birth_year, dates$birth_month, dates$birth_day)
dates

##   birth_month birth_day birth_year death_month death_day death_year                 DOB
## 1           1        30       1980           2         2       1980 1980-01-30 12:00:00
## 2           3        14       1980          NA        NA         NA 1980-03-14 12:00:00
## 3           6        NA       1980          NA        NA         NA                <NA>
## 4          11        NA       1983          NA        NA       1985                <NA>
## 5           4        16       1983          NA        NA         NA 1983-04-16 12:00:00
## 6          12        NA       1983          12        NA       1983                <NA>
## 7           5        20       1986           4        29       1987 1986-05-20 12:00:00
## 8           3        31       1980          NA        NA         NA 1980-03-31 12:00:00

This is super easy. The column DOB that we have added to the dataframe is of class POSIXlt, which is a class of calendar date and time. Since I don't have any time variables, I don't want it crowding up my dataset so I can use the function strptime() to get rid of it. Here's the syntax:

strptime(x, format, tz = “”)

where x is your POSIXlt object, and format is whatever the format is of that object (see below). The strptime help file is a good place to understand the formats.

dates$DOB <- strptime(dates$DOB, format = "%Y-%m-%d")
dates$DOD <- strptime(ISOdate(dates$death_year, dates$death_month, dates$death_day), format = "%Y-%m-%d")

You can, of course, do this in one step, which I've done for death date. Now we can look at our dataset:

dates

##   birth_month birth_day birth_year death_month death_day death_year        DOB        DOD
## 1           1        30       1980           2         2       1980 1980-01-30 1980-02-02
## 2           3        14       1980          NA        NA         NA 1980-03-14       <NA>
## 3           6        NA       1980          NA        NA         NA       <NA>       <NA>
## 4          11        NA       1983          NA        NA       1985       <NA>       <NA>
## 5           4        16       1983          NA        NA         NA 1983-04-16       <NA>
## 6          12        NA       1983          12        NA       1983       <NA>       <NA>
## 7           5        20       1986           4        29       1987 1986-05-20 1987-04-29
## 8           3        31       1980          NA        NA         NA 1980-03-31       <NA>

Now if we want age at death, we can use the difftime() function that follows the following syntax and produces a difftime object (which you can convert to numeric using as.numeric() if you want to, which I highly recommend):

difftime(time1, time2, tz,units = c(“auto”, “secs”, “mins”, “hours”,“days”, “weeks”))

dates$Age.atdeath <- difftime(dates$DOD, dates$DOB, unit = "days")
dates$Age.atdeath

## Time differences in days
## [1]   3  NA  NA  NA  NA  NA 344  NA
## attr(,"tzone")
## [1] ""

class(dates$Age.atdeath)

## [1] "difftime"

# check if there were an infant mortalities
dates$Age.atdeath < 365

## [1] TRUE   NA   NA   NA   NA   NA TRUE   NA

Ok, I found two infant mortalities, but I see that there's a problem. I see that I'm missing two birthdays that come up as NA, and this is because I'm missing the birth day for those two people. However, I do have birth month and birth year, and this is information I don't want to lose. There are many ways to deal with this, including imputing birth days or assigning them randomly from a uniform distribution or whatever.

In this case, what I will do is very simple - just replace the missing birth day with 1 if it's missing (and similarly replace a missing birth month with 1 if it's missing) and replace missing death month and day with 12 and 30, respectively. That way I have the maximum possible age at death and I don't lose potentially important information. There are a number of ways to accomplish what I want to do, but I love using the ifelse() function because I find it extremely intuitive so I will do that:

dates$DOB2<-strptime(ISOdate(year=dates$birth_year, 
                             month=ifelse(is.na(dates$birth_month), 1, dates$birth_month), 
                             day=ifelse(is.na(dates$birth_day),1, dates$birth_day)), 
                     format="%Y-%m-%d")

dates$DOD2<-strptime(ISOdate(year=dates$death_year, 
                             month=ifelse(is.na(dates$death_month),12,dates$death_month), 
                             day=ifelse(is.na(dates$death_day),30, dates$death_day)), 
                     format="%Y-%m-%d")

dates$Ageatdeath_2<-as.numeric(difftime(dates$DOD2,dates$DOB2,unit="days"))

dates[,c(1:6,10:12)]

##   birth_month birth_day birth_year death_month death_day death_year       DOB2       DOD2 Ageatdeath_2
## 1           1        30       1980           2         2       1980 1980-01-30 1980-02-02            3
## 2           3        14       1980          NA        NA         NA 1980-03-14       <NA>           NA
## 3           6        NA       1980          NA        NA         NA 1980-06-01       <NA>           NA
## 4          11        NA       1983          NA        NA       1985 1983-11-01 1985-12-30          790
## 5           4        16       1983          NA        NA         NA 1983-04-16       <NA>           NA
## 6          12        NA       1983          12        NA       1983 1983-12-01 1983-12-30           29
## 7           5        20       1986           4        29       1987 1986-05-20 1987-04-29          344
## 8           3        31       1980          NA        NA         NA 1980-03-31       <NA>           NA

So now we see above that I have all birthdays completed, and I was able to reveal that I have another infant mortality in my data, and another death more generally.

Character Dates

Ok so onto the second type of dataset you may encounter, which is that somebody inputed the dates into Excel or whatever like this:

dates2<-as.data.frame(cbind(c(1:5), 
                            c("8/31/70", "NA", "10/12/60", "1/1/66", "12/31/80"), 
                            c("8/31/56", "12-31-1977", "12Aug55", "July 31 1965" ,"30jan1952")))
colnames(dates2)<-c("ID", "date_factor", "date_horrible")
dates2

##   ID date_factor date_horrible
## 1  1     8/31/70       8/31/56
## 2  2          NA    12-31-1977
## 3  3    10/12/60       12Aug55
## 4  4      1/1/66  July 31 1965
## 5  5    12/31/80     30jan1952

In the first column at least all of the dates have the same format, but in the second (which happens really often!), every date is a different format which seems, at first, like a total nightmare. But fortunately R is here to rescue us.

Ok, let's start with the easy one. The tricky part with this kind of data is that R often immediately converts the dates to factors, like so:

class(dates2$date_factor)

## [1] "factor"

This happens if you are reading in a csv file as well. You can either stop R from doing that by using the as.is option like so (assuming your dates were the second and third columns):

df <- read.table("data_with_dates.txt", header = TRUE, as.is = 2:3)

or we just have to remember to use the as.character() function before we do anything. Ok, so the point here is that even though these look like dates, they are not dates. They are factors or maybe characters, but not manipulatable like you would want. For example, this gives you the following error:

# NO: this gives an error, you can't do this with characters, need the date format
dates2$age <- difftime("02/27/13", as.character(dates2$date_factor), unit = "days")

## Error: character string is not in a standard unambiguous format

So we need to get these character dates into actual date formats. If all of our formats are the same, we can use the chron package and the chron() function really easily like so.

chron(dates., times., format = c(dates = “m/d/y”, times = “h:m:s”),out.format, origin.)

where dates is the vector of character dates, and the format is whatever the format is of the data you have. Note that it yields warning messages because of the missing value, but that is ok.

library(chron)
dates2$date.fmt <- chron(as.character(dates2$date_factor), format = "m/d/y")

## Warning: wrong number of fields in entry(ies) 2

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

## Warning: NAs introduced by coercion

class(dates2$date.fmt)

## [1] "dates" "times"

dates2[, c(1, 2, 4)]

##   ID date_factor date.fmt
## 1  1     8/31/70 08/31/70
## 2  2          NA     <NA>
## 3  3    10/12/60 10/12/60
## 4  4      1/1/66 01/01/66
## 5  5    12/31/80 12/31/80

Note that it looks very similar to the date_factor column but it is not, because it is now a dates and times class as I show above. We can also change how we want the outgoing format to look:

dates2$date.fmt <- chron(as.character(dates2$date_factor), format = "m/d/y", out.format = "month day year")
dates2[, c(1, 2, 4)]

##   ID date_factor         date.fmt
## 1  1     8/31/70   August 31 1970
## 2  2          NA             <NA>
## 3  3    10/12/60  October 12 1960
## 4  4      1/1/66  January 01 1966
## 5  5    12/31/80 December 31 1980

And now we can find the age of these people using difftime() as before. Let's say I interviewed everyone on March 1st of this year and I want their age in years at interview:

dates2$age <- as.numeric(floor(difftime(chron("03/01/2013"), dates2$date.fmt, unit = "days")/360))
dates2[c(1, 2, 4, 5)]

##   ID date_factor         date.fmt age
## 1  1     8/31/70   August 31 1970  43
## 2  2          NA             <NA>  NA
## 3  3    10/12/60  October 12 1960  53
## 4  4      1/1/66  January 01 1966  47
## 5  5    12/31/80 December 31 1980  32

I can also do things like add a day to everybody's date if I had some reason to do that, and I can compare dates to see which one came first, which can be useful:

# Add a day to everyone's date for some reason
dates2$date.fmt + 1

## [1] September 01 1970 <NA>              October 13 1960   January 02 1966   January 01 1981

# Compare the date to some other date to see which came first using < operator
dates2$date.fmt < chron("04/02/62")

## [1] FALSE    NA  TRUE FALSE FALSE

Ok finally what if your data is as horrible as what we see in our second column?

dates2[, c(1, 3)]

##   ID date_horrible
## 1  1       8/31/56
## 2  2    12-31-1977
## 3  3       12Aug55
## 4  4  July 31 1965
## 5  5     30jan1952

Chron won't help us here, as it needs one format for everyone:

# NO: chron needs the same format
chron(as.character(dates2$date_horrible))

## [1] 08/31/56 <NA>     <NA>     <NA>     <NA>

It just gives us NAs (and warnings, which I've hid). However, there is a package called date that will take any type of date and figure it out for us. Watch how amazing it is, the only argument to the function you need is the vector of character dates:

library(date)
# as.date (lower case) will correctly convert dates in vector
as.date(as.character(dates2$date_horrible))

## [1] 31Aug56 31Dec77 12Aug55 31Jul65 30Jan52

Extraordinary! :) It just knows. However, I find that it stubbornly reformats itself into the number of days since 1960 when adding it as a column to the dataframe:

dates2$date_autofmt <- as.date(as.character(dates2$date_horrible))
dates2[, c(1, 3, 6)]

##   ID date_horrible date_autofmt
## 1  1       8/31/56        -1218
## 2  2    12-31-1977         6574
## 3  3       12Aug55        -1603
## 4  4  July 31 1965         2038
## 5  5     30jan1952        -2893

It's also not super easy to work with because it's a date object, not a Date object (VERY case-sensitive stuff here!). Feel free to chime in here if you have a better solution, but my simple fix on that is just to envelop it in an as.Date() function (from base R) like so:

dates2$date_amazing <- as.Date(as.date(as.character(dates2$date_horrible)))
dates2[, c(1, 3, 7)]

##   ID date_horrible date_amazing
## 1  1       8/31/56   1956-08-31
## 2  2    12-31-1977   1977-12-31
## 3  3       12Aug55   1955-08-12
## 4  4  July 31 1965   1965-07-31
## 5  5     30jan1952   1952-01-30

Now it works great. That's pretty incredible that one short line of code can transform your awful dates column into perfectly coordinated and workable dates. You can use difftime() the same way as before. Hope that this was useful for those pulling their hair out over date and time objects in R.

Also, if you made it all the way to the end, please enjoy this hilarious episode of Let's Make a Date on “Whose Line Is It Anyway?'' featuring none other than Stephen Colbert. :D

library(date) library(chron) options(width=150) Of all the frustrating data manipulations to deal with in any programming language, dates and times are the worst in my opinion. In R, there are many different packages that use various functions to deal with dates, which lead to different classes of dates that are not always compatible. Depending on how your data is organized, there are different solutions to your date and time problems. Here, I'll show the way that I think is easiest to deal with dates depending on the organization of your data. Ok so I find in public health data that birthdays and death days are the most common dates to be dealing with. In a dataset like the DHS, dates can either be found as three separate integer variables (month, day, year) or they can be in one character variable like "05/03/2009". So let's start with the first situation, numeric dates. Numeric dates #Create some data dates<-as.data.frame(cbind(c(1,3,6,11,4,12,5,3), c(30,14,NA,NA,16,NA,20,31), c(1980, 1980, 1980, 1983,1983, 1983, 1986, 1980), c(2, NA, NA, NA, NA, 12, 4, NA), c(2, NA, NA, NA, NA, NA, 29, NA), c(1980, NA, NA, 1985, NA, 1983, 1987, NA))) colnames(dates)<-c("birth_month", "birth_day", "birth_year", "death_month", "death_day", "death_year") dates I've included a lot of missing cells, even in birth date, because that's the most common problem that I have - birth and death data, especially from developing countries, is full of missing birth months or days. Ok, so what I would like to do here is to figure out if I have any infant mortalities in my sample size of 8. I find the easiest way to create a date object from three integer month/day/year variables is by using **ISOdate()**, which is just in base R. ISOdate() follows the following syntax: ISOdate(year, month, day, hour = 12, min = 0, sec = 0, tz = "GMT") and if you had hours and minutes you could easily throw those in too into the ISOdatetime() function which is the same syntax. So let's create a birth date: dates$DOB<-ISOdate(dates$birth_year, dates$birth_month, dates$birth_day) dates This is super easy. The column DOB that we have added to the dataframe is of class POSIXlt, which is a class of calendar date and time. Since I don't have any time variables, I don't want it crowding up my dataset so I can use the function strptime() to get rid of it. Here's the syntax: strptime(x, format, tz = "") where x is your POSIXlt object, and format is whatever the format is of that object (see below). The [strptime help file](http://astrostatistics.psu.edu/su07/R/html/base/html/strptime.html) is a good place to understand the formats. dates$DOB<-strptime(dates$DOB, format="%Y-%m-%d") dates$DOD<-strptime(ISOdate(dates$death_year, dates$death_month, dates$death_day), format="%Y-%m-%d") You can, of course, do this in one step, which I've done for death date. Now we can look at our dataset: dates Now if we want age at death, we can use the difftime() function that follows the following syntax and produces a difftime object (which you can convert to numeric using as.numeric() if you want to, which I highly recommend): difftime(time1, time2, tz,units = c("auto", "secs", "mins", "hours","days", "weeks")) dates$Age.atdeath<-difftime(dates$DOD, dates$DOB, unit="days") dates$Age.atdeath class(dates$Age.atdeath) #check if there were an infant mortalities dates$Age.atdeath<365 Ok, I found two infant mortalities, but I see that there's a problem. I see that I'm missing two birthdays that come up as NA, and this is because I'm missing the birth day for those two people. However, I do have birth month and birth year, and this is information I don't want to lose. There are many ways to deal with this, including imputing birth days or assigning them randomly from a uniform distribution or whatever. In this case, what I will do is very simple - just replace the missing birth day with 1 if it's missing (and similarly replace a missing birth month with 1 if it's missing) and replace missing death month and day with 12 and 30, respectively. That way I have the maximum possible age at death and I don't lose potentially important information. There are a number of ways to accomplish what I want to do, but I love using the ifelse() function because I find it extremely intuitive so I will do that: dates$DOB2<-strptime(ISOdate(year=dates$birth_year, month=ifelse(is.na(dates$birth_month), 1, dates$birth_month), day=ifelse(is.na(dates$birth_day),1, dates$birth_day)), format="%Y-%m-%d") dates$DOD2<-strptime(ISOdate(year=dates$death_year, month=ifelse(is.na(dates$death_month),12,dates$death_month), day=ifelse(is.na(dates$death_day),30, dates$death_day)), format="%Y-%m-%d") dates$Ageatdeath_2<-as.numeric(difftime(dates$DOD2,dates$DOB2,unit="days")) dates[,c(1:6,10:12)] So now we see above that I have all birthdays completed, and I was able to reveal that I have another infant mortality in my data, and another death more generally. Character Dates Ok so onto the second type of dataset you may encounter, which is that somebody inputed the dates into Excel or whatever like this: dates2<-as.data.frame(cbind(c(1:5), c("8/31/70", "NA", "10/12/60", "1/1/66", "12/31/80"), c("8/31/56", "12-31-1977", "12Aug55", "July 31 1965" ,"30jan1952"))) colnames(dates2)<-c("ID", "date_factor", "date_horrible") dates2 In the first column at least all of the dates have the same format, but in the second (which happens really often!), every date is a different format which seems, at first, like a total nightmare. But fortunately R is here to rescue us. Ok, let's start with the easy one. The tricky part with this kind of data is that R often immediately converts the dates to factors, like so: class(dates2$date_factor) This happens if you are reading in a csv file as well. You can either stop R from doing that by using the as.is option like so (assuming your dates were the second and third columns): df <- read.table("data_with_dates.txt", header = TRUE, as.is = 2:3) or we just have to remember to use the **as.character()** function before we do anything. Ok, so the point here is that even though these look like dates, they are *not* dates. They are factors or maybe characters, but not manipulatable like you would want. For example, this gives you the following error: #NO: this gives an error, you can't do this with characters, need the date format dates2$age<-difftime("02/27/13", as.character(dates2$date_factor), unit="days") So we need to get these character dates into actual date formats. If all of our formats are the same, we can use the chron package and the **chron()** function really easily like so. chron(dates., times., format = c(dates = "m/d/y", times = "h:m:s"),out.format, origin.) where dates is the vector of **character** dates, and the format is whatever the format is of the data you have. Note that it yields warning messages because of the missing value, but that is ok. library(chron) dates2$date.fmt<-chron(as.character(dates2$date_factor), format="m/d/y") class(dates2$date.fmt) dates2[,c(1,2,4)] Note that it looks very similar to the date_factor column but it is not, because it is now a dates and times class as I show above. We can also change how we want the outgoing format to look: dates2$date.fmt<-chron(as.character(dates2$date_factor), format="m/d/y", out.format="month day year") dates2[,c(1,2,4)] And now we can find the age of these people using difftime() as before. Let's say I interviewed everyone on March 1st of this year and I want their age in years at interview: dates2$age<-as.numeric(floor(difftime(chron("03/01/2013"), dates2$date.fmt, unit="days")/360)) dates2[c(1,2,4,5)] I can also do things like add a day to everybody's date if I had some reason to do that, and I can compare dates to see which one came first, which can be useful: #Add a day to everyone's date for some reason dates2$date.fmt+1 #Compare the date to some other date to see which came first using < operator Ok finally what if your data is as horrible as what we see in our second column? dates2[,c(1,3)] Chron won't help us here, as it needs one format for everyone: #NO: chron needs the same format chron(as.character(dates2$date_horrible)) It just gives us NAs (and warnings, which I've hid). However, there is a package called date that will take any type of date and figure it out for us. Watch how amazing it is, the only argument to the function you need is the vector of character dates: library(date) #as.date (lower case) will correctly convert dates in vector as.date(as.character(dates2$date_horrible)) Extraordinary! :) It just knows. However, I find that it stubbornly reformats itself into the number of days since 1960 when adding it as a column to the dataframe: dates2$date_autofmt<-as.date(as.character(dates2$date_horrible)) dates2[,c(1,3,6)] It's also not super easy to work with because it's a date object, not a Date object (VERY case-sensitive stuff here!). Feel free to chime in here if you have a better solution, but my simple fix on that is just to envelop it in an **as.Date()** function (from base R) like so: dates2$date_amazing<-as.Date(as.date(as.character(dates2$date_horrible))) dates2[,c(1,3,7)] Now it works great. That's pretty incredible that one short line of code can transform your awful dates column into perfectly coordinated and workable dates. You can use difftime() the same way as before. Hope that this was useful for those pulling their hair out over date and time objects in R. Also, if you made it all the way to the end, please enjoy [this hilarious episode] (http://www.youtube.com/watch?v=PD3aHKeFlSI) of Let's Make a Date on ``Whose Line Is It Anyway?'' featuring none other than Stephen Colbert. :D

Thursday, November 8, 2012

Data types part 2: Using classes to your advantage

Last week I talked about objects including scalars, vectors, matrices, dataframes, and lists. This post will show you how to use the objects (and their corresponding classes) you create in R to your advantage.

First off, it's important to remember that columns of dataframes are vectors. That is, if I have a dataframe called mydata, the columns mydata$Height and mydata$Weight are vectors. Numeric vectors can be multiplied or added together, squared, added or multiplied by a constant, etc. Operations on vectors are done element by element, meaning here row by row.

First, I read in a file of data, called mydata, using the read.csv() function. I get the dataframe below:

I check the classes of my objects using class(), or all at the same time with ls.str().

class(mydata$Weight)
class(mydata$Height)

or

So I see that mydata is a dataframe and all my columns are numeric (num). Now, if I want to create a new column in my dataset which calculates BMI, I can do some vector operations:

mydata$BMI<-mydata$Weight/(mydata$Height)^2 * 703

Which is the formula for BMI from weight in pounds and height in inches. Notice how if any component of the calculation is a missing (NA) value, R calculates the BMI as NA as well.

Now I can do summary statistics on my data and store those as a matrix. For example, I start with summary statistics on my Age vector:

summary(mydata$Age)

If I want to extract an element of this summary table, say the minimum, I can do

summary(mydata$Age)[1]

which extracts the first element (of 6) of the summary table.

But what I really want is a summary matrix of a bunch of variables: Age, Sex, and BMI. To do this I can rowbind the summary statistics of those three variables together using the rbind() function, but only take the 1st, 4th, and 6th elements of the summary table, which as you can see correspond to the Min, Mean, and Max. This creates a matrix, which I call summary.matrix:

summary.matrix<-rbind(summary(mydata$Age)[c(1,4,6)], summary(mydata$BMI)[c(1,4,6)], summary(mydata$Sex)[c(1,4,6)])

Rowbinding is basically stacking rows on top of each other. I add rownames and then print the class of my summary matrix and the results.

rownames(summary.matrix)<-c("Age", "BMI", "Sex")
class(summary.matrix)
summary.matrix

There is also a much more efficient way of doing this using the apply() function. Previously I had another post on the apply function, but I find that it takes a lot of examples to get comfortable with so here is another application.

Apply() is a great example of classes because it takes in a dataframe as the first argument (mydata, all rows, but I choose only columns 2, 3, and 7). I then apply it to the numeric vector columns (MARGIN=2) of this subsetted dataframe, and then for each of those columns I perform the mean and standard deviation, removing the NA's from consideration. I save this in a matrix I call summary.matrix2.

summary.matrix2<-apply(mydata[,c(2,3,7)], MARGIN=2, FUN=function(x) c(mean(x,na.rm=TRUE), sd(x, na.rm=TRUE)))

I then rename the rows of the this matrix and print the results, rounded to two decimal places. Notice how the format of the final matrix is different here. Above the rows were the variables and the columns the summary statistics, while here it is reversed. I could have column binded (cbind() instead of the rbind()) in the first case and I would have gotten the matrix transposed to be like this one.

rownames(summary.matrix2)<-c("Mean", "Stdev")
round(summary.matrix2, 2)

Finally, I want to demonstrate how you can take advantage of scalars and vectors when graphing. Creating scalar and vectors objects is really helpful when you are doing the same task multiple times. I give the example of creating a bunch of scatterplots.

I want to make a scatterplot for each of three variables (Height, Weight, and BMI) against age. Since all three scatterplots are going to be very similar, I want to standardize all of my plotting arguments including the range of ages, the plot symbols and the plot colors. I want to include a vertical line for the mean age and a title for each plot. The code is below:

##Assign numeric vector for the range of x-axis
agelimit<-c(20,80)

##Assign numeric single scalar to plotsymbols and meanage
plotsymbols<-2
meanage<-mean(mydata$Age)

##Assign single character words to plottype and plotcolor
plottype<-"p"
plotcolor<-"darkgreen"

##Assign a vector of characters to titletext
titletext<-c("Scatterplot", "vs Age")

Ok, so now that I have all those assigned, I can plot the three plots all together using the following code. Notice how all the highlighted code is the same in each plot (except for the main title) and I'm using the assigned objects I just created. The great part about this is that if I decide I actually want to plot color to be red, I can change it in just one place. You can think about how this would be useful in other situations (data cleaning, regressions, etc) when you do the same thing multiple times and then decide to change one little parameter. If you're not sure about the code below, I posted on the basics of plotting here.

##Plot area is 1 row, 3 columns
par(mfrow=c(1,3))

##Plot all three plots using the assigned objects
plot(mydata$Age, mydata$Height, xlab="Age", ylab="Height", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Height", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$Weight, xlab="Age", ylab="Weight", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Weight", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$BMI, xlab="Age", ylab="BMI", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "BMI", titletext[2]))
abline(v=meanage)

Notice how I do the main title with the paste statement. Paste() is useful for combining words and elements of another variable together into one phrase. The output looks like this, below. Pretty nice!

Thursday, November 1, 2012

Data types, part 1: Ways to store variables

I've been alluding to different R data types, or classes, in various posts, so I want to go over them in more detail. This is part 1 of a 3 part series on data types. In this post, I'll describe and give a general overview of useful data types. In parts 2 and 3, I'll show you in more detailed examples how you can use these data types to your advantage when you're programming.

When you program in R, you must always refer to various objects that you have created. This is in contrast to say, Stata, where you open up a dataset and any variables you refer to are columns of that dataset (with the exception of local macro variables and so on). So for example, if I have a dataset like the one below:

I can just say in Stata

keep if Age>25

and Stata knows that I am talking about the column Age of this dataset.

But in R, I can't do that because I get this error:

As the error indicates, 'Age' is not an object that I have created. This is because 'Age' is part of the dataframe that is called "mydata". A dataframe, as we will see below, is an object (and in this case also a class) with certain properties. How do I know it's a dataframe? I can check with the class() statement:

What does it mean for "mydata" to be a dataframe? Well, there are many different ways to store variables in R (i.e. objects), which have corresponding classes. I enumerate the most common and useful subset of these objects below along with their description and class:

Object	Description	Class
Single Number or letter/word	Just a single number or character/word/phrase in quotes	Either numeric or character
Vector	A vector of either all numbers or all characters strung together	Either all numeric or all character
Matrix	Has columns and rows - all entries are of the same class	Either all numeric or all character
Dataframe	Like a matrix but columns can be different classes	data.frame
List	A bunch of different objects all grouped together under one name	list

There are other classes including factors, which are so useful that they will be a separate post in this blog, so for now I'll leave those aside. You can also make your own classes, but that's definitely beyond the scope of this introduction to objects and classes.

Ok, so here are some examples of different ways of assigning names to these objects and printing the contents on the screen. I chose to name my variables descriptively of what they are (like numeric.var or matrix.var), but of course you can name them anything you want with any mix of periods and underscores, lowercase and uppercase letters, i.e. id_number, Height.cm, BIRTH.YEAR.MONTH, firstname_lastname_middlename, etc. I would only guard against naming variables by calling them things like mean or median, since those are established functions in R and might lead to some weird things happening.

1. Single number or character/word/phrase in quotation marks: just assign one number or one thing in quotes to the variable name

numeric.var<-10
character.var<-"Hello!"

2. Vector: use the c() operator or a function like seq() or rep() to combine several numbers into one vector.

vector.numeric<-c(1,2,3,10)
vector.char<-rep("abc",3)

3. Matrix: use the matrix() function to specify the entries, then the number of rows, and the number of columns in the matrix. Matrices can only be indexed using matrix notation, like [1,2] for row 1, column 2. More about indexing in my previous post on subsetting.

matrix.numeric<-matrix(data=c(1:6),nrow=3,ncol=2)
matrix.character<-matrix(data=c("a","b","c","d"), nrow=2, ncol=2)

4. Dataframe: use the data.frame() function to combine variables together. Here you must use the cbind() function to "column bind" the variables. Notice how I can mix numeric columns with character columns, which is also not possible in matrices. If I want to refer to a specific column, I use the $ operator, like dataframe.var$ID for the second column.

dataframe.var<-data.frame(cbind(School=1, ID=1:5, Test=c("math","read","math","geo","hist")))

Alternatively, any dataset you pull into R using the read.csv(), read.dta(), or read.xport() functions (see my blog post about this here), will automatically be a dataframe.

What's important to note about dataframes is that the variables in your dataframe also have classes. So for example, the class of the whole dataframe is "data.frame", but the class of the ID column is a "factor."

Again, I'll go into factors in another post and how to change back and forth between factors and numeric or character classes.

5. List: use the list() function and list all of the objects you want to include. The list combines all the objects together and has a specific indexing convention, the double square bracket like so: [[1]]. I will go into lists in another post.

list.var<-list(numeric.var, vector.char, matrix.numeric, dataframe.var)

To know what kinds of objects you have created and thus what is in your local memory, use the ls() function like so:

To remove an object, you do:

rm(character.var)

and to remove all objects, you can do:

rm(list=ls())

So that was a brief introduction to objects and classes. Next week, I'll go into how these are useful for easier and more efficient programming.

Why R for public health?

I created this blog to help public health researchers that are used to Stata or SAS to begin using R. I find that public health data is unique and this blog is meant to address the specific data management and analysis needs of the world of public health.

R is a very powerful tool for programming but can have a steep learning curve. In my experience, people find it easier to do it the long way with another programming language, rather than try R, because it just takes longer to learn. I think all statistical packages are useful and have their place in the public health world. However, I am a strong proponent of R and I hope this blog can help you move toward using it when it makes sense for you.

Please email me with posts you would like to see or R questions, and I'll try my best to answer them. Thanks for following!