Showing posts with label ifelse. Show all posts
Showing posts with label ifelse. Show all posts

Thursday, July 11, 2013

Let's Make a Date: Date and Time classes in R

Of all the frustrating data manipulations to deal with in any programming language, dates and times are the worst in my opinion. In R, there are many different packages that use various functions to deal with dates, which lead to different classes of dates that are not always compatible. Depending on how your data is organized, there are different solutions to your date and time problems. Here, I'll show the way that I think is easiest to deal with dates depending on the organization of your data.

Ok so I find in public health data that birthdays and death days are the most common dates to be dealing with. In a dataset like the DHS, dates can either be found as three separate integer variables (month, day, year) or they can be in one character variable like “05/03/2009”. So let's start with the first situation, numeric dates.

Numeric dates

#Create some data
dates<-as.data.frame(cbind(c(1,3,6,11,4,12,5,3), 
                           c(30,14,NA,NA,16,NA,20,31), 
                           c(1980, 1980, 1980, 1983,1983, 1983, 1986, 1980), 
                           c(2, NA, NA, NA, NA, 12, 4, NA), 
                           c(2, NA, NA, NA, NA, NA, 29, NA), 
                           c(1980, NA, NA, 1985, NA, 1983, 1987, NA)))
colnames(dates)<-c("birth_month", "birth_day", "birth_year", "death_month", "death_day", "death_year")
dates
##   birth_month birth_day birth_year death_month death_day death_year
## 1           1        30       1980           2         2       1980
## 2           3        14       1980          NA        NA         NA
## 3           6        NA       1980          NA        NA         NA
## 4          11        NA       1983          NA        NA       1985
## 5           4        16       1983          NA        NA         NA
## 6          12        NA       1983          12        NA       1983
## 7           5        20       1986           4        29       1987
## 8           3        31       1980          NA        NA         NA

I've included a lot of missing cells, even in birth date, because that's the most common problem that I have - birth and death data, especially from developing countries, is full of missing birth months or days.

Ok, so what I would like to do here is to figure out if I have any infant mortalities in my sample size of 8. I find the easiest way to create a date object from three integer month/day/year variables is by using ISOdate(), which is just in base R. ISOdate() follows the following syntax:

ISOdate(year, month, day, hour = 12, min = 0, sec = 0, tz = “GMT”)

and if you had hours and minutes you could easily throw those in too into the ISOdatetime() function which is the same syntax. So let's create a birth date:

dates$DOB <- ISOdate(dates$birth_year, dates$birth_month, dates$birth_day)
dates
##   birth_month birth_day birth_year death_month death_day death_year                 DOB
## 1           1        30       1980           2         2       1980 1980-01-30 12:00:00
## 2           3        14       1980          NA        NA         NA 1980-03-14 12:00:00
## 3           6        NA       1980          NA        NA         NA                <NA>
## 4          11        NA       1983          NA        NA       1985                <NA>
## 5           4        16       1983          NA        NA         NA 1983-04-16 12:00:00
## 6          12        NA       1983          12        NA       1983                <NA>
## 7           5        20       1986           4        29       1987 1986-05-20 12:00:00
## 8           3        31       1980          NA        NA         NA 1980-03-31 12:00:00

This is super easy. The column DOB that we have added to the dataframe is of class POSIXlt, which is a class of calendar date and time. Since I don't have any time variables, I don't want it crowding up my dataset so I can use the function strptime() to get rid of it. Here's the syntax:

strptime(x, format, tz = “”)

where x is your POSIXlt object, and format is whatever the format is of that object (see below). The strptime help file is a good place to understand the formats.

dates$DOB <- strptime(dates$DOB, format = "%Y-%m-%d")
dates$DOD <- strptime(ISOdate(dates$death_year, dates$death_month, dates$death_day), format = "%Y-%m-%d")

You can, of course, do this in one step, which I've done for death date. Now we can look at our dataset:

dates
##   birth_month birth_day birth_year death_month death_day death_year        DOB        DOD
## 1           1        30       1980           2         2       1980 1980-01-30 1980-02-02
## 2           3        14       1980          NA        NA         NA 1980-03-14       <NA>
## 3           6        NA       1980          NA        NA         NA       <NA>       <NA>
## 4          11        NA       1983          NA        NA       1985       <NA>       <NA>
## 5           4        16       1983          NA        NA         NA 1983-04-16       <NA>
## 6          12        NA       1983          12        NA       1983       <NA>       <NA>
## 7           5        20       1986           4        29       1987 1986-05-20 1987-04-29
## 8           3        31       1980          NA        NA         NA 1980-03-31       <NA>

Now if we want age at death, we can use the difftime() function that follows the following syntax and produces a difftime object (which you can convert to numeric using as.numeric() if you want to, which I highly recommend):

difftime(time1, time2, tz,units = c(“auto”, “secs”, “mins”, “hours”,“days”, “weeks”))

dates$Age.atdeath <- difftime(dates$DOD, dates$DOB, unit = "days")
dates$Age.atdeath
## Time differences in days
## [1]   3  NA  NA  NA  NA  NA 344  NA
## attr(,"tzone")
## [1] ""
class(dates$Age.atdeath)
## [1] "difftime"
# check if there were an infant mortalities
dates$Age.atdeath < 365
## [1] TRUE   NA   NA   NA   NA   NA TRUE   NA

Ok, I found two infant mortalities, but I see that there's a problem. I see that I'm missing two birthdays that come up as NA, and this is because I'm missing the birth day for those two people. However, I do have birth month and birth year, and this is information I don't want to lose. There are many ways to deal with this, including imputing birth days or assigning them randomly from a uniform distribution or whatever.

In this case, what I will do is very simple - just replace the missing birth day with 1 if it's missing (and similarly replace a missing birth month with 1 if it's missing) and replace missing death month and day with 12 and 30, respectively. That way I have the maximum possible age at death and I don't lose potentially important information. There are a number of ways to accomplish what I want to do, but I love using the ifelse() function because I find it extremely intuitive so I will do that:

dates$DOB2<-strptime(ISOdate(year=dates$birth_year, 
                             month=ifelse(is.na(dates$birth_month), 1, dates$birth_month), 
                             day=ifelse(is.na(dates$birth_day),1, dates$birth_day)), 
                     format="%Y-%m-%d")

dates$DOD2<-strptime(ISOdate(year=dates$death_year, 
                             month=ifelse(is.na(dates$death_month),12,dates$death_month), 
                             day=ifelse(is.na(dates$death_day),30, dates$death_day)), 
                     format="%Y-%m-%d")

dates$Ageatdeath_2<-as.numeric(difftime(dates$DOD2,dates$DOB2,unit="days"))

dates[,c(1:6,10:12)]
##   birth_month birth_day birth_year death_month death_day death_year       DOB2       DOD2 Ageatdeath_2
## 1           1        30       1980           2         2       1980 1980-01-30 1980-02-02            3
## 2           3        14       1980          NA        NA         NA 1980-03-14       <NA>           NA
## 3           6        NA       1980          NA        NA         NA 1980-06-01       <NA>           NA
## 4          11        NA       1983          NA        NA       1985 1983-11-01 1985-12-30          790
## 5           4        16       1983          NA        NA         NA 1983-04-16       <NA>           NA
## 6          12        NA       1983          12        NA       1983 1983-12-01 1983-12-30           29
## 7           5        20       1986           4        29       1987 1986-05-20 1987-04-29          344
## 8           3        31       1980          NA        NA         NA 1980-03-31       <NA>           NA

So now we see above that I have all birthdays completed, and I was able to reveal that I have another infant mortality in my data, and another death more generally.

Character Dates

Ok so onto the second type of dataset you may encounter, which is that somebody inputed the dates into Excel or whatever like this:

dates2<-as.data.frame(cbind(c(1:5), 
                            c("8/31/70", "NA", "10/12/60", "1/1/66", "12/31/80"), 
                            c("8/31/56", "12-31-1977", "12Aug55", "July 31 1965" ,"30jan1952")))
colnames(dates2)<-c("ID", "date_factor", "date_horrible")
dates2
##   ID date_factor date_horrible
## 1  1     8/31/70       8/31/56
## 2  2          NA    12-31-1977
## 3  3    10/12/60       12Aug55
## 4  4      1/1/66  July 31 1965
## 5  5    12/31/80     30jan1952

In the first column at least all of the dates have the same format, but in the second (which happens really often!), every date is a different format which seems, at first, like a total nightmare. But fortunately R is here to rescue us.

Ok, let's start with the easy one. The tricky part with this kind of data is that R often immediately converts the dates to factors, like so:

class(dates2$date_factor)
## [1] "factor"

This happens if you are reading in a csv file as well. You can either stop R from doing that by using the as.is option like so (assuming your dates were the second and third columns):

df <- read.table("data_with_dates.txt", header = TRUE, as.is = 2:3)

or we just have to remember to use the as.character() function before we do anything. Ok, so the point here is that even though these look like dates, they are not dates. They are factors or maybe characters, but not manipulatable like you would want. For example, this gives you the following error:

# NO: this gives an error, you can't do this with characters, need the date format
dates2$age <- difftime("02/27/13", as.character(dates2$date_factor), unit = "days")
## Error: character string is not in a standard unambiguous format

So we need to get these character dates into actual date formats. If all of our formats are the same, we can use the chron package and the chron() function really easily like so.

chron(dates., times., format = c(dates = “m/d/y”, times = “h:m:s”),out.format, origin.)

where dates is the vector of character dates, and the format is whatever the format is of the data you have. Note that it yields warning messages because of the missing value, but that is ok.

library(chron)
dates2$date.fmt <- chron(as.character(dates2$date_factor), format = "m/d/y")
## Warning: wrong number of fields in entry(ies) 2
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
class(dates2$date.fmt)
## [1] "dates" "times"
dates2[, c(1, 2, 4)]
##   ID date_factor date.fmt
## 1  1     8/31/70 08/31/70
## 2  2          NA     <NA>
## 3  3    10/12/60 10/12/60
## 4  4      1/1/66 01/01/66
## 5  5    12/31/80 12/31/80

Note that it looks very similar to the date_factor column but it is not, because it is now a dates and times class as I show above. We can also change how we want the outgoing format to look:

dates2$date.fmt <- chron(as.character(dates2$date_factor), format = "m/d/y", out.format = "month day year")
dates2[, c(1, 2, 4)]
##   ID date_factor         date.fmt
## 1  1     8/31/70   August 31 1970
## 2  2          NA             <NA>
## 3  3    10/12/60  October 12 1960
## 4  4      1/1/66  January 01 1966
## 5  5    12/31/80 December 31 1980

And now we can find the age of these people using difftime() as before. Let's say I interviewed everyone on March 1st of this year and I want their age in years at interview:

dates2$age <- as.numeric(floor(difftime(chron("03/01/2013"), dates2$date.fmt, unit = "days")/360))
dates2[c(1, 2, 4, 5)]
##   ID date_factor         date.fmt age
## 1  1     8/31/70   August 31 1970  43
## 2  2          NA             <NA>  NA
## 3  3    10/12/60  October 12 1960  53
## 4  4      1/1/66  January 01 1966  47
## 5  5    12/31/80 December 31 1980  32

I can also do things like add a day to everybody's date if I had some reason to do that, and I can compare dates to see which one came first, which can be useful:

# Add a day to everyone's date for some reason
dates2$date.fmt + 1
## [1] September 01 1970 <NA>              October 13 1960   January 02 1966   January 01 1981
# Compare the date to some other date to see which came first using < operator
dates2$date.fmt < chron("04/02/62")
## [1] FALSE    NA  TRUE FALSE FALSE

Ok finally what if your data is as horrible as what we see in our second column?

dates2[, c(1, 3)]
##   ID date_horrible
## 1  1       8/31/56
## 2  2    12-31-1977
## 3  3       12Aug55
## 4  4  July 31 1965
## 5  5     30jan1952

Chron won't help us here, as it needs one format for everyone:

# NO: chron needs the same format
chron(as.character(dates2$date_horrible))
## [1] 08/31/56 <NA>     <NA>     <NA>     <NA>

It just gives us NAs (and warnings, which I've hid). However, there is a package called date that will take any type of date and figure it out for us. Watch how amazing it is, the only argument to the function you need is the vector of character dates:

library(date)
# as.date (lower case) will correctly convert dates in vector
as.date(as.character(dates2$date_horrible))
## [1] 31Aug56 31Dec77 12Aug55 31Jul65 30Jan52

Extraordinary! :) It just knows. However, I find that it stubbornly reformats itself into the number of days since 1960 when adding it as a column to the dataframe:

dates2$date_autofmt <- as.date(as.character(dates2$date_horrible))
dates2[, c(1, 3, 6)]
##   ID date_horrible date_autofmt
## 1  1       8/31/56        -1218
## 2  2    12-31-1977         6574
## 3  3       12Aug55        -1603
## 4  4  July 31 1965         2038
## 5  5     30jan1952        -2893

It's also not super easy to work with because it's a date object, not a Date object (VERY case-sensitive stuff here!). Feel free to chime in here if you have a better solution, but my simple fix on that is just to envelop it in an as.Date() function (from base R) like so:

dates2$date_amazing <- as.Date(as.date(as.character(dates2$date_horrible)))
dates2[, c(1, 3, 7)]
##   ID date_horrible date_amazing
## 1  1       8/31/56   1956-08-31
## 2  2    12-31-1977   1977-12-31
## 3  3       12Aug55   1955-08-12
## 4  4  July 31 1965   1965-07-31
## 5  5     30jan1952   1952-01-30

Now it works great. That's pretty incredible that one short line of code can transform your awful dates column into perfectly coordinated and workable dates. You can use difftime() the same way as before. Hope that this was useful for those pulling their hair out over date and time objects in R.

Also, if you made it all the way to the end, please enjoy this hilarious episode of Let's Make a Date on “Whose Line Is It Anyway?'' featuring none other than Stephen Colbert. :D


library(date) library(chron) options(width=150) Of all the frustrating data manipulations to deal with in any programming language, dates and times are the worst in my opinion. In R, there are many different packages that use various functions to deal with dates, which lead to different classes of dates that are not always compatible. Depending on how your data is organized, there are different solutions to your date and time problems. Here, I'll show the way that I think is easiest to deal with dates depending on the organization of your data. Ok so I find in public health data that birthdays and death days are the most common dates to be dealing with. In a dataset like the DHS, dates can either be found as three separate integer variables (month, day, year) or they can be in one character variable like "05/03/2009". So let's start with the first situation, numeric dates. Numeric dates #Create some data dates<-as.data.frame(cbind(c(1,3,6,11,4,12,5,3), c(30,14,NA,NA,16,NA,20,31), c(1980, 1980, 1980, 1983,1983, 1983, 1986, 1980), c(2, NA, NA, NA, NA, 12, 4, NA), c(2, NA, NA, NA, NA, NA, 29, NA), c(1980, NA, NA, 1985, NA, 1983, 1987, NA))) colnames(dates)<-c("birth_month", "birth_day", "birth_year", "death_month", "death_day", "death_year") dates I've included a lot of missing cells, even in birth date, because that's the most common problem that I have - birth and death data, especially from developing countries, is full of missing birth months or days. Ok, so what I would like to do here is to figure out if I have any infant mortalities in my sample size of 8. I find the easiest way to create a date object from three integer month/day/year variables is by using **ISOdate()**, which is just in base R. ISOdate() follows the following syntax: ISOdate(year, month, day, hour = 12, min = 0, sec = 0, tz = "GMT") and if you had hours and minutes you could easily throw those in too into the ISOdatetime() function which is the same syntax. So let's create a birth date: dates$DOB<-ISOdate(dates$birth_year, dates$birth_month, dates$birth_day) dates This is super easy. The column DOB that we have added to the dataframe is of class POSIXlt, which is a class of calendar date and time. Since I don't have any time variables, I don't want it crowding up my dataset so I can use the function strptime() to get rid of it. Here's the syntax: strptime(x, format, tz = "") where x is your POSIXlt object, and format is whatever the format is of that object (see below). The [strptime help file](http://astrostatistics.psu.edu/su07/R/html/base/html/strptime.html) is a good place to understand the formats. dates$DOB<-strptime(dates$DOB, format="%Y-%m-%d") dates$DOD<-strptime(ISOdate(dates$death_year, dates$death_month, dates$death_day), format="%Y-%m-%d") You can, of course, do this in one step, which I've done for death date. Now we can look at our dataset: dates Now if we want age at death, we can use the difftime() function that follows the following syntax and produces a difftime object (which you can convert to numeric using as.numeric() if you want to, which I highly recommend): difftime(time1, time2, tz,units = c("auto", "secs", "mins", "hours","days", "weeks")) dates$Age.atdeath<-difftime(dates$DOD, dates$DOB, unit="days") dates$Age.atdeath class(dates$Age.atdeath) #check if there were an infant mortalities dates$Age.atdeath<365 Ok, I found two infant mortalities, but I see that there's a problem. I see that I'm missing two birthdays that come up as NA, and this is because I'm missing the birth day for those two people. However, I do have birth month and birth year, and this is information I don't want to lose. There are many ways to deal with this, including imputing birth days or assigning them randomly from a uniform distribution or whatever. In this case, what I will do is very simple - just replace the missing birth day with 1 if it's missing (and similarly replace a missing birth month with 1 if it's missing) and replace missing death month and day with 12 and 30, respectively. That way I have the maximum possible age at death and I don't lose potentially important information. There are a number of ways to accomplish what I want to do, but I love using the ifelse() function because I find it extremely intuitive so I will do that: dates$DOB2<-strptime(ISOdate(year=dates$birth_year, month=ifelse(is.na(dates$birth_month), 1, dates$birth_month), day=ifelse(is.na(dates$birth_day),1, dates$birth_day)), format="%Y-%m-%d") dates$DOD2<-strptime(ISOdate(year=dates$death_year, month=ifelse(is.na(dates$death_month),12,dates$death_month), day=ifelse(is.na(dates$death_day),30, dates$death_day)), format="%Y-%m-%d") dates$Ageatdeath_2<-as.numeric(difftime(dates$DOD2,dates$DOB2,unit="days")) dates[,c(1:6,10:12)] So now we see above that I have all birthdays completed, and I was able to reveal that I have another infant mortality in my data, and another death more generally. Character Dates Ok so onto the second type of dataset you may encounter, which is that somebody inputed the dates into Excel or whatever like this: dates2<-as.data.frame(cbind(c(1:5), c("8/31/70", "NA", "10/12/60", "1/1/66", "12/31/80"), c("8/31/56", "12-31-1977", "12Aug55", "July 31 1965" ,"30jan1952"))) colnames(dates2)<-c("ID", "date_factor", "date_horrible") dates2 In the first column at least all of the dates have the same format, but in the second (which happens really often!), every date is a different format which seems, at first, like a total nightmare. But fortunately R is here to rescue us. Ok, let's start with the easy one. The tricky part with this kind of data is that R often immediately converts the dates to factors, like so: class(dates2$date_factor) This happens if you are reading in a csv file as well. You can either stop R from doing that by using the as.is option like so (assuming your dates were the second and third columns): df <- read.table("data_with_dates.txt", header = TRUE, as.is = 2:3) or we just have to remember to use the **as.character()** function before we do anything. Ok, so the point here is that even though these look like dates, they are *not* dates. They are factors or maybe characters, but not manipulatable like you would want. For example, this gives you the following error: #NO: this gives an error, you can't do this with characters, need the date format dates2$age<-difftime("02/27/13", as.character(dates2$date_factor), unit="days") So we need to get these character dates into actual date formats. If all of our formats are the same, we can use the chron package and the **chron()** function really easily like so. chron(dates., times., format = c(dates = "m/d/y", times = "h:m:s"),out.format, origin.) where dates is the vector of **character** dates, and the format is whatever the format is of the data you have. Note that it yields warning messages because of the missing value, but that is ok. library(chron) dates2$date.fmt<-chron(as.character(dates2$date_factor), format="m/d/y") class(dates2$date.fmt) dates2[,c(1,2,4)] Note that it looks very similar to the date_factor column but it is not, because it is now a dates and times class as I show above. We can also change how we want the outgoing format to look: dates2$date.fmt<-chron(as.character(dates2$date_factor), format="m/d/y", out.format="month day year") dates2[,c(1,2,4)] And now we can find the age of these people using difftime() as before. Let's say I interviewed everyone on March 1st of this year and I want their age in years at interview: dates2$age<-as.numeric(floor(difftime(chron("03/01/2013"), dates2$date.fmt, unit="days")/360)) dates2[c(1,2,4,5)] I can also do things like add a day to everybody's date if I had some reason to do that, and I can compare dates to see which one came first, which can be useful: #Add a day to everyone's date for some reason dates2$date.fmt+1 #Compare the date to some other date to see which came first using < operator Ok finally what if your data is as horrible as what we see in our second column? dates2[,c(1,3)] Chron won't help us here, as it needs one format for everyone: #NO: chron needs the same format chron(as.character(dates2$date_horrible)) It just gives us NAs (and warnings, which I've hid). However, there is a package called date that will take any type of date and figure it out for us. Watch how amazing it is, the only argument to the function you need is the vector of character dates: library(date) #as.date (lower case) will correctly convert dates in vector as.date(as.character(dates2$date_horrible)) Extraordinary! :) It just knows. However, I find that it stubbornly reformats itself into the number of days since 1960 when adding it as a column to the dataframe: dates2$date_autofmt<-as.date(as.character(dates2$date_horrible)) dates2[,c(1,3,6)] It's also not super easy to work with because it's a date object, not a Date object (VERY case-sensitive stuff here!). Feel free to chime in here if you have a better solution, but my simple fix on that is just to envelop it in an **as.Date()** function (from base R) like so: dates2$date_amazing<-as.Date(as.date(as.character(dates2$date_horrible))) dates2[,c(1,3,7)] Now it works great. That's pretty incredible that one short line of code can transform your awful dates column into perfectly coordinated and workable dates. You can use difftime() the same way as before. Hope that this was useful for those pulling their hair out over date and time objects in R. Also, if you made it all the way to the end, please enjoy [this hilarious episode] (http://www.youtube.com/watch?v=PD3aHKeFlSI) of Let's Make a Date on ``Whose Line Is It Anyway?'' featuring none other than Stephen Colbert. :D

Monday, January 14, 2013

For loops (and how to avoid them)

My experience when starting out in R was trying to clean and recode data using for() loops, usually with a few if() statements in the loop as well, and finding the whole thing complicated and frustrating.

In this post, I'll go over how you can avoid for() loops for both improving the quality and speed of your programming, as well as your sanity.

So here we have our classic dataset called mydata.Rdata (you can download this if you want, link at the right):



And if I were in Stata and wanted to create an age group variable, I could just do:

gen Agegroup=1
replace Agegroup=2 if Age>10 & Age<20
replace Agegroup=3 if Age>=20

But when I try this in R, it fails:







Why does it fail? It fails because Age is a vector so the condition if(mydata$Age<10) is asking "is the vector Age less than 10", which is not what we want to know.  We want to ask, row by row is each element of Age<10, so we need to specify the element of the vector we're referring to. We don't specify the element and thus we get the warning (really, error), "only the first element will be used."  So when this fails, the first way people try to solve this problem is with a crazy for() loop like this:

###########Unnecessarily long and ugly code below#######
mydata$Agegroup1<-0

for (i in  1:10){
  if(mydata$Age[i]>10 & mydata$Age[i]<20){
    mydata$Agegroup1[i]<-1
  }
  if(mydata$Age[i]>=20){
    mydata$Agegroup1[i]<-2
  }
}

Here we tell R to go down the rows from i=1 to i=10, and for each of those rows indexed by i, check to see what value of Age it is, and then assign Agegroup a value of 1 or 2.  This works, but at a high cost - you can easily make a mistake with all those indexed vectors, and also for() loops take a lot of computing time, which would be a big deal if this dataset were 10000 observations instead of 10.

So how can we avoid doing this?

One of the most useful functions I have found is one that I have referred to a number of times in my blog so far - the ifelse() function.  The ifelse() function evaluates a condition, and then assigns a value if it's true and a value if it's false.  The great part about it is that it can read in a vector and check each element of the vector one by one so you don't need indices or a loop. You don't even need to initialize some new variable before you run the statement.  Like this:

mydata$newvariable<-ifelse(Condition of some variable,
                    Value of new variable if condition is true
                    Value of new variable if condition is false)

so for example:

mydata$Old<-ifelse(mydata$Age>40,1,0)

This says, check to see if the elements of the vector mydata$Age are greater than 40: if an element is greater than 40, it assigns the value of 1 to mydata$Old, and if it's not greater than 40, it assigns the value of 0 to mydata$Old.

But we wanted to assign values 0, 1, and 2 to an Agegroup variable.  To do this, we can use nested ifelse() statements:

mydata$Agegroup2<-ifelse(mydata$Age>10 & mydata$Age<20,1,     
                  ifelse(mydata$Age>20, 2,0))

Now this says, first check whether each element of the Age vector is >10 and <20.  If it is, assign 1 to Agegroup2.  If it's not, then evaluate the next ifelse() statement, whether Age>20.  If it is, assign Agegroup2 a value of 2.  If it's not any of those, then assign it 0.  We can see that both the loop and the ifelse() statements give us the same result:


You can nest ifelse() statement as much as you like. Just be careful about your final category - it assigns the last value to whatever values are left over that didn't meet any condition (including if a value is NA!) so make sure you want that to happen.


Other examples of ways to use the ifelse() function:
  • If you want to add a column with the mean of Weight by sex for each individual, you can do this with ifelse() like this:
mydata$meanweight.bysex<-ifelse(mydata$Sex==0,  
               mean(mydata$Weight[mydata$Sex==0], na.rm=TRUE),         
               mean(mydata$Weight[mydata$Sex==1], na.rm=TRUE))



  • If you want to recode missing values:
mydata$Height.recode<-ifelse(is.na(mydata$Height),
                      9999, 
                      mydata$Height)

  • If you want to combine two variables together into a new one, such as to create a new ID variable based on year (which I added to this dataframe) and ID:
mydata$ID.long<-ifelse(mydata$ID<10, 
                paste(mydata$year, "-0",mydata$ID,sep=""), 
                paste(mydata$year, "-", mydata$ID, sep=""))



Other ways to avoid the for loop:

  • The apply functions:  If you think you have to use a loop because you have to apply some sort of function to each observation in your data, think again! Use the apply() functions instead.  For example:
  • You can also use other functions such as cut() to do the age grouping above. Here's the post on how this function works, so I won't go over it again, except to say if you convert from a factor to a numeric, *always* convert to a character before converting it to numeric:
mydata$Agegroup3<-as.numeric(as.character(cut(mydata$Age, c(0,10,20,100),labels=0:2)))


Basically, any time you think you have to do a loop, think about how you can do it with another function. It will save you a lot of time and mistakes in your code.