R for Public Health: January 2013

Sunday, January 20, 2013

Translating Weird R Errors

I love R. I think it's intuitive and clever and overall a great language. But I do get really annoyed sometimes at the completely ridiculous, cryptic error messages it often gives me. This post will go over some of those seemingly nonsensical errors so you don't have to go crazy trying to find the bug in your code.

1. all arguments must have the same length

To start with, I just make up some quick data:

prob1<-as.data.frame(cbind(c(1,2,3),c(5,4,3)))
colnames(prob1)<-c("Education","Ethnicity")

And now I just want to do a simple table but I get this error:

What the heck. I look back at my dataset and make sure that both those variables are the same length, which they are. The problem here is that I misspelled "Education". There's a missing "a" in there and instead of telling me that I referenced a variable that doesn't exist, R bizarrely tells me to check the length of my variables. Remember: Anytime you get an error, check to make sure you've spelled everything right.

If I do this, everything works out great:
table(prob1$Education, prob1$Ethnicity)

2. replacement has 0 rows, data has 3

A very similar problem, with a very different error message. Let's say I forgot what columns were in my prob1 data and I thought I had a Sex indicator in there. So I try to recode it like this:

This error message is also pretty unhelpful. The syntax is totally correct; the problem is that I just don't have a variable named Sex in my dataset. If I do this instead to recode education, a variable that exists, everything is fine:

prob1$Educ_recode<-as.numeric(prob1$Education==2)

3. undefined columns selected

Ironically, the error we so badly wanted before comes up but for a completely different reason. See if you can find the problem here. I'll take that same little dataset and I just want to know how many rows there are in which Education is not equal to 1.

So, if I want to know the number of rows of the dataframe prob1, I do:

nrow(prob1)

and if I want to know how many have a value of Education not equal to 1, I do the following (incorrectly) and get an error:

Now I check my variable name and I've definitely spelled Education right this time. The problem, actually, is not that I have referenced a column that doesn't exist but I've messed up the syntax to the nrow() function, in that I haven't defined what columns I want to subset. When I do,

prob1[prob1$Education!=1]

this doesn't make any sense, because I'm saying to subset prob1 but to do this I have to specify which rows I want and which columns I want. This just lists one condition in the brackets and it's unclear whether it's for the rows or columns. See my post on subsetting for more details on this.

If I do it the following way, all is good since I'm saying to subset prob1 with only rows with education !=1 and all columns:

nrow(prob1[prob1$Education!=1,])

So this error message does make sense in a way, but it's still a bit cryptic in my opinion.

Monday, January 14, 2013

For loops (and how to avoid them)

My experience when starting out in R was trying to clean and recode data using for() loops, usually with a few if() statements in the loop as well, and finding the whole thing complicated and frustrating.

In this post, I'll go over how you can avoid for() loops for both improving the quality and speed of your programming, as well as your sanity.

So here we have our classic dataset called mydata.Rdata (you can download this if you want, link at the right):

And if I were in Stata and wanted to create an age group variable, I could just do:

gen Agegroup=1
replace Agegroup=2 if Age>10 & Age<20
replace Agegroup=3 if Age>=20

But when I try this in R, it fails:

Why does it fail? It fails because Age is a vector so the condition if(mydata$Age<10) is asking "is the vector Age less than 10", which is not what we want to know. We want to ask, row by row is each element of Age<10, so we need to specify the element of the vector we're referring to. We don't specify the element and thus we get the warning (really, error), "only the first element will be used." So when this fails, the first way people try to solve this problem is with a crazy for() loop like this:

###########Unnecessarily long and ugly code below#######
mydata$Agegroup1<-0

for (i in 1:10){
if(mydata$Age[i]>10 & mydata$Age[i]<20){
mydata$Agegroup1[i]<-1
}
if(mydata$Age[i]>=20){
mydata$Agegroup1[i]<-2
}
}

Here we tell R to go down the rows from i=1 to i=10, and for each of those rows indexed by i, check to see what value of Age it is, and then assign Agegroup a value of 1 or 2. This works, but at a high cost - you can easily make a mistake with all those indexed vectors, and also for() loops take a lot of computing time, which would be a big deal if this dataset were 10000 observations instead of 10.

So how can we avoid doing this?

One of the most useful functions I have found is one that I have referred to a number of times in my blog so far - the ifelse() function. The ifelse() function evaluates a condition, and then assigns a value if it's true and a value if it's false. The great part about it is that it can read in a vector and check each element of the vector one by one so you don't need indices or a loop. You don't even need to initialize some new variable before you run the statement. Like this:

mydata$newvariable<-ifelse(Condition of some variable,
Value of new variable if condition is true,
Value of new variable if condition is false)

so for example:

mydata$Old<-ifelse(mydata$Age>40,1,0)

This says, check to see if the elements of the vector mydata$Age are greater than 40: if an element is greater than 40, it assigns the value of 1 to mydata$Old, and if it's not greater than 40, it assigns the value of 0 to mydata$Old.

But we wanted to assign values 0, 1, and 2 to an Agegroup variable. To do this, we can use nested ifelse() statements:

mydata$Agegroup2<-ifelse(mydata$Age>10 & mydata$Age<20,1,
ifelse(mydata$Age>20, 2,0))

Now this says, first check whether each element of the Age vector is >10 and <20. If it is, assign 1 to Agegroup2. If it's not, then evaluate the next ifelse() statement, whether Age>20. If it is, assign Agegroup2 a value of 2. If it's not any of those, then assign it 0. We can see that both the loop and the ifelse() statements give us the same result:

You can nest ifelse() statement as much as you like. Just be careful about your final category - it assigns the last value to whatever values are left over that didn't meet any condition (including if a value is NA!) so make sure you want that to happen.

Other examples of ways to use the ifelse() function:

If you want to add a column with the mean of Weight by sex for each individual, you can do this with ifelse() like this:

mydata$meanweight.bysex<-ifelse(mydata$Sex==0,

mean(mydata$Weight[mydata$Sex==0], na.rm=TRUE),

mean(mydata$Weight[mydata$Sex==1], na.rm=TRUE))

If you want to recode missing values:

mydata$Height.recode<-ifelse(is.na(mydata$Height),
9999,
mydata$Height)

If you want to combine two variables together into a new one, such as to create a new ID variable based on year (which I added to this dataframe) and ID:

mydata$ID.long<-ifelse(mydata$ID<10,

paste(mydata$year, "-0",mydata$ID,sep=""),

paste(mydata$year, "-", mydata$ID, sep=""))

Other ways to avoid the for loop:

The apply functions: If you think you have to use a loop because you have to apply some sort of function to each observation in your data, think again! Use the apply() functions instead. For example:

If you have a lot of missing values and want to recode them all at once, or want to sum up the number of times you see a certain value in a row, check out my post on the apply function here.

You can also use other functions such as cut() to do the age grouping above. Here's the post on how this function works, so I won't go over it again, except to say if you convert from a factor to a numeric, *always* convert to a character before converting it to numeric:

mydata$Agegroup3<-as.numeric(as.character(cut(mydata$Age, c(0,10,20,100),labels=0:2)))

Basically, any time you think you have to do a loop, think about how you can do it with another function. It will save you a lot of time and mistakes in your code.

Why R for public health?

I created this blog to help public health researchers that are used to Stata or SAS to begin using R. I find that public health data is unique and this blog is meant to address the specific data management and analysis needs of the world of public health.

R is a very powerful tool for programming but can have a steep learning curve. In my experience, people find it easier to do it the long way with another programming language, rather than try R, because it just takes longer to learn. I think all statistical packages are useful and have their place in the public health world. However, I am a strong proponent of R and I hope this blog can help you move toward using it when it makes sense for you.

Please email me with posts you would like to see or R questions, and I'll try my best to answer them. Thanks for following!