Monday, September 24, 2012

From continuous to categorical


During data analysis, it is often super useful to turn continuous variables into categorical ones.  In Stata you would do something like this:

gen catvar=0
replace catvar=1 if contvar>0 & contvar<=3
replace catvar=2 if contvar>3 & contvar<=5 

etc.  And then you would label your values like so:

label define agelabel 0 "0" 1 "1-3" 2 "3-5"
label values catvar agelabel

How can we do this in R? There's a great function in R called cut() that does everything at once.  It takes in a continuous variable and returns a factor (which is an ordered or unordered categorical variable).  Factor variables are extremely useful for regression because they can be treated as dummy variables.  I'll have another post on the merits of factor variables soon.

But for now, let's focus on getting our categorical variable.  Here is our data:



And now we want to take that "Age" variable and turn in into a categorical variable.  The most basic statement is like so:


mydata$Agecat1<-cut(mydata$Age, c(0,5,10,15,20,25,30))

Here the function cut() takes in as the first argument the continuous variable mydata$Age and it cuts it into chunks that are described in the second argument.  So here I've indicated to make groups that go from 0-5, 6-10, 11-15, 16-20, etc.  By default, the right side of the interval is closed while the left is open.  You can change that, as we will see below.  First, the output with the new "Agecat" variable:



Now we can customize our intervals.  First, in Agecat2, I show how instead of spelling out every cutoff of the interval, I can just specify a sequence using seq(0, 30, 5) - this means we start at 0 and go to 30 by intervals of 5.

For Agecat3, I switch the default closed interval to be the left one by specifying "right=FALSE".

Finally, for Agecat4 I add in my own labels instead of the default "(0,5]" labels that are provided by R.  I want them to be numbers instead so I indicate "labels=c(1:6)".  The output of all of the options are shown below.

mydata$Agecat2<-cut(mydata$Age, seq(0,30,5))

mydata$Agecat3<-cut(mydata$Age, seq(0,30,5), right=FALSE)

mydata$Agecat4<-cut(mydata$Age, seq(0,30,5), right=FALSE, labels=c(1:6))




Now, if I want some summary statistics or a bivariate table, I get some nice output:



summary(mydata$Agecat1)

  (0,5]  (5,10] (10,15] (15,20] (20,25] (25,30] 
      0       1       2       0       0       1 




table(mydata$Agecat1, mydata$Sex)
       
          0 1
  (0,5]   0 0
  (5,10]  0 1
  (10,15] 1 1
  (15,20] 0 0
  (20,25] 0 0
  (25,30] 0 1








Sunday, September 23, 2012

The infamous apply function

For R beginners, the apply() function seems like a secret doorway into programming bliss. It seems so powerful, and yet, beyond reach. For those just starting out, examples of how to use apply() can really help with the intuition of how to harness its power. Here are some great ways to use apply() that can really help make R programming enjoyable and useful.  

First, the general structure of apply() is like so:

apply(x, MARGIN, FUN)

  1. The first argument, "x", is whatever dataset or columns of a dataset you want to do something to.
  2. The second argument, "MARGIN", is how you want to apply function.  The choices are either over the rows (MARGIN=1) or the columns (MARGIN=2).
  3. The third argument (FUN) is the function you apply.  

So for an easy example, if you want to just sum the entries of all the columns in your dataset called "mydata", you can do it this way:

apply(mydata, 2, sum)

But this is not always very useful.  We have other columns in our datasets, and we probably don't want to just sum all the time.  What else can we do? Here are two nice ways to use apply():


1. Counting how many columns meet a certain condition 

I have 13 child outcomes in a dataset named "births" and I want to count up how many live births there were. My "births" data looks like this: 



How can I add up the live births, especially with those pesky NA's in there? Here's a one line way to do it: 

 births$childcount<-apply(births[,1:5], MARGIN=1, function(x) {sum(x=="live birth", na.rm=TRUE)}) 

This code is saying, for the first 5 columns of my dataset births, for each row (MARGIN=1), apply the following function. The function takes x as the input (x is just the births[,1:5] dataset), and sums up for each column of this dataset the number of times it sees "live birth". The na.rm option removes any NA's from consideration.  If you had other conditions, you could say function(x) {sum(x>2010, na.rm=TRUE)}) for example, if you wanted to count up how many years were after 2010. 


 2. Changing coded missing values to NA for multiple columns at a time 

 Often datasets code their missing values as 99 or -99 instead of just leaving them blank. We might want to change these to actual missing so we can work with the data better.  For one variable at at time, I can do it with with ifelse() statement:

originaldata$variable1<-ifelse(originaldata$variable1==99 | originaldata$variable1==-99, NA, originalvariable1)

This is equivalent to the cond() command in stata, where the first argument evaluates the condition, the second argument is what is done if the condition is true, and the third argument is what is done if the condition is false.  

But what if I have 3 or 30 columns that I want to do this to? I don't have to write ifelse() statements for them all individually.  Instead, I use apply.

Here we have a dataset called "originaldata" and we have 4 variables that we want to change from the original missing values to NA values. These variables are in column numbers 2, 4, 5, and 6, as below:




I take the columns of original dataset, and for each of those columns, I use an ifelse statement to check the value of the entry: if it's 99 or -99 I change it to NA, and if it's not then I leave it the way it is. This creates a new dataset called "new data" with just those columns that I choose.

newdata<-apply(originaldata[,c(2,4:6)], MARGIN=2, function(x) {ifelse(x==99 | x==-99, NA,x)})

We print out newdata:


Now if we want the original dataset together with the changed variables, we can just cbind (column bind) them together like so:

alldata<-cbind(originaldata[, c(-2,-4:-6)], newdata)





If you want to be extra fancy, you can just combine the cbind() statement with the apply() in one statement, like this:


newdata<-cbind(originaldata[,c(-2,-4,-6)], apply(originaldata[,c(2,4:6)], MARGIN=2, function(x) {ifelse(x==99 | x==-99, NA,x)}))