R for Public Health: From continuous to categorical

Monday, September 24, 2012

From continuous to categorical

During data analysis, it is often super useful to turn continuous variables into categorical ones. In Stata you would do something like this:

gen catvar=0
replace catvar=1 if contvar>0 & contvar<=3
replace catvar=2 if contvar>3 & contvar<=5

etc. And then you would label your values like so:

label define agelabel 0 "0" 1 "1-3" 2 "3-5"
label values catvar agelabel

How can we do this in R? There's a great function in R called cut() that does everything at once. It takes in a continuous variable and returns a factor (which is an ordered or unordered categorical variable). Factor variables are extremely useful for regression because they can be treated as dummy variables. I'll have another post on the merits of factor variables soon.

But for now, let's focus on getting our categorical variable. Here is our data:

And now we want to take that "Age" variable and turn in into a categorical variable. The most basic statement is like so:

mydata$Agecat1<-cut(mydata$Age, c(0,5,10,15,20,25,30))

Here the function cut() takes in as the first argument the continuous variable mydata$Age and it cuts it into chunks that are described in the second argument. So here I've indicated to make groups that go from 0-5, 6-10, 11-15, 16-20, etc. By default, the right side of the interval is closed while the left is open. You can change that, as we will see below. First, the output with the new "Agecat" variable:

Now we can customize our intervals. First, in Agecat2, I show how instead of spelling out every cutoff of the interval, I can just specify a sequence using seq(0, 30, 5) - this means we start at 0 and go to 30 by intervals of 5.

For Agecat3, I switch the default closed interval to be the left one by specifying "right=FALSE".

Finally, for Agecat4 I add in my own labels instead of the default "(0,5]" labels that are provided by R. I want them to be numbers instead so I indicate "labels=c(1:6)". The output of all of the options are shown below.

mydata$Agecat2<-cut(mydata$Age, seq(0,30,5))

mydata$Agecat3<-cut(mydata$Age, seq(0,30,5), right=FALSE)

mydata$Agecat4<-cut(mydata$Age, seq(0,30,5), right=FALSE, labels=c(1:6))

Now, if I want some summary statistics or a bivariate table, I get some nice output:

summary(mydata$Agecat1)

(0,5] (5,10] (10,15] (15,20] (20,25] (25,30]
0 1 2 0 0 1

table(mydata$Agecat1, mydata$Sex)

0 1
(0,5] 0 0
(5,10] 0 1
(10,15] 1 1
(15,20] 0 0
(20,25] 0 0
(25,30] 0 1

2 comments:

UnknownApril 20, 2013 at 6:22 PM
Thank you Slawa Rokicki, your idea of helping us use R is great, especially giving examples from STATA.

Cut command creates a series of intervals that are all closed on one side and open on the other: how do i make the last interval (or first) closed on both sides so as not to have excluded observations?

Thank you,
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Why R for public health?

I created this blog to help public health researchers that are used to Stata or SAS to begin using R. I find that public health data is unique and this blog is meant to address the specific data management and analysis needs of the world of public health.

R is a very powerful tool for programming but can have a steep learning curve. In my experience, people find it easier to do it the long way with another programming language, rather than try R, because it just takes longer to learn. I think all statistical packages are useful and have their place in the public health world. However, I am a strong proponent of R and I hope this blog can help you move toward using it when it makes sense for you.

Please email me with posts you would like to see or R questions, and I'll try my best to answer them. Thanks for following!