R for Public Health: Quick and Easy Subsetting

Monday, October 1, 2012

Quick and Easy Subsetting

Public health datasets can be enormous and difficult to look at. Often it is great to be able to only look at specific parts of the dataset, or to only run analysis on a specific part of a dataset. There are two ways that you can subset a dataset in R:

Using the subset() function
Using matrix indexing

The first way may sound easier, but the second one is very useful. Let me show you why.

Let's start with the subset() function. The basic structure is like this:

subset(x, condition, select=c(var1, var2))

where x is the original dataset you want to subset, condition is a condition on your rows, and select is the columns you want to keep. So for example, I have a dataset called mydata, shown below. I want to look at only women that are over 50 years old, and I only want to look at their ID, age, and weight. Therefore I do this:

sub.data<-subset(mydata, Age>50 & Sex==0, select=c(ID, Age, Weight))

So here I am saying, take the dataset called "mydata", take only rows where the Age >50 and the Sex ==0 (when you write a condition, you always use two equal signs), and then take only the columns ID, Age, and Weight.

Here are the original (left) and the subsetted (right) datasets:

You can also do a statement like this:

sub.data2<-subset(mydata, Age>50 & Sex==0, select=c(ID:Sex))

which takes all of the columns between (and including) ID and Sex. This is nice so you don't have to list out all the variables one by one.

However, now let's take a more real example. I have a dataset with 200 columns and 3000 rows. I want to keep say 50 columns, but they're not all sequential. Some are at the beginning of my dataset, some are at the end. Do I have to go in and write out each name of the variable? Nope. Instead we can use R's matrix indexing capability. It's super easy and it goes like this in a general form:

newdata<-originaldata[rows I want, columns I want]

The comma is crucial here since the dataset is organized by rows and columns. If you don't put anything in the "rows I want" place, R keeps everything. So, for example, I can do this:

sub.data3<-mydata[,c(1:3)]

So this is saying, take the dataset "mydata", and take all the rows (since there is nothing before the comma), and take columns 1-3. Call this subsetted dataset "sub.data3".

I can also make the equivalent statement from sub.data2 using indexing like this:

sub.data4<-mydata[mydata$Age>50 & mydata$Sex==0, c(1:3)]

Or, if I just want to look at the first 50 observations and all columns, I can do this:

sub.data5<-mydata[c(1:50), ]

Finally, I can do what I said before, in that I can choose the variables I want from a super long list. I first can find out the index numbers of my columns using the names() command like this:

names(mydata)

Then I can say I want columns 1-5, 8, 12-15, 100-120, 197, and 199. I can easily do this by saying

sub.data6<-mydata[,c(1:5, 8, 12:15, 100:120, 197, 199)]

Finally, the last thing I want to point out is that if you want to keep most columns but just take away, say, the last 10, you can do this:

sub.data6<-mydata[,-c(190:200)]

The negative sign is like a 'drop' command. Keep everything except the columns I list out.

It's a very nice feature and very fast!

2 comments:

AnonymousSeptember 2, 2014 at 3:10 AM
Hello! I love your blog posts! I have recently come across the dplyr package which seems to be an excellent wrapper for data manipulation functions http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

Keep up the good work !
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

Why R for public health?

I created this blog to help public health researchers that are used to Stata or SAS to begin using R. I find that public health data is unique and this blog is meant to address the specific data management and analysis needs of the world of public health.

R is a very powerful tool for programming but can have a steep learning curve. In my experience, people find it easier to do it the long way with another programming language, rather than try R, because it just takes longer to learn. I think all statistical packages are useful and have their place in the public health world. However, I am a strong proponent of R and I hope this blog can help you move toward using it when it makes sense for you.

Please email me with posts you would like to see or R questions, and I'll try my best to answer them. Thanks for following!