Monday, October 1, 2012

Quick and Easy Subsetting



Public health datasets can be enormous and difficult to look at.  Often it is great to be able to only look at specific parts of the dataset, or to only run analysis on a specific part of a dataset.  There are two ways that you can subset a dataset in R:

  1. Using the subset() function
  2. Using matrix indexing
The first way may sound easier, but the second one is very useful.  Let me show you why.

Let's start with the subset() function.  The basic structure is like this:

subset(x, condition, select=c(var1, var2))

where x is the original dataset you want to subset, condition is a condition on your rows, and select is the columns you want to keep.  So for example, I have a dataset called mydata, shown below.  I want to look at only women that are over 50 years old, and I only want to look at their ID, age, and weight.  Therefore I do this:

sub.data<-subset(mydata, Age>50 & Sex==0, select=c(ID, Age, Weight))

So here I am saying, take the dataset called "mydata", take only rows where the Age >50 and the Sex ==0 (when you write a condition, you always use two equal signs), and then take only the columns ID, Age, and Weight.

Here are the original (left) and the subsetted (right) datasets:

         










You can also do a statement like this:

sub.data2<-subset(mydata, Age>50 & Sex==0, select=c(ID:Sex))

which takes all of the columns between (and including) ID and Sex.  This is nice so you don't have to list out all the variables one by one.

However, now let's take a more real example.  I have a dataset with 200 columns and 3000 rows. I want to keep say 50 columns, but they're not all sequential.  Some are at the beginning of my dataset, some are at the end.  Do I have to go in and write out each name of the variable? Nope.  Instead we can use R's matrix indexing capability.  It's super easy and it goes like this in a general form:

newdata<-originaldata[rows I want, columns I want]

The comma is crucial here since the dataset is organized by rows and columns.  If you don't put anything in the "rows I want" place, R keeps everything. So, for example, I can do this:

sub.data3<-mydata[,c(1:3)]

So this is saying, take the dataset "mydata", and take all the rows (since there is nothing before the comma), and take columns 1-3.  Call this subsetted dataset "sub.data3".  

I can also make the equivalent statement from sub.data2 using indexing like this:

sub.data4<-mydata[mydata$Age>50 & mydata$Sex==0, c(1:3)]

Or, if I just want to look at the first 50 observations and all columns, I can do this:

sub.data5<-mydata[c(1:50), ]

Finally, I can do what I said before, in that I can choose the variables I want from a super long list. I first can find out the index numbers of my columns using the names() command like this:

names(mydata)

Then I can say I want columns 1-5, 8, 12-15, 100-120, 197, and 199.  I can easily do this by saying

sub.data6<-mydata[,c(1:5, 8, 12:15, 100:120, 197, 199)]

Finally, the last thing I want to point out is that if you want to keep most columns but just take away, say, the last 10, you can do this:

sub.data6<-mydata[,-c(190:200)]

The negative sign is like a 'drop' command.  Keep everything except the columns I list out.

It's a very nice feature and very fast!











2 comments:

  1. Hello! I love your blog posts! I have recently come across the dplyr package which seems to be an excellent wrapper for data manipulation functions http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

    Keep up the good work !

    ReplyDelete
    Replies
    1. Thanks! This looks pretty cool; I remember I saw a presentation on this by Hadley Wickham a few months ago and then haven't used it since. I'll try it out.

      Delete