Friday, February 1, 2013

Converting a dataset from wide to long


I recently had to convert a dataset that I was working with from a wide format to a long format for my analysis.  I struggled with this a bit, but finally found the right sources and the right package to do it, so I thought I'd share my practical example of reshaping data in R. This post is specifically helpful for those using Demographic and Health Survey (DHS) data.

The DHS dataset includes one observation for each woman. For each observation, there are 20 columns for each birth she could have had for 16 different characteristics.  If no birth happened then the cell is left missing. The characteristics include birth month, birth year, sex of the child, death month, death year, who the child lives with, current age of child, and so on.

For clarity, I've shortened the data to just 7 observations and two characteristics of each birth (b2 and b4) for 3 possible births:


Here v012 is the mother's age, all the b2 variables are year of births, and the b4 variables are the sex of the child.

So the first subject, aged 30, has had two births - one in 2000 and one in 2005, both boys.  Since she did not have a third birth yet, her values for b2_03 and b4_03 are missing.

I would like to convert this dataset to one where I have one observation per child born - i.e. from the wide format to the long format. Specifically, I would like the columns to just be the caseid of the mother, age of the mother, year of birth of the child, and sex of the child.

There are a number of ways of doing this in R, including melt() and cast()plyr()aggregate(), and others.  However, after a good deal of struggle and looking things up, I found that the reshape() function is the most intuitive and user-friendly for the needs of this problem. You don't need to use melt and cast at all, which are difficult to manipulate in my opinion.

The reshape() function takes in a number of important parameters that will be necessary for our transformation (there are more parameters than this, but I've boiled it down to the ones that are crucial):

reshape(data, varying = NULL, timevar = "time", idvar = "id", direction, sep = "")

where
data = dataframe you want to convert
varying = columns in the wide format that correspond to a single column in the long format
timevar = name of new variable that differentiates multiple observations from the same individual
idvar = variable in your dataset that identifies multiple records from the same individual
direction = "wide" if you're going from long to wide and "long" if you're going from wide to long
sep = the symbol that separates the name of a varying column from its number

Ok, so the most important thing about the reshape function is that you have to give it variable names that it can understand. Specifically, the columns that I want to turn into one column should all follow the same structure in the naming convention. It can be anything with a pattern like this:

Birthyear_1, Birthyear_2, Birthyear_3
b1, b2, b3
year.1, year.2, year.3

etc.  As long as they follow the same text and number pattern. If the dataset is not in this format, you will have a lot of problems and I suggest changing your variable names before trying to convert. For the first example, you would use sep="_", the second would be sep="", and the third is sep="." and all of these are valid.

Ok, so let's try it out. All variables that we want to convert into one column we can put into the varying parameter and R will sort them out based on the naming patterns. We specify a long direction, that our id variable is caseid, and, importantly, that the separating symbol between the name of the variable and its order number is a "_".

births.long1<-reshape(births.wide, varying=c("b2_01","b2_02","b2_03", "b4_01", "b4_02", "b4_03"), direction="long", idvar="caseid", sep="_")

This gives us the following result (truncated picture):



Which is what we wanted! Each observation is now a birth, with corresponding mother caseid and age, and the year (b2) and sex of the child (b4) for each observation correctly lined up. R has added in the time variable which is just the number after the "_" for the varying variables. Notice that R automatically orders the dataset by this time variable, so you will have to re-order if you want it by caseid.

If you wanted to convert all of the 16 characteristics of each birth, you do not need to write out each individual variable. You can easily do it like this, indicating the number of the columns that you want:

births.long2<-reshape(births.wide, varying=c(3:8), direction="long", idvar="caseid", sep="_", timevar="order")

births.long2<-births.long2[order(births.long2$caseid),]

names(births.long2)<-c("subject","age","order", "birthyear", "childsex")

births.long2<-na.omit(births.long2)

Here, I've specified that all variables from column 3 to column 8 are varying variables, and I've also indicated that the new variable that R creates should be called "order" instead of the default "time". Then I reorder my dataset by caseid, name the new columns, and take out all of the missing observations, which are just non-existent births. I get the following result: 



Notice how subject 6 is completely gone because she did not have any children and this is now a dataset of children ever born.

I've made this example of reshape very specific to DHS data, but there are also many great sources on how to reshape data in R. Here are a couple that I found especially helpful:

12 comments:

  1. Best post yet! I have found reshape to be incredibly tricky. Every time I think I've mastered it, I try it on a new dataset and get tons of errors. (Not gonna lie, once I just gave up and did the analysis in Stata.) This is now bookmarked for next time I need to use the command.

    Can't wait for you to do a post on tapply() !!!

    ReplyDelete
    Replies
    1. hi Portia can you show me how to do this in STATA or the intuition on how to if you don't mind. im trying to have a dataset of all children born in my DHS dataset

      Delete
  2. Thanks! :) I'm waiting on your guest post too!

    ReplyDelete
  3. I find melt() and cast() so intuitive that I haven't tried anything else. The terms "wide" and "long" also always bug me because it seems so relative. Often "wide" can get wider and "long" can get longer.

    ReplyDelete
  4. Again, a nice post. There are many ways to do this, and it is important that you get use to the syntax of one of the ways and use it as often as possible. Sometimes looking at the function can scare one off, but with a little practice they become more intuitive. Great examples, and very clear coding!

    ReplyDelete
  5. Wow, Fantastic Blog, it’s so helpful to me, and your blog is very good, I’ve from your blog here, Keep on going, my friend; I will keep an eye on it,learned a lot. Thanks for sharing it...

    Deep muscle pain relief

    ReplyDelete
  6. hi Slawa have you got any idea how to do this in STATA or the intuition on how to if you don't mind. im trying to have a dataset of all children born in my DHS dataset

    ReplyDelete
    Replies
    1. I think you would do

      reshape long b2 b4, i(caseid) j(childnumber)

      More help at this site:
      http://www.ats.ucla.edu/stat/stata/modules/reshapel.htm

      Delete
  7. I recently had to convert a dataset that I was working with from a wide format to a long format for my analysis. I struggled with this a bit, but finally found the right sources and the right package to do it, so I thought I'd share my practical example of reshaping data in R. This post is specifically helpful for those using Demographic and Health Survey (DHS) data
    http://ways2gettaller.eklablog.com/

    ReplyDelete
  8. Really impressed! It is so very informative information. Each & every guidelines of your publish are amazing.

    purchase worldwide facebook fans
    buy facebook likes/fans

    ReplyDelete
  9. Thank-you. I always got hung up on the sep="" part of the command structure. You have cleared this up for me and given me a valuable tool!

    ReplyDelete
    Replies
    1. Glad it was useful, great to hear your feedback!

      Delete