Friday, October 4, 2013

Loops revisited: How to rethink macros when using R

If you're a Stata user (or SAS for that matter), you are most likely a big fan of macros. They're very helpful when you're repeating the same actions over and over again. In R, we don't have macros. Instead we have functions and loops, and even better than loops are the apply functions. I already had one post on the apply() function about a year ago, so as this is the one year anniversary of my blog (yay!), I revisit apply() and show even more examples of how incredibly versatile and useful this function is, once you get used to the syntax. I also show where loops can be useful. This blog post is inspired by [this great U-W Madison site]( for computing that I found when I was searching for a way to do a loop in Stata. They go through all of the ways you may want to loop using macros in Stata. So in this blog post, I show how to do all of these problems in R. The R perspective The first thing to do when you're trying to think about how to solve a problem in R that you've done in Stata using macros, is to stop thinking 'macro' and start thinking 'objects'. The idea is that when you use R, you have a space in which to store many different objects - vectors, dataframes, matrices, lists, etc. I went over all of these in a [series of blog posts]( called "Data types" in November of last year. You can use the power of objects to change the way you're thinking about your programming problem. Let's start with U-W's first example: running multiple regressions of various types using a fixed set of control variables. We'll run a linear regression and a logit. In Stata you do: local controlVars age sex occupation location maritalStatus hasChildren reg income education `controlVars' logit employed education `controlVars' In R, we can take those local controlVars and put them into a new object, for example a matrix. Then we use that same matrix in all of our regressions. Here is an example. We create some data: set.seed(10) x<-rnorm(100,5,2) z<-rnorm(100,6,5) w<-rnorm(100,3,2) y<-x*2+w*.5+rnorm(100,0,1) ybin<-as.numeric(y<10) mydata<,z,w,y,ybin)) And now if we have two models to run, a linear and a logit, we can create a matrix of explanatory variables that we put on the right hand side each time. You don't need the "data=mydata" part since ybin is also a vector in our workspace, but generally if you were to import this data as a dataframe, then you would need to include it or you would need to create a separate ybin vector object from the dataframe you imported. xvars<-cbind(x,z,w) summary(lm(ybin~xvars, data=mydata)) summary(glm(ybin~xvars, family=binomial(logit), data=mydata)) Next, we want to run the regression if the data meet some requirement. In Stata we would do: local blackWoman race==1 & female reg income education `controlVars' if `blackWoman' Again in R, think objects. We can subset our original dataframe "mydata" to another matrix "data.sub". I go over subsetting in a blogpost [here]( I take the subset of my data based on the conditions I want, and to stick with the above example, I do a xvars matrix as well so I can combine the two methods like so. Notice the degrees of freedom have been reduced since we're only using data in which x>2 and z<3: data.sub<[x>2 & z<3,c("x","z","ybin")]) xvars.sub<-as.matrix(data.sub[,c("x","z")]) summary(lm(ybin~xvars.sub, data=data.sub)) Well now that we have the basics down, let's get to some of the more interesting problems of loops. Looping over Variables In Stata, if we want to run regressions for three different outcome variables, we can do it this way, via a **foreach** loop: foreach yvar in mpg price displacement { reg `yvar' foreign weight } In R, we can use apply(). The syntax of apply() is three arguments: apply(dataset, margin, function) * the first argument is the dataframe or matrix that you want to apply your function to, * the second argument is the margin, meaning are you doing it over rows (margin=1) or over columns (margin=2) of the data, * and the third argument is the function that you want to apply. In this case, we can use the apply() function as follows. I will take the two columns of mydata (y and ybin), and for each of those columns (since margin=2) I apply the function that I create in the third argument. The function I create is to take an argument (outcome) which refers to each of those two columns and to run a linear model on each of those columns with x and z as the explanatory variables: apply(mydata[,c("y","ybin")], 2, function(outcome){summary(lm(outcome~x+z))}) And out come the two summaries of the models, the first with y and the second with ybin. ### Looping over parts of variable names In this Stata example, which I feel comes up often, we want to take each of our monthly income variables and create an indicator of whether there was positive income in that month. We want to create 12 new variables with the original name of the variable like "Jan", now in an indicator form like "hadIncJan". The Stata code is as follows: foreach month in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec { gen hadInc`month'=(inc`month'>0) if inc`month'<. } In R, this is one of those times that a for() loop actually works very nicely. I actually found this solution on [Stackoverflow](, which if you're not familiar, can be a lifesaver for your programming struggles. I create some month data. In this example, we'll use the names() function in R, which is very useful. I had a [blog post about the names()]( function if you want to read up on its other uses. We also use the paste0() function, which is a way to concatenate character strings and numbers, along with a new way to index, the "[[" operator. You can read about this operator in the help file. The idea is to refer a single element of the dataframe (a vector) by inserting the column name that is being referred to via an index from the loop. I think reading the code makes things more clear. We make our data: jan<-rnorm(100,3,5) feb<-rnorm(100,4,8) march<-rnorm(100,2,5) months<,feb,march)) names(months) head(months) And now we run our loop: for (n in names(months)){ months[[paste0("HadInc_",n)]] <- as.numeric(months[[n]]>0) } head(months) The for() loop takes each column name in months, and creates a new variable by concatenating the string "HadInc_" with that column name, and assigns to it a binary indicator of whether the original variable had monthly income greater than 0. If this is confusing, I suggest breaking it down to parts. You can run it this way to see what is actually happening (output not shown here, but you can run it on your own to understand it) for (n in names(months)){ print(n) print(months[[n]]) print(paste0("HadInc_",n)) } Looping over Varlists In Stata in order to do something to all or a lot of variables (for example, to rename them to have all lower case or upper case letters), you use a **foreach** loop like this: foreach oldname of varlist * { local newname=lower("`oldname'") rename `oldname' `newname' } In R, for this exact situation, you can use the toupper() function on the names of the data, or a subset of the data if you only wanted to do some of the column names. names(months)<-toupper(names(months)) head(months) For other situations, like replacing indicators of missing values to NA for a bunch of variables at a time, check out [my previous blog post]( on using apply() in these situations. Looping over numbers We revisit the same problem we had with the monthly income, except now we want an indicator by year. We have variables of the yearly income. In Stata this would be the code, using now the **forvalues** loop with a macro: forvalues year=1990/2010 { gen hadInc`year'=(inc`year'>0) if inc`year'<. } In R we can use a for() loop along with the "[[" operator that we used before, but this time we make use of the seq(along=x) syntax that will let us go along a sequence of numbers. Our dataframe called "Income" includes columns for each year of income. We make a new vector called "years" that just contains numbers from 1990 to 1992. Then for each value along that vector, we make a new column in our Income dataframe with a name that concatenates "hadInc_" with the number in the sequence, and this variable is just a binary indicator of whether that year's income was positive. Inc1990<-rnorm(100,5,6) Inc1991<-rnorm(100,3,8) Inc1992<-rnorm(100,4,4) Income<, Inc1991, Inc1992)) years<-c(1990:1992) head(Income) for (i in seq(along=years)){ Income[[paste0("hadInc_",years[i])]] <- as.numeric(Income[[i]]>0) } head(Income) Again, run just parts of it to understand it if you're having trouble with the syntax. Looping over Values and Levelsof Finally, we may want to run the same functions over values or levels of a variable. Here's are two situations in Stata, the first can use the **by** statement, and the second uses **forvalues** for a survey function. by race: regress income age forvalues race=1/3 { svy, subpop(if race==`race'): reg income age } In R, there is no "by" option for linear regression but we can use the lapply() function instead. The function lapply() is the same idea as apply() except it can be used to apply some function over a list. We create some data similar to the Stata example: race<-c(rep(1,30),rep(2,30),rep(3,40)) age<-rnorm(100,25,3) y<-age*10+ifelse(race==1,100, ifelse(race==2, 2000, 0))+rnorm(100,0,1) racedata<,age,y)) racedata$race<-as.factor(racedata$race) Now we use lapply() to run the summary of lm for a subset of the racedata where we subset by each value of the list. lapply(1:3, function(index) summary(lm(y~age, data=racedata[racedata$race==index,]))) To make this even better, we take the levels of race, which is a factor, and run the lapply() function of those instead of the number 1-3 so that if those levels change, we won't have to change our code. lapply(as.numeric(levels(race)), function(index) summary(lm(y~age, data=racedata[racedata$race==index,]))) This idea can also be applied to any function that you want to evaluate by different values. Of course there may be more efficient ways to do what I've shown here. If you have comments on improvement over these solutions, let me know!


  1. Hi Slawa, thank you for this wonderful post!

    In the first several examples, it could be costly in terms of memory use to generate sub data frames when the data set is very large. I guess a better way is to work with formula. We can write a function to do this:

    genForm = function(dep, main, control){
    indep = paste(paste(list.main, collapse = " + "), "+", paste(list.control, collapse = " + "))
    form = paste(dep,"~",indep)
    form = as.formula(form)

    list.control = c("z", "w")
    list.main = c("x")

    summary(lm(genForm("y", "x", list.control), data = mydata))
    summary(glm(genForm("ybin", "x", list.control), data = mydata))

    And regression with a repeatedly used sub data frame can be done this way

    con1 = expression(x > 2 & z < 3)
    summary(glm(genForm("ybin", "x", list.control), data = mydata[eval(con1),]))

    And looping over variables

    lapply(c("y", "ybin") , function(outcome) summary(lm(genForm(outcome, "x", list.control), data = mydata)) )

    1. Hi Yimeng, thanks for your reply! Those are great suggestions. I try to use the shortest possible code but you're right about needing to think about memory, especially with large data sets, which a lot of people work with. Thanks for your contribution!

  2. Actually, there is an even more elegant way to subset on rows within most lm-like model-fitting functions:

    summary(lm(ybin ~ xvars.sub, data = data.sub,subset=x>2 & z<3))