## Friday, October 4, 2013

### Loops revisited: How to rethink macros when using R

The R perspective

If you're a Stata user (or SAS for that matter), you are most likely a big fan of macros. They're very helpful when you're repeating the same actions over and over again. In R, we don't have macros. Instead we have functions and loops, and even better than loops are the apply functions. I already had one post on the apply() function about a year ago, so as this is the one year anniversary of my blog (yay!), I revisit apply() and show even more examples of how incredibly versatile and useful this function is, once you get used to the syntax. I also show where loops can be useful.

This blog post is inspired by this great U-W Madison site for computing that I found when I was searching for a way to do a loop in Stata. They go through all of the ways you may want to loop using macros in Stata. So in this blog post, I show how to do all of these problems in R.

### The R perspective

The first thing to do when you're trying to think about how to solve a problem in R that you've done in Stata using macros, is to stop thinking 'macro' and start thinking 'objects'. The idea is that when you use R, you have a space in which to store many different objects - vectors, dataframes, matrices, lists, etc. I went over all of these in a series of blog posts called “Data types” in November of last year.

You can use the power of objects to change the way you're thinking about your programming problem. Let's start with U-W's first example: running multiple regressions of various types using a fixed set of control variables. We'll run a linear regression and a logit. In Stata you do:

``````local controlVars age sex occupation location maritalStatus hasChildren
reg income education `controlVars'
logit employed education `controlVars'
``````

In R, we can take those local controlVars and put them into a new object, for example a matrix. Then we use that same matrix in all of our regressions. Here is an example. We create some data:

``````set.seed(10)
x <- rnorm(100, 5, 2)
z <- rnorm(100, 6, 5)
w <- rnorm(100, 3, 2)
y <- x * 2 + w * 0.5 + rnorm(100, 0, 1)
ybin <- as.numeric(y < 10)

mydata <- as.data.frame(cbind(x, z, w, y, ybin))
``````

And now if we have two models to run, a linear and a logit, we can create a matrix of explanatory variables that we put on the right hand side each time. You don't need the “data=mydata” part since ybin is also a vector in our workspace, but generally if you were to import this data as a dataframe, then you would need to include it or you would need to create a separate ybin vector object from the dataframe you imported.

``````xvars <- cbind(x, z, w)
summary(lm(ybin ~ xvars, data = mydata))
``````
``````##
## Call:
## lm(formula = ybin ~ xvars, data = mydata)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -0.6179 -0.2446  0.0175  0.2177  0.6699
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.44087    0.10888   13.23   <2e-16 ***
## xvarsx      -0.20002    0.01673  -11.95   <2e-16 ***
## xvarsz      -0.00179    0.00656   -0.27    0.785
## xvarsw      -0.03123    0.01631   -1.91    0.058 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.313 on 96 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:  0.593
## F-statistic:   49 on 3 and 96 DF,  p-value: <2e-16
``````
``````summary(glm(ybin ~ xvars, family = binomial(logit), data = mydata))
``````
``````##
## Call:
## glm(formula = ybin ~ xvars, family = binomial(logit), data = mydata)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -2.3283  -0.0962  -0.0030   0.0720   2.1836
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  20.0872     5.3972    3.72  0.00020 ***
## xvarsx       -4.0466     1.0590   -3.82  0.00013 ***
## xvarsz       -0.0136     0.1197   -0.11  0.90967
## xvarsw       -1.1068     0.4088   -2.71  0.00679 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 133.750  on 99  degrees of freedom
## Residual deviance:  29.772  on 96  degrees of freedom
## AIC: 37.77
##
## Number of Fisher Scoring iterations: 8
``````

Next, we want to run the regression if the data meet some requirement. In Stata we would do:

``````local blackWoman race==1 & female
reg income education `controlVars' if `blackWoman'
``````

Again in R, think objects. We can subset our original dataframe “mydata” to another matrix “data.sub”. I go over subsetting in a blogpost here. I take the subset of my data based on the conditions I want, and to stick with the above example, I do a xvars matrix as well so I can combine the two methods like so. Notice the degrees of freedom have been reduced since we're only using data in which x>2 and z<3:

``````data.sub <- as.data.frame(mydata[x > 2 & z < 3, c("x", "z", "ybin")])
xvars.sub <- as.matrix(data.sub[, c("x", "z")])
summary(lm(ybin ~ xvars.sub, data = data.sub))
``````
``````##
## Call:
## lm(formula = ybin ~ xvars.sub, data = data.sub)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -0.4706 -0.1793 -0.0171  0.1504  0.6144
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.4185     0.1611    8.81  1.5e-09 ***
## xvars.subx   -0.2113     0.0287   -7.38  5.0e-08 ***
## xvars.subz   -0.0434     0.0238   -1.83    0.079 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.274 on 28 degrees of freedom
## Multiple R-squared:  0.672,  Adjusted R-squared:  0.649
## F-statistic: 28.7 on 2 and 28 DF,  p-value: 1.67e-07
``````

Well now that we have the basics down, let's get to some of the more interesting problems of loops.

### Looping over Variables

In Stata, if we want to run regressions for three different outcome variables, we can do it this way, via a foreach loop:

``````foreach yvar in mpg price displacement {
reg `yvar' foreign weight
}
``````

In R, we can use apply(). The syntax of apply() is three arguments:

apply(dataset, margin, function)

• the first argument is the dataframe or matrix that you want to apply your function to,
• the second argument is the margin, meaning are you doing it over rows (margin=1) or over columns (margin=2) of the data,
• and the third argument is the function that you want to apply.

In this case, we can use the apply() function as follows. I will take the two columns of mydata (y and ybin), and for each of those columns (since margin=2) I apply the function that I create in the third argument. The function I create is to take an argument (outcome) which refers to each of those two columns and to run a linear model on each of those columns with x and z as the explanatory variables:

``````apply(mydata[,c("y","ybin")], 2, function(outcome){summary(lm(outcome~x+z))})
``````
``````## \$y
##
## Call:
## lm(formula = outcome ~ x + z)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -3.863 -0.887 -0.035  0.899  3.570
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   1.7342     0.4299    4.03  0.00011 ***
## x             2.0224     0.0764   26.46  < 2e-16 ***
## z            -0.0204     0.0297   -0.69  0.49450
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.43 on 97 degrees of freedom
## Multiple R-squared:  0.879,  Adjusted R-squared:  0.877
## F-statistic:  352 on 2 and 97 DF,  p-value: <2e-16
##
##
## \$ybin
##
## Call:
## lm(formula = outcome ~ x + z)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -0.6057 -0.2183  0.0068  0.1967  0.7292
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.335967   0.095381   14.01   <2e-16 ***
## x           -0.200010   0.016962  -11.79   <2e-16 ***
## z           -0.000098   0.006587   -0.01     0.99
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.317 on 97 degrees of freedom
## Multiple R-squared:  0.59,   Adjusted R-squared:  0.581
## F-statistic: 69.7 on 2 and 97 DF,  p-value: <2e-16
``````

And out come the two summaries of the models, the first with y and the second with ybin.

### Looping over parts of variable names

In this Stata example, which I feel comes up often, we want to take each of our monthly income variables and create an indicator of whether there was positive income in that month. We want to create 12 new variables with the original name of the variable like “Jan”, now in an indicator form like “hadIncJan”. The Stata code is as follows:

``````foreach month in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec {
}
``````

In R, this is one of those times that a for() loop actually works very nicely. I actually found this solution on Stackoverflow, which if you're not familiar, can be a lifesaver for your programming struggles.

I create some month data. In this example, we'll use the names() function in R, which is very useful. I had a blog post about the names() function if you want to read up on its other uses.

We also use the paste0() function, which is a way to concatenate character strings and numbers, along with a new way to index, the “[[” operator. You can read about this operator in the help file. The idea is to refer a single element of the dataframe (a vector) by inserting the column name that is being referred to via an index from the loop. I think reading the code makes things more clear. We make our data:

``````jan <- rnorm(100, 3, 5)
feb <- rnorm(100, 4, 8)
march <- rnorm(100, 2, 5)

months <- as.data.frame(cbind(jan, feb, march))
names(months)
``````
``````## [1] "jan"   "feb"   "march"
``````
``````head(months)
``````
``````##       jan    feb  march
## 1  2.6027 10.956 -2.510
## 2  8.9088 -1.440  1.726
## 3 13.9307  5.386 -4.784
## 4  5.0309  2.724  3.586
## 5 -0.6918 10.348  8.979
## 6 -6.7824 17.555 -4.056
``````

And now we run our loop:

``````for (n in names(months)) {
months[[paste0("HadInc_", n)]] <- as.numeric(months[[n]] > 0)
}
``````
``````##       jan    feb  march HadInc_jan HadInc_feb HadInc_march
## 1  2.6027 10.956 -2.510          1          1            0
## 2  8.9088 -1.440  1.726          1          0            1
## 3 13.9307  5.386 -4.784          1          1            0
## 4  5.0309  2.724  3.586          1          1            1
## 5 -0.6918 10.348  8.979          0          1            1
## 6 -6.7824 17.555 -4.056          0          1            0
``````

The for() loop takes each column name in months, and creates a new variable by concatenating the string “HadInc_” with that column name, and assigns to it a binary indicator of whether the original variable had monthly income greater than 0. If this is confusing, I suggest breaking it down to parts. You can run it this way to see what is actually happening (output not shown here, but you can run it on your own to understand it)

``````for (n in names(months)) {
print(n)
print(months[[n]])
}
``````

### Looping over Varlists

In Stata in order to do something to all or a lot of variables (for example, to rename them to have all lower case or upper case letters), you use a foreach loop like this:

``````foreach oldname of varlist * {
local newname=lower("`oldname'")
rename `oldname' `newname'
}
``````

In R, for this exact situation, you can use the toupper() function on the names of the data, or a subset of the data if you only wanted to do some of the column names.

``````names(months) <- toupper(names(months))
``````
``````##       JAN    FEB  MARCH HADINC_JAN HADINC_FEB HADINC_MARCH
## 1  2.6027 10.956 -2.510          1          1            0
## 2  8.9088 -1.440  1.726          1          0            1
## 3 13.9307  5.386 -4.784          1          1            0
## 4  5.0309  2.724  3.586          1          1            1
## 5 -0.6918 10.348  8.979          0          1            1
## 6 -6.7824 17.555 -4.056          0          1            0
``````

For other situations, like replacing indicators of missing values to NA for a bunch of variables at a time, check out my previous blog post on using apply() in these situations.

### Looping over numbers

We revisit the same problem we had with the monthly income, except now we want an indicator by year. We have variables of the yearly income. In Stata this would be the code, using now the forvalues loop with a macro:

``````forvalues year=1990/2010 {
}
``````

In R we can use a for() loop along with the “[[” operator that we used before, but this time we make use of the seq(along=x) syntax that will let us go along a sequence of numbers. Our dataframe called “Income” includes columns for each year of income. We make a new vector called “years” that just contains numbers from 1990 to 1992. Then for each value along that vector, we make a new column in our Income dataframe with a name that concatenates “hadInc_” with the number in the sequence, and this variable is just a binary indicator of whether that year's income was positive.

``````Inc1990 <- rnorm(100, 5, 6)
Inc1991 <- rnorm(100, 3, 8)
Inc1992 <- rnorm(100, 4, 4)

Income <- as.data.frame(cbind(Inc1990, Inc1991, Inc1992))
years <- c(1990:1992)
``````
``````##   Inc1990 Inc1991 Inc1992
## 1  3.6391 -14.001   6.169
## 2  0.8242   7.169   8.095
## 3  8.0399   2.629   7.251
## 4  8.6974  -6.196   9.204
## 5 -5.0013   8.035   6.272
## 6  2.9825  -2.812   3.520
``````
``````for (i in seq(along = years)) {
Income[[paste0("hadInc_", years[i])]] <- as.numeric(Income[[i]] > 0)
}
``````
``````##   Inc1990 Inc1991 Inc1992 hadInc_1990 hadInc_1991 hadInc_1992
## 1  3.6391 -14.001   6.169           1           0           1
## 2  0.8242   7.169   8.095           1           1           1
## 3  8.0399   2.629   7.251           1           1           1
## 4  8.6974  -6.196   9.204           1           0           1
## 5 -5.0013   8.035   6.272           0           1           1
## 6  2.9825  -2.812   3.520           1           0           1
``````

Again, run just parts of it to understand it if you're having trouble with the syntax.

### Looping over Values and Levelsof

Finally, we may want to run the same functions over values or levels of a variable. Here's are two situations in Stata, the first can use the by statement, and the second uses forvalues for a survey function.

``````by race: regress income age i.education

forvalues race=1/3 {
svy, subpop(if race==`race'): reg income age i.education
}
``````

In R, there is no “by” option for linear regression but we can use the lapply() function instead. The function lapply() is the same idea as apply() except it can be used to apply some function over a list. We create some data similar to the Stata example:

``````race <- c(rep(1, 30), rep(2, 30), rep(3, 40))
age <- rnorm(100, 25, 3)
y <- age * 10 + ifelse(race == 1, 100, ifelse(race == 2, 2000, 0)) + rnorm(100,
0, 1)

racedata <- as.data.frame(cbind(race, age, y))
racedata\$race <- as.factor(racedata\$race)
``````

Now we use lapply() to run the summary of lm for a subset of the racedata where we subset by each value of the list.

``````lapply(1:3, function(index) summary(lm(y~age, data=racedata[racedata\$race==index,])))
``````
``````## [[1]]
##
## Call:
## lm(formula = y ~ age, data = racedata[racedata\$race == index,
##     ])
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.0021 -0.8978 -0.0686  0.8602  2.0371
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 103.0121     2.2522    45.7   <2e-16 ***
## age           9.8836     0.0895   110.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.2 on 28 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.998
## F-statistic: 1.22e+04 on 1 and 28 DF,  p-value: <2e-16
##
##
## [[2]]
##
## Call:
## lm(formula = y ~ age, data = racedata[racedata\$race == index,
##     ])
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.4785 -0.4682 -0.0427  0.6685  1.3434
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.00e+03   1.34e+00    1491   <2e-16 ***
## age         9.96e+00   5.31e-02     188   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.881 on 28 degrees of freedom
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.999
## F-statistic: 3.52e+04 on 1 and 28 DF,  p-value: <2e-16
##
##
## [[3]]
##
## Call:
## lm(formula = y ~ age, data = racedata[racedata\$race == index,
##     ])
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.1961 -0.3204  0.0736  0.5295  2.1861
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -2.0024     1.0816   -1.85    0.072 .
## age          10.0814     0.0442  228.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.998 on 38 degrees of freedom
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.999
## F-statistic: 5.21e+04 on 1 and 38 DF,  p-value: <2e-16
``````

To make this even better, we take the levels of race, which is a factor, and run the lapply() function of those instead of the number 1-3 so that if those levels change, we won't have to change our code.

``````lapply(as.numeric(levels(race)), function(index) summary(lm(y~age, data=racedata[racedata\$race==index,])))
``````

This idea can also be applied to any function that you want to evaluate by different values.

Of course there may be more efficient ways to do what I've shown here. If you have comments on improvement over these solutions, let me know!

1. Hi Slawa, thank you for this wonderful post!

In the first several examples, it could be costly in terms of memory use to generate sub data frames when the data set is very large. I guess a better way is to work with formula. We can write a function to do this:

genForm = function(dep, main, control){
indep = paste(paste(list.main, collapse = " + "), "+", paste(list.control, collapse = " + "))
form = paste(dep,"~",indep)
form = as.formula(form)
}

list.control = c("z", "w")
list.main = c("x")

summary(lm(genForm("y", "x", list.control), data = mydata))
summary(glm(genForm("ybin", "x", list.control), data = mydata))

And regression with a repeatedly used sub data frame can be done this way

con1 = expression(x > 2 & z < 3)
summary(glm(genForm("ybin", "x", list.control), data = mydata[eval(con1),]))

And looping over variables

lapply(c("y", "ybin") , function(outcome) summary(lm(genForm(outcome, "x", list.control), data = mydata)) )

1. Hi Yimeng, thanks for your reply! Those are great suggestions. I try to use the shortest possible code but you're right about needing to think about memory, especially with large data sets, which a lot of people work with. Thanks for your contribution!

2. Actually, there is an even more elegant way to subset on rows within most lm-like model-fitting functions:

summary(lm(ybin ~ xvars.sub, data = data.sub,subset=x>2 & z<3))