## Monday, October 20, 2014

### Easy Clustered Standard Errors in R

Public health data can often be hierarchical in nature; for example, individuals are grouped in hospitals which are grouped in counties. When units are not independent, then regular OLS standard errors are biased.

One way to correct for this is using clustered standard errors. This post will show you how you can easily put together a function to calculate clustered SEs and get everything else you need, including confidence intervals, F-tests, and linear hypothesis testing.

Under standard OLS assumptions, with independent errors,

$V_{OLS} = \sigma^2(X'X)^{-1}$

We can estimate $$\sigma^2$$ with $$s^2$$:

$s^2 = \frac{1}{N-K}\sum_{i=1}^N e_i^2$

where N is the number of observations, K is the rank (number of variables in the regression), and $$e_i$$ are the residuals from the regression.

But if the errors are not independent because the observations are clustered within groups, then confidence intervals obtained will not have $$1-\alpha$$ coverage probability. To fix this, we can apply a sandwich estimator, like this:

$V_{Cluster} = (X'X)^{-1} \sum_{j=1}^{n_c} (u_j'*u_j) (X'X)^{-1}$

where $$n_c$$ is the total number of clusters and $$u_j = \sum_{j_{cluster}}e_i*x_i$$.

$$x_i$$ is the row vector of predictors including the constant. Programs like Stata also use a degree of freedom adjustment (small sample size adjustment), like so:

$\frac{M}{M-1}*\frac{N-1}{N-K} * V_{Cluster}$

where M is the number of clusters, N is the sample size, and K is the rank.

In this example, we’ll use the Crime dataset from the plm package. It includes yearly data on crime rates in counties across the United States, with some characteristics of those counties. Let’s load in the libraries we need and the Crime data:

library(plm)
library(lmtest)
data(Crime)

We would like to see the effect of percentage males aged 15-24 (pctymle) on crime rate, adjusting for police per capita (polpc), region, and year. However, there are multiple observations from the same county, so we will cluster by county.

In Stata the commands would look like this.

reg crmrte pctymle polpc  i.region  year, cluster(county)

In R, we can first run our basic ols model using lm() and save the results in an object called m1. Unfortunately, there’s no ‘cluster’ option in the lm() function. But there are many ways to get the same result.

#basic linear model with standard variance estimate
Crime$region<-factor(Crime$region)
m1<-lm(crmrte~pctymle+polpc+region+year, data=Crime)

#### >>> Get the cluster-adjusted variance-covariance matrix

First, I’ll show how to write a function to obtain clustered standard errors. If you are unsure about how user-written functions work, please see my posts about them, here (How to write and debug an R function) and here (3 ways that functions can improve your R code).

There are many sources to help us write a function to calculate clustered SEs. Check out these helpful links: Mahmood Arai’s paper found here and DiffusePrioR’s blogpost found here.

I’ll base my function on the first source. It uses functions from the sandwich and the lmtest packages so make sure to install those packages. However, instead of returning the coefficients and standard errors, I am going to modify Arai’s function to return the variance-covariance matrix, so I can work with that later. The function will input the lm model object and the cluster vector.

After that, I’ll do it the super easy way with the new multiwayvcov package which has a cluster.vcov() function. Help on this package found here. The function also needs the model and the cluster as inputs.

Here are both ways:

#write your own function to return variance covariance matrix under clustered SEs
get_CL_vcov<-function(model, cluster){
require(sandwich, quietly = TRUE)
require(lmtest, quietly = TRUE)

M <- length(unique(cluster))
N <- length(cluster)
K <- model$rank dfc <- (M/(M-1))*((N-1)/(N-K)) #calculate the uj's uj <- apply(estfun(model),2, function(x) tapply(x, cluster, sum)) #use sandwich to get the var-covar matrix vcovCL <- dfc*sandwich(model, meat=crossprod(uj)/N) return(vcovCL) } #call our new function and save the var-cov matrix output in an object m1.vcovCL<-get_CL_vcov(m1, Crime$county)

#equivalent way: use the cluster.vcov function to get variance-covariance matrix
library(multiwayvcov)
m1.vcovCL.2<-cluster.vcov(m1, Crime$county) Now, in order to obtain the coefficients and SEs, we can use the coeftest() function in the lmtest library, which allows us to input our own var-covar matrix. Let’s compare our standard OLS SEs to the clustered SEs. #the regular OLS standard errors coeftest(m1) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.00262057 0.02757101 -0.0950 0.9243 ## pctymle 0.16568860 0.02688128 6.1637 1.276e-09 *** ## polpc 1.40604239 0.23482618 5.9876 3.595e-09 *** ## regionwest -0.01424935 0.00168946 -8.4343 2.317e-16 *** ## regioncentral 0.00227384 0.00147881 1.5376 0.1246 ## year 0.00022901 0.00032286 0.7093 0.4784 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #the clustered standard errors by indicating the correct var-covar matrix coeftest(m1, m1.vcovCL) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.00262057 0.01909911 -0.1372 0.89091 ## pctymle 0.16568860 0.06511197 2.5447 0.01118 * ## polpc 1.40604239 0.88833006 1.5828 0.11398 ## regionwest -0.01424935 0.00329691 -4.3220 1.799e-05 *** ## regioncentral 0.00227384 0.00391121 0.5814 0.56120 ## year 0.00022901 0.00019074 1.2006 0.23035 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 We can see that the SEs generally increased, due to the clustering. Now, let’s obtain the F-statistic and the confidence intervals. #### >>>Getting the F-statistic and confidence intervals for clustered SEs To obtain the F-statistic, we can use the waldtest() function from the lmtest library with test=“F” indicated for the F-test. For the 95% CIs, we can write our own function that takes in the model and the variance-covariance matrix and produces the 95% CIs. You can modify this function to make it better and more versatile, but I’m going to keep it simple. #function to return confidence intervals get_confint<-function(model, vcovCL){ t<-qt(.975, model$df.residual)
ct<-coeftest(model, vcovCL)
est<-cbind(ct[,1], ct[,1]-t*ct[,2], ct[,1]+t*ct[,2])
colnames(est)<-c("Estimate","LowerCI","UpperCI")
return(est)
}

Now, we can get the F-stat and the confidence intervals:

#do a wald test to get F-statistic
waldtest(m1, vcov = m1.vcovCL, test = "F")
## Wald test
##
## Model 1: crmrte ~ pctymle + polpc + region + year
## Model 2: crmrte ~ 1
##   Res.Df Df      F    Pr(>F)
## 1    624
## 2    629 -5 7.3616 1.018e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that now the F-statistic is calculated based on a Wald test (using the cluster-robustly esimtated var-covar matrix) rather than on sums of squares and degrees of freedom. The degrees of freedom listed here are for the model, but the var-covar matrix has been corrected for the fact that there are only 90 independent observations.

If you want to save the F-statistic itself, save the waldtest function call in an object and extract:

#save waldtest in an object
w<-waldtest(m1, vcov = m1.vcovCL, test = "F")

#check out what is saved in w
names(w)
## [1] "Res.Df" "Df"     "F"      "Pr(>F)"
#use $F to get the F-statistic w$F[2]
## [1] 7.361599

For confidence intervals, we can use the function we wrote:

#obtain the confidence interval using our function
get_confint(m1, m1.vcovCL)
##                    Estimate      LowerCI       UpperCI
## (Intercept)   -0.0026205744 -0.040126885  0.0348857363
## pctymle        0.1656885998  0.037823465  0.2935537342
## polpc          1.4060423854 -0.338436157  3.1505209280
## regionwest    -0.0142493530 -0.020723744 -0.0077749620
## regioncentral  0.0022738418 -0.005406889  0.0099545724
## year           0.0002290114 -0.000145564  0.0006035867

As an aside, to get the R-squared value, you can extract that from the original model m1, since that won’t change if the errors are clustered. Again, remember that the R-squared is calculated via sums of squares, which are technically no longer relevant because of the corrected variance-covariance matrix. But it can still be used as a measure of goodness-of-fit.

#r-squared
summary(m1)$r.squared ## [1] 0.2198695 #### >>>One function to do everything Notice, that you could wrap all of these 3 components (F-test, coefficients/SEs, and CIs) in a function that saved them all in a list, for example like this: super.cluster.fun<-function(model, cluster) { require(multiwayvcov) require(lmtest) vcovCL<-cluster.vcov(model, cluster) coef<-coeftest(model, vcovCL) w<-waldtest(model, vcov = vcovCL, test = "F") ci<-get_confint(model, vcovCL) return(list(coef, w, ci)) } super.cluster.fun(m1, Crime$county)
## [[1]]
##
## t test of coefficients:
##
##                  Estimate  Std. Error t value  Pr(>|t|)
## (Intercept)   -0.00262057  0.01909911 -0.1372   0.89091
## pctymle        0.16568860  0.06511197  2.5447   0.01118 *
## polpc          1.40604239  0.88833006  1.5828   0.11398
## regionwest    -0.01424935  0.00329691 -4.3220 1.799e-05 ***
## regioncentral  0.00227384  0.00391121  0.5814   0.56120
## year           0.00022901  0.00019074  1.2006   0.23035
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## [[2]]
## Wald test
##
## Model 1: crmrte ~ pctymle + polpc + region + year
## Model 2: crmrte ~ 1
##   Res.Df Df      F    Pr(>F)
## 1    624
## 2    629 -5 7.3616 1.018e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## [[3]]
##                    Estimate      LowerCI       UpperCI
## (Intercept)   -0.0026205744 -0.040126885  0.0348857363
## pctymle        0.1656885998  0.037823465  0.2935537342
## polpc          1.4060423854 -0.338436157  3.1505209280
## regionwest    -0.0142493530 -0.020723744 -0.0077749620
## regioncentral  0.0022738418 -0.005406889  0.0099545724
## year           0.0002290114 -0.000145564  0.0006035867

Then you could extract each component with the [[]] operator. Check out this post(“Returning a list of objects”) if you’re unsure.

#### >>>Hypothesis testing when errors are clustered

Now what if we wanted to test whether the west region coefficient was different from the central region? Again, we need to incorporate the right var-cov matrix into our calculation. Fortunately the car package has a linearHypothesis() function that allows for specification of a var-covar matrix. The inputs are the model, the var-cov matrix, and the coefficients you want to test. Check out the help file of the function to see the wide range of tests you can do.

library(car)
linearHypothesis(m1, vcov=m1.vcovCL, "regionwest = regioncentral")
## Linear hypothesis test
##
## Hypothesis:
## regionwest - regioncentral = 0
##
## Model 1: restricted model
## Model 2: crmrte ~ pctymle + polpc + region + year
##
## Note: Coefficient covariance matrix supplied.
##
##   Res.Df Df      F   Pr(>F)
## 1    625
## 2    624  1 18.522 1.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#### >>>Troubleshooting and error messages

One reason to opt for the cluster.vcov() function from the multiwayvcov package is that it can handle missing values without any problems. When doing the variance-covariance matrix using the user-written function get_CL_vcov above, an error message can often come up:

#error messages
get_CL_vcov(m1, Crime$couunty) ## Error in tapply(x, cluster, sum): arguments must have same length There are two common reasons for this. One is just that you spelled the name of the cluster variable incorrectly (as above). Make sure to check that. The second is that you have missing values in your outcome or explanatory variables. In this case, the length of the cluster will be different from the length of the outcome or covariates and tapply() will not work. One possible solutions is to remove the missing values by subsetting the cluster to include only those values where the outcome is not missing. Another option is to run na.omit() on the entire dataset to remove all missing vaues. Here’s an example: #force in missing value in outcome Crime2<-Crime Crime2$crmrte[1]<-NA

#rerun model
m2<-lm(crmrte~pctymle+polpc+region+year, data=Crime2)

#this produces errors
v1<-get_CL_vcov(m2, Crime2$county) ## Error in tapply(x, cluster, sum): arguments must have same length #but we can remove the observations in county for which crmrte is missing v2<-get_CL_vcov(m2, Crime2$county[!is.na(Crime2$crmrte)]) #or can use na.omit Crime3<-na.omit(Crime2) m3<-lm(crmrte~pctymle+polpc+region+year, data=Crime3) v3<-get_CL_vcov(m3, Crime3$county)

However, if you’re running a number of regressions with different covariates, each with a different missing pattern, it may be annoying to create multiple datasets and run na.omit() on them to deal with this. To avoid this, you can use the cluster.vcov() function, which handles missing values within its own function code, so you don’t have to.

#### >>>Using plm

Finally, you can also use the plm() and vcovHC() functions from the plm package. You still need to do your own small sample size correction though.

#use plm function to define formula, dataset, that it's a pooling model, and the cluster variable
p1 <- plm(crmrte~pctymle+polpc+region+year, Crime, model='pooling', index=c('county'))

G <- length(unique(Crime$county)) N <- length(Crime$county)
dfa <- (G/(G - 1)) * (N - 1)/p1$df.residual #use coeftest and the vcovHC functions, specifying HC0 type and identifying cluster as 'group' coeftest(p1, vcov=function(x) dfa*vcovHC(x, cluster="group", type="HC0")) ## ## t test of coefficients: ## ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.00262057 0.01909911 -0.1372 0.89091 ## pctymle 0.16568860 0.06511197 2.5447 0.01118 * ## polpc 1.40604239 0.88833006 1.5828 0.11398 ## regionwest -0.01424935 0.00329691 -4.3220 1.799e-05 *** ## regioncentral 0.00227384 0.00391121 0.5814 0.56120 ## year 0.00022901 0.00019074 1.2006 0.23035 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 A website that goes further into this function is here. Update: A reader pointed out to me that another package that can do clustering is the rms package, so definitely check that out as well. Public health data can often be hierarchical in nature; for example, individuals are grouped in hospitals which are grouped in counties. When units are not independent, then regular OLS standard errors are biased. One way to correct for this is using clustered standard errors. This post will show you how you can easily put together a function to calculate clustered SEs and get everything else you need, including confidence intervals, F-tests, and linear hypothesis testing. Under standard OLS assumptions, with independent errors, $$V_{OLS} = \sigma^2(X'X)^{-1}$$ We can estimate$\sigma^2$with$s^2$: $$s^2 = \frac{1}{N-K}\sum_{i=1}^N e_i^2$$ where N is the number of observations, K is the rank (number of variables in the regression), and$e_i$are the residuals from the regression. But if the errors are not independent because the observations are clustered within groups, then confidence intervals obtained will not have$1-\alpha$coverage probability. To fix this, we can apply a sandwich estimator, like this: $$V_{Cluster} = (X'X)^{-1} \sum_{j=1}^{n_c} (u_j'*u_j) (X'X)^{-1}$$ where$n_c$is the total number of clusters and$u_j = \sum_{j_{cluster}}e_i*x_i$.$x_i$is the row vector of predictors including the constant. Programs like Stata also use a degree of freedom adjustment (small sample size adjustment), like so: $$\frac{M}{M-1}*\frac{N-1}{N-K} * V_{Cluster}$$ where M is the number of clusters, N is the sample size, and K is the rank. In this example, we'll use the Crime dataset from the plm package. It includes yearly data on crime rates in counties across the United States, with some characteristics of those counties. Let's load in the libraries we need and the Crime data: library(plm) library(lmtest) data(Crime) We would like to see the effect of percentage males aged 15-24 (pctymle) on crime rate, adjusting for police per capita (polpc), region, and year. However, there are multiple observations from the same county, so we will cluster by county. In Stata the commands would look like this. reg crmrte pctymle polpc i.region year, cluster(county) In R, we can first run our basic ols model using lm() and save the results in an object called m1. Unfortunately, there's no 'cluster' option in the lm() function. But there are many ways to get the same result #basic linear model with standard variance estimate Crime$region<-factor(Crime$region) m1<-lm(crmrte~pctymle+polpc+region+year, data=Crime) Get the cluster-adjusted variance-covariance matrix First, I'll show how to write a function to obtain clustered standard errors. If you are unsure about how user-written functions work, please see my posts about them, [here](http://rforpublichealth.blogspot.com/2014/06/how-to-write-and-debug-r-function.html) (How to write and debug an R function) and [here](http://rforpublichealth.blogspot.com/2014/07/3-ways-that-functions-can-improve-your.html) (3 ways that functions can improve your R code). There are many sources to help us write a function to calculate clustered SEs. Check out these helpful links: Mahmood Arai's paper [found here](http://people.su.se/~ma/clustering.pdf) and DiffusePrioR's blogpost [found here](http://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/). I'll base my function on the first source. It uses functions from the sandwich and the lmtest packages so make sure to install those packages. However, instead of returning the coefficients and standard errors, I am going to modify Arai's function to return the variance-covariance matrix, so I can work with that later. The function will input the lm model object and the cluster vector. After that, I'll do it the super easy way with the new multiwayvcov package which has a cluster.vcov() function. Help on this package [found here](http://cran.r-project.org/web/packages/multiwayvcov/multiwayvcov.pdf). The function also needs the model and the cluster as inputs. Here are both ways: #write your own function to return variance covariance matrix under clustered SEs get_CL_vcov<-function(model, cluster){ require(sandwich, quietly = TRUE) require(lmtest, quietly = TRUE) #calculate degree of freedom adjustment M <- length(unique(cluster)) N <- length(cluster) K <- model$rank dfc <- (M/(M-1))*((N-1)/(N-K)) #calculate the uj's uj <- apply(estfun(model),2, function(x) tapply(x, cluster, sum) #use sandwich to get the var-covar matrix vcovCL <- dfc*sandwich(model, meat=crossprod(uj)/N) return(vcovCL)} #call our new function and save the var-cov matrix output in an object m1.vcovCL<-get_CL_vcov(m1, Crime$county) #equivalent way: use the cluster.vcov function to get variance-covariance matrix library(multiwayvcov) m1.vcovCL.2<-cluster.vcov(m1, Crime$county) Now, in order to obtain the coefficients and SEs, we can use the coeftest() function in the lmtest library, which allows us to input our own var-covar matrix. Let's compare our standard OLS SEs to the clustered SEs. #the regular OLS standard errors coeftest(m1) #the clustered standard errors by indicating the correct var-covar matrix coeftest(m1, m1.vcovCL) We can see that the SEs generally increased, due to the clustering. Now, let's obtain the F-statistic and the confidence intervals. Getting the F-statistic and confidence intervals for clustered SEs To obtain the F-statistic, we can use the waldtest() function from the lmtest library with test="F" indicated for the F-test. For the 95% CIs, we can write our own function that takes in the model and the variance-covariance matrix and produces the 95% CIs. You can modify this function to make it better and more versatile, but I'm going to keep it simple. #function to return confidence intervals get_confint<-function(model, vcovCL){ t<-qt(.975, model$df.residual) ct<-coeftest(model, vcovCL) est<-cbind(ct[,1], ct[,1]-t*ct[,2], ct[,1]+t*ct[,2]) colnames(est)<-c("Estimate","LowerCI","UpperCI") return(est)} Now, we can get the F-stat and the confidence intervals: #do a wald test to get F-statistic waldtest(m1, vcov = m1.vcovCL, test = "F") Note that now the F-statistic is calculated based on a Wald test (using the cluster-robustly esimtated var-covar matrix) rather than on sums of squares and degrees of freedom. The degrees of freedom listed here are for the model, but the var-covar matrix has been corrected for the fact that there are only 90 independent observations. If you want to save the F-statistic itself, save the waldtest function call in an object and extract: #save waldtest in an object w<-waldtest(m1, vcov = m1.vcovCL, test = "F") #check out what is saved in w names(w) #use$F to get the F-statistic w$F[2] For confidence intervals, we can use the function we wrote: #obtain the confidence interval using our function get_confint(m1, m1.vcovCL) As an aside, to get the R-squared value, you can extract that from the original model m1, since that won't change if the errors are clustered. Again, remember that the R-squared is calculated via sums of squares, which are technically no longer relevant because of the corrected variance-covariance matrix. But it can still be used as a measure of goodness-of-fit. #r-squared summary(m1)$r.squared One function to do everything Notice, that you could wrap all of these 3 components (F-test, coefficients/SEs, and CIs) in a function that saved them all in a list, for example like this: super.cluster.fun<-function(model, cluster){ require(multiwayvcov) require(lmtest) vcovCL<-cluster.vcov(model, cluster) coef<-coeftest(model, vcovCL) w<-waldtest(model, vcov = vcovCL, test = "F") ci<-get_confint(model, vcovCL) return(list(coef, w, ci))} super.cluster.fun(m1, Crime$county) Then you could extract each component with the [[]] operator. Check out [this post](http://rforpublichealth.blogspot.com/2014/06/how-to-write-and-debug-r-function.html)("Returning a list of objects") if you're unsure. Hypothesis testing when errors are clustered Now what if we wanted to test whether the west region coefficient was different from the central region? Again, we need to incorporate the right var-cov matrix into our calculation. Fortunately the car package has a __linearHypothesis()__ function that allows for specification of a var-covar matrix. The inputs are the model, the var-cov matrix, and the coefficients you want to test. Check out the help file of the function to see the wide range of tests you can do. library(car) linearHypothesis(m1, vcov=m1.vcovCL, "regionwest = regioncentral") Troubleshooting and error messages One reason to opt for the cluster.vcov() function from the multiwayvcov package is that it can handle missing values without any problems. When doing the variance-covariance matrix using the user-written function get_CL_vcov above, an error message can often come up: #error messages get_CL_vcov(m1, Crime$couunty) There are two common reasons for this. One is just that you spelled the name of the cluster variable incorrectly (as above). Make sure to check that. The second is that you have __missing values__ in your outcome or explanatory variables. In this case, the length of the cluster will be different from the length of the outcome or covariates and tapply() will not work. One possible solutions is to remove the missing values by subsetting the cluster to include only those values where the outcome is not missing. Another option is to run na.omit() on the entire dataset to remove all missing vaues. Here's an example: #force in missing value in outcome Crime2<-Crime Crime2$crmrte[1]<-NA #rerun model m2<-lm(crmrte~pctymle+polpc+region+year, data=Crime2) #this produces errors v1<-get_CL_vcov(m2, Crime2$county) #but we can remove the observations in county for which crmrte is missing v2<-get_CL_vcov(m2, Crime2$county[!is.na(Crime2$crmrte)]) #or can use na.omit Crime3<-na.omit(Crime2) m3<-lm(crmrte~pctymle+polpc+region+year, data=Crime3) v3<-get_CL_vcov(m3, Crime3$county) However, if you're running a number of regressions with different covariates, each with a different missing pattern, it may be annoying to create multiple datasets and run na.omit() on them to deal with this. To avoid this, you can use the cluster.vcov() function, which handles missing values within its own function code, so you don't have to. Using plm Finally, you can also use the plm() and vcovHC() functions from the plm package. You still need to do your own small sample size correction though. #use plm function to define formula, dataset, that it's a pooling model, and the cluster variable p1 <- plm(crmrte~pctymle+polpc+region+year, Crime, model='pooling', index=c('county')) #calculate small sample size adjustment G <- length(unique(Crime$county)) N <- length(Crime$county) dfa <- (G/(G - 1)) * (N - 1)/p1$df.residual #use coeftest and the vcovHC functions, specifying HC0 type and identifying cluster as 'group' coeftest(p1, vcov=function(x) dfa*vcovHC(x, cluster="group", type="HC0")) A website that goes further into this function [is here](http://landroni.wordpress.com/2012/06/02/fama-macbeth-and-cluster-robust-by-firm-and-time-standard-errors-in-r/). Update: A reader pointed out to me that another package that can do clustering is the rms package, so definitely [check that out](http://cran.r-project.org/web/packages/rms/rms.pdf) as well.