## Saturday, August 24, 2013

### Motivation

Basic linear regression in R is super easy. Just use the lm() function, summarize the model with summary() and you're done. But as researchers we need more than that. What if our model is more complicated, like if our errors are heteroskedastic? And once we have run our regressions with our adjusted standard errors, how do we export the results for presentation in word or latex documents? This post will go over exactly these things, with the help of the stargazer package created by my fellow Harvard grad student Marek Hlavac. More posts about stargazer can be found here on this blog.

Today we are going to analyze how temperature affects ice cream sales. We just make ourselves a little dataset like so:

set.seed(5)
temp <- rnorm(500, mean=80, sd=12)
sales <- 2+temp*3

for(i in 1:length(sales)){
if(temp[i]<75 | temp[i]>95) sales[i] <- sales[i]+rnorm(1,0,25 )
else sales[i] <- sales[i]+rnorm(1,0,8 )
}

female <- rnorm(500, mean=.5, sd=.01)
icecream <- as.data.frame(cbind(temp, sales, female))


##     temp sales female
## 1  69.91 209.6 0.4854
## 2  96.61 296.8 0.5124
## 3  64.93 227.5 0.4957
## 4  80.84 244.6 0.5001
## 5 100.54 271.2 0.5012
## 6  72.77 210.4 0.4959


So there we have the temperature of 500 cities, sales per 100,000 people, and gender ratio of each of these cities. Let's plot out the temperature and sales.

plot(icecream$temp, icecream$sales, xlab = "Temperature", ylab = "Sales/100,000", main = "I scream for icecream")


Here we definitely see a relationship between the temperature in a city and the sales of icecream. However, we notice that our errors are not exactly homoskedastic. We see that for both lower temperatures and very high temperatures, the variance is larger than the middle temperatures. So we will have to deal with that.

Let's start by running a linear regression using the lm() function and summarizing the results:

fit1 <- lm(sales ~ temp + female, data = icecream)
summary(fit1)

##
## Call:
## lm(formula = sales ~ temp + female, data = icecream)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -76.34  -8.02   0.34   8.19  84.97
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  23.9164    41.1872    0.58     0.56
## temp          2.9716     0.0671   44.30   <2e-16 ***
## female      -37.8194    81.0642   -0.47     0.64
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.1 on 497 degrees of freedom
## Multiple R-squared:  0.799,  Adjusted R-squared:  0.798
## F-statistic:  985 on 2 and 497 DF,  p-value: <2e-16

names(summary(fit1))

##  [1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"         "df"            "r.squared"     "adj.r.squared"
## [10] "fstatistic"    "cov.unscaled"


We see the strong association between temperature and icecream sales, but no relationship between the gender ratio and icecream. Don't forget that you can extract many things from the summary of the lm object (read up on extracting information using names() in this post.) Now let's deal with the heteroskedasticity. We know that under the constant variance assumption,

$V[\hat\beta] = (X'X)^{-1}X'\Sigma X(X'X)^{-1}$

where

$\Sigma = \sigma^2I$

and we can find an unbiased estimate of $$\sigma^2$$ by summing the squared the residuals. However, in this case, we see that the variance is not constant, so the naive covariance of the OLS estimator is biased. To fix this we employ the White Huber, or sandwich, method of robust standard errors. The sandwich method assumes that

$\Sigma = diag(\sigma_1^2, ... , \sigma_n^2)$

and again the $$\hat\sigma_i^2$$ are estimated via the residuals $$\hat u_i^2$$. In R, we can use the sandwich package to estimate robust standard errors this way:

library(sandwich)
cov.fit1 <- vcovHC(fit1, type = "HC")
rob.std.err <- sqrt(diag(cov.fit1))


The vcovHC() function returns the variance-covariance matrix under the assumption of “HC” (Heteroskedasticity-consistent) estimation. We then take the diagonal of this matrix and square root it to calculate the robust standard errors. You could do this in one line of course, without creating the cov.fit1 object.

Now, we can put the estimates, the naive standard errors, and the robust standard errors together in a nice little table. We save the naive standard errors into a vector and then column bind the three columns together like so: