My previous blog post went over the basics of the syntax and debugging of user-written functions. In this post, I’ll show you examples of useful functions that you can write to make your life easier when using R.
Here is the data we’ll be using for this post:
set.seed(10)
bpdata<-data.frame(bp=rnorm(1000,140,20),
age=rnorm(1000,50,3),
sex=rbinom(1000,1,.5),
race=as.factor(c(rep(1,500),rep(2,500))),
out=rbinom(1000,1,.8))
bpdata[c(100,200,400),2]<-NA
bpdata[c(300),1]<-400
>> Functions to help you avoid repeating yourself
One of the main reasons to use functions is that you can re-use them over and over again, without repeating the same code again and again. Copying and pasting code that you change just a little bit every time leads to errors. With a function, if you decide to change the way you do something, you just change the one function and it will now automatically update your code everywhere you use that function.
As an example, in a previous blog post I wrote about calculating robust standard errors and exporting them.
The calculation went like this:
library(sandwich)
cov.fit1 <- vcovHC(fit1, type = "HC")
rob.std.err <- sqrt(diag(cov.fit1))
Well, if you are going to be using robust standard errors more than once in your analysis, it makes more sense to make a function. That way you don’t need to copy and paste this code and change the object fit1 everytime.
To do this, I just take the exact same code, but I change the fit1 to a more general modelfit that is passed in as an argument to a function I call robust.SE(). Also, instead of loading the library sandwich, I can just require it for that function (note that you still have to have the package installed for R to load it using require). Then you can pass in the model fit as the argument and return the robust standard errors. Like this:
#function to calculate robust standard errors for a model
robust.SE<-function(modelfit){
require(sandwich, quietly=TRUE)
cov.fit1 <- vcovHC(modelfit, type = "HC")
rob.std.err <- sqrt(diag(cov.fit1))
return(rob.std.err)
}
#get robust SEs for model 1
model1<-lm(out~age + sex, data=bpdata)
robust.SE(model1)
## (Intercept) age sex
## 0.203340 0.004071 0.025304
#get robust SEs for model2
model2<-lm(bp~age+sex, data=bpdata)
robust.SE(model2)
## (Intercept) age sex
## 10.7907 0.2141 1.3653
Instead of copying and pasting the code for robust standard errors twice, I just wrap it up in a function that allows me to evaluate it for any model fit I want to throw at it.
>> Functions to customize your output
Another reason to use functions is to make it easier and faster to do the things you always need to do, in the exact way you want. For example, it’s important when you start with a new dataset to look at it descriptively and look at whether there are strange values or missing values. There are many ways to do this. You can of course do summary(data) which is a good start:
#use the basic summary built-in function
summary(bpdata)
## bp age sex race out
## Min. : 79.8 Min. :40.3 Min. :0.000 1:500 Min. :0.0
## 1st Qu.:126.5 1st Qu.:48.0 1st Qu.:0.000 2:500 1st Qu.:1.0
## Median :139.9 Median :50.0 Median :0.000 Median :1.0
## Mean :140.5 Mean :50.0 Mean :0.492 Mean :0.8
## 3rd Qu.:154.5 3rd Qu.:52.2 3rd Qu.:1.000 3rd Qu.:1.0
## Max. :400.0 Max. :61.4 Max. :1.000 Max. :1.0
## NA's :3
But this doesn’t give me all the information I want and it’s not in the format I want. If I had 20 variables, which is not out of the ordinary, it would be hard to read this and it takes up too much space in a report. I could also use the describe() function in the psych package, but it doesn’t deal well with factor variables.
I want to create a table with the variables down the rows and some specific stats going across the columns, namely mean, standard deviation, min, max, and the number missing. I want to make sure to treat factors as dummies instead of numeric variables. Here I create a new function that takes a dataset and produces this kind of table for all the variables within the dataset. I call the function summarize.vars.
The only argument I pass through is a dataset. I use the package dummies to create dummy variables of all the factors so that I have only numeric variables in my data (it ignores all character variables if I had any). Then I use the apply() function to do my summarizing. Check out my previous post on how apply works, here.
#function to summarize the variables in the data
summarize.vars<-function(data){
#use dummies package to turn all factors into dummies
require(dummies, quietly=TRUE)
dat.d<-dummy.data.frame(data, dummy.class="factor")
#use apply to calculate statistics for each variable
mat<-t(apply(dat.d, 2, function(x) c(length(x),
round(mean(x, na.rm=TRUE),2),
round(sd(x, na.rm=TRUE),2),
round(min(x, na.rm=TRUE),2),
round(max(x, na.rm=TRUE),2),
length(x)-length(x[!is.na(x)]))))
#assign column names and rownames to output table
colnames(mat)<-c("N","Mean","SD","Min","Max","Num Missing")
rownames(mat)<-colnames(dat.d)
return(mat)
}
summarize.vars(bpdata)
## N Mean SD Min Max Num Missing
## bp 1000 140.46 21.46 79.76 400.00 0
## age 1000 50.05 3.14 40.31 61.44 3
## sex 1000 0.49 0.50 0.00 1.00 0
## race1 1000 0.50 0.50 0.00 1.00 0
## race2 1000 0.50 0.50 0.00 1.00 0
## out 1000 0.80 0.40 0.00 1.00 0
I can also use this to summarize only part of my data, for example by subsetting my data like so:
summarize.vars(bpdata[bpdata$race==1,])
You can think about making this function better. One way would be to have the N for the factors count within the level, since that is more useful information. You can customize it in whatever way makes sense to get the kind of information that is most useful.
Let’s do another example of a function to summarize information. This time we’ll put together a function that runs a linear model, succinctly summarizes the results, and produces some diagnostic plots at the same time. This is useful if you run linear models often.
Remember that you can pass anything through a function, including a formula. In this function, I pass in a formula for the model and the dataframe and I don’t return any values; instead, I print directly from inside the function. You could change this so that it returned the output and you could store it in an object outside the function.
#linear model function
lm.model.diagnostics<-function(formula, dataset){
#run model and print specific output
model1<-lm(formula=formula, data=dataset)
stats<-round(c(summary(model1)$fstatistic[c(1,3)],
summary(model1)$sigma,
summary(model1)$r.squared,
summary(model1)$adj.r.squared),3)
names(stats)<-c("F","DF", "Sigma","Rsq","AdjRsq")
l1<-list(round(summary(model1)$coefficients,3), stats)
names(l1)<-c("Coefficients","Stats")
print(l1)
#run specific diagnostic tests
par(mfrow=c(1,3))
hist(model1$residuals, main="Histogram of residuals", xlab="")
plot(model1, 1)
plot(model1, 2)
}
#run function for model of blood pressure on age and sex
lm.model.diagnostics(bp~age+sex, bpdata)
## $Coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122.111 10.909 11.193 0.000
## age 0.359 0.217 1.658 0.098
## sex 0.693 1.359 0.510 0.610
##
## $Stats
## F DF Sigma Rsq AdjRsq
## 1.480 994.000 21.439 0.003 0.001
So we can see that there is a funny outlier in observation 300. That observation has a blood pressure of 400, which we think is an incorrect value. We can see what happens when we take out that outlier:
#take out the outlier
lm.model.diagnostics(bp~age+sex, bpdata[c(-300),])
## $Coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 125.961 10.089 12.485 0.000
## age 0.283 0.200 1.410 0.159
## sex 0.152 1.257 0.121 0.904
##
## $Stats
## F DF Sigma Rsq AdjRsq
## 0.996 993.000 19.818 0.002 0.000
Note that if your formula is long and you use it more than once, you can avoid copying and pasting it as well by just saving it in an object like this:
form1<-bp~age+sex
lm.model.diagnostics(form1, bpdata)
>> Functions to aid in your analysis
Another useful way to use functions is for improving your analysis. For example, R summary output of a model fit doesn’t provide confidence intervals. It’s useful to have a function to calculate the confidence intervals and put them in a nice table. As a bonus, you can make the function versatile so that it can provide nice output of logit or poisson models, by exponentiating the results to get odds ratios or incidence rate ratios, respectively.
Notice how in this function I set up the defaults for the parameters that I pass in. Unless I indicate otherwise, exponent=FALSE, alpha=.05, and digits=2. That way the function runs even without specifying those parameters. If I want to change them, I can do so the way I do in the second example.
#function to get confidence intervals for glm output, can get exponentiated output for logit or poisson
glmCI <- function(glmfit, exponent=FALSE, alpha=0.05, digits=2)
{
#get SE from model fit
se <- sqrt(diag(summary(glmfit)$cov.scaled))
#calculuate CI for linear case
mat <- cbind(coef(glmfit),
coef(glmfit) - qnorm(1-alpha/2)*se,
coef(glmfit) + qnorm(1-alpha/2)*se)
colnames(mat) <- c("Beta", "LowerCI", "UpperCI")
#if exponent=TRUE, exponeniate the coefficients and CIs
if(exponent == TRUE)
{
mat <- exp(mat)
if(summary(glmfit)$family$link=="logit") colnames(mat)[1] <- "OR"
if(summary(glmfit)$family$link=="log") colnames(mat)[1] <- "IRR"
}
#return a rounded matrix of results
return(round(mat, digits=digits))
}
#1. use glm with identity link on continuous response data (default family is gaussian)
g.glm<-glm(bp~age+sex, data=bpdata)
glmCI(g.glm)
## Beta LowerCI UpperCI
## (Intercept) 122.11 100.73 143.49
## age 0.36 -0.07 0.78
## sex 0.69 -1.97 3.36
#2. use glm with logit link on binary response data
b.glm<-glm(out~age+sex+bp, family=binomial(link="logit"), data=bpdata)
glmCI(b.glm, exponent=TRUE, digits=3)
## OR LowerCI UpperCI
## (Intercept) 5.906 0.414 84.210
## age 0.990 0.942 1.040
## sex 0.831 0.609 1.133
## bp 1.001 0.994 1.009
There are tons of other possibilities for functions, but hopefully this post has convinced you that functions can improve your coding by reducing repetition, increasing customizability, and improving your analysis and reports.