Monday, July 7, 2014

3 ways that functions can improve your R code

My previous blog post went over the basics of the syntax and debugging of user-written functions. In this post, I’ll show you examples of useful functions that you can write to make your life easier when using R.

Here is the data we’ll be using for this post:

set.seed(10)
bpdata<-data.frame(bp=rnorm(1000,140,20),
                   age=rnorm(1000,50,3),
                   sex=rbinom(1000,1,.5),
                   race=as.factor(c(rep(1,500),rep(2,500))),
                   out=rbinom(1000,1,.8))
bpdata[c(100,200,400),2]<-NA
bpdata[c(300),1]<-400

>> Functions to help you avoid repeating yourself

One of the main reasons to use functions is that you can re-use them over and over again, without repeating the same code again and again. Copying and pasting code that you change just a little bit every time leads to errors. With a function, if you decide to change the way you do something, you just change the one function and it will now automatically update your code everywhere you use that function.

As an example, in a previous blog post I wrote about calculating robust standard errors and exporting them.

The calculation went like this:

library(sandwich)
cov.fit1 <- vcovHC(fit1, type = "HC")
rob.std.err <- sqrt(diag(cov.fit1))

Well, if you are going to be using robust standard errors more than once in your analysis, it makes more sense to make a function. That way you don’t need to copy and paste this code and change the object fit1 everytime.

To do this, I just take the exact same code, but I change the fit1 to a more general modelfit that is passed in as an argument to a function I call robust.SE(). Also, instead of loading the library sandwich, I can just require it for that function (note that you still have to have the package installed for R to load it using require). Then you can pass in the model fit as the argument and return the robust standard errors. Like this:

#function to calculate robust standard errors for a model
robust.SE<-function(modelfit){
  require(sandwich, quietly=TRUE)
  cov.fit1 <- vcovHC(modelfit, type = "HC")
  rob.std.err <- sqrt(diag(cov.fit1))
  return(rob.std.err)
}

#get robust SEs for model 1
model1<-lm(out~age + sex, data=bpdata)
robust.SE(model1)
## (Intercept)         age         sex 
##    0.203340    0.004071    0.025304
#get robust SEs for model2
model2<-lm(bp~age+sex, data=bpdata)
robust.SE(model2)
## (Intercept)         age         sex 
##     10.7907      0.2141      1.3653

Instead of copying and pasting the code for robust standard errors twice, I just wrap it up in a function that allows me to evaluate it for any model fit I want to throw at it.

>> Functions to customize your output

Another reason to use functions is to make it easier and faster to do the things you always need to do, in the exact way you want. For example, it’s important when you start with a new dataset to look at it descriptively and look at whether there are strange values or missing values. There are many ways to do this. You can of course do summary(data) which is a good start:

#use the basic summary built-in function
summary(bpdata)
##        bp             age            sex        race         out     
##  Min.   : 79.8   Min.   :40.3   Min.   :0.000   1:500   Min.   :0.0  
##  1st Qu.:126.5   1st Qu.:48.0   1st Qu.:0.000   2:500   1st Qu.:1.0  
##  Median :139.9   Median :50.0   Median :0.000           Median :1.0  
##  Mean   :140.5   Mean   :50.0   Mean   :0.492           Mean   :0.8  
##  3rd Qu.:154.5   3rd Qu.:52.2   3rd Qu.:1.000           3rd Qu.:1.0  
##  Max.   :400.0   Max.   :61.4   Max.   :1.000           Max.   :1.0  
##                  NA's   :3

But this doesn’t give me all the information I want and it’s not in the format I want. If I had 20 variables, which is not out of the ordinary, it would be hard to read this and it takes up too much space in a report. I could also use the describe() function in the psych package, but it doesn’t deal well with factor variables.

I want to create a table with the variables down the rows and some specific stats going across the columns, namely mean, standard deviation, min, max, and the number missing. I want to make sure to treat factors as dummies instead of numeric variables. Here I create a new function that takes a dataset and produces this kind of table for all the variables within the dataset. I call the function summarize.vars.

The only argument I pass through is a dataset. I use the package dummies to create dummy variables of all the factors so that I have only numeric variables in my data (it ignores all character variables if I had any). Then I use the apply() function to do my summarizing. Check out my previous post on how apply works, here.

#function to summarize the variables in the data
summarize.vars<-function(data){
  
  #use dummies package to turn all factors into dummies
  require(dummies, quietly=TRUE)
  dat.d<-dummy.data.frame(data, dummy.class="factor")
  
  #use apply to calculate statistics for each variable
  mat<-t(apply(dat.d, 2, function(x) c(length(x), 
                                      round(mean(x, na.rm=TRUE),2), 
                                      round(sd(x, na.rm=TRUE),2), 
                                      round(min(x, na.rm=TRUE),2), 
                                      round(max(x, na.rm=TRUE),2), 
                                      length(x)-length(x[!is.na(x)]))))

  #assign column names and rownames to output table
  colnames(mat)<-c("N","Mean","SD","Min","Max","Num Missing")
  rownames(mat)<-colnames(dat.d)
  return(mat)
}

summarize.vars(bpdata)
##          N   Mean    SD   Min    Max Num Missing
## bp    1000 140.46 21.46 79.76 400.00           0
## age   1000  50.05  3.14 40.31  61.44           3
## sex   1000   0.49  0.50  0.00   1.00           0
## race1 1000   0.50  0.50  0.00   1.00           0
## race2 1000   0.50  0.50  0.00   1.00           0
## out   1000   0.80  0.40  0.00   1.00           0

I can also use this to summarize only part of my data, for example by subsetting my data like so:

summarize.vars(bpdata[bpdata$race==1,])

You can think about making this function better. One way would be to have the N for the factors count within the level, since that is more useful information. You can customize it in whatever way makes sense to get the kind of information that is most useful.

Let’s do another example of a function to summarize information. This time we’ll put together a function that runs a linear model, succinctly summarizes the results, and produces some diagnostic plots at the same time. This is useful if you run linear models often.

Remember that you can pass anything through a function, including a formula. In this function, I pass in a formula for the model and the dataframe and I don’t return any values; instead, I print directly from inside the function. You could change this so that it returned the output and you could store it in an object outside the function.

#linear model function
lm.model.diagnostics<-function(formula, dataset){
  
  #run model and print specific output
  model1<-lm(formula=formula, data=dataset)
  stats<-round(c(summary(model1)$fstatistic[c(1,3)], 
                 summary(model1)$sigma, 
                 summary(model1)$r.squared, 
                 summary(model1)$adj.r.squared),3)
  names(stats)<-c("F","DF", "Sigma","Rsq","AdjRsq")
  l1<-list(round(summary(model1)$coefficients,3), stats)
  names(l1)<-c("Coefficients","Stats")
  print(l1)
  
  #run specific diagnostic tests
  par(mfrow=c(1,3))
  hist(model1$residuals, main="Histogram of residuals", xlab="")
  plot(model1, 1)
  plot(model1, 2)
}

#run function for model of blood pressure on age and sex
lm.model.diagnostics(bp~age+sex, bpdata)
## $Coefficients
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  122.111     10.909  11.193    0.000
## age            0.359      0.217   1.658    0.098
## sex            0.693      1.359   0.510    0.610
## 
## $Stats
##       F      DF   Sigma     Rsq  AdjRsq 
##   1.480 994.000  21.439   0.003   0.001

plot of chunk unnamed-chunk-7

So we can see that there is a funny outlier in observation 300. That observation has a blood pressure of 400, which we think is an incorrect value. We can see what happens when we take out that outlier:

#take out the outlier
lm.model.diagnostics(bp~age+sex, bpdata[c(-300),])
## $Coefficients
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  125.961     10.089  12.485    0.000
## age            0.283      0.200   1.410    0.159
## sex            0.152      1.257   0.121    0.904
## 
## $Stats
##       F      DF   Sigma     Rsq  AdjRsq 
##   0.996 993.000  19.818   0.002   0.000

plot of chunk unnamed-chunk-8

Note that if your formula is long and you use it more than once, you can avoid copying and pasting it as well by just saving it in an object like this:

form1<-bp~age+sex
lm.model.diagnostics(form1, bpdata)

>> Functions to aid in your analysis

Another useful way to use functions is for improving your analysis. For example, R summary output of a model fit doesn’t provide confidence intervals. It’s useful to have a function to calculate the confidence intervals and put them in a nice table. As a bonus, you can make the function versatile so that it can provide nice output of logit or poisson models, by exponentiating the results to get odds ratios or incidence rate ratios, respectively.

Notice how in this function I set up the defaults for the parameters that I pass in. Unless I indicate otherwise, exponent=FALSE, alpha=.05, and digits=2. That way the function runs even without specifying those parameters. If I want to change them, I can do so the way I do in the second example.

#function to get confidence intervals for glm output, can get exponentiated output for logit or poisson
glmCI <- function(glmfit, exponent=FALSE, alpha=0.05, digits=2)
{
  #get SE from model fit
  se <- sqrt(diag(summary(glmfit)$cov.scaled))
  
  #calculuate CI for linear case
  mat <- cbind(coef(glmfit),
                 coef(glmfit) - qnorm(1-alpha/2)*se,
                 coef(glmfit) + qnorm(1-alpha/2)*se)
  colnames(mat) <- c("Beta", "LowerCI", "UpperCI")
  
  #if exponent=TRUE, exponeniate the coefficients and CIs
  if(exponent == TRUE)
  {
    mat <- exp(mat)
    if(summary(glmfit)$family$link=="logit") colnames(mat)[1] <- "OR"
    if(summary(glmfit)$family$link=="log") colnames(mat)[1] <- "IRR"
  }
  
  #return a rounded matrix of results
  return(round(mat, digits=digits))
}

#1. use glm with identity link on continuous response data (default family is gaussian)
g.glm<-glm(bp~age+sex, data=bpdata)
glmCI(g.glm)
##               Beta LowerCI UpperCI
## (Intercept) 122.11  100.73  143.49
## age           0.36   -0.07    0.78
## sex           0.69   -1.97    3.36
#2. use glm with logit link on binary response data
b.glm<-glm(out~age+sex+bp, family=binomial(link="logit"), data=bpdata)
glmCI(b.glm, exponent=TRUE, digits=3)
##                OR LowerCI UpperCI
## (Intercept) 5.906   0.414  84.210
## age         0.990   0.942   1.040
## sex         0.831   0.609   1.133
## bp          1.001   0.994   1.009

There are tons of other possibilities for functions, but hopefully this post has convinced you that functions can improve your coding by reducing repetition, increasing customizability, and improving your analysis and reports.


My [previous blog post](http://rforpublichealth.blogspot.com/2014/06/how-to-write-and-debug-r-function.html) went over the basics of the syntax and debugging of user-written functions. In this post, I'll show you examples of useful functions that you can write to make your life easier when using R. Here is the data we'll be using for this post: set.seed(10) bpdata<-data.frame(bp=rnorm(1000,140,20), age=rnorm(1000,50,3), sex=rbinom(1000,1,.5), race=as.factor(c(rep(1,500),rep(2,500))), out=rbinom(1000,1,.8)) bpdata[c(100,200,400),2]<-NA bpdata[c(300),1]<-400 Functions to help you avoid repeating yourself One of the main reasons to use functions is that you can re-use them over and over again, without repeating the same code again and again. Copying and pasting code that you change just a little bit every time leads to errors. With a function, if you decide to change the way you do something, you just change the one function and it will now automatically update your code everywhere you use that function. As an example, in a previous blog post I wrote about [calculating robust standard errors and exporting](http://rforpublichealth.blogspot.com/2013/08/exporting-results-of-linear-regression_24.html) them. The calculation went like this: library(sandwich) cov.fit1 <- vcovHC(fit1, type = "HC") rob.std.err <- sqrt(diag(cov.fit1)) Well, if you are going to be using robust standard errors more than once in your analysis, it makes more sense to make a function. That way you don't need to copy and paste this code and change the object fit1 everytime. To do this, I just take the exact same code, but I change the fit1 to a more general modelfit that is passed in as an argument to a function I call **robust.SE()**. Also, instead of loading the library sandwich, I can just require it for that function (note that you still have to have the package installed for R to load it using require). Then you can pass in the model fit as the argument and return the robust standard errors. Like this: #function to calculate robust standard errors for a model robust.SE<-function(modelfit){ require(sandwich, quietly=TRUE) cov.fit1 <- vcovHC(modelfit, type = "HC") rob.std.err <- sqrt(diag(cov.fit1)) return(rob.std.err)} #get robust SEs for model 1 model1<-lm(out~age + sex, data=bpdata) robust.SE(model1) #get robust SEs for model2 model2<-lm(bp~age+sex, data=bpdata) robust.SE(model2) Instead of copying and pasting the code for robust standard errors twice, I just wrap it up in a function that allows me to evaluate it for any model fit I want to throw at it. Functions to customize your output Another reason to use functions is to make it easier and faster to do the things you always need to do, in the exact way you want. For example, it's important when you start with a new dataset to look at it descriptively and look at whether there are strange values or missing values. There are many ways to do this. You can of course do **summary(data)** which is a good start: #use the basic summary built-in function summary(bpdata) But this doesn't give me all the information I want and it's not in the format I want. If I had 20 variables, which is not out of the ordinary, it would be hard to read this and it takes up too much space in a report. I could also use the describe() function in the psych package, but it doesn't deal well with factor variables. I want to create a table with the variables down the rows and some specific stats going across the columns, namely mean, standard deviation, min, max, and the number missing. I want to make sure to treat factors as dummies instead of numeric variables. Here I create a new function that takes a dataset and produces this kind of table for all the variables within the dataset. I call the function **summarize.vars**. The only argument I pass through is a dataset. I use the package dummies to create dummy variables of all the factors so that I have only numeric variables in my data (it ignores all character variables if I had any). Then I use the **apply()** function to do my summarizing. Check out my previous post on how apply works, [here](http://rforpublichealth.blogspot.com/2012/09/the-infamous-apply-function.html). #function to summarize the variables in the data summarize.vars<-function(data){ #use dummies package to turn all factors into dummies require(dummies, quietly=TRUE) dat.d<-dummy.data.frame(data, dummy.class="factor") #use apply to calculate statistics for each variable mat<-t(apply(dat.d, 2, function(x) c(length(x), round(mean(x, na.rm=TRUE),2), round(sd(x, na.rm=TRUE),2), round(min(x, na.rm=TRUE),2), round(max(x, na.rm=TRUE),2), length(x)-length(x[!is.na(x)])))) #assign column names and rownames to output table colnames(mat)<-c("N","Mean","SD","Min","Max","Num Missing") rownames(mat)<-colnames(dat.d) return(mat)} summarize.vars(bpdata) I can also use this to summarize only part of my data, for example by subsetting my data like so: summarize.vars(bpdata[bpdata$race==1,]) You can think about making this function better. One way would be to have the N for the factors count within the level, since that is more useful information. You can customize it in whatever way makes sense to get the kind of information that is most useful. Let's do another example of a function to summarize information. This time we'll put together a function that runs a linear model, succinctly summarizes the results, and produces some diagnostic plots at the same time. This is useful if you run linear models often. Remember that you can pass anything through a function, including a formula. In this function, I pass in a formula for the model and the dataframe and I don't return any values; instead, I print directly from inside the function. You could change this so that it returned the output and you could store it in an object outside the function. #linear model function lm.model.diagnostics<-function(formula, dataset){ #run model and print specific output model1<-lm(formula=formula, data=dataset) stats<-round(c(summary(model1)$fstatistic[c(1,3)], summary(model1)$sigma, summary(model1)$r.squared, summary(model1)$adj.r.squared),3) names(stats)<-c("F","DF", "Sigma","Rsq","AdjRsq") l1<-list(round(summary(model1)$coefficients,3), stats) names(l1)<-c("Coefficients","Stats") print(l1) #run specific diagnostic tests par(mfrow=c(1,3)) hist(model1$residuals, main="Histogram of residuals", xlab="") plot(model1, 1) plot(model1, 2)} #run function for model of blood pressure on age and sex lm.model.diagnostics(bp~age+sex, bpdata) So we can see that there is a funny outlier in observation 300. That observation has a blood pressure of 400, which we think is an incorrect value. We can see what happens when we take out that outlier: #take out the outlier lm.model.diagnostics(bp~age+sex, bpdata[c(-300),]) Note that if your formula is long and you use it more than once, you can avoid copying and pasting it as well by just saving it in an object like this: form1<-bp~age+sex lm.model.diagnostics(form1, bpdata) Functions to aid in your analysis Another useful way to use functions is for improving your analysis. For example, R summary output of a model fit doesn't provide confidence intervals. It's useful to have a function to calculate the confidence intervals and put them in a nice table. As a bonus, you can make the function versatile so that it can provide nice output of logit or poisson models, by exponentiating the results to get odds ratios or incidence rate ratios, respectively. Notice how in this function I set up the defaults for the parameters that I pass in. Unless I indicate otherwise, exponent=FALSE, alpha=.05, and digits=2. That way the function runs even without specifying those parameters. If I want to change them, I can do so the way I do in the second example. #function to get confidence intervals for glm output, can get exponentiated output for logit or poisson glmCI <- function(glmfit, exponent=FALSE, alpha=0.05, digits=2){ #get SE from model fit se <- sqrt(diag(summary(glmfit)$cov.scaled)) #calculuate CI for linear case mat <- cbind(coef(glmfit), coef(glmfit) - qnorm(1-alpha/2)*se, coef(glmfit) + qnorm(1-alpha/2)*se) colnames(mat) <- c("Beta", "LowerCI", "UpperCI") #if exponent=TRUE, exponeniate the coefficients and CIs if(exponent == TRUE) { mat <- exp(mat) if(summary(glmfit)$family$link=="logit") colnames(mat)[1] <- "OR" if(summary(glmfit)$family$link=="log") colnames(mat)[1] <- "IRR"} #return a rounded matrix of results return(round(mat, digits=digits))} #1. use glm with identity link on continuous response data (default family is gaussian) g.glm<-glm(bp~age+sex, data=bpdata) glmCI(g.glm) #2. use glm with logit link on binary response data b.glm<-glm(out~age+sex+bp, family=binomial(link="logit"), data=bpdata) glmCI(b.glm, exponent=TRUE, digits=3) There are tons of other possibilities for functions, but hopefully this post has convinced you that functions can improve your coding by reducing repetition, increasing customizability, and improving your analysis and reports.

5 comments:

  1. Hi, Thank you for the post and the codes. Very useful for me as a beginner.

    ReplyDelete
  2. Another reason: it allows you to do calculations without leaving many variables around, now they are temporary variables that disappear when the function ends.

    ReplyDelete

Note: Only a member of this blog may post a comment.