Friday, December 26, 2014

Animations and GIFs using ggplot2

Tracing a regression line Diverging density plots Happy New Year plot

Happy New Year everyone! For the last post of the year, I thought I'd have a little fun with the new animation package in R. It's actually really easy to use. I recently had some fun with it when I presented my research at an electronic poster session, and had an animated movie embedded into the powerpoint. All of the GIFs above use ggplot and the animation packages. The main idea is to iterate the same plot over and over again, changing incrementally whatever it is that you want to move in the graph, and then save all those plots together into one GIF. Let's start with first plot that traces a regression line over the scatterplot of the points. We'll make up some data and fit a loess of two degrees (default). You could easily do this with any kind of regression. #make up some data tracedat<-data.frame(x=rnorm(1000,0,1)) tracedat$y<-abs(tracedat$x)*2+rnorm(1000,0,3) #predict a spline fit and add predicted values to the dataframe loess_fit <- loess(y ~ x, tracedat) tracedat$predict_y<-predict(loess_fit) Now let's make the completed scatterplot and loess line using ggplot. If you need help on how to plot a scatterplot in ggplot, see my post here: [ggplot2: Cheatsheet for Scatterplots]( It is possible to use **stat_smooth()** within **ggplot** to get the loess fit without predicting the values and using **geom_line()**, but the predicted values are going to make it easier to make the animation. #plot finished scatterplot with loess fit ggplot(tracedat, aes(x,y)) + geom_point() + geom_line(data=tracedat, aes(x,predict_y), color="red", size=1.3) + scale_x_continuous(limits=c(-3, 3)) + scale_y_continuous(limits=c(-10, 10)) Now what we need is for the loess fit to appear bit by bit, so to do this we'll cut off the dataframe for **geom_line** for only those x-values up to a certain cutoff x-value (by subsetting the dataframe called tracedat in the **geom_line** statement). Then we'll just keep moving that cutoff forward as we iterate over the range of all x-values. First, we will build a function that takes the cutoff value as an argument. Then we can pass whatever value of x we want and it will only graph the line up to that cutoff. Notice how the scatterplot itself, however, is for the full data. For more on how to write functions, see my post about them, [here]( #function to draw the scatterplot, but the curve fit only up to whatever index we set it at #try it out: draw curve up to cutoff x-value of -2 draw.curve(cutoff=-2) Almost done! Now we just need to iterate the **draw.curve()** function we just created for the full range of the values of x. So we'll use **lapply()** to iterate the **draw.curve()** function over the sequence of i=-3 to 3 (incrementing by .2) and we call the **draw.curve()** function for each value of i. Finally, we'll use the saveGIF() function from the animation package to stick all the images together successively into one GIF. The interval argument tells you how fast the GIF will move from one image to the next, and you can give it a name. If you don't want the GIF to loop back to the start again, you would add an argument "loop=FALSE" into the function call. #function to iterate over the full span of x-values trace.animate <- function() { lapply(seq(-3,3,.2), function(i) { draw.curve(i) }) } #save all iterations into one GIF saveGIF(trace.animate(), interval = .2,"trace.gif") ####2. Diverging density plots The same idea used above is used to make the diverging density plots (plot 2 above). Here we are showing the distribution of scores before some intervention and after the intervention. We need to create the code for a ggplot density plot, and turn it into a function that can take as an argument the variable we want to plot (that way we can use the same code for the "before" intervention plot and the "after" plot. Then to animate, we'll iterate between them. We'll make up some data and plot the "before" plot to start. Notice that we'll use **aes_string** rather than simply **aes** in the ggplot statement in order to be able to pass the data in as an argument when we turn this into a function. More on how to plot distributions using **ggplot** in my post [ggplot2: Cheatsheet for Visualizing Distributions]( It's important to set the scale for x and y axes so that when we iterate over the two plots, we have the same dimensions each time. The alpha argument in **geom_density** makes the colors more transparent.<-data.frame(base=rnorm(5000, 0, 3), follow=rnorm(5000, 0, 3), type=c(rep("Type 1",2500),rep("Type 2",2500)))$follow<-ifelse($type=="Type 2",$follow+12,$follow) #plot one to make sure it's working. Use aes_string rather than aes p<-ggplot(, aes_string("base", fill="type")) + geom_density(alpha=0.5) + theme(legend.position="bottom") + scale_x_continuous(limits=c(-10, 20)) + scale_y_continuous(limits=c(0, 0.20)) + scale_fill_manual("", labels=c("Type 1", "Type 2"), values = c("orange","purple")) + labs(x="Score",y="Density", title="title") Again, we write two functions: one that draws a density plot based on the arguments passed to it (**plot.dens()**), and one that iterates over the two different plots (called **distdiverge.animate()**). In the **dist.diverge.animate()** function, we pass the plot.item (which is a character class that aes_string will understand as the name of the column in to plot), and the title, which is also a character class. #function that plots a density plot with arguments for the variable to plot and the title plot.dens<-function(plot.item, title.item){ p<-ggplot(, aes_string(plot.item, fill="type"))+ geom_density(alpha=0.5) + theme(legend.position="bottom") + scale_x_continuous(limits=c(-10, 20)) + scale_y_continuous(limits=c(0, 0.20)) + scale_fill_manual("", labels=c("Type 1", "Type 2"), values = c("orange","purple"))+ labs(x="Score",y="Density", title=title.item) print(p)} #try it out - plot it for the follow data with the title "After Intervention" plot.dens(plot.item="follow", title.item="After Intervention") #function that iterates over the two different plots distdiverge.animate <- function() { items<-c("base", "follow") titles<-c("Before Intervention","After Intervention") lapply(seq(1:2), function(i) { plot.dens(items[i], titles[i]) })} We'll make the interval slower so there's more time to view each plot before the GIF moves to the next one. #save in a GIF saveGIF(distdiverge.animate(), interval = .65, ,"dist.gif") ####3. Happy New Year plot Finally, to make the fun "Happy 2015" plot, we just make a black background plot, create new random data at every iteration for the snow, and then iterate over the letters of the sign. Let's start with the first plot for the letter "H". We create some data and the objects that will hold the letters of the sign that scrolls through, the colors (I use the colorspace package to pull some colors for me for this), and the x- and y-coordinates for where we want the letters to go. You can think about randomizing the coordinates too. Then we just plot a scatterplot and use annotate to add the letter to it. #create dataset happy2015<-data.frame(x=rnorm(500, 0, 1.5), y=rnorm(500, 0, 1.5), z=rnorm(500,0,1.5)) #create objects to hold the letters, colors, and x and y coordinates that we will scroll through sign<-c("H","A","P","P","Y","2","0","1","5","!!") colors <- rainbow_hcl(10, c=300) xcoord<-rep(c(-2, -1, 0, 1, 2),2) ycoord<-c(2, 1.7, 2.1, 1.5, 2, -.5, 0, -1, -.8, -.7) We set up the ggplot theme and test the first plot. #set up the theme in an object (get rid of axes, grids, and legend) theme.both<- theme(legend.position="none", panel.background = element_blank(), axis.ticks = element_blank(), axis.line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), plot.background = element_rect(fill = "black"), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) #plot the first letter (set index=1 to get the first element of color, letter, and coordinates) index<-1 ggplot(happy2015, aes(x, y, alpha = z, color=z)) + geom_point(alpha=0.2) + labs(title="", x="", y="") + theme.both + scale_colour_gradient(low = "white", high="lightblue")+ annotate("text", x=xcoord[index], y=ycoord[index], size=15, label=sign[index], color=colors[index]) Finally, we again go through the structure of two functions - one to draw a plot based on the "index" we give it as an argument, and one to iterate through all the letters using lapply(). Notice we put the dataframe statement in the first function - this will make the scatterplot different every time, rather than stay static, which is more festive (more or less like falling snow). Again, we save it in a GIF, with a slow interval in order to give time to read it. #set up function to create a new dataset, plot it, and annotate it by an index argument draw.a.plot<- function(index){ #make up a new dataframe happy2015<-data.frame(x=rnorm(500, 0, 1.5), y=rnorm(500, 0, 1.5), z=rnorm(500,0,1.5)) #plot according to the index passed g<-ggplot(happy2015, aes(x, y, alpha = z, color=z)) + geom_point(alpha=0.2) + labs(title="", x="", y="") + theme.both + scale_colour_gradient(low = "white", high="lightblue")+ annotate("text", x=xcoord[index], y=ycoord[index], size=15, label=sign[index], color=colors[index]) #print out the plot print(g)} #set up function to loop through the draw.a.plot() function loop.animate <- function() { lapply(1:length(sign), function(i) { draw.a.plot(i) })} #save the images into a GIF saveGIF(loop.animate(), interval = .5,"happy2015.gif")

Monday, October 20, 2014

Easy Clustered Standard Errors in R

Public health data can often be hierarchical in nature; for example, individuals are grouped in hospitals which are grouped in counties. When units are not independent, then regular OLS standard errors are biased. One way to correct for this is using clustered standard errors. This post will show you how you can easily put together a function to calculate clustered SEs and get everything else you need, including confidence intervals, F-tests, and linear hypothesis testing. Under standard OLS assumptions, with independent errors, $$V_{OLS} = \sigma^2(X'X)^{-1}$$ We can estimate $\sigma^2$ with $s^2$: $$s^2 = \frac{1}{N-K}\sum_{i=1}^N e_i^2$$ where N is the number of observations, K is the rank (number of variables in the regression), and $e_i$ are the residuals from the regression. But if the errors are not independent because the observations are clustered within groups, then confidence intervals obtained will not have $1-\alpha$ coverage probability. To fix this, we can apply a sandwich estimator, like this: $$V_{Cluster} = (X'X)^{-1} \sum_{j=1}^{n_c} (u_j'*u_j) (X'X)^{-1}$$ where $n_c$ is the total number of clusters and $u_j = \sum_{j_{cluster}}e_i*x_i$. $x_i$ is the row vector of predictors including the constant. Programs like Stata also use a degree of freedom adjustment (small sample size adjustment), like so: $$\frac{M}{M-1}*\frac{N-1}{N-K} * V_{Cluster}$$ where M is the number of clusters, N is the sample size, and K is the rank. In this example, we'll use the Crime dataset from the plm package. It includes yearly data on crime rates in counties across the United States, with some characteristics of those counties. Let's load in the libraries we need and the Crime data: library(plm) library(lmtest) data(Crime) We would like to see the effect of percentage males aged 15-24 (pctymle) on crime rate, adjusting for police per capita (polpc), region, and year. However, there are multiple observations from the same county, so we will cluster by county. In Stata the commands would look like this. reg crmrte pctymle polpc i.region year, cluster(county) In R, we can first run our basic ols model using lm() and save the results in an object called m1. Unfortunately, there's no 'cluster' option in the lm() function. But there are many ways to get the same result #basic linear model with standard variance estimate Crime$region<-factor(Crime$region) m1<-lm(crmrte~pctymle+polpc+region+year, data=Crime) Get the cluster-adjusted variance-covariance matrix First, I'll show how to write a function to obtain clustered standard errors. If you are unsure about how user-written functions work, please see my posts about them, [here]( (How to write and debug an R function) and [here]( (3 ways that functions can improve your R code). There are many sources to help us write a function to calculate clustered SEs. Check out these helpful links: Mahmood Arai's paper [found here]( and DiffusePrioR's blogpost [found here]( I'll base my function on the first source. It uses functions from the sandwich and the lmtest packages so make sure to install those packages. However, instead of returning the coefficients and standard errors, I am going to modify Arai's function to return the variance-covariance matrix, so I can work with that later. The function will input the lm model object and the cluster vector. After that, I'll do it the super easy way with the new multiwayvcov package which has a cluster.vcov() function. Help on this package [found here]( The function also needs the model and the cluster as inputs. Here are both ways: #write your own function to return variance covariance matrix under clustered SEs get_CL_vcov<-function(model, cluster){ require(sandwich, quietly = TRUE) require(lmtest, quietly = TRUE) #calculate degree of freedom adjustment M <- length(unique(cluster)) N <- length(cluster) K <- model$rank dfc <- (M/(M-1))*((N-1)/(N-K)) #calculate the uj's uj <- apply(estfun(model),2, function(x) tapply(x, cluster, sum) #use sandwich to get the var-covar matrix vcovCL <- dfc*sandwich(model, meat=crossprod(uj)/N) return(vcovCL)} #call our new function and save the var-cov matrix output in an object m1.vcovCL<-get_CL_vcov(m1, Crime$county) #equivalent way: use the cluster.vcov function to get variance-covariance matrix library(multiwayvcov) m1.vcovCL.2<-cluster.vcov(m1, Crime$county) Now, in order to obtain the coefficients and SEs, we can use the coeftest() function in the lmtest library, which allows us to input our own var-covar matrix. Let's compare our standard OLS SEs to the clustered SEs. #the regular OLS standard errors coeftest(m1) #the clustered standard errors by indicating the correct var-covar matrix coeftest(m1, m1.vcovCL) We can see that the SEs generally increased, due to the clustering. Now, let's obtain the F-statistic and the confidence intervals. Getting the F-statistic and confidence intervals for clustered SEs To obtain the F-statistic, we can use the waldtest() function from the lmtest library with test="F" indicated for the F-test. For the 95% CIs, we can write our own function that takes in the model and the variance-covariance matrix and produces the 95% CIs. You can modify this function to make it better and more versatile, but I'm going to keep it simple. #function to return confidence intervals get_confint<-function(model, vcovCL){ t<-qt(.975, model$df.residual) ct<-coeftest(model, vcovCL) est<-cbind(ct[,1], ct[,1]-t*ct[,2], ct[,1]+t*ct[,2]) colnames(est)<-c("Estimate","LowerCI","UpperCI") return(est)} Now, we can get the F-stat and the confidence intervals: #do a wald test to get F-statistic waldtest(m1, vcov = m1.vcovCL, test = "F") Note that now the F-statistic is calculated based on a Wald test (using the cluster-robustly esimtated var-covar matrix) rather than on sums of squares and degrees of freedom. The degrees of freedom listed here are for the model, but the var-covar matrix has been corrected for the fact that there are only 90 independent observations. If you want to save the F-statistic itself, save the waldtest function call in an object and extract: #save waldtest in an object w<-waldtest(m1, vcov = m1.vcovCL, test = "F") #check out what is saved in w names(w) #use $F to get the F-statistic w$F[2] For confidence intervals, we can use the function we wrote: #obtain the confidence interval using our function get_confint(m1, m1.vcovCL) As an aside, to get the R-squared value, you can extract that from the original model m1, since that won't change if the errors are clustered. Again, remember that the R-squared is calculated via sums of squares, which are technically no longer relevant because of the corrected variance-covariance matrix. But it can still be used as a measure of goodness-of-fit. #r-squared summary(m1)$r.squared One function to do everything Notice, that you could wrap all of these 3 components (F-test, coefficients/SEs, and CIs) in a function that saved them all in a list, for example like this:<-function(model, cluster){ require(multiwayvcov) require(lmtest) vcovCL<-cluster.vcov(model, cluster) coef<-coeftest(model, vcovCL) w<-waldtest(model, vcov = vcovCL, test = "F") ci<-get_confint(model, vcovCL) return(list(coef, w, ci))}, Crime$county) Then you could extract each component with the [[]] operator. Check out [this post]("Returning a list of objects") if you're unsure. Hypothesis testing when errors are clustered Now what if we wanted to test whether the west region coefficient was different from the central region? Again, we need to incorporate the right var-cov matrix into our calculation. Fortunately the car package has a __linearHypothesis()__ function that allows for specification of a var-covar matrix. The inputs are the model, the var-cov matrix, and the coefficients you want to test. Check out the help file of the function to see the wide range of tests you can do. library(car) linearHypothesis(m1, vcov=m1.vcovCL, "regionwest = regioncentral") Troubleshooting and error messages One reason to opt for the cluster.vcov() function from the multiwayvcov package is that it can handle missing values without any problems. When doing the variance-covariance matrix using the user-written function get_CL_vcov above, an error message can often come up: #error messages get_CL_vcov(m1, Crime$couunty) There are two common reasons for this. One is just that you spelled the name of the cluster variable incorrectly (as above). Make sure to check that. The second is that you have __missing values__ in your outcome or explanatory variables. In this case, the length of the cluster will be different from the length of the outcome or covariates and tapply() will not work. One possible solutions is to remove the missing values by subsetting the cluster to include only those values where the outcome is not missing. Another option is to run na.omit() on the entire dataset to remove all missing vaues. Here's an example: #force in missing value in outcome Crime2<-Crime Crime2$crmrte[1]<-NA #rerun model m2<-lm(crmrte~pctymle+polpc+region+year, data=Crime2) #this produces errors v1<-get_CL_vcov(m2, Crime2$county) #but we can remove the observations in county for which crmrte is missing v2<-get_CL_vcov(m2, Crime2$county[!$crmrte)]) #or can use na.omit Crime3<-na.omit(Crime2) m3<-lm(crmrte~pctymle+polpc+region+year, data=Crime3) v3<-get_CL_vcov(m3, Crime3$county) However, if you're running a number of regressions with different covariates, each with a different missing pattern, it may be annoying to create multiple datasets and run na.omit() on them to deal with this. To avoid this, you can use the cluster.vcov() function, which handles missing values within its own function code, so you don't have to. Using plm Finally, you can also use the plm() and vcovHC() functions from the plm package. You still need to do your own small sample size correction though. #use plm function to define formula, dataset, that it's a pooling model, and the cluster variable p1 <- plm(crmrte~pctymle+polpc+region+year, Crime, model='pooling', index=c('county')) #calculate small sample size adjustment G <- length(unique(Crime$county)) N <- length(Crime$county) dfa <- (G/(G - 1)) * (N - 1)/p1$df.residual #use coeftest and the vcovHC functions, specifying HC0 type and identifying cluster as 'group' coeftest(p1, vcov=function(x) dfa*vcovHC(x, cluster="group", type="HC0")) A website that goes further into this function [is here]( Update: A reader pointed out to me that another package that can do clustering is the rms package, so definitely [check that out]( as well.

Monday, July 7, 2014

3 ways that functions can improve your R code

My [previous blog post]( went over the basics of the syntax and debugging of user-written functions. In this post, I'll show you examples of useful functions that you can write to make your life easier when using R. Here is the data we'll be using for this post: set.seed(10) bpdata<-data.frame(bp=rnorm(1000,140,20), age=rnorm(1000,50,3), sex=rbinom(1000,1,.5), race=as.factor(c(rep(1,500),rep(2,500))), out=rbinom(1000,1,.8)) bpdata[c(100,200,400),2]<-NA bpdata[c(300),1]<-400 Functions to help you avoid repeating yourself One of the main reasons to use functions is that you can re-use them over and over again, without repeating the same code again and again. Copying and pasting code that you change just a little bit every time leads to errors. With a function, if you decide to change the way you do something, you just change the one function and it will now automatically update your code everywhere you use that function. As an example, in a previous blog post I wrote about [calculating robust standard errors and exporting]( them. The calculation went like this: library(sandwich) cov.fit1 <- vcovHC(fit1, type = "HC") rob.std.err <- sqrt(diag(cov.fit1)) Well, if you are going to be using robust standard errors more than once in your analysis, it makes more sense to make a function. That way you don't need to copy and paste this code and change the object fit1 everytime. To do this, I just take the exact same code, but I change the fit1 to a more general modelfit that is passed in as an argument to a function I call **robust.SE()**. Also, instead of loading the library sandwich, I can just require it for that function (note that you still have to have the package installed for R to load it using require). Then you can pass in the model fit as the argument and return the robust standard errors. Like this: #function to calculate robust standard errors for a model robust.SE<-function(modelfit){ require(sandwich, quietly=TRUE) cov.fit1 <- vcovHC(modelfit, type = "HC") rob.std.err <- sqrt(diag(cov.fit1)) return(rob.std.err)} #get robust SEs for model 1 model1<-lm(out~age + sex, data=bpdata) robust.SE(model1) #get robust SEs for model2 model2<-lm(bp~age+sex, data=bpdata) robust.SE(model2) Instead of copying and pasting the code for robust standard errors twice, I just wrap it up in a function that allows me to evaluate it for any model fit I want to throw at it. Functions to customize your output Another reason to use functions is to make it easier and faster to do the things you always need to do, in the exact way you want. For example, it's important when you start with a new dataset to look at it descriptively and look at whether there are strange values or missing values. There are many ways to do this. You can of course do **summary(data)** which is a good start: #use the basic summary built-in function summary(bpdata) But this doesn't give me all the information I want and it's not in the format I want. If I had 20 variables, which is not out of the ordinary, it would be hard to read this and it takes up too much space in a report. I could also use the describe() function in the psych package, but it doesn't deal well with factor variables. I want to create a table with the variables down the rows and some specific stats going across the columns, namely mean, standard deviation, min, max, and the number missing. I want to make sure to treat factors as dummies instead of numeric variables. Here I create a new function that takes a dataset and produces this kind of table for all the variables within the dataset. I call the function **summarize.vars**. The only argument I pass through is a dataset. I use the package dummies to create dummy variables of all the factors so that I have only numeric variables in my data (it ignores all character variables if I had any). Then I use the **apply()** function to do my summarizing. Check out my previous post on how apply works, [here]( #function to summarize the variables in the data summarize.vars<-function(data){ #use dummies package to turn all factors into dummies require(dummies, quietly=TRUE) dat.d<, dummy.class="factor") #use apply to calculate statistics for each variable mat<-t(apply(dat.d, 2, function(x) c(length(x), round(mean(x, na.rm=TRUE),2), round(sd(x, na.rm=TRUE),2), round(min(x, na.rm=TRUE),2), round(max(x, na.rm=TRUE),2), length(x)-length(x[!])))) #assign column names and rownames to output table colnames(mat)<-c("N","Mean","SD","Min","Max","Num Missing") rownames(mat)<-colnames(dat.d) return(mat)} summarize.vars(bpdata) I can also use this to summarize only part of my data, for example by subsetting my data like so: summarize.vars(bpdata[bpdata$race==1,]) You can think about making this function better. One way would be to have the N for the factors count within the level, since that is more useful information. You can customize it in whatever way makes sense to get the kind of information that is most useful. Let's do another example of a function to summarize information. This time we'll put together a function that runs a linear model, succinctly summarizes the results, and produces some diagnostic plots at the same time. This is useful if you run linear models often. Remember that you can pass anything through a function, including a formula. In this function, I pass in a formula for the model and the dataframe and I don't return any values; instead, I print directly from inside the function. You could change this so that it returned the output and you could store it in an object outside the function. #linear model function lm.model.diagnostics<-function(formula, dataset){ #run model and print specific output model1<-lm(formula=formula, data=dataset) stats<-round(c(summary(model1)$fstatistic[c(1,3)], summary(model1)$sigma, summary(model1)$r.squared, summary(model1)$adj.r.squared),3) names(stats)<-c("F","DF", "Sigma","Rsq","AdjRsq") l1<-list(round(summary(model1)$coefficients,3), stats) names(l1)<-c("Coefficients","Stats") print(l1) #run specific diagnostic tests par(mfrow=c(1,3)) hist(model1$residuals, main="Histogram of residuals", xlab="") plot(model1, 1) plot(model1, 2)} #run function for model of blood pressure on age and sex lm.model.diagnostics(bp~age+sex, bpdata) So we can see that there is a funny outlier in observation 300. That observation has a blood pressure of 400, which we think is an incorrect value. We can see what happens when we take out that outlier: #take out the outlier lm.model.diagnostics(bp~age+sex, bpdata[c(-300),]) Note that if your formula is long and you use it more than once, you can avoid copying and pasting it as well by just saving it in an object like this: form1<-bp~age+sex lm.model.diagnostics(form1, bpdata) Functions to aid in your analysis Another useful way to use functions is for improving your analysis. For example, R summary output of a model fit doesn't provide confidence intervals. It's useful to have a function to calculate the confidence intervals and put them in a nice table. As a bonus, you can make the function versatile so that it can provide nice output of logit or poisson models, by exponentiating the results to get odds ratios or incidence rate ratios, respectively. Notice how in this function I set up the defaults for the parameters that I pass in. Unless I indicate otherwise, exponent=FALSE, alpha=.05, and digits=2. That way the function runs even without specifying those parameters. If I want to change them, I can do so the way I do in the second example. #function to get confidence intervals for glm output, can get exponentiated output for logit or poisson glmCI <- function(glmfit, exponent=FALSE, alpha=0.05, digits=2){ #get SE from model fit se <- sqrt(diag(summary(glmfit)$cov.scaled)) #calculuate CI for linear case mat <- cbind(coef(glmfit), coef(glmfit) - qnorm(1-alpha/2)*se, coef(glmfit) + qnorm(1-alpha/2)*se) colnames(mat) <- c("Beta", "LowerCI", "UpperCI") #if exponent=TRUE, exponeniate the coefficients and CIs if(exponent == TRUE) { mat <- exp(mat) if(summary(glmfit)$family$link=="logit") colnames(mat)[1] <- "OR" if(summary(glmfit)$family$link=="log") colnames(mat)[1] <- "IRR"} #return a rounded matrix of results return(round(mat, digits=digits))} #1. use glm with identity link on continuous response data (default family is gaussian) g.glm<-glm(bp~age+sex, data=bpdata) glmCI(g.glm) #2. use glm with logit link on binary response data b.glm<-glm(out~age+sex+bp, family=binomial(link="logit"), data=bpdata) glmCI(b.glm, exponent=TRUE, digits=3) There are tons of other possibilities for functions, but hopefully this post has convinced you that functions can improve your coding by reducing repetition, increasing customizability, and improving your analysis and reports.

Sunday, June 8, 2014

How to write and debug an R function

I've been asked on a few occasions what is the deal with R user-written functions. First of all, how does the syntax work? And second of all, why would you ever want to do this? In Stata, we don't write functions; we execute built-in commands like **browse** or **gen** or **logit**. You can write your own programs that create new commands (like ado files) but it's less common for users to do so. In R, there are built-in functions like **summary()** or **glm()** or **median()**, but you can also write your own functions. You can write a quick, one-line function or long elaborate functions. I use functions all the time to make my code cleaner and less repetitive. In this post I'll go over the basics of how to write functions. In the next post, I'll explain what kinds of functions I have used commonly in public health research that have improved my data cleaning and analyses. Basic syntax of a function A function needs to have a name, probably at least one argument (although it doesn't have to), and a body of code that does something. At the end it usually should (although doesn't have to) return an object out of the function. The important idea behind functions is that objects that are created within the function are local to the environment of the function - they don't exist outside of the function. But you can "return" the value of the object from the function, meaning pass the value of it into the global environment. I'll go over this in more detail. Functions need to have curly braces around the statements, like so: name.of.function <- function(argument1, argument2){ statements return(something) } The argument can be any type of object (like a scalar, a matrix, a dataframe, a vector, a logical, etc), and it's not necessary to define what it is in any way. As a very simple example, we can write a function that squares an incoming argument. The function below takes the argument x and multiplies it by itself. It saves this value into the object called square, and then it returns the value of the object square. Writing and calling a function<-function(x){ square<-x*x return(square) } I can now call the function by passing in a scalar or a vector or matrix as its argument, because all of those objects will square nicely. But it won't work if I input a character as its argument because although it will pass "hi" into the function, R can't multiply "hi". #square a number #square a vector,4,2)) #square a character (not going to happen)"hi") I can also pass in an object that I already have saved. For example, here I have a matrix called matrix1, so I pass that into the ** function**. R takes this matrix1 into the function as x. That is, in the local function environment it is now called x, where it is squared, and returned. matrix1<-cbind(c(3,10),c(4,5)) Local vs global environment Now, it's not necessarily the case that you must use **return()** at the end of your function. The reason you return an object is if you've saved the value of your statements into an object inside the function - in this case, the objects in the function are in a local environment and won't appear in your global environment. See how it works in the following two examples: fun1<-function(x){ 3*x-1 } fun1(5) fun2<-function(x){ y <- 3*x-1 } fun2(5) In the first function, I just evaluate the statement 3*x-1 without saving it anywhere inside the function. So when I run fun1(5), the result comes popping out of the function. However, when I call fun2(5), nothing happens. That's because the object y that I saved my result into doesn't exist outside the function and I haven't used __return(y)__ to pass the value of y outside the function. When I try to print y, it doesn't exist because it was created in the local environment of the function. print(y) I can return the *value* of y using the **return(y)** at the end of the function fun2, but I can't return the object itself; it's stuck inside the function. Getting more complex Obviously writing a whole function to square something when you could just use the ^ operator is silly. But you can do much more complicated things in functions, once you get the hang of them. Calling other functions and passing multiple arguments First, you can pass multiple arguments into a function and you can call other functions within your function. Here's an example. I'm passing in 3 arguments which I want to be a matrix, a vector, and a scalar. In the function body, I first call my previous function **** and use it to square the scalar. Then I multiply the matrix by the vector. Then I multiply those two results together and return the final object. <- function(X.matrix, y.vec, z.scalar){ #use my previous function to square the scalar and save result sq.scalar< #multiply the matrix by the vector using %*% operator mult<-X.matrix%*%y.vec #multiply the two resulting objects together to get a final object final<-mult*sq.scalar #return the result return(final) } When you have multiple arguments in a function that you call, R will just evaluate them in order of how you've written the function (the first argument will correspond to X.matrix, the second y.vec, and so on), but for clarity I would name the arguments in the function call. In this example below, I already have two saved objects, my.mat and my.vec that I pass through as the X.matrix and y.vec arguments, and then I just assign the z.scalar argument the number 9. #save a matrix and a vector object my.mat<-cbind(c(1,3,4),c(5,4,3)) my.vec<-c(4,3) #pass my.mat and my.vec into the function, y.vec=my.vec, z.scalar=9) #this is the same as, my.vec, 9) Returning a list of objects Also, if you need to return multiple objects from a function, you can use **list()** to list them together. An example of this is my [blog post on sample size functions.]( For example:<-function(sq.matrix, vector){ #transpose matrix and square the vector step1<-t(sq.matrix) step2<-vector*vector #save both results in a list and return final<-list(step1, step2) return(final) } #call the function and save result in object called outcome outcome<,2),c(3,4)), vector=c(2,3)) #print the outcome list print(outcome) Now to separate those objects for use in your further code, you can extract them using the [[]] operator: ###extract first in list outcome[[1]] ##extract second in list outcome[[2]] Tricks for troubleshooting and debugging When you execute multiple statements in a function, sometimes things go wrong. What's nice about functions is that R evalutes every statement until it reaches an error. So in the last function, the dimensions of the objects really matter. You can't multiply matrices of incomptabile dimensions. Like this:, y.vec=c(2,3,6,4,1), z.scalar=9) Using the Debug() function When you have an error, one thing you can do is use R's built-in debugger **debug()** to find at what point the error occurs. You indicate which function you want to debug, then run your statement calling the function, and R shows you at what point the function stops because of errors: debug(, y.vec=c(2,3,6,4,1), z.scalar=9) We see that the first line calling the **** function was fine, but then an error occurred in the line defining mult. This debugging is useful especially if you had many more statements in your function that multiplied matrices and you weren't sure which one was causing the issues. So now we know the problem is that X.matrix and y.vec won't multiply. But we still need to know why they won't multiply. More on debugging [can be found here.]( Printing out what's happening (sanity checks) At this point, a good way to troubleshoot this is to print out the dimensions or lengths of the objects or even the objects themselves that are going into the statement causing errors. The great part about functions is that they evaluate all the way until there's an error. So you can see what is happening inside your function before the error. If the object is too long, you can **print(head(object))**. This helps to see if you're doing what you think you're doing. Note that you have to use the function **print()** to actually print out anything from within a function. <- function(X.matrix, y.vec, z.scalar){ print("xmatrix") print(X.matrix) print("yvec") print(y.vec) print("Dimensions") print(dim(X.matrix)) print(length(y.vec)) #use my previous function to square the scalar and save result sq.scalar< print(paste("sq.scalar=", sq.scalar)) #multiply the matrix by the vector using %*% operator mult<-X.matrix%*%y.vec #multiply the two resulting objects together to get a final object final<-mult*sq.scalar #return the result return(final) }, y.vec=c(2,3,6,4,1), z.scalar=9) Now we can see the actual dimensions of the objects and fix them accordingly. This example is really simple, but you can imagine that if you've written a long function that uses many arguments, you could easily lose track of them and not be sure where the issue in your function was. You can also throw in these statements along the way as sanity checks to make sure that things are proceeding as you think they should, even if there isn't any error. Using the stop() and stopifnot() functions to write your own error messages One other trick you can use is writing your own error messages using the **stop()** and **stopifnot()** functions. In this example, if I know I need dimensions to be the right size, I can check them and print out a message that says they are incorrect. That way I know what the issue is immediately. Here's an example:<-function(matrix, vector){ if(dim(matrix)[2]!=length(vector)){ stop("Can't multiply matrix%*%vector because the dimensions are wrong") } product<-matrix%*%vector return(product) } #function works when dimensions are right, c(6,5)) #function call triggered error, c(6,5,7)) You can do these kinds of error messages for yourself as checks so you know exactly what triggered the error. You can think about putting in a check for if the value of an object is 0 if you are dividing by it as another example. Good function writing practices Based on my experience, there are a few good practices that I would recommend keeping in mind when writing function. 1. Keep your functions short. Remember you can use them to call other functions! * If things start to get very long, you can probably split up your function into more manageable chunks that call other functions. This makes your code cleaner and easily testable. * It also makes your code easy to update. You only have to change one function and every other function that uses that function will also be automatically updated. 2. Put in comments on what are the inputs to the function, what the function does, and what is the output. 3. Check for errors along the way. * Try out your function with simple examples to make sure it's working properly * Use debugging and error messages, as well as sanity checks as you build your function. The next post will go over examples of useful functions that you can use in your day to day R coding.

Monday, February 17, 2014

ggplot2: Cheatsheet for Visualizing Distributions

In the third and last of the ggplot series, this post will go over interesting ways to visualize the distribution of your data. I will make up some data, and make sure to set the seed. library(ggplot2) library(gridExtra) set.seed(10005) xvar<-c(rnorm(1500, mean=-1), rnorm(1500, mean=1.5)) yvar<-c(rnorm(1500,mean=1), rnorm(1500, mean=1.5)) zvar<-as.factor(c(rep(1,1500),rep(2,1500))) xy<-data.frame(xvar,yvar,zvar) Histograms I've already done a [post on histograms]( using base R, so I won't spend too much time on them. Here are the basics of doing them in ggplot. [More on all options for histograms here.]( The R cookbook has a nice page about it too: Also, I found [this really great aggregation]( of all of the possible geom layers and options you can add to a plot. In general the site is a great reference for all things ggplot. #counts on y-axis g1<-ggplot(xy, aes(xvar)) + geom_histogram() #horribly ugly default g2<-ggplot(xy, aes(xvar)) + geom_histogram(binwidth=1) #change binwidth g3<-ggplot(xy, aes(xvar)) + geom_histogram(fill=NA, color="black") + theme_bw() #nicer looking #density on y-axis g4<-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + theme_bw() grid.arrange(g1, g2, g3, g4, nrow=1) Notice the warnings about the default binwidth that always is reported unless you specify it yourself. I will remove the warnings from all plots that follow to conserve space. Density plots We can do basic density plots as well. Note that the default for the smoothing kernel is gaussian, and you can change it to a number of different options, including __kernel="epanechnikov"__ and __kernel="rectangular"__ or whatever you want. You can [find all of those options here]( #basic density p1<-ggplot(xy, aes(xvar)) + geom_density() #histogram with density line overlaid p2<-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + geom_density(color="blue") #split and color by third variable, alpha fades the color a bit p3<-ggplot(xy, aes(xvar, fill = zvar)) + geom_density(alpha = 0.2) grid.arrange(p1, p2, p3, nrow=1) Boxplots and more We can also look at other ways to visualize our distributions. Boxplots are probably the most useful in order to describe the statistics of a distribution, but sometimes other visualizations are nice. I show a jitter plot and a violin plot. [More on boxplots here.]( Note that I removed the legend from each one because it is redundant. #boxplot b1<-ggplot(xy, aes(zvar, xvar)) + geom_boxplot(aes(fill = zvar)) + theme(legend.position = "none") #jitter plot b2<-ggplot(xy, aes(zvar, xvar)) + geom_jitter(alpha=I(1/4), aes(color=zvar)) + theme(legend.position = "none") #violin plot b3<-ggplot(xy, aes(x = xvar)) + stat_density(aes(ymax = ..density.., ymin = -..density.., fill = zvar, color = zvar), geom = "ribbon", position = "identity") + facet_grid(. ~ zvar) + coord_flip() + theme(legend.position = "none") grid.arrange(b1, b2, b3, nrow=1) Putting multiple plots together Finally, it's nice to put different plots together to get a real sense of the data. We can make a scatterplot of the data, and add marginal density plots to each side. Most of the code below I adapted from this [StackOverflow page]( One way to do this is to add distribution information to a scatterplot as a "rug plot". It adds a little tick mark for every point in your data projected onto the axis. #rug plot ggplot(xy,aes(xvar,yvar)) + geom_point() + geom_rug(col="darkred",alpha=.1) Another way to do this is to add histograms or density plots or boxplots to the sides of a scatterplot. I followed the stackoverflow page, but let me know if you have suggestions on a better way to do this, especially without the use of the empty plot as a place-holder. I do the density plots by the zvar variable to highlight the differences in the two groups. #placeholder plot - prints nothing at all empty <- ggplot()+geom_point(aes(1,1), colour="white") + theme( plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank() ) #scatterplot of x and y variables scatter <- ggplot(xy,aes(xvar, yvar)) + geom_point(aes(color=zvar)) + scale_color_manual(values = c("orange", "purple")) + theme(legend.position=c(1,1),legend.justification=c(1,1)) #marginal density of x - plot on top plot_top <- ggplot(xy, aes(xvar, fill=zvar)) + geom_density(alpha=.5) + scale_fill_manual(values = c("orange", "purple")) + theme(legend.position = "none") #marginal density of y - plot on the right plot_right <- ggplot(xy, aes(yvar, fill=zvar)) + geom_density(alpha=.5) + coord_flip() + scale_fill_manual(values = c("orange", "purple")) + theme(legend.position = "none") #arrange the plots together, with appropriate height and width for each row and column grid.arrange(plot_top, empty, scatter, plot_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4)) It's really nice that grid.arrange() clips the plots together so that the scales are automatically the same. You could get rid of the redundant axis labels by adding in __theme(axis.title.x = element_blank())__ in the density plot code. I think it comes out looking very nice, with not a ton of effort. You could also add linear regression lines and confidence intervals to the scatterplot. Check out my first [ggplot2 cheatsheet for scatterplots]( if you need a refresher.

Thursday, January 9, 2014

ggplot2: Cheatsheet for Barplots

In the second of the series, this post will go over barplots in ggplot2. Our data is from mtcars as before. library(ggplot2) library(gridExtra) mtc<-mtcars #preview data head(mtc) To introduce the barplot, I show the basic default bargraph that you would get if you indicate an x-variable and use the default geom_bar layer, which is geom_bar(stat="bin"). You could just write geom_bar() and it would also work. Remember that in ggplot we add layers to make plots, so first we specify the data we want to use and then we specify that we want to plot it as a bar graph (instead of points or lines). The basic plot gives a count of the number in each group of the x-variable (gears). ggplot(mtc, aes(x = factor(gear))) + geom_bar(stat="bin") Aggregate data for barplot Instead of this, we would like to graphically show the mean weight of the cars by the number of gears. There are a number of ways to make this graph. The first way is that we summarize the data beforehand, and use the summarized data in the ggplot statement. I show two ways to summarize here, with two different results of how the data looks when summarized, using aggregate and tapply. Using the tapply() inside of a data.frame() statement, we can put together a new dataframe of the mean weight by gear. #using aggregate ag.mtc<-aggregate(mtc$wt, by=list(mtc$gear), FUN=mean) ag.mtc #using tapply summary.mtc <- data.frame( gear=levels(as.factor(mtc$gear)), meanwt=tapply(mtc$wt, mtc$gear, mean)) summary.mtc Now we can use the summarized dataframe in a ggplot statement and use the geom_bar layer to plot it. In the first argument we indicate that the dataframe is summary.mtc, next we indicate in the aes() statement that the x-axis is gear and the y-axis is meanwt, and finally we add the geom_bar() layer. We use the geom_bar(stat="identity") to indicate that we want the y-values to be exactly the values in the dataset. Remember, by default the stat is set to stat="bin" which is a count of the x-axis variable, so it's important to change it when we have summarized our data. ggplot(summary.mtc, aes(x = factor(gear), y = meanwt)) + geom_bar(stat = "identity") Another option for quick graphing is to use the built-in __stat_summary()__ layer. Instead of summarizing data, we use the original dataset and indicate that the x-axis is gear and the y-axis is just weight. However, we use __stat_summary()__ to calculate the mean of the y for each x with the following command: ggplot(mtc,aes(x=factor(gear), y=wt)) + stat_summary(fun.y=mean, geom="bar") There are reasons why we would want to use the first or second method. For the first, summarizing our data the way we want it gives us validity that we are sure that we are doing what we want to be doing and gives us more flexibility in case we want to use that summarized data in a later portion of our analysis (like in a table). Using the stat_summary() layer is faster and less code to write. For now, we continue with the second method, but later on we'll come back to the summarizing method. Horizontal bars, colors, width of bars We can make these plots look more presentable with a variety of options. First, we rotate the bars so they are horizontal. Second, we change the colors of the bars. Finally, we change the width of the bars. #1. horizontal bars p1<-ggplot(mtc,aes(x=factor(gear),y=wt)) + stat_summary(fun.y=mean,geom="bar") + coord_flip() #2. change colors of bars p2<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(gear))) + stat_summary(fun.y=mean,geom="bar") + scale_fill_manual(values=c("purple", "blue", "darkgreen")) #3. change width of bars p3<-ggplot(mtc,aes(x=factor(gear),y=wt)) + stat_summary(fun.y=mean,geom="bar", aes(width=0.5)) grid.arrange(p1, p2, p3, nrow=1) For the colors, I color the bars by the gear variable so it's a different color for each bar, and then indicate manually the colors I want. You could color them all the same way using fill="blue" for example, or you can keep the default colors when you fill by gear by leaving off scale_fill_manual altogether. You can also use scale_fill_brewer() to fill the bars with a scale of one color (default is blue). This R cookbook site is particularly useful for understanding how to get the exact colors you want: Note that if you are summarizing the data yourself, you change the width this way (graphs not shown since they look the same): ggplot(summary.mtc, aes(x = factor(gear), y = meanwt)) + geom_bar(stat = "identity", width=0.2) Split and color by another variable Now, let's make the graph more complicated by adding a third variable. We can do this in three ways: bars next to each other, bars stacked, or using 'faceting' which is making multiple graphs at once. We would like to know the mean weight by both gear and engine type (vs). Stacking is a particularly bad idea in this example, but I'll show it for completeness. #1. next to each other p1<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs)), color=factor(vs)) + stat_summary(fun.y=mean,position=position_dodge(),geom="bar") #2. stacked p2<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs)), color=factor(vs)) + stat_summary(fun.y=mean,position="stack",geom="bar") #3. with facets p3<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs)), color=factor(vs)) + stat_summary(fun.y=mean, geom="bar") + facet_wrap(~vs) grid.arrange(p1, p2, p3, nrow=1) You can also indicate the width of the spread between the bars in the first plot using position=position_dodge(width=.5) and play around with the width number. You can change the order of the stacking by re-ordering the levels of the fill variable. Here is a prior blog post I had about [how to reorder factors]( mtc$vs2<-factor(mtc$vs, levels = c(1,0)) ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs2)), color=factor(vs2)) + stat_summary(fun.y=mean,position="stack",geom="bar") Note that if you are using summarized data, just indicate the position in the geom_bar() statement. Faceting is a really nice feature in ggplot2 and deserves more space on this blog, but for now more information on how faceting works can be found here: Add text to the bars, label axes, and label legend Next, I would like to add the value in text to the top of each bar. This is a case in which you definitely want to summarize the data first - it is much easier and cleaner that way. I use the aggregate() function to summarize the data by both gear and type of engine. ag.mtc<-aggregate(mtc$wt, by=list(mtc$gear,mtc$vs), FUN=mean) colnames(ag.mtc)<-c("gear","vs","meanwt") ag.mtc Now, I use the geom_bar() layer as in the first example, and the geom_text() layer to add the text. In order to move the text to the top of each bar, I use the position_dodge and vjust options to move the text around. The first plot shows the basic output, but we see that the first number is cutoff by the top of the y-axis and we need to round the text. We can fix it by adjusting the range of the y-axis exactly how we did in a scatterplot, by adding a scale_y_continuous layer to the plot. I also change the x-axis label using scale_x_discrete, change the text to be black so it's readable, and label the legend. Notice here, it is the scale_fill_discrete layer. Go back to the [cheatsheet for scatterplots]( if you want to go over how to customize axes and legends. #1. basic g1<-ggplot(ag.mtc, aes(x = factor(gear), y = meanwt, fill=factor(vs),color=factor(vs))) + geom_bar(stat = "identity", position=position_dodge()) + geom_text(aes(y=meanwt, ymax=meanwt, label=meanwt),position= position_dodge(width=0.9), vjust=-.5) #2. fixing the yaxis problem, changing the color of text, legend labels, and rounding to 2 decimals g2<-ggplot(ag.mtc, aes(x = factor(gear), y = meanwt, fill=factor(vs))) + geom_bar(stat = "identity", position=position_dodge()) + geom_text(aes(y=meanwt, ymax=meanwt, label=round(meanwt,2)), position= position_dodge(width=0.9), vjust=-.5, color="black") + scale_y_continuous("Mean Weight",limits=c(0,4.5),breaks=seq(0, 4.5, .5)) + scale_x_discrete("Number of Gears") + scale_fill_discrete(name ="Engine", labels=c("V-engine", "Straight engine")) grid.arrange(g1, g2, nrow=1) Add error bars or best fit line Again there are two ways to do this, but I prefer summarizing the data first and then adding in error bars. I use tapply to get the mean and SD of the weight by gear, then I add a geom_bar layer and a geom_errorbar layer, where I indicate the range of the error bar using ymin and ymax in the aes() statement. summary.mtc2 <- data.frame( gear=levels(as.factor(mtc$gear)), meanwt=tapply(mtc$wt, mtc$gear, mean), sd=tapply(mtc$wt, mtc$gear, sd)) summary.mtc2 ggplot(summary.mtc2, aes(x = factor(gear), y = meanwt)) + geom_bar(stat = "identity", position="dodge", fill="lightblue") + geom_errorbar(aes(ymin=meanwt-sd, ymax=meanwt+sd), width=.3, color="darkblue") And if you were really cool and wanted to add a linear fit to the barplot, you can do it in two ways. You can evaluate the linear model yourself, and then use geom_abline() with an intercept and slope indicated. Or you can take advantage of the stat_summary() layer to summarize the data and the geom_smooth() layer to add a linear model instantly. #summarize data summary.mtc3 <- data.frame( hp=levels(as.factor(mtc$hp)), meanmpg=tapply(mtc$mpg, mtc$hp, mean)) #run a model l<-summary(lm(meanmpg~as.numeric(hp), data=summary.mtc3)) #manually entering the intercept and slope f1<-ggplot(summary.mtc3, aes(x = factor(hp), y = meanmpg)) + geom_bar(stat = "identity", fill="darkblue")+ geom_abline(aes(intercept=l$coef[1,1], slope=l$coef[2,1]), color="red", size=1.5) #using stat_smooth to fit the line for you f2<-ggplot(summary.mtc3, aes(x = factor(hp), y = meanmpg)) + geom_bar(stat = "identity", fill="darkblue")+ stat_smooth(aes(group=1),method="lm", se=FALSE, color="orange", size=1.5) grid.arrange(f1, f2, nrow=1) And as before, check out [The R cookbook]( and the [ggplot2 documentation]( for more help on getting the bargraph of your dreams.

Wednesday, November 13, 2013

ggplot2: Cheatsheet for Scatterplots

The graphics package ggplot2 is powerful, aesthetically pleasing, and (after a short learning curve to understand the syntax) easy to use. I have made some pretty cool plots with it, but on the whole I find myself making a lot of the same ones, since doing something over and over again is generally how research goes. Since I constantly forget the options that I need to customize my plots, this next series of posts will serve as cheatsheets for scatterplots, barplots, and density plots. We start with scatterplots. ### Quick Intro to ggplot2 The way ggplot2 works is by layering components of your plot on top of each other. You start with the basic of the data you want your plot to include (x and y variables), and then layer on top the kind of plotting colors/symbols you want, the look of the x- and y-axes, the background color, etc. You can also easily add regression lines and summary statistics. For great reference guides, use the [ggplot2 documentation]( or the [R Graphs Cookbook]( In this post, we focus only on scatterplots with a continuous x and continuous y. We are going to use the mtcars data that is available through R. library(ggplot2) library(gridExtra) mtc<-mtcars Here's the basic syntax of a scatterplot. We give it a dataframe, mtc, and then in the **aes()** statement, we give it an x-variable and a y-variable to plot. I save it as a ggplot object called p1, because we are going to use this as the base and then layer everything else on top: #Basic scatterplot p1 <- ggplot(mtc, aes(x = hp, y = mpg)) Now for the plot to print, we need to specify the next layer, which is how the symbols should look - do we want points or lines, what color, how big. Let's start with points: #Print plot with default points p1+geom_point() That's the bare bones of it. Now we have fun with adding layers. For each of the examples, I'm going to use the *grid.arrange()* function in the **gridExtra** package to create multiple graphs in one panel to save space. Change color of points We start with options for colors just by adding how we want to color our points in the geom_point() layer: p2 <- p1 + geom_point(color="red") #set one color for all points p3 <- p1 + geom_point(aes(color = wt)) #set color scale by a continuous variable p4 <- p1 + geom_point(aes(color=factor(am))) #set color scale by a factor variable grid.arrange(p2, p3, p4, nrow=1) We can also change the default colors that are given by ggplot2 like this: #Change default colors in color scale p1 + geom_point(aes(color=factor(am))) + scale_color_manual(values = c("orange", "purple")) Change shape or size of points We're sticking with the basic p1 plot, but now changing the shape and size of the points: p2 <- p1 + geom_point(size = 5) #increase all points to size 5 p3 <- p1 + geom_point(aes(size = wt)) #set point size by continuous variable p4 <- p1 + geom_point(aes(shape = factor(am))) #set point shape by factor variable grid.arrange(p2, p3, p4, nrow=1) Again, if we want to change the default shapes we can: p1 + geom_point(aes(shape = factor(am))) + scale_shape_manual(values=c(0,2)) * More options for [color and shape manual changes are here]( * All shape and line types can be found here: Add lines to scatterplot p2 <- p1 + geom_point(color="blue") + geom_line() #connect points with line p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE) #add regression line p4 <- p1 + geom_point() + geom_vline(xintercept = 100, color="red") #add vertical line grid.arrange(p2, p3, p4, nrow=1) You can also take out the points, and just create a line plot, and change size and color as before: ggplot(mtc, aes(x = wt, y = qsec)) + geom_line(size=2, aes(color=factor(vs))) * More help on scatterplots can be found here: Change axis labels There are a few ways to do this. If you only want to quickly add labels you can use the *labs()* layer. If you want to change the font size and style of the label, then you need to use the *theme()* layer. More on this at the end of this post. If you want to change around the limits of the axis, and exactly where the breaks are, you use the *scale_x_continuous* (and *scale_y_continuous* for the y-axis). p2 <- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point() p3 <- p2 + labs(x="Horsepower", y = "Miles per Gallon") #label all axes at once p4 <- p2 + theme(axis.title.x = element_text(face="bold", size=20)) + labs(x="Horsepower") #label and change font size p5 <- p2 + scale_x_continuous("Horsepower", limits=c(0,400), breaks=seq(0, 400, 50)) #adjust axis limits and breaks grid.arrange(p3, p4, p5, nrow=1) * More axis options can be found here: Change legend options We start off by creating a new ggplot base object, g1, which colors the points by a factor variable. Then we show three basic options to modify the legend. g1<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(aes(color=factor(vs))) g2 <- g1 + theme(legend.position=c(1,1),legend.justification=c(1,1)) #move legend inside g3 <- g1 + theme(legend.position = "bottom") #move legend bottom g4 <- g1 + scale_color_discrete(name ="Engine", labels=c("V-engine", "Straight engine")) #change labels grid.arrange(g2, g3, g4, nrow=1) If we had changed the shape of the points, we would use *scale_shape_discrete()* with the same options. We can also remove the entire legend altogether by using **theme(legend.position="none")** Next we customize a legend when the scale is continuous: g5<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color = wt)) g5 + scale_color_continuous(name="Weight", #name of legend breaks = with(mtc, c(min(wt), mean(wt), max(wt))), #choose breaks of variable labels = c("Light", "Medium", "Heavy"), #label low = "pink", #color of lowest value high = "red") #color of highest value * More legend options can be found here: Change background color and style The look of the plot in terms of the background colors and style is the **theme()**. I personally don't like the look of the default gray so here are some quick ways to change it. I often the theme_bw() layer, which gets rid of the gray. * All of the theme options [can be found here]( g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point() #Completely clear all lines except axis lines and make background white t1<-theme( plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.line = element_line(size=.4) ) #Use theme to change axis label style t2<-theme( axis.title.x = element_text(face="bold", color="black", size=10), axis.title.y = element_text(face="bold", color="black", size=10), plot.title = element_text(face="bold", color = "black", size=12) ) g3 <- g2 + t1 g4 <- g2 + theme_bw() g5 <- g2 + theme_bw() + t2 + labs(x="Horsepower", y = "Miles per Gallon", title= "MPG vs Horsepower") grid.arrange(g2, g3, g4, g5, nrow=1) Finally, here's a nice graph using a combination of options: g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color=factor(vs), shape=factor(vs))) + geom_smooth(aes(color=factor(vs)),method = "lm", se = TRUE) + scale_color_manual(name ="Engine", labels=c("V-engine", "Straight engine"), values=c("red","blue")) + scale_shape_manual(name ="Engine", labels=c("V-engine", "Straight engine"), values=c(0,2)) + theme_bw() + theme( axis.title.x = element_text(face="bold", color="black", size=12), axis.title.y = element_text(face="bold", color="black", size=12), plot.title = element_text(face="bold", color = "black", size=12), legend.position=c(1,1), legend.justification=c(1,1)) + labs(x="Horsepower", y = "Miles per Gallon", title= "Linear Regression (95% CI) of MPG vs Horsepower by Engine type") g2 Reader request: Display Regression Line Equation on Scatterplot I received a request asking how to overlay the regression equation itself on a plot, so I've decided to update this post with that information. There are two ways to put text on a ggplot: annotate or geom_text(). I was finding that the geom_text() layer did not look very nice on my screen so I checked up on it and it seems others have this issue as well. I'll show you how the two behave, at least in my version of everything I use on my mac. We'll go back to the example where I add a regression line to the plot using geom_smooth(). To add text, you need to run the regression outside of ggplot, extract the coefficients, and then paste them together into some text that you can layer onto the plot. We're plotting MPG against horsepower so we create an object m that stores the linear model, and then extract the coefficients using the coef() function. We envelope the coef() function with signif() in order to round the coefficients to two significant digits. I then paste the regression equation text together, using sep="" in order to eliminate spaces. m <- lm(mtc$mpg ~ mtc$hp) a <- signif(coef(m)[1], digits = 2) b <- signif(coef(m)[2], digits = 2) textlab <- paste("y = ",b,"x + ",a, sep="") print(textlab) Next, I take the original p1 ggplot object, add points and a linear model to it, and then add a layer of text. I will show the two ways here, first using geom_smooth and then using annotate. With both methods, you must specify the x- and y-coordinates for where the text should be centered. In the geom_text code, notice that that label=textlab is included in the aes statement, while this is not the case for annotate. If there were mathematical or formatting symbols in the text, I would indicate parse=TRUE instead of FALSE, as we will see in the next example. ##basic ggplot with points and linear model p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE) ##add regression text using geom_text r1 <- p3 + geom_text(aes(x = 245, y = 30, label = textlab), color="black", size=5, parse = FALSE) ##add regression text using annotate r2 <- p3 + annotate("text", x = 245, y = 30, label = textlab, color="black", size = 5, parse=FALSE) grid.arrange(r1, r2, nrow=1) In a fancier way that I got from [this StackOverflow page](, you can use a function to piece together your text (which would be useful if you were doing this a lot). It also shows you how you can put in mathematical symbols and formattting changes, like making your variables italic by using substitute(), and adding in a dot for the multiplication symbol. The function lm_eqn() takes the arguments x, y, and a dataframe and evaluates the same linear model as before. Then it uses the substitute() function to piece together the regression equation using an expression, which is an R object of class "call". Finally, the function returns the expression, and is used exactly the same way in the two ggplot statements, EXCEPT that since we now have these formatting changes, we must use parse=TRUE in order to properly display the expressions. ##function to create equation expression lm_eqn = function(x, y, df){ m <- lm(y ~ x, df); eq <- substitute(italic(y) == b %.% italic(x) + a, list(a = format(coef(m)[1], digits = 2), b = format(coef(m)[2], digits = 2))) as.character(as.expression(eq)); } ##add regression equation using geom_text r3 <- p3 + geom_text(aes(x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc)), color="black", size=5, parse = TRUE) ##add regression equation using annotate r4 <- p3 + annotate("text", x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc), color="black", size = 5, parse=TRUE) grid.arrange(r3, r4, nrow=1) Of course, you can change the font and do more formatting stuff on the text itself - [find that information here.]( Lastly, I will go over functions in a post that I plan on doing very soon so be on the lookout for that if the function used here is confusing or you'd like to know more.