## Monday, July 7, 2014

### 3 ways that functions can improve your R code

My [previous blog post](http://rforpublichealth.blogspot.com/2014/06/how-to-write-and-debug-r-function.html) went over the basics of the syntax and debugging of user-written functions. In this post, I'll show you examples of useful functions that you can write to make your life easier when using R. Here is the data we'll be using for this post: set.seed(10) bpdata<-data.frame(bp=rnorm(1000,140,20), age=rnorm(1000,50,3), sex=rbinom(1000,1,.5), race=as.factor(c(rep(1,500),rep(2,500))), out=rbinom(1000,1,.8)) bpdata[c(100,200,400),2]<-NA bpdata[c(300),1]<-400 Functions to help you avoid repeating yourself One of the main reasons to use functions is that you can re-use them over and over again, without repeating the same code again and again. Copying and pasting code that you change just a little bit every time leads to errors. With a function, if you decide to change the way you do something, you just change the one function and it will now automatically update your code everywhere you use that function. As an example, in a previous blog post I wrote about [calculating robust standard errors and exporting](http://rforpublichealth.blogspot.com/2013/08/exporting-results-of-linear-regression_24.html) them. The calculation went like this: library(sandwich) cov.fit1 <- vcovHC(fit1, type = "HC") rob.std.err <- sqrt(diag(cov.fit1)) Well, if you are going to be using robust standard errors more than once in your analysis, it makes more sense to make a function. That way you don't need to copy and paste this code and change the object fit1 everytime. To do this, I just take the exact same code, but I change the fit1 to a more general modelfit that is passed in as an argument to a function I call **robust.SE()**. Also, instead of loading the library sandwich, I can just require it for that function (note that you still have to have the package installed for R to load it using require). Then you can pass in the model fit as the argument and return the robust standard errors. Like this: #function to calculate robust standard errors for a model robust.SE<-function(modelfit){ require(sandwich, quietly=TRUE) cov.fit1 <- vcovHC(modelfit, type = "HC") rob.std.err <- sqrt(diag(cov.fit1)) return(rob.std.err)} #get robust SEs for model 1 model1<-lm(out~age + sex, data=bpdata) robust.SE(model1) #get robust SEs for model2 model2<-lm(bp~age+sex, data=bpdata) robust.SE(model2) Instead of copying and pasting the code for robust standard errors twice, I just wrap it up in a function that allows me to evaluate it for any model fit I want to throw at it. Functions to customize your output Another reason to use functions is to make it easier and faster to do the things you always need to do, in the exact way you want. For example, it's important when you start with a new dataset to look at it descriptively and look at whether there are strange values or missing values. There are many ways to do this. You can of course do **summary(data)** which is a good start: #use the basic summary built-in function summary(bpdata) But this doesn't give me all the information I want and it's not in the format I want. If I had 20 variables, which is not out of the ordinary, it would be hard to read this and it takes up too much space in a report. I could also use the describe() function in the psych package, but it doesn't deal well with factor variables. I want to create a table with the variables down the rows and some specific stats going across the columns, namely mean, standard deviation, min, max, and the number missing. I want to make sure to treat factors as dummies instead of numeric variables. Here I create a new function that takes a dataset and produces this kind of table for all the variables within the dataset. I call the function **summarize.vars**. The only argument I pass through is a dataset. I use the package dummies to create dummy variables of all the factors so that I have only numeric variables in my data (it ignores all character variables if I had any). Then I use the **apply()** function to do my summarizing. Check out my previous post on how apply works, [here](http://rforpublichealth.blogspot.com/2012/09/the-infamous-apply-function.html). #function to summarize the variables in the data summarize.vars<-function(data){ #use dummies package to turn all factors into dummies require(dummies, quietly=TRUE) dat.d<-dummy.data.frame(data, dummy.class="factor") #use apply to calculate statistics for each variable mat<-t(apply(dat.d, 2, function(x) c(length(x), round(mean(x, na.rm=TRUE),2), round(sd(x, na.rm=TRUE),2), round(min(x, na.rm=TRUE),2), round(max(x, na.rm=TRUE),2), length(x)-length(x[!is.na(x)])))) #assign column names and rownames to output table colnames(mat)<-c("N","Mean","SD","Min","Max","Num Missing") rownames(mat)<-colnames(dat.d) return(mat)} summarize.vars(bpdata) I can also use this to summarize only part of my data, for example by subsetting my data like so: summarize.vars(bpdata[bpdata$race==1,]) You can think about making this function better. One way would be to have the N for the factors count within the level, since that is more useful information. You can customize it in whatever way makes sense to get the kind of information that is most useful. Let's do another example of a function to summarize information. This time we'll put together a function that runs a linear model, succinctly summarizes the results, and produces some diagnostic plots at the same time. This is useful if you run linear models often. Remember that you can pass anything through a function, including a formula. In this function, I pass in a formula for the model and the dataframe and I don't return any values; instead, I print directly from inside the function. You could change this so that it returned the output and you could store it in an object outside the function. #linear model function lm.model.diagnostics<-function(formula, dataset){ #run model and print specific output model1<-lm(formula=formula, data=dataset) stats<-round(c(summary(model1)$fstatistic[c(1,3)], summary(model1)$sigma, summary(model1)$r.squared, summary(model1)$adj.r.squared),3) names(stats)<-c("F","DF", "Sigma","Rsq","AdjRsq") l1<-list(round(summary(model1)$coefficients,3), stats) names(l1)<-c("Coefficients","Stats") print(l1) #run specific diagnostic tests par(mfrow=c(1,3)) hist(model1$residuals, main="Histogram of residuals", xlab="") plot(model1, 1) plot(model1, 2)} #run function for model of blood pressure on age and sex lm.model.diagnostics(bp~age+sex, bpdata) So we can see that there is a funny outlier in observation 300. That observation has a blood pressure of 400, which we think is an incorrect value. We can see what happens when we take out that outlier: #take out the outlier lm.model.diagnostics(bp~age+sex, bpdata[c(-300),]) Note that if your formula is long and you use it more than once, you can avoid copying and pasting it as well by just saving it in an object like this: form1<-bp~age+sex lm.model.diagnostics(form1, bpdata) Functions to aid in your analysis Another useful way to use functions is for improving your analysis. For example, R summary output of a model fit doesn't provide confidence intervals. It's useful to have a function to calculate the confidence intervals and put them in a nice table. As a bonus, you can make the function versatile so that it can provide nice output of logit or poisson models, by exponentiating the results to get odds ratios or incidence rate ratios, respectively. Notice how in this function I set up the defaults for the parameters that I pass in. Unless I indicate otherwise, exponent=FALSE, alpha=.05, and digits=2. That way the function runs even without specifying those parameters. If I want to change them, I can do so the way I do in the second example. #function to get confidence intervals for glm output, can get exponentiated output for logit or poisson glmCI <- function(glmfit, exponent=FALSE, alpha=0.05, digits=2){ #get SE from model fit se <- sqrt(diag(summary(glmfit)$cov.scaled)) #calculuate CI for linear case mat <- cbind(coef(glmfit), coef(glmfit) - qnorm(1-alpha/2)*se, coef(glmfit) + qnorm(1-alpha/2)*se) colnames(mat) <- c("Beta", "LowerCI", "UpperCI") #if exponent=TRUE, exponeniate the coefficients and CIs if(exponent == TRUE) { mat <- exp(mat) if(summary(glmfit)$family$link=="logit") colnames(mat)[1] <- "OR" if(summary(glmfit)$family$link=="log") colnames(mat)[1] <- "IRR"} #return a rounded matrix of results return(round(mat, digits=digits))} #1. use glm with identity link on continuous response data (default family is gaussian) g.glm<-glm(bp~age+sex, data=bpdata) glmCI(g.glm) #2. use glm with logit link on binary response data b.glm<-glm(out~age+sex+bp, family=binomial(link="logit"), data=bpdata) glmCI(b.glm, exponent=TRUE, digits=3) There are tons of other possibilities for functions, but hopefully this post has convinced you that functions can improve your coding by reducing repetition, increasing customizability, and improving your analysis and reports.

## Thursday, January 9, 2014

### ggplot2: Cheatsheet for Barplots

In the second of the series, this post will go over barplots in ggplot2. Our data is from mtcars as before. library(ggplot2) library(gridExtra) mtc<-mtcars #preview data head(mtc) To introduce the barplot, I show the basic default bargraph that you would get if you indicate an x-variable and use the default geom_bar layer, which is geom_bar(stat="bin"). You could just write geom_bar() and it would also work. Remember that in ggplot we add layers to make plots, so first we specify the data we want to use and then we specify that we want to plot it as a bar graph (instead of points or lines). The basic plot gives a count of the number in each group of the x-variable (gears). ggplot(mtc, aes(x = factor(gear))) + geom_bar(stat="bin") Aggregate data for barplot Instead of this, we would like to graphically show the mean weight of the cars by the number of gears. There are a number of ways to make this graph. The first way is that we summarize the data beforehand, and use the summarized data in the ggplot statement. I show two ways to summarize here, with two different results of how the data looks when summarized, using aggregate and tapply. Using the tapply() inside of a data.frame() statement, we can put together a new dataframe of the mean weight by gear. #using aggregate ag.mtc<-aggregate(mtc$wt, by=list(mtc$gear), FUN=mean) ag.mtc #using tapply summary.mtc <- data.frame( gear=levels(as.factor(mtc$gear)), meanwt=tapply(mtc$wt, mtc$gear, mean)) summary.mtc Now we can use the summarized dataframe in a ggplot statement and use the geom_bar layer to plot it. In the first argument we indicate that the dataframe is summary.mtc, next we indicate in the aes() statement that the x-axis is gear and the y-axis is meanwt, and finally we add the geom_bar() layer. We use the geom_bar(stat="identity") to indicate that we want the y-values to be exactly the values in the dataset. Remember, by default the stat is set to stat="bin" which is a count of the x-axis variable, so it's important to change it when we have summarized our data. ggplot(summary.mtc, aes(x = factor(gear), y = meanwt)) + geom_bar(stat = "identity") Another option for quick graphing is to use the built-in __stat_summary()__ layer. Instead of summarizing data, we use the original dataset and indicate that the x-axis is gear and the y-axis is just weight. However, we use __stat_summary()__ to calculate the mean of the y for each x with the following command: ggplot(mtc,aes(x=factor(gear), y=wt)) + stat_summary(fun.y=mean, geom="bar") There are reasons why we would want to use the first or second method. For the first, summarizing our data the way we want it gives us validity that we are sure that we are doing what we want to be doing and gives us more flexibility in case we want to use that summarized data in a later portion of our analysis (like in a table). Using the stat_summary() layer is faster and less code to write. For now, we continue with the second method, but later on we'll come back to the summarizing method. Horizontal bars, colors, width of bars We can make these plots look more presentable with a variety of options. First, we rotate the bars so they are horizontal. Second, we change the colors of the bars. Finally, we change the width of the bars. #1. horizontal bars p1<-ggplot(mtc,aes(x=factor(gear),y=wt)) + stat_summary(fun.y=mean,geom="bar") + coord_flip() #2. change colors of bars p2<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(gear))) + stat_summary(fun.y=mean,geom="bar") + scale_fill_manual(values=c("purple", "blue", "darkgreen")) #3. change width of bars p3<-ggplot(mtc,aes(x=factor(gear),y=wt)) + stat_summary(fun.y=mean,geom="bar", aes(width=0.5)) grid.arrange(p1, p2, p3, nrow=1) For the colors, I color the bars by the gear variable so it's a different color for each bar, and then indicate manually the colors I want. You could color them all the same way using fill="blue" for example, or you can keep the default colors when you fill by gear by leaving off scale_fill_manual altogether. You can also use scale_fill_brewer() to fill the bars with a scale of one color (default is blue). This R cookbook site is particularly useful for understanding how to get the exact colors you want:http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/ Note that if you are summarizing the data yourself, you change the width this way (graphs not shown since they look the same): ggplot(summary.mtc, aes(x = factor(gear), y = meanwt)) + geom_bar(stat = "identity", width=0.2) Split and color by another variable Now, let's make the graph more complicated by adding a third variable. We can do this in three ways: bars next to each other, bars stacked, or using 'faceting' which is making multiple graphs at once. We would like to know the mean weight by both gear and engine type (vs). Stacking is a particularly bad idea in this example, but I'll show it for completeness. #1. next to each other p1<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs)), color=factor(vs)) + stat_summary(fun.y=mean,position=position_dodge(),geom="bar") #2. stacked p2<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs)), color=factor(vs)) + stat_summary(fun.y=mean,position="stack",geom="bar") #3. with facets p3<-ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs)), color=factor(vs)) + stat_summary(fun.y=mean, geom="bar") + facet_wrap(~vs) grid.arrange(p1, p2, p3, nrow=1) You can also indicate the width of the spread between the bars in the first plot using position=position_dodge(width=.5) and play around with the width number. You can change the order of the stacking by re-ordering the levels of the fill variable. Here is a prior blog post I had about [how to reorder factors](http://rforpublichealth.blogspot.com/2012/11/data-types-part-3-factors.html). mtc$vs2<-factor(mtc$vs, levels = c(1,0)) ggplot(mtc,aes(x=factor(gear),y=wt,fill=factor(vs2)), color=factor(vs2)) + stat_summary(fun.y=mean,position="stack",geom="bar") Note that if you are using summarized data, just indicate the position in the geom_bar() statement. Faceting is a really nice feature in ggplot2 and deserves more space on this blog, but for now more information on how faceting works can be found here: http://www.cookbook-r.com/Graphs/Facets_(ggplot2)/ Add text to the bars, label axes, and label legend Next, I would like to add the value in text to the top of each bar. This is a case in which you definitely want to summarize the data first - it is much easier and cleaner that way. I use the aggregate() function to summarize the data by both gear and type of engine. ag.mtc<-aggregate(mtc$wt, by=list(mtc$gear,mtc$vs), FUN=mean) colnames(ag.mtc)<-c("gear","vs","meanwt") ag.mtc Now, I use the geom_bar() layer as in the first example, and the geom_text() layer to add the text. In order to move the text to the top of each bar, I use the position_dodge and vjust options to move the text around. The first plot shows the basic output, but we see that the first number is cutoff by the top of the y-axis and we need to round the text. We can fix it by adjusting the range of the y-axis exactly how we did in a scatterplot, by adding a scale_y_continuous layer to the plot. I also change the x-axis label using scale_x_discrete, change the text to be black so it's readable, and label the legend. Notice here, it is the scale_fill_discrete layer. Go back to the [cheatsheet for scatterplots](http://rforpublichealth.blogspot.com/2013/11/ggplot2-cheatsheet-for-scatterplots.html) if you want to go over how to customize axes and legends. #1. basic g1<-ggplot(ag.mtc, aes(x = factor(gear), y = meanwt, fill=factor(vs),color=factor(vs))) + geom_bar(stat = "identity", position=position_dodge()) + geom_text(aes(y=meanwt, ymax=meanwt, label=meanwt),position= position_dodge(width=0.9), vjust=-.5) #2. fixing the yaxis problem, changing the color of text, legend labels, and rounding to 2 decimals g2<-ggplot(ag.mtc, aes(x = factor(gear), y = meanwt, fill=factor(vs))) + geom_bar(stat = "identity", position=position_dodge()) + geom_text(aes(y=meanwt, ymax=meanwt, label=round(meanwt,2)), position= position_dodge(width=0.9), vjust=-.5, color="black") + scale_y_continuous("Mean Weight",limits=c(0,4.5),breaks=seq(0, 4.5, .5)) + scale_x_discrete("Number of Gears") + scale_fill_discrete(name ="Engine", labels=c("V-engine", "Straight engine")) grid.arrange(g1, g2, nrow=1) Add error bars or best fit line Again there are two ways to do this, but I prefer summarizing the data first and then adding in error bars. I use tapply to get the mean and SD of the weight by gear, then I add a geom_bar layer and a geom_errorbar layer, where I indicate the range of the error bar using ymin and ymax in the aes() statement. summary.mtc2 <- data.frame( gear=levels(as.factor(mtc$gear)), meanwt=tapply(mtc$wt, mtc$gear, mean), sd=tapply(mtc$wt, mtc$gear, sd)) summary.mtc2 ggplot(summary.mtc2, aes(x = factor(gear), y = meanwt)) + geom_bar(stat = "identity", position="dodge", fill="lightblue") + geom_errorbar(aes(ymin=meanwt-sd, ymax=meanwt+sd), width=.3, color="darkblue") And if you were really cool and wanted to add a linear fit to the barplot, you can do it in two ways. You can evaluate the linear model yourself, and then use geom_abline() with an intercept and slope indicated. Or you can take advantage of the stat_summary() layer to summarize the data and the geom_smooth() layer to add a linear model instantly. #summarize data summary.mtc3 <- data.frame( hp=levels(as.factor(mtc$hp)), meanmpg=tapply(mtc$mpg, mtc$hp, mean)) #run a model l<-summary(lm(meanmpg~as.numeric(hp), data=summary.mtc3)) #manually entering the intercept and slope f1<-ggplot(summary.mtc3, aes(x = factor(hp), y = meanmpg)) + geom_bar(stat = "identity", fill="darkblue")+ geom_abline(aes(intercept=l$coef[1,1], slope=l$coef[2,1]), color="red", size=1.5) #using stat_smooth to fit the line for you f2<-ggplot(summary.mtc3, aes(x = factor(hp), y = meanmpg)) + geom_bar(stat = "identity", fill="darkblue")+ stat_smooth(aes(group=1),method="lm", se=FALSE, color="orange", size=1.5) grid.arrange(f1, f2, nrow=1) And as before, check out [The R cookbook](http://www.cookbook-r.com/Graphs) and the [ggplot2 documentation](http://docs.ggplot2.org/0.9.3.1/geom_bar.html) for more help on getting the bargraph of your dreams.

## Wednesday, November 13, 2013

### ggplot2: Cheatsheet for Scatterplots

The graphics package ggplot2 is powerful, aesthetically pleasing, and (after a short learning curve to understand the syntax) easy to use. I have made some pretty cool plots with it, but on the whole I find myself making a lot of the same ones, since doing something over and over again is generally how research goes. Since I constantly forget the options that I need to customize my plots, this next series of posts will serve as cheatsheets for scatterplots, barplots, and density plots. We start with scatterplots. ### Quick Intro to ggplot2 The way ggplot2 works is by layering components of your plot on top of each other. You start with the basic of the data you want your plot to include (x and y variables), and then layer on top the kind of plotting colors/symbols you want, the look of the x- and y-axes, the background color, etc. You can also easily add regression lines and summary statistics. For great reference guides, use the [ggplot2 documentation](http://docs.ggplot2.org/0.9.2.1/index.html) or the [R Graphs Cookbook](http://www.cookbook-r.com/Graphs). In this post, we focus only on scatterplots with a continuous x and continuous y. We are going to use the mtcars data that is available through R. library(ggplot2) library(gridExtra) mtc<-mtcars Here's the basic syntax of a scatterplot. We give it a dataframe, mtc, and then in the **aes()** statement, we give it an x-variable and a y-variable to plot. I save it as a ggplot object called p1, because we are going to use this as the base and then layer everything else on top: #Basic scatterplot p1 <- ggplot(mtc, aes(x = hp, y = mpg)) Now for the plot to print, we need to specify the next layer, which is how the symbols should look - do we want points or lines, what color, how big. Let's start with points: #Print plot with default points p1+geom_point() That's the bare bones of it. Now we have fun with adding layers. For each of the examples, I'm going to use the *grid.arrange()* function in the **gridExtra** package to create multiple graphs in one panel to save space. Change color of points We start with options for colors just by adding how we want to color our points in the geom_point() layer: p2 <- p1 + geom_point(color="red") #set one color for all points p3 <- p1 + geom_point(aes(color = wt)) #set color scale by a continuous variable p4 <- p1 + geom_point(aes(color=factor(am))) #set color scale by a factor variable grid.arrange(p2, p3, p4, nrow=1) We can also change the default colors that are given by ggplot2 like this: #Change default colors in color scale p1 + geom_point(aes(color=factor(am))) + scale_color_manual(values = c("orange", "purple")) Change shape or size of points We're sticking with the basic p1 plot, but now changing the shape and size of the points: p2 <- p1 + geom_point(size = 5) #increase all points to size 5 p3 <- p1 + geom_point(aes(size = wt)) #set point size by continuous variable p4 <- p1 + geom_point(aes(shape = factor(am))) #set point shape by factor variable grid.arrange(p2, p3, p4, nrow=1) Again, if we want to change the default shapes we can: p1 + geom_point(aes(shape = factor(am))) + scale_shape_manual(values=c(0,2)) * More options for [color and shape manual changes are here](http://docs.ggplot2.org/0.9.3.1/scale_manual.html) * All shape and line types can be found here: http://www.cookbook-r.com/Graphs/Shapes_and_line_types Add lines to scatterplot p2 <- p1 + geom_point(color="blue") + geom_line() #connect points with line p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE) #add regression line p4 <- p1 + geom_point() + geom_vline(xintercept = 100, color="red") #add vertical line grid.arrange(p2, p3, p4, nrow=1) You can also take out the points, and just create a line plot, and change size and color as before: ggplot(mtc, aes(x = wt, y = qsec)) + geom_line(size=2, aes(color=factor(vs))) * More help on scatterplots can be found here: http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2) Change axis labels There are a few ways to do this. If you only want to quickly add labels you can use the *labs()* layer. If you want to change the font size and style of the label, then you need to use the *theme()* layer. More on this at the end of this post. If you want to change around the limits of the axis, and exactly where the breaks are, you use the *scale_x_continuous* (and *scale_y_continuous* for the y-axis). p2 <- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point() p3 <- p2 + labs(x="Horsepower", y = "Miles per Gallon") #label all axes at once p4 <- p2 + theme(axis.title.x = element_text(face="bold", size=20)) + labs(x="Horsepower") #label and change font size p5 <- p2 + scale_x_continuous("Horsepower", limits=c(0,400), breaks=seq(0, 400, 50)) #adjust axis limits and breaks grid.arrange(p3, p4, p5, nrow=1) * More axis options can be found here: http://www.cookbook-r.com/Graphs/Axes_(ggplot2) Change legend options We start off by creating a new ggplot base object, g1, which colors the points by a factor variable. Then we show three basic options to modify the legend. g1<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(aes(color=factor(vs))) g2 <- g1 + theme(legend.position=c(1,1),legend.justification=c(1,1)) #move legend inside g3 <- g1 + theme(legend.position = "bottom") #move legend bottom g4 <- g1 + scale_color_discrete(name ="Engine", labels=c("V-engine", "Straight engine")) #change labels grid.arrange(g2, g3, g4, nrow=1) If we had changed the shape of the points, we would use *scale_shape_discrete()* with the same options. We can also remove the entire legend altogether by using **theme(legend.position="none")** Next we customize a legend when the scale is continuous: g5<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color = wt)) g5 + scale_color_continuous(name="Weight", #name of legend breaks = with(mtc, c(min(wt), mean(wt), max(wt))), #choose breaks of variable labels = c("Light", "Medium", "Heavy"), #label low = "pink", #color of lowest value high = "red") #color of highest value * More legend options can be found here: http://www.cookbook-r.com/Graphs/Legends_(ggplot2) Change background color and style The look of the plot in terms of the background colors and style is the **theme()**. I personally don't like the look of the default gray so here are some quick ways to change it. I often the theme_bw() layer, which gets rid of the gray. * All of the theme options [can be found here](http://docs.ggplot2.org/0.9.3/theme.html). g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point() #Completely clear all lines except axis lines and make background white t1<-theme( plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.line = element_line(size=.4) ) #Use theme to change axis label style t2<-theme( axis.title.x = element_text(face="bold", color="black", size=10), axis.title.y = element_text(face="bold", color="black", size=10), plot.title = element_text(face="bold", color = "black", size=12) ) g3 <- g2 + t1 g4 <- g2 + theme_bw() g5 <- g2 + theme_bw() + t2 + labs(x="Horsepower", y = "Miles per Gallon", title= "MPG vs Horsepower") grid.arrange(g2, g3, g4, g5, nrow=1) Finally, here's a nice graph using a combination of options: g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color=factor(vs), shape=factor(vs))) + geom_smooth(aes(color=factor(vs)),method = "lm", se = TRUE) + scale_color_manual(name ="Engine", labels=c("V-engine", "Straight engine"), values=c("red","blue")) + scale_shape_manual(name ="Engine", labels=c("V-engine", "Straight engine"), values=c(0,2)) + theme_bw() + theme( axis.title.x = element_text(face="bold", color="black", size=12), axis.title.y = element_text(face="bold", color="black", size=12), plot.title = element_text(face="bold", color = "black", size=12), legend.position=c(1,1), legend.justification=c(1,1)) + labs(x="Horsepower", y = "Miles per Gallon", title= "Linear Regression (95% CI) of MPG vs Horsepower by Engine type") g2 Reader request: Display Regression Line Equation on Scatterplot I received a request asking how to overlay the regression equation itself on a plot, so I've decided to update this post with that information. There are two ways to put text on a ggplot: annotate or geom_text(). I was finding that the geom_text() layer did not look very nice on my screen so I checked up on it and it seems others have this issue as well. I'll show you how the two behave, at least in my version of everything I use on my mac. We'll go back to the example where I add a regression line to the plot using geom_smooth(). To add text, you need to run the regression outside of ggplot, extract the coefficients, and then paste them together into some text that you can layer onto the plot. We're plotting MPG against horsepower so we create an object m that stores the linear model, and then extract the coefficients using the coef() function. We envelope the coef() function with signif() in order to round the coefficients to two significant digits. I then paste the regression equation text together, using sep="" in order to eliminate spaces. m <- lm(mtc$mpg ~ mtc$hp) a <- signif(coef(m)[1], digits = 2) b <- signif(coef(m)[2], digits = 2) textlab <- paste("y = ",b,"x + ",a, sep="") print(textlab) Next, I take the original p1 ggplot object, add points and a linear model to it, and then add a layer of text. I will show the two ways here, first using geom_smooth and then using annotate. With both methods, you must specify the x- and y-coordinates for where the text should be centered. In the geom_text code, notice that that label=textlab is included in the aes statement, while this is not the case for annotate. If there were mathematical or formatting symbols in the text, I would indicate parse=TRUE instead of FALSE, as we will see in the next example. ##basic ggplot with points and linear model p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE) ##add regression text using geom_text r1 <- p3 + geom_text(aes(x = 245, y = 30, label = textlab), color="black", size=5, parse = FALSE) ##add regression text using annotate r2 <- p3 + annotate("text", x = 245, y = 30, label = textlab, color="black", size = 5, parse=FALSE) grid.arrange(r1, r2, nrow=1) In a fancier way that I got from [this StackOverflow page](http://stackoverflow.com/questions/7549694/ggplot2-adding-regression-line-equation-and-r2-on-graph), you can use a function to piece together your text (which would be useful if you were doing this a lot). It also shows you how you can put in mathematical symbols and formattting changes, like making your variables italic by using substitute(), and adding in a dot for the multiplication symbol. The function lm_eqn() takes the arguments x, y, and a dataframe and evaluates the same linear model as before. Then it uses the substitute() function to piece together the regression equation using an expression, which is an R object of class "call". Finally, the function returns the expression, and is used exactly the same way in the two ggplot statements, EXCEPT that since we now have these formatting changes, we must use parse=TRUE in order to properly display the expressions. ##function to create equation expression lm_eqn = function(x, y, df){ m <- lm(y ~ x, df); eq <- substitute(italic(y) == b %.% italic(x) + a, list(a = format(coef(m)[1], digits = 2), b = format(coef(m)[2], digits = 2))) as.character(as.expression(eq)); } ##add regression equation using geom_text r3 <- p3 + geom_text(aes(x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc)), color="black", size=5, parse = TRUE) ##add regression equation using annotate r4 <- p3 + annotate("text", x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc), color="black", size = 5, parse=TRUE) grid.arrange(r3, r4, nrow=1) Of course, you can change the font and do more formatting stuff on the text itself - [find that information here.](http://docs.ggplot2.org/0.9.3.1/geom_text.html) Lastly, I will go over functions in a post that I plan on doing very soon so be on the lookout for that if the function used here is confusing or you'd like to know more.