Wednesday, November 13, 2013

ggplot2: Cheatsheet for Scatterplots

The graphics package ggplot2 is powerful, aesthetically pleasing, and (after a short learning curve to understand the syntax) easy to use. I have made some pretty cool plots with it, but on the whole I find myself making a lot of the same ones, since doing something over and over again is generally how research goes. Since I constantly forget the options that I need to customize my plots, this next series of posts will serve as cheatsheets for scatterplots, barplots, and density plots. We start with scatterplots. ### Quick Intro to ggplot2 The way ggplot2 works is by layering components of your plot on top of each other. You start with the basic of the data you want your plot to include (x and y variables), and then layer on top the kind of plotting colors/symbols you want, the look of the x- and y-axes, the background color, etc. You can also easily add regression lines and summary statistics. For great reference guides, use the [ggplot2 documentation]( or the [R Graphs Cookbook]( In this post, we focus only on scatterplots with a continuous x and continuous y. We are going to use the mtcars data that is available through R. library(ggplot2) library(gridExtra) mtc<-mtcars Here's the basic syntax of a scatterplot. We give it a dataframe, mtc, and then in the **aes()** statement, we give it an x-variable and a y-variable to plot. I save it as a ggplot object called p1, because we are going to use this as the base and then layer everything else on top: #Basic scatterplot p1 <- ggplot(mtc, aes(x = hp, y = mpg)) Now for the plot to print, we need to specify the next layer, which is how the symbols should look - do we want points or lines, what color, how big. Let's start with points: #Print plot with default points p1+geom_point() That's the bare bones of it. Now we have fun with adding layers. For each of the examples, I'm going to use the *grid.arrange()* function in the **gridExtra** package to create multiple graphs in one panel to save space. Change color of points We start with options for colors just by adding how we want to color our points in the geom_point() layer: p2 <- p1 + geom_point(color="red") #set one color for all points p3 <- p1 + geom_point(aes(color = wt)) #set color scale by a continuous variable p4 <- p1 + geom_point(aes(color=factor(am))) #set color scale by a factor variable grid.arrange(p2, p3, p4, nrow=1) We can also change the default colors that are given by ggplot2 like this: #Change default colors in color scale p1 + geom_point(aes(color=factor(am))) + scale_color_manual(values = c("orange", "purple")) Change shape or size of points We're sticking with the basic p1 plot, but now changing the shape and size of the points: p2 <- p1 + geom_point(size = 5) #increase all points to size 5 p3 <- p1 + geom_point(aes(size = wt)) #set point size by continuous variable p4 <- p1 + geom_point(aes(shape = factor(am))) #set point shape by factor variable grid.arrange(p2, p3, p4, nrow=1) Again, if we want to change the default shapes we can: p1 + geom_point(aes(shape = factor(am))) + scale_shape_manual(values=c(0,2)) * More options for [color and shape manual changes are here]( * All shape and line types can be found here: Add lines to scatterplot p2 <- p1 + geom_point(color="blue") + geom_line() #connect points with line p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE) #add regression line p4 <- p1 + geom_point() + geom_vline(xintercept = 100, color="red") #add vertical line grid.arrange(p2, p3, p4, nrow=1) You can also take out the points, and just create a line plot, and change size and color as before: ggplot(mtc, aes(x = wt, y = qsec)) + geom_line(size=2, aes(color=factor(vs))) * More help on scatterplots can be found here: Change axis labels There are a few ways to do this. If you only want to quickly add labels you can use the *labs()* layer. If you want to change the font size and style of the label, then you need to use the *theme()* layer. More on this at the end of this post. If you want to change around the limits of the axis, and exactly where the breaks are, you use the *scale_x_continuous* (and *scale_y_continuous* for the y-axis). p2 <- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point() p3 <- p2 + labs(x="Horsepower", y = "Miles per Gallon") #label all axes at once p4 <- p2 + theme(axis.title.x = element_text(face="bold", size=20)) + labs(x="Horsepower") #label and change font size p5 <- p2 + scale_x_continuous("Horsepower", limits=c(0,400), breaks=seq(0, 400, 50)) #adjust axis limits and breaks grid.arrange(p3, p4, p5, nrow=1) * More axis options can be found here: Change legend options We start off by creating a new ggplot base object, g1, which colors the points by a factor variable. Then we show three basic options to modify the legend. g1<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(aes(color=factor(vs))) g2 <- g1 + theme(legend.position=c(1,1),legend.justification=c(1,1)) #move legend inside g3 <- g1 + theme(legend.position = "bottom") #move legend bottom g4 <- g1 + scale_color_discrete(name ="Engine", labels=c("V-engine", "Straight engine")) #change labels grid.arrange(g2, g3, g4, nrow=1) If we had changed the shape of the points, we would use *scale_shape_discrete()* with the same options. We can also remove the entire legend altogether by using **theme(legend.position="none")** Next we customize a legend when the scale is continuous: g5<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color = wt)) g5 + scale_color_continuous(name="Weight", #name of legend breaks = with(mtc, c(min(wt), mean(wt), max(wt))), #choose breaks of variable labels = c("Light", "Medium", "Heavy"), #label low = "pink", #color of lowest value high = "red") #color of highest value * More legend options can be found here: Change background color and style The look of the plot in terms of the background colors and style is the **theme()**. I personally don't like the look of the default gray so here are some quick ways to change it. I often the theme_bw() layer, which gets rid of the gray. * All of the theme options [can be found here]( g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point() #Completely clear all lines except axis lines and make background white t1<-theme( plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.line = element_line(size=.4) ) #Use theme to change axis label style t2<-theme( axis.title.x = element_text(face="bold", color="black", size=10), axis.title.y = element_text(face="bold", color="black", size=10), plot.title = element_text(face="bold", color = "black", size=12) ) g3 <- g2 + t1 g4 <- g2 + theme_bw() g5 <- g2 + theme_bw() + t2 + labs(x="Horsepower", y = "Miles per Gallon", title= "MPG vs Horsepower") grid.arrange(g2, g3, g4, g5, nrow=1) Finally, here's a nice graph using a combination of options: g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color=factor(vs), shape=factor(vs))) + geom_smooth(aes(color=factor(vs)),method = "lm", se = TRUE) + scale_color_manual(name ="Engine", labels=c("V-engine", "Straight engine"), values=c("red","blue")) + scale_shape_manual(name ="Engine", labels=c("V-engine", "Straight engine"), values=c(0,2)) + theme_bw() + theme( axis.title.x = element_text(face="bold", color="black", size=12), axis.title.y = element_text(face="bold", color="black", size=12), plot.title = element_text(face="bold", color = "black", size=12), legend.position=c(1,1), legend.justification=c(1,1)) + labs(x="Horsepower", y = "Miles per Gallon", title= "Linear Regression (95% CI) of MPG vs Horsepower by Engine type") g2 Reader request: Display Regression Line Equation on Scatterplot I received a request asking how to overlay the regression equation itself on a plot, so I've decided to update this post with that information. There are two ways to put text on a ggplot: annotate or geom_text(). I was finding that the geom_text() layer did not look very nice on my screen so I checked up on it and it seems others have this issue as well. I'll show you how the two behave, at least in my version of everything I use on my mac. We'll go back to the example where I add a regression line to the plot using geom_smooth(). To add text, you need to run the regression outside of ggplot, extract the coefficients, and then paste them together into some text that you can layer onto the plot. We're plotting MPG against horsepower so we create an object m that stores the linear model, and then extract the coefficients using the coef() function. We envelope the coef() function with signif() in order to round the coefficients to two significant digits. I then paste the regression equation text together, using sep="" in order to eliminate spaces. m <- lm(mtc$mpg ~ mtc$hp) a <- signif(coef(m)[1], digits = 2) b <- signif(coef(m)[2], digits = 2) textlab <- paste("y = ",b,"x + ",a, sep="") print(textlab) Next, I take the original p1 ggplot object, add points and a linear model to it, and then add a layer of text. I will show the two ways here, first using geom_smooth and then using annotate. With both methods, you must specify the x- and y-coordinates for where the text should be centered. In the geom_text code, notice that that label=textlab is included in the aes statement, while this is not the case for annotate. If there were mathematical or formatting symbols in the text, I would indicate parse=TRUE instead of FALSE, as we will see in the next example. ##basic ggplot with points and linear model p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE) ##add regression text using geom_text r1 <- p3 + geom_text(aes(x = 245, y = 30, label = textlab), color="black", size=5, parse = FALSE) ##add regression text using annotate r2 <- p3 + annotate("text", x = 245, y = 30, label = textlab, color="black", size = 5, parse=FALSE) grid.arrange(r1, r2, nrow=1) In a fancier way that I got from [this StackOverflow page](, you can use a function to piece together your text (which would be useful if you were doing this a lot). It also shows you how you can put in mathematical symbols and formattting changes, like making your variables italic by using substitute(), and adding in a dot for the multiplication symbol. The function lm_eqn() takes the arguments x, y, and a dataframe and evaluates the same linear model as before. Then it uses the substitute() function to piece together the regression equation using an expression, which is an R object of class "call". Finally, the function returns the expression, and is used exactly the same way in the two ggplot statements, EXCEPT that since we now have these formatting changes, we must use parse=TRUE in order to properly display the expressions. ##function to create equation expression lm_eqn = function(x, y, df){ m <- lm(y ~ x, df); eq <- substitute(italic(y) == b %.% italic(x) + a, list(a = format(coef(m)[1], digits = 2), b = format(coef(m)[2], digits = 2))) as.character(as.expression(eq)); } ##add regression equation using geom_text r3 <- p3 + geom_text(aes(x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc)), color="black", size=5, parse = TRUE) ##add regression equation using annotate r4 <- p3 + annotate("text", x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc), color="black", size = 5, parse=TRUE) grid.arrange(r3, r4, nrow=1) Of course, you can change the font and do more formatting stuff on the text itself - [find that information here.]( Lastly, I will go over functions in a post that I plan on doing very soon so be on the lookout for that if the function used here is confusing or you'd like to know more.


  1. Thanks! Very clear and helpful.

  2. Indeed, very clear and helpful. One question: in your last example, you change both colour and shape to vary with vs. Having colour represent vs, and shape, say, am, is not a problem; but how does one construct a suitable legend?

    1. Thanks! You would change scale_shape_manual and scale_color_manual accordingly. I took out the regression lines because it would be confusing but here is the plot with color by vs and shape by am with the legend:

      g2<- ggplot(mtc, aes(x = hp, y = mpg)) +
      geom_point(size=3, aes(color=factor(vs), shape=factor(am))) +
      scale_color_manual(name ="Engine",
      labels=c("V-engine", "Straight"),
      values=c("red","blue")) +
      scale_shape_manual(name ="Transmission",
      labels=c("Automatic", "Manual"),
      values=c(0,2)) +
      theme_bw() +
      axis.title.x = element_text(face="bold", color="black", size=12),
      axis.title.y = element_text(face="bold", color="black", size=12),
      plot.title = element_text(face="bold", color = "black", size=12),
      legend.justification=c(1,1)) +
      labs(x="Horsepower", y = "Miles per Gallon", title= "MPG vs Horsepower by Engine and Transmission")

    2. Some of the plots are not loading (e.g. 4, 6, 8, 10, ...)

    3. Hmm, they look fine to me. Which one specifically doesn't load? Or can you send me a screenshot?

  3. Hi Rokicki.. I'm also Public Health researcher and admire R very much. Its amazing to learn more of R from your blog. I liked this particular ggplot series on Scatterplot.. I would like to know how we can put the regression equation onto the plot, for example in your plot
    p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE) #add regression line

    Thank you.

    1. Hi Manoj, Great question! I have updated the Scatterplot blog post to answer it. Check out the last section now and I hope it helps! Thanks for reading.

  4. Thanks for sharing, that what useful. However, annotate() is a better way than geom_text(), as you can see from the poor, jagged annotations it produces, caused by printing over and over. See

  5. Thank you very much for taking the initiative to organize this very useful information in a clear and concise way.

    I recently finished MITx's excellent 15.071x MOOC in data analytics, and this post plus your

    complement the visualization unit of that course very well.

    1. Thanks Nick! I'm really glad it's helpful. That class sounds really interesting. I'll check it out.