Wednesday, November 13, 2013

ggplot2: Cheatsheet for Scatterplots

Quick Intro to ggplot2

The graphics package ggplot2 is powerful, aesthetically pleasing, and (after a short learning curve to understand the syntax) easy to use. I have made some pretty cool plots with it, but on the whole I find myself making a lot of the same ones, since doing something over and over again is generally how research goes. Since I constantly forget the options that I need to customize my plots, this next series of posts will serve as cheatsheets for scatterplots, barplots, and density plots. We start with scatterplots.

Quick Intro to ggplot2

The way ggplot2 works is by layering components of your plot on top of each other. You start with the basic of the data you want your plot to include (x and y variables), and then layer on top the kind of plotting colors/symbols you want, the look of the x- and y-axes, the background color, etc. You can also easily add regression lines and summary statistics.

For great reference guides, use the ggplot2 documentation or the R Graphs Cookbook.

In this post, we focus only on scatterplots with a continuous x and continuous y. We are going to use the mtcars data that is available through R.

library(ggplot2)
library(gridExtra)
mtc <- mtcars

Here's the basic syntax of a scatterplot. We give it a dataframe, mtc, and then in the aes() statement, we give it an x-variable and a y-variable to plot. I save it as a ggplot object called p1, because we are going to use this as the base and then layer everything else on top:

# Basic scatterplot
p1 <- ggplot(mtc, aes(x = hp, y = mpg))

Now for the plot to print, we need to specify the next layer, which is how the symbols should look - do we want points or lines, what color, how big. Let's start with points:

# Print plot with default points
p1 + geom_point()

plot of chunk unnamed-chunk-3

That's the bare bones of it. Now we have fun with adding layers. For each of the examples, I'm going to use the grid.arrange() function in the gridExtra package to create multiple graphs in one panel to save space.

>> Change color of points

We start with options for colors just by adding how we want to color our points in the geom_point() layer:

p2 <- p1 + geom_point(color="red")            #set one color for all points
p3 <- p1 + geom_point(aes(color = wt))        #set color scale by a continuous variable
p4 <- p1 + geom_point(aes(color=factor(am)))  #set color scale by a factor variable

grid.arrange(p2, p3, p4, nrow=1)

plot of chunk unnamed-chunk-4

We can also change the default colors that are given by ggplot2 like this:

#Change default colors in color scale
p1 + geom_point(aes(color=factor(am))) + scale_color_manual(values = c("orange", "purple"))

plot of chunk unnamed-chunk-5

>> Change shape or size of points

We're sticking with the basic p1 plot, but now changing the shape and size of the points:

p2 <- p1 + geom_point(size = 5)                   #increase all points to size 5
p3 <- p1 + geom_point(aes(size = wt))             #set point size by continuous variable
p4 <- p1 + geom_point(aes(shape = factor(am)))    #set point shape by factor variable    

grid.arrange(p2, p3, p4, nrow=1)

plot of chunk unnamed-chunk-6

Again, if we want to change the default shapes we can:

p1 + geom_point(aes(shape = factor(am))) + scale_shape_manual(values=c(0,2))

plot of chunk unnamed-chunk-7

>> Add lines to scatterplot

p2 <- p1 + geom_point(color="blue") + geom_line()                           #connect points with line
p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE)  #add regression line
p4 <- p1 + geom_point() + geom_vline(xintercept = 100, color="red")         #add vertical line

grid.arrange(p2, p3, p4, nrow=1)

plot of chunk unnamed-chunk-8

You can also take out the points, and just create a line plot, and change size and color as before:

ggplot(mtc, aes(x = wt, y = qsec)) + geom_line(size=2, aes(color=factor(vs)))

plot of chunk unnamed-chunk-9

>> Change axis labels

There are a few ways to do this. If you only want to quickly add labels you can use the labs() layer. If you want to change the font size and style of the label, then you need to use the theme() layer. More on this at the end of this post. If you want to change around the limits of the axis, and exactly where the breaks are, you use the scale_x_continuous (and scale_y_continuous for the y-axis).

p2 <- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point()

p3 <- p2 + labs(x="Horsepower", 
                y = "Miles per Gallon")                                  #label all axes at once

p4 <- p2 + theme(axis.title.x = element_text(face="bold", size=20)) + 
           labs(x="Horsepower")                                          #label and change font size

p5 <- p2 + scale_x_continuous("Horsepower",
                              limits=c(0,400),
                              breaks=seq(0, 400, 50))                    #adjust axis limits and breaks

grid.arrange(p3, p4, p5, nrow=1)

plot of chunk unnamed-chunk-10

>> Change legend options

We start off by creating a new ggplot base object, g1, which colors the points by a factor variable. Then we show three basic options to modify the legend.

g1<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(aes(color=factor(vs)))

g2 <- g1 + theme(legend.position=c(1,1),legend.justification=c(1,1))        #move legend inside                
g3 <- g1 + theme(legend.position = "bottom")                                #move legend bottom         
g4 <- g1 + scale_color_discrete(name ="Engine", 
                                labels=c("V-engine", "Straight engine"))    #change labels

grid.arrange(g2, g3, g4, nrow=1)

plot of chunk unnamed-chunk-11

If we had changed the shape of the points, we would use scale_shape_discrete() with the same options. We can also remove the entire legend altogether by using theme(legend.position=“none”)

Next we customize a legend when the scale is continuous:

g5<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color = wt))
g5 + scale_color_continuous(name="Weight",                                     #name of legend
                            breaks = with(mtc, c(min(wt), mean(wt), max(wt))), #choose breaks of variable
                            labels = c("Light", "Medium", "Heavy"),            #label
                            low = "pink",                                      #color of lowest value
                            high = "red")                                      #color of highest value

plot of chunk unnamed-chunk-12

>> Change background color and style

The look of the plot in terms of the background colors and style is the theme(). I personally don't like the look of the default gray so here are some quick ways to change it. I often the theme_bw() layer, which gets rid of the gray.

g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point()

#Completely clear all lines except axis lines and make background white
t1<-theme(                              
  plot.background = element_blank(), 
  panel.grid.major = element_blank(), 
  panel.grid.minor = element_blank(), 
  panel.border = element_blank(), 
  panel.background = element_blank(),
  axis.line = element_line(size=.4)
)

#Use theme to change axis label style
t2<-theme(                              
  axis.title.x = element_text(face="bold", color="black", size=10),
  axis.title.y = element_text(face="bold", color="black", size=10),
  plot.title = element_text(face="bold", color = "black", size=12)
)


g3 <- g2 + t1
g4 <- g2 + theme_bw()
g5 <- g2 + theme_bw() + t2 + labs(x="Horsepower", y = "Miles per Gallon", title= "MPG vs Horsepower")


grid.arrange(g2, g3, g4, g5, nrow=1)

plot of chunk unnamed-chunk-13

Finally, here's a nice graph using a combination of options:

g2<- ggplot(mtc, aes(x = hp, y = mpg)) + 
  geom_point(size=2, aes(color=factor(vs), shape=factor(vs))) +
  geom_smooth(aes(color=factor(vs)),method = "lm", se = TRUE) +
  scale_color_manual(name ="Engine", 
                     labels=c("V-engine", "Straight engine"),
                     values=c("red","blue")) +
  scale_shape_manual(name ="Engine", 
                     labels=c("V-engine", "Straight engine"),
                     values=c(0,2)) +
  theme_bw() + 
  theme(                              
    axis.title.x = element_text(face="bold", color="black", size=12),
    axis.title.y = element_text(face="bold", color="black", size=12),
    plot.title = element_text(face="bold", color = "black", size=12),
    legend.position=c(1,1),
    legend.justification=c(1,1)) +
  labs(x="Horsepower", 
       y = "Miles per Gallon", 
       title= "Linear Regression (95% CI) of MPG vs Horsepower by Engine type")

g2

plot of chunk unnamed-chunk-14

>> Reader request: Display Regression Line Equation on Scatterplot

I received a request asking how to overlay the regression equation itself on a plot, so I've decided to update this post with that information.

There are two ways to put text on a ggplot: annotate or geom_text(). I was finding that the geom_text() layer did not look very nice on my screen so I checked up on it and it seems others have this issue as well. I'll show you how the two behave, at least in my version of everything I use on my mac.

We'll go back to the example where I add a regression line to the plot using geom_smooth(). To add text, you need to run the regression outside of ggplot, extract the coefficients, and then paste them together into some text that you can layer onto the plot.

We're plotting MPG against horsepower so we create an object m that stores the linear model, and then extract the coefficients using the coef() function. We envelope the coef() function with signif() in order to round the coefficients to two significant digits. I then paste the regression equation text together, using sep=“” in order to eliminate spaces.

m <- lm(mtc$mpg ~ mtc$hp)
a <- signif(coef(m)[1], digits = 2)
b <- signif(coef(m)[2], digits = 2)
textlab <- paste("y = ",b,"x + ",a, sep="")
print(textlab)
## [1] "y = -0.068x + 30"

Next, I take the original p1 ggplot object, add points and a linear model to it, and then add a layer of text. I will show the two ways here, first using geom_smooth and then using annotate.

With both methods, you must specify the x- and y-coordinates for where the text should be centered. In the geom_text code, notice that that label=textlab is included in the aes statement, while this is not the case for annotate. If there were mathematical or formatting symbols in the text, I would indicate parse=TRUE instead of FALSE, as we will see in the next example.

##basic ggplot with points and linear model 
p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE)

##add regression text using geom_text
r1 <- p3 + geom_text(aes(x = 245, y = 30, label = textlab), color="black", size=5, parse = FALSE)

##add regression text using annotate
r2 <- p3 + annotate("text", x = 245, y = 30, label = textlab, color="black", size = 5, parse=FALSE)

grid.arrange(r1, r2, nrow=1)

plot of chunk unnamed-chunk-16

In a fancier way that I got from this StackOverflow page, you can use a function to piece together your text (which would be useful if you were doing this a lot). It also shows you how you can put in mathematical symbols and formattting changes, like making your variables italic by using substitute(), and adding in a dot for the multiplication symbol.

The function lm_eqn() takes the arguments x, y, and a dataframe and evaluates the same linear model as before. Then it uses the substitute() function to piece together the regression equation using an expression, which is an R object of class “call”.

Finally, the function returns the expression, and is used exactly the same way in the two ggplot statements, EXCEPT that since we now have these formatting changes, we must use parse=TRUE in order to properly display the expressions.

##function to create equation expression
lm_eqn = function(x, y, df){
  m <- lm(y ~ x, df);
  eq <- substitute(italic(y) == b %.% italic(x) + a,
                   list(a = format(coef(m)[1], digits = 2), 
                        b = format(coef(m)[2], digits = 2)))
  as.character(as.expression(eq));                 
}

##add regression equation using geom_text
r3 <- p3 + geom_text(aes(x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc)), color="black", size=5, parse = TRUE)

##add regression equation using annotate
r4 <- p3 + annotate("text", x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc), color="black", size = 5, parse=TRUE)

grid.arrange(r3, r4, nrow=1)

plot of chunk unnamed-chunk-17

Of course, you can change the font and do more formatting stuff on the text itself - find that information here.

Lastly, I will go over functions in a post that I plan on doing very soon so be on the lookout for that if the function used here is confusing or you'd like to know more.


The graphics package ggplot2 is powerful, aesthetically pleasing, and (after a short learning curve to understand the syntax) easy to use. I have made some pretty cool plots with it, but on the whole I find myself making a lot of the same ones, since doing something over and over again is generally how research goes. Since I constantly forget the options that I need to customize my plots, this next series of posts will serve as cheatsheets for scatterplots, barplots, and density plots. We start with scatterplots. ### Quick Intro to ggplot2 The way ggplot2 works is by layering components of your plot on top of each other. You start with the basic of the data you want your plot to include (x and y variables), and then layer on top the kind of plotting colors/symbols you want, the look of the x- and y-axes, the background color, etc. You can also easily add regression lines and summary statistics. For great reference guides, use the [ggplot2 documentation](http://docs.ggplot2.org/0.9.2.1/index.html) or the [R Graphs Cookbook](http://www.cookbook-r.com/Graphs). In this post, we focus only on scatterplots with a continuous x and continuous y. We are going to use the mtcars data that is available through R. library(ggplot2) library(gridExtra) mtc<-mtcars Here's the basic syntax of a scatterplot. We give it a dataframe, mtc, and then in the **aes()** statement, we give it an x-variable and a y-variable to plot. I save it as a ggplot object called p1, because we are going to use this as the base and then layer everything else on top: #Basic scatterplot p1 <- ggplot(mtc, aes(x = hp, y = mpg)) Now for the plot to print, we need to specify the next layer, which is how the symbols should look - do we want points or lines, what color, how big. Let's start with points: #Print plot with default points p1+geom_point() That's the bare bones of it. Now we have fun with adding layers. For each of the examples, I'm going to use the *grid.arrange()* function in the **gridExtra** package to create multiple graphs in one panel to save space. Change color of points We start with options for colors just by adding how we want to color our points in the geom_point() layer: p2 <- p1 + geom_point(color="red") #set one color for all points p3 <- p1 + geom_point(aes(color = wt)) #set color scale by a continuous variable p4 <- p1 + geom_point(aes(color=factor(am))) #set color scale by a factor variable grid.arrange(p2, p3, p4, nrow=1) We can also change the default colors that are given by ggplot2 like this: #Change default colors in color scale p1 + geom_point(aes(color=factor(am))) + scale_color_manual(values = c("orange", "purple")) Change shape or size of points We're sticking with the basic p1 plot, but now changing the shape and size of the points: p2 <- p1 + geom_point(size = 5) #increase all points to size 5 p3 <- p1 + geom_point(aes(size = wt)) #set point size by continuous variable p4 <- p1 + geom_point(aes(shape = factor(am))) #set point shape by factor variable grid.arrange(p2, p3, p4, nrow=1) Again, if we want to change the default shapes we can: p1 + geom_point(aes(shape = factor(am))) + scale_shape_manual(values=c(0,2)) * More options for [color and shape manual changes are here](http://docs.ggplot2.org/0.9.3.1/scale_manual.html) * All shape and line types can be found here: http://www.cookbook-r.com/Graphs/Shapes_and_line_types Add lines to scatterplot p2 <- p1 + geom_point(color="blue") + geom_line() #connect points with line p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE) #add regression line p4 <- p1 + geom_point() + geom_vline(xintercept = 100, color="red") #add vertical line grid.arrange(p2, p3, p4, nrow=1) You can also take out the points, and just create a line plot, and change size and color as before: ggplot(mtc, aes(x = wt, y = qsec)) + geom_line(size=2, aes(color=factor(vs))) * More help on scatterplots can be found here: http://www.cookbook-r.com/Graphs/Scatterplots_(ggplot2) Change axis labels There are a few ways to do this. If you only want to quickly add labels you can use the *labs()* layer. If you want to change the font size and style of the label, then you need to use the *theme()* layer. More on this at the end of this post. If you want to change around the limits of the axis, and exactly where the breaks are, you use the *scale_x_continuous* (and *scale_y_continuous* for the y-axis). p2 <- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point() p3 <- p2 + labs(x="Horsepower", y = "Miles per Gallon") #label all axes at once p4 <- p2 + theme(axis.title.x = element_text(face="bold", size=20)) + labs(x="Horsepower") #label and change font size p5 <- p2 + scale_x_continuous("Horsepower", limits=c(0,400), breaks=seq(0, 400, 50)) #adjust axis limits and breaks grid.arrange(p3, p4, p5, nrow=1) * More axis options can be found here: http://www.cookbook-r.com/Graphs/Axes_(ggplot2) Change legend options We start off by creating a new ggplot base object, g1, which colors the points by a factor variable. Then we show three basic options to modify the legend. g1<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(aes(color=factor(vs))) g2 <- g1 + theme(legend.position=c(1,1),legend.justification=c(1,1)) #move legend inside g3 <- g1 + theme(legend.position = "bottom") #move legend bottom g4 <- g1 + scale_color_discrete(name ="Engine", labels=c("V-engine", "Straight engine")) #change labels grid.arrange(g2, g3, g4, nrow=1) If we had changed the shape of the points, we would use *scale_shape_discrete()* with the same options. We can also remove the entire legend altogether by using **theme(legend.position="none")** Next we customize a legend when the scale is continuous: g5<-ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color = wt)) g5 + scale_color_continuous(name="Weight", #name of legend breaks = with(mtc, c(min(wt), mean(wt), max(wt))), #choose breaks of variable labels = c("Light", "Medium", "Heavy"), #label low = "pink", #color of lowest value high = "red") #color of highest value * More legend options can be found here: http://www.cookbook-r.com/Graphs/Legends_(ggplot2) Change background color and style The look of the plot in terms of the background colors and style is the **theme()**. I personally don't like the look of the default gray so here are some quick ways to change it. I often the theme_bw() layer, which gets rid of the gray. * All of the theme options [can be found here](http://docs.ggplot2.org/0.9.3/theme.html). g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point() #Completely clear all lines except axis lines and make background white t1<-theme( plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.line = element_line(size=.4) ) #Use theme to change axis label style t2<-theme( axis.title.x = element_text(face="bold", color="black", size=10), axis.title.y = element_text(face="bold", color="black", size=10), plot.title = element_text(face="bold", color = "black", size=12) ) g3 <- g2 + t1 g4 <- g2 + theme_bw() g5 <- g2 + theme_bw() + t2 + labs(x="Horsepower", y = "Miles per Gallon", title= "MPG vs Horsepower") grid.arrange(g2, g3, g4, g5, nrow=1) Finally, here's a nice graph using a combination of options: g2<- ggplot(mtc, aes(x = hp, y = mpg)) + geom_point(size=2, aes(color=factor(vs), shape=factor(vs))) + geom_smooth(aes(color=factor(vs)),method = "lm", se = TRUE) + scale_color_manual(name ="Engine", labels=c("V-engine", "Straight engine"), values=c("red","blue")) + scale_shape_manual(name ="Engine", labels=c("V-engine", "Straight engine"), values=c(0,2)) + theme_bw() + theme( axis.title.x = element_text(face="bold", color="black", size=12), axis.title.y = element_text(face="bold", color="black", size=12), plot.title = element_text(face="bold", color = "black", size=12), legend.position=c(1,1), legend.justification=c(1,1)) + labs(x="Horsepower", y = "Miles per Gallon", title= "Linear Regression (95% CI) of MPG vs Horsepower by Engine type") g2 Reader request: Display Regression Line Equation on Scatterplot I received a request asking how to overlay the regression equation itself on a plot, so I've decided to update this post with that information. There are two ways to put text on a ggplot: annotate or geom_text(). I was finding that the geom_text() layer did not look very nice on my screen so I checked up on it and it seems others have this issue as well. I'll show you how the two behave, at least in my version of everything I use on my mac. We'll go back to the example where I add a regression line to the plot using geom_smooth(). To add text, you need to run the regression outside of ggplot, extract the coefficients, and then paste them together into some text that you can layer onto the plot. We're plotting MPG against horsepower so we create an object m that stores the linear model, and then extract the coefficients using the coef() function. We envelope the coef() function with signif() in order to round the coefficients to two significant digits. I then paste the regression equation text together, using sep="" in order to eliminate spaces. m <- lm(mtc$mpg ~ mtc$hp) a <- signif(coef(m)[1], digits = 2) b <- signif(coef(m)[2], digits = 2) textlab <- paste("y = ",b,"x + ",a, sep="") print(textlab) Next, I take the original p1 ggplot object, add points and a linear model to it, and then add a layer of text. I will show the two ways here, first using geom_smooth and then using annotate. With both methods, you must specify the x- and y-coordinates for where the text should be centered. In the geom_text code, notice that that label=textlab is included in the aes statement, while this is not the case for annotate. If there were mathematical or formatting symbols in the text, I would indicate parse=TRUE instead of FALSE, as we will see in the next example. ##basic ggplot with points and linear model p3 <- p1 + geom_point(color="red") + geom_smooth(method = "lm", se = TRUE) ##add regression text using geom_text r1 <- p3 + geom_text(aes(x = 245, y = 30, label = textlab), color="black", size=5, parse = FALSE) ##add regression text using annotate r2 <- p3 + annotate("text", x = 245, y = 30, label = textlab, color="black", size = 5, parse=FALSE) grid.arrange(r1, r2, nrow=1) In a fancier way that I got from [this StackOverflow page](http://stackoverflow.com/questions/7549694/ggplot2-adding-regression-line-equation-and-r2-on-graph), you can use a function to piece together your text (which would be useful if you were doing this a lot). It also shows you how you can put in mathematical symbols and formattting changes, like making your variables italic by using substitute(), and adding in a dot for the multiplication symbol. The function lm_eqn() takes the arguments x, y, and a dataframe and evaluates the same linear model as before. Then it uses the substitute() function to piece together the regression equation using an expression, which is an R object of class "call". Finally, the function returns the expression, and is used exactly the same way in the two ggplot statements, EXCEPT that since we now have these formatting changes, we must use parse=TRUE in order to properly display the expressions. ##function to create equation expression lm_eqn = function(x, y, df){ m <- lm(y ~ x, df); eq <- substitute(italic(y) == b %.% italic(x) + a, list(a = format(coef(m)[1], digits = 2), b = format(coef(m)[2], digits = 2))) as.character(as.expression(eq)); } ##add regression equation using geom_text r3 <- p3 + geom_text(aes(x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc)), color="black", size=5, parse = TRUE) ##add regression equation using annotate r4 <- p3 + annotate("text", x = 245, y = 30, label = lm_eqn(mtc$hp, mtc$mpg, mtc), color="black", size = 5, parse=TRUE) grid.arrange(r3, r4, nrow=1) Of course, you can change the font and do more formatting stuff on the text itself - [find that information here.](http://docs.ggplot2.org/0.9.3.1/geom_text.html) Lastly, I will go over functions in a post that I plan on doing very soon so be on the lookout for that if the function used here is confusing or you'd like to know more.