Monday, February 17, 2014

ggplot2: Cheatsheet for Visualizing Distributions

>> Histograms

In the third and last of the ggplot series, this post will go over interesting ways to visualize the distribution of your data. I will make up some data, and make sure to set the seed.

library(ggplot2)
library(gridExtra)
set.seed(10005)

xvar <- c(rnorm(1500, mean = -1), rnorm(1500, mean = 1.5))
yvar <- c(rnorm(1500, mean = 1), rnorm(1500, mean = 1.5))
zvar <- as.factor(c(rep(1, 1500), rep(2, 1500)))
xy <- data.frame(xvar, yvar, zvar)

>> Histograms

I've already done a post on histograms using base R, so I won't spend too much time on them. Here are the basics of doing them in ggplot. More on all options for histograms here.

The R cookbook has a nice page about it too: http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/

Also, I found this really great aggregation of all of the possible geom layers and options you can add to a plot. In general the site is a great reference for all things ggplot.

#counts on y-axis
g1<-ggplot(xy, aes(xvar)) + geom_histogram()                                      #horribly ugly default
g2<-ggplot(xy, aes(xvar)) + geom_histogram(binwidth=1)                            #change binwidth
g3<-ggplot(xy, aes(xvar)) + geom_histogram(fill=NA, color="black") + theme_bw()   #nicer looking

#density on y-axis
g4<-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + theme_bw()

grid.arrange(g1, g2, g3, g4, nrow=1)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust
## this. stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to
## adjust this. stat_bin: binwidth defaulted to range/30. Use 'binwidth = x'
## to adjust this.

plot of chunk unnamed-chunk-2

Notice the warnings about the default binwidth that always is reported unless you specify it yourself. I will remove the warnings from all plots that follow to conserve space.

>> Density plots

We can do basic density plots as well. Note that the default for the smoothing kernel is gaussian, and you can change it to a number of different options, including kernel=“epanechnikov” and kernel=“rectangular” or whatever you want. You can find all of those options here.

#basic density
p1<-ggplot(xy, aes(xvar)) + geom_density()

#histogram with density line overlaid
p2<-ggplot(xy, aes(x=xvar)) + 
  geom_histogram(aes(y = ..density..), color="black", fill=NA) +
  geom_density(color="blue")

#split and color by third variable, alpha fades the color a bit
p3<-ggplot(xy, aes(xvar, fill = zvar)) + geom_density(alpha = 0.2)

grid.arrange(p1, p2, p3, nrow=1)

plot of chunk unnamed-chunk-3

>> Boxplots and more

We can also look at other ways to visualize our distributions. Boxplots are probably the most useful in order to describe the statistics of a distribution, but sometimes other visualizations are nice. I show a jitter plot and a violin plot. More on boxplots here. Note that I removed the legend from each one because it is redundant.

#boxplot
b1<-ggplot(xy, aes(zvar, xvar)) + 
  geom_boxplot(aes(fill = zvar)) +
  theme(legend.position = "none")

#jitter plot
b2<-ggplot(xy, aes(zvar, xvar)) + 
  geom_jitter(alpha=I(1/4), aes(color=zvar)) +
  theme(legend.position = "none")

#violin plot
b3<-ggplot(xy, aes(x = xvar)) +
  stat_density(aes(ymax = ..density..,  ymin = -..density..,
               fill = zvar, color = zvar),
               geom = "ribbon", position = "identity") +
  facet_grid(. ~ zvar) +
  coord_flip() +
  theme(legend.position = "none")

grid.arrange(b1, b2, b3, nrow=1)

plot of chunk unnamed-chunk-4

>> Putting multiple plots together

Finally, it's nice to put different plots together to get a real sense of the data. We can make a scatterplot of the data, and add marginal density plots to each side. Most of the code below I adapted from this StackOverflow page.

One way to do this is to add distribution information to a scatterplot as a “rug plot”. It adds a little tick mark for every point in your data projected onto the axis.

#rug plot
ggplot(xy,aes(xvar,yvar))  + geom_point() + geom_rug(col="darkred",alpha=.1)

plot of chunk unnamed-chunk-5

Another way to do this is to add histograms or density plots or boxplots to the sides of a scatterplot. I followed the stackoverflow page, but let me know if you have suggestions on a better way to do this, especially without the use of the empty plot as a place-holder.

I do the density plots by the zvar variable to highlight the differences in the two groups.

#placeholder plot - prints nothing at all
empty <- ggplot()+geom_point(aes(1,1), colour="white") +
     theme(                              
       plot.background = element_blank(), 
       panel.grid.major = element_blank(), 
       panel.grid.minor = element_blank(), 
       panel.border = element_blank(), 
       panel.background = element_blank(),
       axis.title.x = element_blank(),
       axis.title.y = element_blank(),
       axis.text.x = element_blank(),
       axis.text.y = element_blank(),
       axis.ticks = element_blank()
     )

#scatterplot of x and y variables
scatter <- ggplot(xy,aes(xvar, yvar)) + 
  geom_point(aes(color=zvar)) + 
  scale_color_manual(values = c("orange", "purple")) + 
  theme(legend.position=c(1,1),legend.justification=c(1,1)) 

#marginal density of x - plot on top
plot_top <- ggplot(xy, aes(xvar, fill=zvar)) + 
  geom_density(alpha=.5) + 
  scale_fill_manual(values = c("orange", "purple")) + 
  theme(legend.position = "none")

#marginal density of y - plot on the right
plot_right <- ggplot(xy, aes(yvar, fill=zvar)) + 
  geom_density(alpha=.5) + 
  coord_flip() + 
  scale_fill_manual(values = c("orange", "purple")) + 
  theme(legend.position = "none") 

#arrange the plots together, with appropriate height and width for each row and column
grid.arrange(plot_top, empty, scatter, plot_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4))

plot of chunk unnamed-chunk-6

It's really nice that grid.arrange() clips the plots together so that the scales are automatically the same. You could get rid of the redundant axis labels by adding in theme(axis.title.x = element_blank()) in the density plot code. I think it comes out looking very nice, with not a ton of effort. You could also add linear regression lines and confidence intervals to the scatterplot. Check out my first ggplot2 cheatsheet for scatterplots if you need a refresher.


In the third and last of the ggplot series, this post will go over interesting ways to visualize the distribution of your data. I will make up some data, and make sure to set the seed. library(ggplot2) library(gridExtra) set.seed(10005) xvar<-c(rnorm(1500, mean=-1), rnorm(1500, mean=1.5)) yvar<-c(rnorm(1500,mean=1), rnorm(1500, mean=1.5)) zvar<-as.factor(c(rep(1,1500),rep(2,1500))) xy<-data.frame(xvar,yvar,zvar) Histograms I've already done a [post on histograms](http://rforpublichealth.blogspot.com/2012/12/basics-of-histograms.html) using base R, so I won't spend too much time on them. Here are the basics of doing them in ggplot. [More on all options for histograms here.](http://docs.ggplot2.org/current/geom_histogram.html) The R cookbook has a nice page about it too: http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/ Also, I found [this really great aggregation](http://sape.inf.usi.ch/quick-reference/ggplot2/geom) of all of the possible geom layers and options you can add to a plot. In general the site is a great reference for all things ggplot. #counts on y-axis g1<-ggplot(xy, aes(xvar)) + geom_histogram() #horribly ugly default g2<-ggplot(xy, aes(xvar)) + geom_histogram(binwidth=1) #change binwidth g3<-ggplot(xy, aes(xvar)) + geom_histogram(fill=NA, color="black") + theme_bw() #nicer looking #density on y-axis g4<-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + theme_bw() grid.arrange(g1, g2, g3, g4, nrow=1) Notice the warnings about the default binwidth that always is reported unless you specify it yourself. I will remove the warnings from all plots that follow to conserve space. Density plots We can do basic density plots as well. Note that the default for the smoothing kernel is gaussian, and you can change it to a number of different options, including __kernel="epanechnikov"__ and __kernel="rectangular"__ or whatever you want. You can [find all of those options here](http://docs.ggplot2.org/current/stat_density.html). #basic density p1<-ggplot(xy, aes(xvar)) + geom_density() #histogram with density line overlaid p2<-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + geom_density(color="blue") #split and color by third variable, alpha fades the color a bit p3<-ggplot(xy, aes(xvar, fill = zvar)) + geom_density(alpha = 0.2) grid.arrange(p1, p2, p3, nrow=1) Boxplots and more We can also look at other ways to visualize our distributions. Boxplots are probably the most useful in order to describe the statistics of a distribution, but sometimes other visualizations are nice. I show a jitter plot and a violin plot. [More on boxplots here.](http://docs.ggplot2.org/0.9.3.1/geom_boxplot.html) Note that I removed the legend from each one because it is redundant. #boxplot b1<-ggplot(xy, aes(zvar, xvar)) + geom_boxplot(aes(fill = zvar)) + theme(legend.position = "none") #jitter plot b2<-ggplot(xy, aes(zvar, xvar)) + geom_jitter(alpha=I(1/4), aes(color=zvar)) + theme(legend.position = "none") #violin plot b3<-ggplot(xy, aes(x = xvar)) + stat_density(aes(ymax = ..density.., ymin = -..density.., fill = zvar, color = zvar), geom = "ribbon", position = "identity") + facet_grid(. ~ zvar) + coord_flip() + theme(legend.position = "none") grid.arrange(b1, b2, b3, nrow=1) Putting multiple plots together Finally, it's nice to put different plots together to get a real sense of the data. We can make a scatterplot of the data, and add marginal density plots to each side. Most of the code below I adapted from this [StackOverflow page](http://stackoverflow.com/questions/8545035/scatterplot-with-marginal-histograms-in-ggplot2). One way to do this is to add distribution information to a scatterplot as a "rug plot". It adds a little tick mark for every point in your data projected onto the axis. #rug plot ggplot(xy,aes(xvar,yvar)) + geom_point() + geom_rug(col="darkred",alpha=.1) Another way to do this is to add histograms or density plots or boxplots to the sides of a scatterplot. I followed the stackoverflow page, but let me know if you have suggestions on a better way to do this, especially without the use of the empty plot as a place-holder. I do the density plots by the zvar variable to highlight the differences in the two groups. #placeholder plot - prints nothing at all empty <- ggplot()+geom_point(aes(1,1), colour="white") + theme( plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank() ) #scatterplot of x and y variables scatter <- ggplot(xy,aes(xvar, yvar)) + geom_point(aes(color=zvar)) + scale_color_manual(values = c("orange", "purple")) + theme(legend.position=c(1,1),legend.justification=c(1,1)) #marginal density of x - plot on top plot_top <- ggplot(xy, aes(xvar, fill=zvar)) + geom_density(alpha=.5) + scale_fill_manual(values = c("orange", "purple")) + theme(legend.position = "none") #marginal density of y - plot on the right plot_right <- ggplot(xy, aes(yvar, fill=zvar)) + geom_density(alpha=.5) + coord_flip() + scale_fill_manual(values = c("orange", "purple")) + theme(legend.position = "none") #arrange the plots together, with appropriate height and width for each row and column grid.arrange(plot_top, empty, scatter, plot_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4)) It's really nice that grid.arrange() clips the plots together so that the scales are automatically the same. You could get rid of the redundant axis labels by adding in __theme(axis.title.x = element_blank())__ in the density plot code. I think it comes out looking very nice, with not a ton of effort. You could also add linear regression lines and confidence intervals to the scatterplot. Check out my first [ggplot2 cheatsheet for scatterplots](http://rforpublichealth.blogspot.com/2013/11/ggplot2-cheatsheet-for-scatterplots.html) if you need a refresher.