Monday, February 17, 2014

ggplot2: Cheatsheet for Visualizing Distributions


In the third and last of the ggplot series, this post will go over interesting ways to visualize the distribution of your data. I will make up some data, and make sure to set the seed. library(ggplot2) library(gridExtra) set.seed(10005) xvar<-c(rnorm(1500, mean=-1), rnorm(1500, mean=1.5)) yvar<-c(rnorm(1500,mean=1), rnorm(1500, mean=1.5)) zvar<-as.factor(c(rep(1,1500),rep(2,1500))) xy<-data.frame(xvar,yvar,zvar) Histograms I've already done a [post on histograms](http://rforpublichealth.blogspot.com/2012/12/basics-of-histograms.html) using base R, so I won't spend too much time on them. Here are the basics of doing them in ggplot. [More on all options for histograms here.](http://docs.ggplot2.org/current/geom_histogram.html) The R cookbook has a nice page about it too: http://www.cookbook-r.com/Graphs/Plotting_distributions_(ggplot2)/ Also, I found [this really great aggregation](http://sape.inf.usi.ch/quick-reference/ggplot2/geom) of all of the possible geom layers and options you can add to a plot. In general the site is a great reference for all things ggplot. #counts on y-axis g1<-ggplot(xy, aes(xvar)) + geom_histogram() #horribly ugly default g2<-ggplot(xy, aes(xvar)) + geom_histogram(binwidth=1) #change binwidth g3<-ggplot(xy, aes(xvar)) + geom_histogram(fill=NA, color="black") + theme_bw() #nicer looking #density on y-axis g4<-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + theme_bw() grid.arrange(g1, g2, g3, g4, nrow=1) Notice the warnings about the default binwidth that always is reported unless you specify it yourself. I will remove the warnings from all plots that follow to conserve space. Density plots We can do basic density plots as well. Note that the default for the smoothing kernel is gaussian, and you can change it to a number of different options, including __kernel="epanechnikov"__ and __kernel="rectangular"__ or whatever you want. You can [find all of those options here](http://docs.ggplot2.org/current/stat_density.html). #basic density p1<-ggplot(xy, aes(xvar)) + geom_density() #histogram with density line overlaid p2<-ggplot(xy, aes(x=xvar)) + geom_histogram(aes(y = ..density..), color="black", fill=NA) + geom_density(color="blue") #split and color by third variable, alpha fades the color a bit p3<-ggplot(xy, aes(xvar, fill = zvar)) + geom_density(alpha = 0.2) grid.arrange(p1, p2, p3, nrow=1) Boxplots and more We can also look at other ways to visualize our distributions. Boxplots are probably the most useful in order to describe the statistics of a distribution, but sometimes other visualizations are nice. I show a jitter plot and a violin plot. [More on boxplots here.](http://docs.ggplot2.org/0.9.3.1/geom_boxplot.html) Note that I removed the legend from each one because it is redundant. #boxplot b1<-ggplot(xy, aes(zvar, xvar)) + geom_boxplot(aes(fill = zvar)) + theme(legend.position = "none") #jitter plot b2<-ggplot(xy, aes(zvar, xvar)) + geom_jitter(alpha=I(1/4), aes(color=zvar)) + theme(legend.position = "none") #violin plot b3<-ggplot(xy, aes(x = xvar)) + stat_density(aes(ymax = ..density.., ymin = -..density.., fill = zvar, color = zvar), geom = "ribbon", position = "identity") + facet_grid(. ~ zvar) + coord_flip() + theme(legend.position = "none") grid.arrange(b1, b2, b3, nrow=1) Putting multiple plots together Finally, it's nice to put different plots together to get a real sense of the data. We can make a scatterplot of the data, and add marginal density plots to each side. Most of the code below I adapted from this [StackOverflow page](http://stackoverflow.com/questions/8545035/scatterplot-with-marginal-histograms-in-ggplot2). One way to do this is to add distribution information to a scatterplot as a "rug plot". It adds a little tick mark for every point in your data projected onto the axis. #rug plot ggplot(xy,aes(xvar,yvar)) + geom_point() + geom_rug(col="darkred",alpha=.1) Another way to do this is to add histograms or density plots or boxplots to the sides of a scatterplot. I followed the stackoverflow page, but let me know if you have suggestions on a better way to do this, especially without the use of the empty plot as a place-holder. I do the density plots by the zvar variable to highlight the differences in the two groups. #placeholder plot - prints nothing at all empty <- ggplot()+geom_point(aes(1,1), colour="white") + theme( plot.background = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.border = element_blank(), panel.background = element_blank(), axis.title.x = element_blank(), axis.title.y = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks = element_blank() ) #scatterplot of x and y variables scatter <- ggplot(xy,aes(xvar, yvar)) + geom_point(aes(color=zvar)) + scale_color_manual(values = c("orange", "purple")) + theme(legend.position=c(1,1),legend.justification=c(1,1)) #marginal density of x - plot on top plot_top <- ggplot(xy, aes(xvar, fill=zvar)) + geom_density(alpha=.5) + scale_fill_manual(values = c("orange", "purple")) + theme(legend.position = "none") #marginal density of y - plot on the right plot_right <- ggplot(xy, aes(yvar, fill=zvar)) + geom_density(alpha=.5) + coord_flip() + scale_fill_manual(values = c("orange", "purple")) + theme(legend.position = "none") #arrange the plots together, with appropriate height and width for each row and column grid.arrange(plot_top, empty, scatter, plot_right, ncol=2, nrow=2, widths=c(4, 1), heights=c(1, 4)) It's really nice that grid.arrange() clips the plots together so that the scales are automatically the same. You could get rid of the redundant axis labels by adding in __theme(axis.title.x = element_blank())__ in the density plot code. I think it comes out looking very nice, with not a ton of effort. You could also add linear regression lines and confidence intervals to the scatterplot. Check out my first [ggplot2 cheatsheet for scatterplots](http://rforpublichealth.blogspot.com/2013/11/ggplot2-cheatsheet-for-scatterplots.html) if you need a refresher.

9 comments:

  1. Wow, Slawa this is a great resource! Thanks so much for putting this together.

    Also I think 3rd plot under 'Boxplots' is a not a volcano plot, but "violin plot":
    http://en.wikipedia.org/wiki/Violin_plot

    Great stuff.

    ReplyDelete
    Replies
    1. Thanks! Thanks for the comment; I changed it.

      Delete
  2. You should check out beanplots, which are basically violin plots, with superimposed boxplots and dot plots. There is a beanplot package for R, but ggplot2 does not include a geom specifically for this. You can easily create one by using geom_violin, geom_boxplot, and geom_point.

    ReplyDelete
    Replies
    1. Yes, I think that's really the beauty of ggplot2 and what I've tried to convey over three posts about it is the idea of layering. You can superimpose layers of points, boxplots, and whatever else you want very easily once you know how to build the different components.

      Delete
  3. > "It's really nice that grid.arrange() clips the plots together so that the scales are automatically the same. "

    That's not the case, and for this very reason I wouldn't recommend using grid.arrange when the axes ought to be aligned. Consider using gtable instead, e.g http://stackoverflow.com/a/21531303/471093

    ReplyDelete
  4. Such a great post, Slawa!
    Tips for the readers - if you are interested in customizing your graphs in ggplot, checkout this blog post in R bloggers - http://www.r-bloggers.com/how-to-customize-ggplot2-graphics/

    ReplyDelete
  5. That is an extremely smart written article. I will be sure to bookmark it and return to learn extra of your useful information. Thank you for the post. I will certainly return.

    Personalized Kids Scrubs
    Toddler scrubs

    ReplyDelete