Monday, October 15, 2012

What a nice looking scatterplot!


This week, we look at plotting data using scatterplots. I'll definitely have a post on other ways of plotting data, like boxplots or histograms.

Our data from last week remains the same:


First, a quick way to look at all of your continuous variables at once is just to do a plot command of your data.  Here, I will subset the data to just take three columns and plot those against each other:

plot(mydata[,c(2,4,5)])

This takes all rows, and the columns 2, 4, and 5 from the dataset and plots them all against each other, like this:



Next, I want to make a nice scatterplot of Weight on Height.  The basic format is

plot(xvariable, yvariable)

So it looks like this:

plot(mydata$Weight, mydata$Height)


But this is a little ugly.  Fortunately, there are a million options that I can take advantage of.  In this first post on plotting, I will:

  • add labels for the x and y axes (xlab and ylab, respectively) 
  • change the dimensions of the plot so it's not quite so condensed (xlim and ylim)
  • add a title (main) and change the font size of the title (cex.main)
  • get rid of the frame around the plot (frame.plot=FALSE)
  • change the type of plotting symbol from little circles to little trianges (pch=2) and make those little triangles blue (col="blue").


plot(mydata$Weight, mydata$Height, xlab="Weight (lbs)", ylab="Height (inches)", xlim=c(80,200), ylim=c(55,75), main="Height vs Weight", pch=2, cex.main=1.5, frame.plot=FALSE , col="blue")



Now, let's get a little more complicated.  I want the color of the plot symbol to be indicative of whether the observation is male or female, and to put a legend in there too. This is super easy right inside the plot function call using the ifelse() statement.  To review, the ifelse() statement is similar to the cond() statement in Stata.  It looks like this:

ifelse(condition, result if condition is true, result if condition is false)

So here I change my parameter col=blue to col=ifelse(mydata$Sex==1, "red", "blue"). This is saying that if the sex is a 1, make the color of the triangle red, else make it blue:

plot(mydata$Weight, mydata$Height, xlab="Weight (lbs)", ylab="Height (inches)", xlim=c(80,200), ylim=c(55,75), main="Height vs Weight", pch=2, cex.main=1.5, frame.plot=FALSE, col=ifelse(mydata$Sex==1, "red", "blue"))

Then I add in the legend. The first two parameters of the legend function are the x and y points where the legend should begin (here at the point (80,75)).  Then I indicate that I want two triangle symbols (pch=c(2,2)).  The first 2 is for the number of symbols and the second 2 is to indicate that pch=2 (a triangle) as it was in the previous example.  Next I say I want the first triangle to be red and the second one blue.  I label the two symbols with the labels "Male" and "Female".  Next, I indicate I want a box around the legend (bty="o") and that I want the box to be darkgreen.  Finally, I indicate that the font size of the whole legend text should be .8. (cex=.8).

legend(80, 75, pch=c(2,2), col=c("red", "blue"), c("Male", "Female"), bty="o",  box.col="darkgreen", cex=.8)



Incidentally, I could get the same plot by identifying "topleft" in my legend() call, as below.  But sometimes it's nice to put the legend exactly where you want it and the legend options only allow for “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right”, “center”.

legend("topleft", pch=c(2,2), col=c("red", "blue"), c("Male", "Female"), bty="o", cex=.8, box.col="darkgreen")

Finally, one of the really nice aspects of R is being able to manipulate the plot region and make it do exactly what you want.  For starters, we can have two plots side by side, by indicating:

par(mfrow=c(1,2))

meaning one row and two columns. If I wanted a 2 by 2 plot area, I would do mfrow=c(2,2).

Now I'll show the full code for the plot below.  The first part is just the first plot we already did, but I add in a vertical line at the average weight and add in text.  The second plot is Height on Age, and I add in the linear regression line.  To do this is quite easy.  I start by running the regression of Height on Age and save it as "reg".  Then I use abline() to add the line to the plot.  Finally, I use the text() function to add text to the plot anywhere I want.  I walk you through the code below:


##set up the plot area with 1 row and 2 columns of plots
par(mfrow=c(1,2))

##first plot height on weight
plot(mydata$Weight, mydata$Height, xlab="Weight (lbs)", ylab="Height (inches)", xlim=c(80,200), ylim=c(55,75), main="Height vs Weight", pch=2, cex.main=1.5, frame.plot=FALSE, col="blue")

##add in the vertical line at the mean of the weight, using na.rm=TRUE to remove the NAs from consideration
abline(v=mean(mydata$Weight, na.rm=TRUE), col="orange")

##add in text at the point (140, 73), with font size .8. The position is 4, meaning that the text moves to the right from the starting point. The "\n" is a carriage return (moves the text to the next line)
text(140,73, cex=.8, pos=4, "Orange line is\n sample average\n weight")

##add in the second plot of height on age
plot(mydata$Age, mydata$Height, xlab="Age (years)", ylab="Height (inches)", xlim=c(0,80), ylim=c(55,75), main="Height vs Age", pch=3, cex.main=1.5, frame.plot=FALSE, col="darkred")


##run a linear regression of Height on Age - if this is confusing, I'll do a post on linear regressions very soon
reg<-lm(Height~Age, data=mydata)

##add the regression line to the plot
abline(reg)


##add text to the plot. Start at the point 0,70.  Position the text to the right of this point (pos=4), make the font smaller (cex=.8), and add in the text using the paste function since I'm pasting in both text and the contents of some variables. For the text, I extract the intercept by taking the first coefficient from the reg object with the code reg$coef[1]; and the coefficient on Age by taking the second coefficient, reg$coef[2].  I round both of those to the second decimal point using round(x,2). There's a lot going on here but hopefully I've unpacked it for everyone.
text(0,72, paste("Height ~ ", round(reg$coef[1],2), "+", round(reg$coef[2],2), "*Age"), pos=4, cex=.8)


Plots are really fun to do in R.  This post was just a basic introduction and more will come on the many other interesting plotting features one can take advantage of in R.  If you want to see more options in R plotting, you can always look at R documentation, or other R blogs and help pages.  Here are a few:







No comments:

Post a Comment

Note: Only a member of this blog may post a comment.