Sunday, April 7, 2013

Mastering Matrices

R has many ways to store information.  Most of the time, our data comes in the form of a dataset, which we bring into R as a data.frame object. However, there are times when we want to use matrices as well. This post will show you how matrices can be useful and how to manipulate them easily.

First of all, the big difference between matrices and dataframes is that all of the rows and columns of a matrix must have the same class (numeric, character, etc).  In a dataframe, you can have some of each. See my initial post about objects, here.

You can convert from one to the other using as.data.frame() or as.matrix().  Be careful though, that if you convert a dataframe with different classes of columns, then your matrix will just be all character:


In order to have a numeric matrix, I'm going to just take the first 6 columns of the mydata dataframe. I can delete columns of a matrix or dataframe in two ways:
mydata.mat<-as.matrix(mydata[,1:6])
mydata.mat<-as.matrix(mydata[,-7])
These two lines are doing the exact same thing. In the first, I am subsetting the dataframe mydata by taking all rows and the first 6 columns of the dataframe, then I'm converting that subset to a matrix. In the second, I'm taking all rows and all columns except the 7th column. Note that if I wanted to drop even more columns, I would just use the c() function like this:
mydata.mat<-as.matrix(mydata[,c(-3,-7)])
Note now that since I have taken out the one character column in my dataframe before I convert it to a matrix, I will get a numeric matrix instead of a character matrix:


This kind of operation for deleting columns works the same way in both matrices and dataframes. However, to add a column to a dataframe or matrix is different. In a dataframe, you can use the $ operator to identify columns, like mydata$Married is the vector corresponding to the Married column. However, you can't use the $ operator on matrices. You will get the following error that the "$ operator is invalid for atomic vectors", which I see all the time when I'm converting back and forth from dataframes to matrices and make a mistake:


All this message means is that the object you're using is a matrix (mydata.mat) and you can't use the $ operator on a matrix.  If you get this message, you can either use as.data.frame() to convert your matrix to a dataframe, or you can adjust what you are doing to accomodate the rules of matrices.  For adding columns to a matrix, you use cbind(), and likewise for rows, rbind().

So let's say I want to add an age squared column. In the dataframe, I do:
mydata$agesq<-mydata$Age^2
which instantly names the new column "agesq". Now for a matrix, there are two ways to do this, via indexing by number or by name of the original column:
mydata.mat<-cbind(mydata.mat, mydata.mat[,2]^2)
mydata.mat<-cbind(mydata.mat, mydata.mat[,"Age"]^2)
In the first line, I'm taking all rows and the second column of the mydata.mat matrix and squaring it, then I'm column binding it to the original matrix. In the second line, I'm doing the exact same thing, except that instead of indexing with a number, I can use the name of the column "Age". I get the following after running both statements:



Notice that the last two columns of this matrix do not have names, which can be rectified, by using the colnames() function:
colnames(mydata.mat)[7:8]<-c("AgeSq", "AgeSqAgain")
I don't want to rename everything, so I take the 7th and 8th columns and name those appropriately.

Finally, what can matrices do for us? One important aspect of matrices is of course matrix multiplication, which is how we do any multivariable regression analysis. I'll do a post soon on regression analysis by hand in R. But another reason is that matrices are great way to store values that you return during the course of running a loop.

For example, say I want to show how great the central limit theorem is. I'll generate deviates from  some other distribution, say the Poisson, and take the mean of the draws each time. I'll do this 1000 times and then show what the histogram looks like.  In a problem like this, I'll use a loop.  I'll also use a matrix to store the mean each time.

Ok, we start out by initializing a matrix. We'll create a matrix of all NAs with 1000 rows and 3 columns using the matrix() function:
mat1<-matrix(NA, nrow=1000, ncol=3)


Next, we'll set up the for() loop. Let's look at it first and then go through the logic:
for(i in 1:nrow(mat1)){
  vec1<-rpois(1,1)
  vec2<-rpois(10,1)
  vec3<-rpois(100,1)
  mat1[i,]<-c(mean(vec1), mean(vec2), mean(vec3))
 
}

So in the first line, we're saying for each value of i going from i=1 to i=nrow(mat1), do the stuff in the loop. We could have written 1:1000, but it's nice to leave it as nrow(mat1) since we may want to change the size of mat1 later and this way the loop will still be fine.

Next, we draw from a Poisson distribution three times, each time a larger number of draws (first 1 draw, then 10, then 100), and each time with a lambda of 1.

Finally, and this is where the matrix comes in, we'll take the mean of each one of those vectors and store it. We will store the three values in the ith row of the mat1 matrix, filling in all three columns.  In a longer way, I could have done:
mat1[i,1]<-mean(vec1)
mat1[i,2]<-mean(vec2)
mat1[i,3]<-mean(vec3)
and it would have come out the same, but the first way is nicer since it's more compact. Remember that matrices are just columns and rows of vectors, so you can always assign a vector to a row, as long as it's the same length. When you concatenate numbers (using the c() function), you make a vector, which is why it works.

Now, let's see how the old CLT is working by plotting some histograms:
par(mfrow=c(1,3))
hist(mat1[,1], main="n=1")
hist(mat1[,2], main="n=10")
hist(mat1[,3], main="n=100")

Again, with the histograms, I can plot each column at a time by subsetting the mat1 matrix:

Pretty nice! Other very helpful places to better understand matrices:



Sunday, March 17, 2013

Extracting Information From Objects Using Names()

One of the big differences between a language like Stata compared to R is the ability in R to handle many different types of objects at once, and combine them together or pull them apart.  I had a post about objects last year, but I thought I'd show in this post how to extract information from objects you create in R.

For this example, I'll go back to a dataset I've used in the past called mydata.Rdata and it's in the Code and Data Download site.

One function that is extremely useful to know is names().  The names() function will show you everything that is stored in R under that object name.  So, for example, if you do





where mydata is a dataframe object, you will get the names of the columns, which are the vectors that comprise the dataframe. Note that names(mydata) is an object itself (because everything is an object in R) - it is a character vector of length 7.  You can save this vector and print out the class to verify this.








But names() can be useful for much more than just column names, as we'll see in a moment.

But before we go on, let's take a moment to remember how subsetting works. In subsetting, you use square brackets to pull out exactly the element of an object that you want. So if I want to subset a dataframe, I can say

mydata.subset<-mydata[,c(1:2)]

which is saving into the new object mydata.subset, all the rows and only the first two columns of the mydata dataframe.

Now, let's combine the concept of using the names() function with the concept of subsetting to change one of the column names of our dataset:

names(mydata)[4]<-"Weight_lbs"

Here we are saying, of the names(mydata) object, take the fourth component and make it "Weight_lbs".  Now, if you run the names() on our dataframe, we find the change has been made:




Ok, so now we'll see how the names() function can be used in other applications.

1. Summary objects

There are two ways to extract information from objects in R, using subsetting and using the "$" operator. 

Below, we summarize the Age vector and store the results in sum.vec.  We print out the sum.vec object and the print out the corresponding names.  Now we can extract the 1st element of the summary vector of Age in the following way using the [ ] operator.













This gives us the first element, which is the minimum. We could also do:

sum.vec[c(2,3,5)] 

for the 25th, 50th, and 75th percentiles.


The other way to extract is by using "$".  For example, the summary() function on a table object gives you a Chi squared test:












Here, you can extract any of the pieces of information that came out in the test, including the number of cases, the number of variables, the test statistic, etc.  We can extract the pvalue of the test statistic by using the "$" operator, like this:






Let's see how this can be useful in the next example.

2. Regressions and statistical tests

The standard linear regression that we run in R is using lm().  It looks like this:











But there's a lot more that R has calculated that is not shown here. We can see this by saving this linear regression as an object and running names() on it:




So we see that saved under the reg.object are the coefficients, the residuals, fitted values, degrees of freedom, and a lot more.   To find out everything that names() provides for a given object, look it up by doing ?lm.  Now, to extract any of these components, like the residuals, use the "$" operator like this:

reg.object$residuals

You can make use of this extraction by taking the mean of the residuals





or plotting their distribution:

hist(reg.object$residuals, main="Distribution of Residuals" ,xlab="Residuals")

Don't forget that you can summarize regression objects using summary(), and get the names() of that summary too, like this:

summary(reg.object)
names(summary(reg.object))

which will give you more objects you can extract from your regression. You can use the names() function on any statistical model or function such as aov(), t.test(), chisq.test(), etc.

3.  Histograms and boxplots

Finally, let's go back to that histogram and save that into an object. There are objects under names() of the histogram object now:

 I showed how you can manipulate those in my post on histograms.

Similarly, for boxplot:













Here I've extracted the stats object which gives you the lower whisker, the lower hinge, the median, the upper hinge, and the upper whisker for each group, which you can see below.



Monday, February 25, 2013

Normal distribution functions

Ah, the Central Limit Theorem.  The basis of much of statistical inference and how we get those 95% confidence intervals.  It's just so beautiful!  Lately, I have found myself looking up the normal distribution functions in R.  They can be difficult to keep straight, so this post will give a succinct overview and show you how they can be useful in your data analysis.

To start, here is a table with all four normal distribution functions and their purpose, syntax, and an example:

Purpose Syntax Example
rnorm Generates random numbers 
from normal distribution
rnorm(n, mean, sd) rnorm(1000, 3, .25)
Generates 1000 numbers
from a normal with mean 3
and sd=.25
dnorm Probability Density Function
(PDF)
dnorm(x, mean, sd) dnorm(0, 0, .5)
Gives the density (height of the
PDF) of the normal
with mean=0 and sd=.5. 
pnorm Cumulative Distribution Function
(CDF)
pnorm(q, mean, sd) pnorm(1.96, 0, 1)
Gives the area under the
standard normal curve to
the left of 1.96,
i.e. ~0.975
qnorm Quantile Function - inverse of
pnorm
qnorm(p, mean, sd) qnorm(0.975, 0, 1)
Gives the value at which the
CDF of the standard normal
is .975, i.e. ~1.96

Note that for all functions, leaving out the mean and standard deviation would result in default values of mean=0 and sd=1, a standard normal distribution.

Another important note for the pnorn() function is the ability to get the right hand probability using the lower.tail=FALSE option.  For example,







In the first line, we are calculating the area to the left of 1.96, while in the second line we are calculating the area to the right of 1.96.

With these functions, I can do some fun plotting. I create a sequence of values from -4 to 4, and then calculate both the standard normal PDF and the CDF of each of those values.  I also generate 1000 random draws from the standard normal distribution. I then plot these next to each other. Whenever you use probability functions, you should, as a habit, remember to set the seed. Setting the seed means locking in the sequence of "random" (they are pseudorandom) numbers that R gives you, so you can reproduce your work later on.

set.seed(3000)
xseq<-seq(-4,4,.01)
densities<-dnorm(xseq, 0,1)
cumulative<-pnorm(xseq, 0, 1)
randomdeviates<-rnorm(1000,0,1)
 
par(mfrow=c(1,3), mar=c(3,4,4,2))

plot(xseq, densities, col="darkgreen",xlab="", ylab="Density", type="l",lwd=2, cex=2, main="PDF of Standard Normal", cex.axis=.8)

plot(xseq, cumulative, col="darkorange", xlab="", ylab="Cumulative Probability",type="l",lwd=2, cex=2, main="CDF of Standard Normal", cex.axis=.8)

hist(randomdeviates, main="Random draws from Std Normal", cex.axis=.8, xlim=c(-4,4))

The par() parameters set up a plotting area of 1 row and 3 columns (mfrow), and move the three plots closer to each other (mar). Here is a good explanation of the plotting area.  The output is below:

















Now, when we have our actual data, we can do a visual check of the normality of our outcome variable, which, if we assume a linear relationship with normally distributed errors, should also be normal. Let's make up some data, where I add noise by using rnorm() - here I'm generating the same amount of random numbers as is the length of the xseq vector, with a mean of 0 and a standard deviation of 5.5.

xseq<-seq(-4,4,.01)
y<-2*xseq + rnorm(length(xseq),0,5.5)

And now I can plot a histogram of y (check out my post on histograms if you want more detail) and add a curve() function to the plot using the mean and standard deviation of y as the parameters:

hist(y, prob=TRUE, ylim=c(0,.06), breaks=20)
curve(dnorm(x, mean(y), sd(y)), add=TRUE, col="darkblue", lwd=2)

Here, the curve() function takes as its first parameter a function itself (or an expression) that must be written as some function of x.  Our function here is dnorm(). The x in the dnorm() function is not an object we have created; rather, it's indicating that there's a variable that is being evaluated, and the evaluation is the normal density at the mean of y and standard deviation of y. Make sure to include add=TRUE so that the curve is plotted on the same plot as the histogram.  Here is what we get:


Here are some other good sources on the topic of probability distribution functions:

Friday, February 1, 2013

Converting a dataset from wide to long


I recently had to convert a dataset that I was working with from a wide format to a long format for my analysis.  I struggled with this a bit, but finally found the right sources and the right package to do it, so I thought I'd share my practical example of reshaping data in R. This post is specifically helpful for those using Demographic and Health Survey (DHS) data.

The DHS dataset includes one observation for each woman. For each observation, there are 20 columns for each birth she could have had for 16 different characteristics.  If no birth happened then the cell is left missing. The characteristics include birth month, birth year, sex of the child, death month, death year, who the child lives with, current age of child, and so on.

For clarity, I've shortened the data to just 7 observations and two characteristics of each birth (b2 and b4) for 3 possible births:


Here v012 is the mother's age, all the b2 variables are year of births, and the b4 variables are the sex of the child.

So the first subject, aged 30, has had two births - one in 2000 and one in 2005, both boys.  Since she did not have a third birth yet, her values for b2_03 and b4_03 are missing.

I would like to convert this dataset to one where I have one observation per child born - i.e. from the wide format to the long format. Specifically, I would like the columns to just be the caseid of the mother, age of the mother, year of birth of the child, and sex of the child.

There are a number of ways of doing this in R, including melt() and cast()plyr()aggregate(), and others.  However, after a good deal of struggle and looking things up, I found that the reshape() function is the most intuitive and user-friendly for the needs of this problem. You don't need to use melt and cast at all, which are difficult to manipulate in my opinion.

The reshape() function takes in a number of important parameters that will be necessary for our transformation (there are more parameters than this, but I've boiled it down to the ones that are crucial):

reshape(data, varying = NULL, timevar = "time", idvar = "id", direction, sep = "")

where
data = dataframe you want to convert
varying = columns in the wide format that correspond to a single column in the long format
timevar = name of new variable that differentiates multiple observations from the same individual
idvar = variable in your dataset that identifies multiple records from the same individual
direction = "wide" if you're going from long to wide and "long" if you're going from wide to long
sep = the symbol that separates the name of a varying column from its number

Ok, so the most important thing about the reshape function is that you have to give it variable names that it can understand. Specifically, the columns that I want to turn into one column should all follow the same structure in the naming convention. It can be anything with a pattern like this:

Birthyear_1, Birthyear_2, Birthyear_3
b1, b2, b3
year.1, year.2, year.3

etc.  As long as they follow the same text and number pattern. If the dataset is not in this format, you will have a lot of problems and I suggest changing your variable names before trying to convert. For the first example, you would use sep="_", the second would be sep="", and the third is sep="." and all of these are valid.

Ok, so let's try it out. All variables that we want to convert into one column we can put into the varying parameter and R will sort them out based on the naming patterns. We specify a long direction, that our id variable is caseid, and, importantly, that the separating symbol between the name of the variable and its order number is a "_".

births.long1<-reshape(births.wide, varying=c("b2_01","b2_02","b2_03", "b4_01", "b4_02", "b4_03"), direction="long", idvar="caseid", sep="_")

This gives us the following result (truncated picture):



Which is what we wanted! Each observation is now a birth, with corresponding mother caseid and age, and the year (b2) and sex of the child (b4) for each observation correctly lined up. R has added in the time variable which is just the number after the "_" for the varying variables. Notice that R automatically orders the dataset by this time variable, so you will have to re-order if you want it by caseid.

If you wanted to convert all of the 16 characteristics of each birth, you do not need to write out each individual variable. You can easily do it like this, indicating the number of the columns that you want:

births.long2<-reshape(births.wide, varying=c(3:8), direction="long", idvar="caseid", sep="_", timevar="order")

births.long2<-births.long2[order(births.long2$caseid),]

names(births.long2)<-c("subject","age","order", "birthyear", "childsex")

births.long2<-na.omit(births.long2)

Here, I've specified that all variables from column 3 to column 8 are varying variables, and I've also indicated that the new variable that R creates should be called "order" instead of the default "time". Then I reorder my dataset by caseid, name the new columns, and take out all of the missing observations, which are just non-existent births. I get the following result: 



Notice how subject 6 is completely gone because she did not have any children and this is now a dataset of children ever born.

I've made this example of reshape very specific to DHS data, but there are also many great sources on how to reshape data in R. Here are a couple that I found especially helpful:

Sunday, January 20, 2013

Translating Weird R Errors


I love R. I think it's intuitive and clever and overall a great language. But I do get really annoyed sometimes at the completely ridiculous, cryptic error messages it often gives me.  This post will go over some of those seemingly nonsensical errors so you don't have to go crazy trying to find the bug in your code.

1. all arguments must have the same length

To start with, I just make up some quick data:

prob1<-as.data.frame(cbind(c(1,2,3),c(5,4,3)))
colnames(prob1)<-c("Education","Ethnicity")

And now I just want to do a simple table but I get this error:






What the heck. I look back at my dataset and make sure that both those variables are the same length, which they are. The problem here is that I misspelled "Education".  There's a missing "a" in there and instead of telling me that I referenced a variable that doesn't exist, R bizarrely tells me to check the length of my variables. Remember: Anytime you get an error, check to make sure you've spelled everything right. 

If I do this, everything works out great:
table(prob1$Education, prob1$Ethnicity)


2. replacement has 0 rows, data has 3

A very similar problem, with a very different error message. Let's say I forgot what columns were in my prob1 data and I thought I had a Sex indicator in there. So I try to recode it like this:

This error message is also pretty unhelpful. The syntax is totally correct; the problem is that I just don't have a variable named Sex in my dataset. If I do this instead to recode education, a variable that exists, everything is fine:

prob1$Educ_recode<-as.numeric(prob1$Education==2)


3. undefined columns selected

Ironically, the error we so badly wanted before comes up but for a completely different reason. See if you can find the problem here.  I'll take that same little dataset and I just want to know how many rows there are in which Education is not equal to 1.

So, if I want to know the number of rows of the dataframe prob1, I do:

nrow(prob1)

and if I want to know how many have a value of Education not equal to 1, I do the following (incorrectly) and get an error:






Now I check my variable name and I've definitely spelled Education right this time. The problem, actually, is not that I have referenced a column that doesn't exist but I've messed up the syntax to the nrow() function, in that I haven't defined what columns I want to subset.  When I do,

prob1[prob1$Education!=1]

this doesn't make any sense, because I'm saying to subset prob1 but to do this I have to specify which rows I want and which columns I want.  This just lists one condition in the brackets and it's unclear whether it's for the rows or columns.  See my post on subsetting for more details on this.

If I do it the following way, all is good since I'm saying to subset prob1 with only rows with education !=1 and all columns:

nrow(prob1[prob1$Education!=1,])

So this error message does make sense in a way, but it's still a bit cryptic in my opinion.


Monday, January 14, 2013

For loops (and how to avoid them)

My experience when starting out in R was trying to clean and recode data using for() loops, usually with a few if() statements in the loop as well, and finding the whole thing complicated and frustrating.

In this post, I'll go over how you can avoid for() loops for both improving the quality and speed of your programming, as well as your sanity.

So here we have our classic dataset called mydata.Rdata (you can download this if you want, link at the right):



And if I were in Stata and wanted to create an age group variable, I could just do:

gen Agegroup=1
replace Agegroup=2 if Age>10 & Age<20
replace Agegroup=3 if Age>=20

But when I try this in R, it fails:







Why does it fail? It fails because Age is a vector so the condition if(mydata$Age<10) is asking "is the vector Age less than 10", which is not what we want to know.  We want to ask, row by row is each element of Age<10, so we need to specify the element of the vector we're referring to. We don't specify the element and thus we get the warning (really, error), "only the first element will be used."  So when this fails, the first way people try to solve this problem is with a crazy for() loop like this:

###########Unnecessarily long and ugly code below#######
mydata$Agegroup1<-0

for (i in  1:10){
  if(mydata$Age[i]>10 & mydata$Age[i]<20){
    mydata$Agegroup1[i]<-1
  }
  if(mydata$Age[i]>=20){
    mydata$Agegroup1[i]<-2
  }
}

Here we tell R to go down the rows from i=1 to i=10, and for each of those rows indexed by i, check to see what value of Age it is, and then assign Agegroup a value of 1 or 2.  This works, but at a high cost - you can easily make a mistake with all those indexed vectors, and also for() loops take a lot of computing time, which would be a big deal if this dataset were 10000 observations instead of 10.

So how can we avoid doing this?

One of the most useful functions I have found is one that I have referred to a number of times in my blog so far - the ifelse() function.  The ifelse() function evaluates a condition, and then assigns a value if it's true and a value if it's false.  The great part about it is that it can read in a vector and check each element of the vector one by one so you don't need indices or a loop. You don't even need to initialize some new variable before you run the statement.  Like this:

mydata$newvariable<-ifelse(Condition of some variable,
                    Value of new variable if condition is true
                    Value of new variable if condition is false)

so for example:

mydata$Old<-ifelse(mydata$Age>40,1,0)

This says, check to see if the elements of the vector mydata$Age are greater than 40: if an element is greater than 40, it assigns the value of 1 to mydata$Old, and if it's not greater than 40, it assigns the value of 0 to mydata$Old.

But we wanted to assign values 0, 1, and 2 to an Agegroup variable.  To do this, we can use nested ifelse() statements:

mydata$Agegroup2<-ifelse(mydata$Age>10 & mydata$Age<20,1,     
                  ifelse(mydata$Age>20, 2,0))

Now this says, first check whether each element of the Age vector is >10 and <20.  If it is, assign 1 to Agegroup2.  If it's not, then evaluate the next ifelse() statement, whether Age>20.  If it is, assign Agegroup2 a value of 2.  If it's not any of those, then assign it 0.  We can see that both the loop and the ifelse() statements give us the same result:


You can nest ifelse() statement as much as you like. Just be careful about your final category - it assigns the last value to whatever values are left over that didn't meet any condition (including if a value is NA!) so make sure you want that to happen.


Other examples of ways to use the ifelse() function:
  • If you want to add a column with the mean of Weight by sex for each individual, you can do this with ifelse() like this:
mydata$meanweight.bysex<-ifelse(mydata$Sex==0,  
               mean(mydata$Weight[mydata$Sex==0], na.rm=TRUE),         
               mean(mydata$Weight[mydata$Sex==1], na.rm=TRUE))



  • If you want to recode missing values:
mydata$Height.recode<-ifelse(is.na(mydata$Height),
                      9999, 
                      mydata$Height)

  • If you want to combine two variables together into a new one, such as to create a new ID variable based on year (which I added to this dataframe) and ID:
mydata$ID.long<-ifelse(mydata$ID<10, 
                paste(mydata$year, "-0",mydata$ID,sep=""), 
                paste(mydata$year, "-", mydata$ID, sep=""))



Other ways to avoid the for loop:

  • The apply functions:  If you think you have to use a loop because you have to apply some sort of function to each observation in your data, think again! Use the apply() functions instead.  For example:
  • You can also use other functions such as cut() to do the age grouping above. Here's the post on how this function works, so I won't go over it again, except to say if you convert from a factor to a numeric, *always* convert to a character before converting it to numeric:
mydata$Agegroup3<-as.numeric(as.character(cut(mydata$Age, c(0,10,20,100),labels=0:2)))


Basically, any time you think you have to do a loop, think about how you can do it with another function. It will save you a lot of time and mistakes in your code.


Saturday, December 22, 2012

Basics of Histograms


Histograms are used very often in public health to show the distributions of your independent and dependent variables.  Although the basic command for histograms (hist()) in R is simple, getting your histogram to look exactly like you want takes getting to know a few options of the plot.  Here I present ways to customize your histogram for your needs.

First, I want to point out that ggplot2 is a package in R that does some amazing graphics, including histograms.  I will do a post on ggplot2 in the coming year.   However, the hist() function in base R is really easy and fast, and does the job for most of your histogram-ing needs. However, if you want to do complicated histograms, I would recommend reading up on ggplot2.

Okay so for our purposes today, instead of importing data, I'll create some normally distributed data myself. In R, you can generate normal data this way using the rnorm() function:

BMI<-rnorm(n=1000, m=24.2, sd=2.2) 

So now we have some BMI data, and the basic histogram plot that comes out of R looks like this:

hist(BMI)


Which is actually pretty nice.  There are a number of things that R does by default in creating this histogram, and I think it's useful to print out that information to understand the parameters of this histogram.  You can do this by saving the histogram as an object and then printing it like this:

histinfo<-hist(BMI)
histinfo

And you get the output below:



This is helpful because you can see how R has decided to break up your data by default. It shows the breaks, which are the cutoff points for the bins. It shows the counts, intensity/density for each bin (same thing but two different names for R version compatibility), the midpoints of each bin, and then the name of the variable, whether the bins are equidistant, and the class of the object. You can of course take any one of these outputs by itself, i.e.  histinfo$counts would give you just the vector of counts.

Now we can manipulate this information to customize our plot.

1. Number of bins

R chooses how to bin your data for you by default using an algorithm, but if you want coarser or finer groups, there are a number of ways to do this. We can see that right now from the output above that the breaks go from 17 to 32 by 1.  You can use the breaks() option to change this in a number of ways.  An easy way is just to give it one number that gives the number of cells for the histogram:


hist(BMI, breaks=20, main="Breaks=20")
hist(BMI, breaks=5, main="Breaks=5")



The bins don't correspond to exactly the number you put in, because of the way R runs its algorithm to break up the data but it gives you generally what you want. If you want more control over exactly the breakpoints between bins, you can be more precise with the breaks() option and give it a vector of breakpoints, like this:

hist(BMI, breaks=c(17,20,23,26,29,32), main="Breaks is vector of breakpoints")


This dictates exactly the start and end point of each bin.  Of course, you could give the breaks vector as a sequence like this to cut down on the messiness of the code:

hist(BMI, breaks=seq(17,32,by=3), main="Breaks is vector of breakpoints")

Note that when giving breakpoints, the default for R is that the histogram cells are right-closed (left open) intervals of the form (a,b]. You can change this with the right=FALSE option, which would change the intervals to be of the form [a,b).  This is important if you have a lot of points exactly at the breakpoint.


2. Frequency vs Density 

Often, we are more interested in density than frequency, since frequency is relative to your sample size. Instead of counting the number of datapoints per bin, R can give the probability densities using the freq=FALSE option:


hist(BMI, freq=FALSE, main="Density plot")


Notice the y-axis now.  If the breaks are equidistant, with difference between breaks=1, then the height of each rectangle is proportional to the number of points falling into the cell, and thus the sum of the probability densities adds up to 1.  Here I specify plot=FALSE because I just want the histogram output, not the plot, and show how the sum of all of the densities is 1:


However, if you choose to make bins that are not all separated by 1 (like breaks=c(17,25,26, 32) or something like that), then the plot still has an area of 1, but the area of the rectangles is the fraction of data points falling into the cells. The densities are calculated like this as counts/(n*diff(breaks).  Thus, this adds up to 1 if add up the areas of the rectangles, i.e. you multiply each density by the difference in the breaks like this:











3. Plot aesthetics 

Finally, we can make the histogram better looking by adjusting the x-axis, y-axis, axis labels, title, and color like this:


hist(BMI, freq=FALSE, xlab="Body Mass Index", main="Distribution of Body Mass Index", col="lightgreen", xlim=c(15,35),  ylim=c(0, .20))

Here along with our frequency option, I changed the x-axis label, changed the main title, made the color light green, and provided limits for both the x-axis and y-axis.  Note that defining the look of you axis using xlim() will not have any impact on the bins - this option is only for the aesthetics of the plot itself.

Finally, I can add a nice normal distribution curve to this plot using the curve() function, in which I specify a normal density function with mean and standard deviation that is equal to the mean and standard deviation of my data, and I add this to my previous plot with a dark blue color and a line width of 2. You can play around with these options to get the kind of line you want:

curve(dnorm(x, mean=mean(BMI), sd=sd(BMI)), add=TRUE, col="darkblue", lwd=2) 


And we get this!