R for Public Health: Stata

Showing posts with label Stata. Show all posts

Sunday, June 8, 2014

How to write and debug an R function

Basic syntax of a function

I've been asked on a few occasions what is the deal with R user-written functions. First of all, how does the syntax work? And second of all, why would you ever want to do this?

In Stata, we don't write functions; we execute built-in commands like browse or gen or logit. You can write your own programs that create new commands (like ado files) but it's less common for users to do so.

In R, there are built-in functions like summary() or glm() or median(), but you can also write your own functions. You can write a quick, one-line function or long elaborate functions. I use functions all the time to make my code cleaner and less repetitive.

In this post I'll go over the basics of how to write functions. In the next post, I'll explain what kinds of functions I have used commonly in public health research that have improved my data cleaning and analyses.

Basic syntax of a function

A function needs to have a name, probably at least one argument (although it doesn't have to), and a body of code that does something. At the end it usually should (although doesn't have to) return an object out of the function. The important idea behind functions is that objects that are created within the function are local to the environment of the function - they don't exist outside of the function. But you can “return” the value of the object from the function, meaning pass the value of it into the global environment. I'll go over this in more detail.

Functions need to have curly braces around the statements, like so:

name.of.function <- function(argument1, argument2) {
    statements
    return(something)
}

The argument can be any type of object (like a scalar, a matrix, a dataframe, a vector, a logical, etc), and it's not necessary to define what it is in any way. As a very simple example, we can write a function that squares an incoming argument. The function below takes the argument x and multiplies it by itself. It saves this value into the object called square, and then it returns the value of the object square.

>>>Writing and calling a function

square.it <- function(x) {
    square <- x * x
    return(square)
}

I can now call the function by passing in a scalar or a vector or matrix as its argument, because all of those objects will square nicely. But it won't work if I input a character as its argument because although it will pass “hi” into the function, R can't multiply “hi”.

# square a number
square.it(5)

## [1] 25

# square a vector
square.it(c(1, 4, 2))

## [1]  1 16  4

# square a character (not going to happen)
square.it("hi")

## Error: non-numeric argument to binary operator

I can also pass in an object that I already have saved. For example, here I have a matrix called matrix1, so I pass that into the square.it() function. R takes this matrix1 into the function as x. That is, in the local function environment it is now called x, where it is squared, and returned.

matrix1 <- cbind(c(3, 10), c(4, 5))
square.it(matrix1)

##      [,1] [,2]
## [1,]    9   16
## [2,]  100   25

>>>Local vs global environment

Now, it's not necessarily the case that you must use return() at the end of your function. The reason you return an object is if you've saved the value of your statements into an object inside the function - in this case, the objects in the function are in a local environment and won't appear in your global environment. See how it works in the following two examples:

fun1 <- function(x) {
    3 * x - 1
}
fun1(5)

## [1] 14

fun2 <- function(x) {
    y <- 3 * x - 1
}
fun2(5)

In the first function, I just evaluate the statement 3*x-1 without saving it anywhere inside the function. So when I run fun1(5), the result comes popping out of the function. However, when I call fun2(5), nothing happens. That's because the object y that I saved my result into doesn't exist outside the function and I haven't used return(y) to pass the value of y outside the function. When I try to print y, it doesn't exist because it was created in the local environment of the function.

print(y)

## Error: object 'y' not found

I can return the value of y using the return(y) at the end of the function fun2, but I can't return the object itself; it's stuck inside the function.

Getting more complex

Obviously writing a whole function to square something when you could just use the ^ operator is silly. But you can do much more complicated things in functions, once you get the hang of them.

>>>Calling other functions and passing multiple arguments

First, you can pass multiple arguments into a function and you can call other functions within your function. Here's an example. I'm passing in 3 arguments which I want to be a matrix, a vector, and a scalar. In the function body, I first call my previous function square.it() and use it to square the scalar. Then I multiply the matrix by the vector. Then I multiply those two results together and return the final object.

my.fun <- function(X.matrix, y.vec, z.scalar) {

    # use my previous function square.it() to square the scalar and save result
    sq.scalar <- square.it(z.scalar)

    # multiply the matrix by the vector using %*% operator
    mult <- X.matrix %*% y.vec

    # multiply the two resulting objects together to get a final object
    final <- mult * sq.scalar

    # return the result
    return(final)
}

When you have multiple arguments in a function that you call, R will just evaluate them in order of how you've written the function (the first argument will correspond to X.matrix, the second y.vec, and so on), but for clarity I would name the arguments in the function call. In this example below, I already have two saved objects, my.mat and my.vec that I pass through as the X.matrix and y.vec arguments, and then I just assign the z.scalar argument the number 9.

# save a matrix and a vector object
my.mat <- cbind(c(1, 3, 4), c(5, 4, 3))
my.vec <- c(4, 3)

# pass my.mat and my.vec into the my.fun function
my.fun(X.matrix = my.mat, y.vec = my.vec, z.scalar = 9)

##      [,1]
## [1,] 1539
## [2,] 1944
## [3,] 2025

# this is the same as
my.fun(my.mat, my.vec, 9)

##      [,1]
## [1,] 1539
## [2,] 1944
## [3,] 2025

>>>Returning a list of objects

Also, if you need to return multiple objects from a function, you can use list() to list them together. An example of this is my blog post on sample size functions.

For example:

another.fun <- function(sq.matrix, vector) {

    # transpose matrix and square the vector
    step1 <- t(sq.matrix)
    step2 <- vector * vector

    # save both results in a list and return
    final <- list(step1, step2)
    return(final)
}

# call the function and save result in object called outcome
outcome <- another.fun(sq.matrix = cbind(c(1, 2), c(3, 4)), vector = c(2, 3))

# print the outcome list
print(outcome)

## [[1]]
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## 
## [[2]]
## [1] 4 9

Now to separate those objects for use in your further code, you can extract them using the [[]] operator:

### extract first in list
outcome[[1]]

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4


## extract second in list
outcome[[2]]

## [1] 4 9

Tricks for troubleshooting and debugging

When you execute multiple statements in a function, sometimes things go wrong. What's nice about functions is that R evalutes every statement until it reaches an error. So in the last function, the dimensions of the objects really matter. You can't multiply matrices of incomptabile dimensions. Like this:

my.fun(X.matrix = my.mat, y.vec = c(2, 3, 6, 4, 1), z.scalar = 9)

## Error: non-conformable arguments

>>>Using the Debug() function

When you have an error, one thing you can do is use R's built-in debugger debug() to find at what point the error occurs. You indicate which function you want to debug, then run your statement calling the function, and R shows you at what point the function stops because of errors:

debug(my.fun)
my.fun(X.matrix = my.mat, y.vec = c(2, 3, 6, 4, 1), z.scalar = 9)

## debugging in: my.fun(X.matrix = my.mat, y.vec = c(2, 3, 6, 4, 1), z.scalar = 9)
## debug at <text>#1: {
##     sq.scalar <- square.it(z.scalar)
##     mult <- X.matrix %*% y.vec
##     final <- mult * sq.scalar
##     return(final)
## }
## debug at <text>#4: sq.scalar <- square.it(z.scalar)
## debug at <text>#7: mult <- X.matrix %*% y.vec

## Error: non-conformable arguments

We see that the first line calling the square.it() function was fine, but then an error occurred in the line defining mult. This debugging is useful especially if you had many more statements in your function that multiplied matrices and you weren't sure which one was causing the issues. So now we know the problem is that X.matrix and y.vec won't multiply. But we still need to know why they won't multiply.

>>>Printing out what's happening (sanity checks)

At this point, a good way to troubleshoot this is to print out the dimensions or lengths of the objects or even the objects themselves that are going into the statement causing errors. The great part about functions is that they evaluate all the way until there's an error. So you can see what is happening inside your function before the error.

If the object is too long, you can print(head(object)). This helps to see if you're doing what you think you're doing. Note that you have to use the function print() to actually print out anything from within a function.

my.fun <- function(X.matrix, y.vec, z.scalar) {
    print("xmatrix")
    print(X.matrix)

    print("yvec")
    print(y.vec)

    print("Dimensions")
    print(dim(X.matrix))
    print(length(y.vec))

    # use my previous function square.it() to square the scalar and save result
    sq.scalar <- square.it(z.scalar)
    print(paste("sq.scalar=", sq.scalar))

    # multiply the matrix by the vector using %*% operator
    mult <- X.matrix %*% y.vec

    # multiply the two resulting objects together to get a final object
    final <- mult * sq.scalar

    # return the result
    return(final)
}

my.fun(X.matrix = my.mat, y.vec = c(2, 3, 6, 4, 1), z.scalar = 9)

## [1] "xmatrix"
##      [,1] [,2]
## [1,]    1    5
## [2,]    3    4
## [3,]    4    3
## [1] "yvec"
## [1] 2 3 6 4 1
## [1] "Dimensions"
## [1] 3 2
## [1] 5
## [1] "sq.scalar= 81"

## Error: non-conformable arguments

Now we can see the actual dimensions of the objects and fix them accordingly. This example is really simple, but you can imagine that if you've written a long function that uses many arguments, you could easily lose track of them and not be sure where the issue in your function was. You can also throw in these statements along the way as sanity checks to make sure that things are proceeding as you think they should, even if there isn't any error.

>>>Using the stop() and stopifnot() functions to write your own error messages

One other trick you can use is writing your own error messages using the stop() and stopifnot() functions. In this example, if I know I need dimensions to be the right size, I can check them and print out a message that says they are incorrect. That way I know what the issue is immediately. Here's an example:

my.second.fun <- function(matrix, vector) {

    if (dim(matrix)[2] != length(vector)) {
        stop("Can't multiply matrix%*%vector because the dimensions are wrong")
    }

    product <- matrix %*% vector

    return(product)

}

# function works when dimensions are right
my.second.fun(my.mat, c(6, 5))

##      [,1]
## [1,]   31
## [2,]   38
## [3,]   39

# function call triggered error
my.second.fun(my.mat, c(6, 5, 7))

## Error: Can't multiply matrix%*%vector because the dimensions are wrong

You can do these kinds of error messages for yourself as checks so you know exactly what triggered the error. You can think about putting in a check for if the value of an object is 0 if you are dividing by it as another example.

Good function writing practices

Based on my experience, there are a few good practices that I would recommend keeping in mind when writing function.

Keep your functions short. Remember you can use them to call other functions!
- If things start to get very long, you can probably split up your function into more manageable chunks that call other functions. This makes your code cleaner and easily testable.
- It also makes your code easy to update. You only have to change one function and every other function that uses that function will also be automatically updated.
Put in comments on what are the inputs to the function, what the function does, and what is the output.
Check for errors along the way.
- Try out your function with simple examples to make sure it's working properly
- Use debugging and error messages, as well as sanity checks as you build your function.

The next post will go over examples of useful functions that you can use in your day to day R coding.

I've been asked on a few occasions what is the deal with R user-written functions. First of all, how does the syntax work? And second of all, why would you ever want to do this? In Stata, we don't write functions; we execute built-in commands like **browse** or **gen** or **logit**. You can write your own programs that create new commands (like ado files) but it's less common for users to do so. In R, there are built-in functions like **summary()** or **glm()** or **median()**, but you can also write your own functions. You can write a quick, one-line function or long elaborate functions. I use functions all the time to make my code cleaner and less repetitive. In this post I'll go over the basics of how to write functions. In the next post, I'll explain what kinds of functions I have used commonly in public health research that have improved my data cleaning and analyses. Basic syntax of a function A function needs to have a name, probably at least one argument (although it doesn't have to), and a body of code that does something. At the end it usually should (although doesn't have to) return an object out of the function. The important idea behind functions is that objects that are created within the function are local to the environment of the function - they don't exist outside of the function. But you can "return" the value of the object from the function, meaning pass the value of it into the global environment. I'll go over this in more detail. Functions need to have curly braces around the statements, like so: name.of.function <- function(argument1, argument2){ statements return(something) } The argument can be any type of object (like a scalar, a matrix, a dataframe, a vector, a logical, etc), and it's not necessary to define what it is in any way. As a very simple example, we can write a function that squares an incoming argument. The function below takes the argument x and multiplies it by itself. It saves this value into the object called square, and then it returns the value of the object square. Writing and calling a function square.it<-function(x){ square<-x*x return(square) } I can now call the function by passing in a scalar or a vector or matrix as its argument, because all of those objects will square nicely. But it won't work if I input a character as its argument because although it will pass "hi" into the function, R can't multiply "hi". #square a number square.it(5) #square a vector square.it(c(1,4,2)) #square a character (not going to happen) square.it("hi") I can also pass in an object that I already have saved. For example, here I have a matrix called matrix1, so I pass that into the **square.it() function**. R takes this matrix1 into the function as x. That is, in the local function environment it is now called x, where it is squared, and returned. matrix1<-cbind(c(3,10),c(4,5)) square.it(matrix1) Local vs global environment Now, it's not necessarily the case that you must use **return()** at the end of your function. The reason you return an object is if you've saved the value of your statements into an object inside the function - in this case, the objects in the function are in a local environment and won't appear in your global environment. See how it works in the following two examples: fun1<-function(x){ 3*x-1 } fun1(5) fun2<-function(x){ y <- 3*x-1 } fun2(5) In the first function, I just evaluate the statement 3*x-1 without saving it anywhere inside the function. So when I run fun1(5), the result comes popping out of the function. However, when I call fun2(5), nothing happens. That's because the object y that I saved my result into doesn't exist outside the function and I haven't used __return(y)__ to pass the value of y outside the function. When I try to print y, it doesn't exist because it was created in the local environment of the function. print(y) I can return the *value* of y using the **return(y)** at the end of the function fun2, but I can't return the object itself; it's stuck inside the function. Getting more complex Obviously writing a whole function to square something when you could just use the ^ operator is silly. But you can do much more complicated things in functions, once you get the hang of them. Calling other functions and passing multiple arguments First, you can pass multiple arguments into a function and you can call other functions within your function. Here's an example. I'm passing in 3 arguments which I want to be a matrix, a vector, and a scalar. In the function body, I first call my previous function **square.it()** and use it to square the scalar. Then I multiply the matrix by the vector. Then I multiply those two results together and return the final object. my.fun <- function(X.matrix, y.vec, z.scalar){ #use my previous function square.it() to square the scalar and save result sq.scalar<-square.it(z.scalar) #multiply the matrix by the vector using %*% operator mult<-X.matrix%*%y.vec #multiply the two resulting objects together to get a final object final<-mult*sq.scalar #return the result return(final) } When you have multiple arguments in a function that you call, R will just evaluate them in order of how you've written the function (the first argument will correspond to X.matrix, the second y.vec, and so on), but for clarity I would name the arguments in the function call. In this example below, I already have two saved objects, my.mat and my.vec that I pass through as the X.matrix and y.vec arguments, and then I just assign the z.scalar argument the number 9. #save a matrix and a vector object my.mat<-cbind(c(1,3,4),c(5,4,3)) my.vec<-c(4,3) #pass my.mat and my.vec into the my.fun function my.fun(X.matrix=my.mat, y.vec=my.vec, z.scalar=9) #this is the same as my.fun(my.mat, my.vec, 9) Returning a list of objects Also, if you need to return multiple objects from a function, you can use **list()** to list them together. An example of this is my [blog post on sample size functions.](http://rforpublichealth.blogspot.com/2013/06/sample-size-calculations-equivalent-to.html) For example: another.fun<-function(sq.matrix, vector){ #transpose matrix and square the vector step1<-t(sq.matrix) step2<-vector*vector #save both results in a list and return final<-list(step1, step2) return(final) } #call the function and save result in object called outcome outcome<-another.fun(sq.matrix=cbind(c(1,2),c(3,4)), vector=c(2,3)) #print the outcome list print(outcome) Now to separate those objects for use in your further code, you can extract them using the [[]] operator: ###extract first in list outcome[[1]] ##extract second in list outcome[[2]] Tricks for troubleshooting and debugging When you execute multiple statements in a function, sometimes things go wrong. What's nice about functions is that R evalutes every statement until it reaches an error. So in the last function, the dimensions of the objects really matter. You can't multiply matrices of incomptabile dimensions. Like this: my.fun(X.matrix=my.mat, y.vec=c(2,3,6,4,1), z.scalar=9) Using the Debug() function When you have an error, one thing you can do is use R's built-in debugger **debug()** to find at what point the error occurs. You indicate which function you want to debug, then run your statement calling the function, and R shows you at what point the function stops because of errors: debug(my.fun) my.fun(X.matrix=my.mat, y.vec=c(2,3,6,4,1), z.scalar=9) We see that the first line calling the **square.it()** function was fine, but then an error occurred in the line defining mult. This debugging is useful especially if you had many more statements in your function that multiplied matrices and you weren't sure which one was causing the issues. So now we know the problem is that X.matrix and y.vec won't multiply. But we still need to know why they won't multiply. More on debugging [can be found here.](http://www.stats.uwo.ca/faculty/murdoch/software/debuggingR/debug.shtml) Printing out what's happening (sanity checks) At this point, a good way to troubleshoot this is to print out the dimensions or lengths of the objects or even the objects themselves that are going into the statement causing errors. The great part about functions is that they evaluate all the way until there's an error. So you can see what is happening inside your function before the error. If the object is too long, you can **print(head(object))**. This helps to see if you're doing what you think you're doing. Note that you have to use the function **print()** to actually print out anything from within a function. my.fun <- function(X.matrix, y.vec, z.scalar){ print("xmatrix") print(X.matrix) print("yvec") print(y.vec) print("Dimensions") print(dim(X.matrix)) print(length(y.vec)) #use my previous function square.it() to square the scalar and save result sq.scalar<-square.it(z.scalar) print(paste("sq.scalar=", sq.scalar)) #multiply the matrix by the vector using %*% operator mult<-X.matrix%*%y.vec #multiply the two resulting objects together to get a final object final<-mult*sq.scalar #return the result return(final) } my.fun(X.matrix=my.mat, y.vec=c(2,3,6,4,1), z.scalar=9) Now we can see the actual dimensions of the objects and fix them accordingly. This example is really simple, but you can imagine that if you've written a long function that uses many arguments, you could easily lose track of them and not be sure where the issue in your function was. You can also throw in these statements along the way as sanity checks to make sure that things are proceeding as you think they should, even if there isn't any error. Using the stop() and stopifnot() functions to write your own error messages One other trick you can use is writing your own error messages using the **stop()** and **stopifnot()** functions. In this example, if I know I need dimensions to be the right size, I can check them and print out a message that says they are incorrect. That way I know what the issue is immediately. Here's an example: my.second.fun<-function(matrix, vector){ if(dim(matrix)[2]!=length(vector)){ stop("Can't multiply matrix%*%vector because the dimensions are wrong") } product<-matrix%*%vector return(product) } #function works when dimensions are right my.second.fun(my.mat, c(6,5)) #function call triggered error my.second.fun(my.mat, c(6,5,7)) You can do these kinds of error messages for yourself as checks so you know exactly what triggered the error. You can think about putting in a check for if the value of an object is 0 if you are dividing by it as another example. Good function writing practices Based on my experience, there are a few good practices that I would recommend keeping in mind when writing function. 1. Keep your functions short. Remember you can use them to call other functions! * If things start to get very long, you can probably split up your function into more manageable chunks that call other functions. This makes your code cleaner and easily testable. * It also makes your code easy to update. You only have to change one function and every other function that uses that function will also be automatically updated. 2. Put in comments on what are the inputs to the function, what the function does, and what is the output. 3. Check for errors along the way. * Try out your function with simple examples to make sure it's working properly * Use debugging and error messages, as well as sanity checks as you build your function. The next post will go over examples of useful functions that you can use in your day to day R coding.

Friday, October 4, 2013

Loops revisited: How to rethink macros when using R

The R perspective

This blog post is inspired by this great U-W Madison site for computing that I found when I was searching for a way to do a loop in Stata. They go through all of the ways you may want to loop using macros in Stata. So in this blog post, I show how to do all of these problems in R.

The R perspective

The first thing to do when you're trying to think about how to solve a problem in R that you've done in Stata using macros, is to stop thinking 'macro' and start thinking 'objects'. The idea is that when you use R, you have a space in which to store many different objects - vectors, dataframes, matrices, lists, etc. I went over all of these in a series of blog posts called “Data types” in November of last year.

You can use the power of objects to change the way you're thinking about your programming problem. Let's start with U-W's first example: running multiple regressions of various types using a fixed set of control variables. We'll run a linear regression and a logit. In Stata you do:

local controlVars age sex occupation location maritalStatus hasChildren
reg income education `controlVars'
logit employed education `controlVars'

In R, we can take those local controlVars and put them into a new object, for example a matrix. Then we use that same matrix in all of our regressions. Here is an example. We create some data:

set.seed(10)
x <- rnorm(100, 5, 2)
z <- rnorm(100, 6, 5)
w <- rnorm(100, 3, 2)
y <- x * 2 + w * 0.5 + rnorm(100, 0, 1)
ybin <- as.numeric(y < 10)

mydata <- as.data.frame(cbind(x, z, w, y, ybin))

And now if we have two models to run, a linear and a logit, we can create a matrix of explanatory variables that we put on the right hand side each time. You don't need the “data=mydata” part since ybin is also a vector in our workspace, but generally if you were to import this data as a dataframe, then you would need to include it or you would need to create a separate ybin vector object from the dataframe you imported.

xvars <- cbind(x, z, w)
summary(lm(ybin ~ xvars, data = mydata))

## 
## Call:
## lm(formula = ybin ~ xvars, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6179 -0.2446  0.0175  0.2177  0.6699 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.44087    0.10888   13.23   <2e-16 ***
## xvarsx      -0.20002    0.01673  -11.95   <2e-16 ***
## xvarsz      -0.00179    0.00656   -0.27    0.785    
## xvarsw      -0.03123    0.01631   -1.91    0.058 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.313 on 96 degrees of freedom
## Multiple R-squared:  0.605,  Adjusted R-squared:  0.593 
## F-statistic:   49 on 3 and 96 DF,  p-value: <2e-16

summary(glm(ybin ~ xvars, family = binomial(logit), data = mydata))

## 
## Call:
## glm(formula = ybin ~ xvars, family = binomial(logit), data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3283  -0.0962  -0.0030   0.0720   2.1836  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  20.0872     5.3972    3.72  0.00020 ***
## xvarsx       -4.0466     1.0590   -3.82  0.00013 ***
## xvarsz       -0.0136     0.1197   -0.11  0.90967    
## xvarsw       -1.1068     0.4088   -2.71  0.00679 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 133.750  on 99  degrees of freedom
## Residual deviance:  29.772  on 96  degrees of freedom
## AIC: 37.77
## 
## Number of Fisher Scoring iterations: 8

Next, we want to run the regression if the data meet some requirement. In Stata we would do:

local blackWoman race==1 & female
reg income education `controlVars' if `blackWoman'

Again in R, think objects. We can subset our original dataframe “mydata” to another matrix “data.sub”. I go over subsetting in a blogpost here. I take the subset of my data based on the conditions I want, and to stick with the above example, I do a xvars matrix as well so I can combine the two methods like so. Notice the degrees of freedom have been reduced since we're only using data in which x>2 and z<3:

data.sub <- as.data.frame(mydata[x > 2 & z < 3, c("x", "z", "ybin")])
xvars.sub <- as.matrix(data.sub[, c("x", "z")])
summary(lm(ybin ~ xvars.sub, data = data.sub))

## 
## Call:
## lm(formula = ybin ~ xvars.sub, data = data.sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4706 -0.1793 -0.0171  0.1504  0.6144 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.4185     0.1611    8.81  1.5e-09 ***
## xvars.subx   -0.2113     0.0287   -7.38  5.0e-08 ***
## xvars.subz   -0.0434     0.0238   -1.83    0.079 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.274 on 28 degrees of freedom
## Multiple R-squared:  0.672,  Adjusted R-squared:  0.649 
## F-statistic: 28.7 on 2 and 28 DF,  p-value: 1.67e-07

Well now that we have the basics down, let's get to some of the more interesting problems of loops.

Looping over Variables

In Stata, if we want to run regressions for three different outcome variables, we can do it this way, via a foreach loop:

foreach yvar in mpg price displacement {
reg `yvar' foreign weight
}

In R, we can use apply(). The syntax of apply() is three arguments:

apply(dataset, margin, function)

the first argument is the dataframe or matrix that you want to apply your function to,
the second argument is the margin, meaning are you doing it over rows (margin=1) or over columns (margin=2) of the data,
and the third argument is the function that you want to apply.

In this case, we can use the apply() function as follows. I will take the two columns of mydata (y and ybin), and for each of those columns (since margin=2) I apply the function that I create in the third argument. The function I create is to take an argument (outcome) which refers to each of those two columns and to run a linear model on each of those columns with x and z as the explanatory variables:

apply(mydata[,c("y","ybin")], 2, function(outcome){summary(lm(outcome~x+z))})

## $y
## 
## Call:
## lm(formula = outcome ~ x + z)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.863 -0.887 -0.035  0.899  3.570 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.7342     0.4299    4.03  0.00011 ***
## x             2.0224     0.0764   26.46  < 2e-16 ***
## z            -0.0204     0.0297   -0.69  0.49450    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.43 on 97 degrees of freedom
## Multiple R-squared:  0.879,  Adjusted R-squared:  0.877 
## F-statistic:  352 on 2 and 97 DF,  p-value: <2e-16
## 
## 
## $ybin
## 
## Call:
## lm(formula = outcome ~ x + z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6057 -0.2183  0.0068  0.1967  0.7292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.335967   0.095381   14.01   <2e-16 ***
## x           -0.200010   0.016962  -11.79   <2e-16 ***
## z           -0.000098   0.006587   -0.01     0.99    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.317 on 97 degrees of freedom
## Multiple R-squared:  0.59,   Adjusted R-squared:  0.581 
## F-statistic: 69.7 on 2 and 97 DF,  p-value: <2e-16

And out come the two summaries of the models, the first with y and the second with ybin.

Looping over parts of variable names

In this Stata example, which I feel comes up often, we want to take each of our monthly income variables and create an indicator of whether there was positive income in that month. We want to create 12 new variables with the original name of the variable like “Jan”, now in an indicator form like “hadIncJan”. The Stata code is as follows:

foreach month in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec {
gen hadInc`month'=(inc`month'>0) if inc`month'<.
}

In R, this is one of those times that a for() loop actually works very nicely. I actually found this solution on Stackoverflow, which if you're not familiar, can be a lifesaver for your programming struggles.

I create some month data. In this example, we'll use the names() function in R, which is very useful. I had a blog post about the names() function if you want to read up on its other uses.

We also use the paste0() function, which is a way to concatenate character strings and numbers, along with a new way to index, the “[[” operator. You can read about this operator in the help file. The idea is to refer a single element of the dataframe (a vector) by inserting the column name that is being referred to via an index from the loop. I think reading the code makes things more clear. We make our data:

jan <- rnorm(100, 3, 5)
feb <- rnorm(100, 4, 8)
march <- rnorm(100, 2, 5)

months <- as.data.frame(cbind(jan, feb, march))
names(months)

## [1] "jan"   "feb"   "march"

head(months)

##       jan    feb  march
## 1  2.6027 10.956 -2.510
## 2  8.9088 -1.440  1.726
## 3 13.9307  5.386 -4.784
## 4  5.0309  2.724  3.586
## 5 -0.6918 10.348  8.979
## 6 -6.7824 17.555 -4.056

And now we run our loop:

for (n in names(months)) {
    months[[paste0("HadInc_", n)]] <- as.numeric(months[[n]] > 0)
}
head(months)

##       jan    feb  march HadInc_jan HadInc_feb HadInc_march
## 1  2.6027 10.956 -2.510          1          1            0
## 2  8.9088 -1.440  1.726          1          0            1
## 3 13.9307  5.386 -4.784          1          1            0
## 4  5.0309  2.724  3.586          1          1            1
## 5 -0.6918 10.348  8.979          0          1            1
## 6 -6.7824 17.555 -4.056          0          1            0

The for() loop takes each column name in months, and creates a new variable by concatenating the string “HadInc_” with that column name, and assigns to it a binary indicator of whether the original variable had monthly income greater than 0. If this is confusing, I suggest breaking it down to parts. You can run it this way to see what is actually happening (output not shown here, but you can run it on your own to understand it)

for (n in names(months)) {
    print(n)
    print(months[[n]])
    print(paste0("HadInc_", n))
}

Looping over Varlists

In Stata in order to do something to all or a lot of variables (for example, to rename them to have all lower case or upper case letters), you use a foreach loop like this:

foreach oldname of varlist * {
local newname=lower("`oldname'")
rename `oldname' `newname'
}

In R, for this exact situation, you can use the toupper() function on the names of the data, or a subset of the data if you only wanted to do some of the column names.

names(months) <- toupper(names(months))
head(months)

##       JAN    FEB  MARCH HADINC_JAN HADINC_FEB HADINC_MARCH
## 1  2.6027 10.956 -2.510          1          1            0
## 2  8.9088 -1.440  1.726          1          0            1
## 3 13.9307  5.386 -4.784          1          1            0
## 4  5.0309  2.724  3.586          1          1            1
## 5 -0.6918 10.348  8.979          0          1            1
## 6 -6.7824 17.555 -4.056          0          1            0

For other situations, like replacing indicators of missing values to NA for a bunch of variables at a time, check out my previous blog post on using apply() in these situations.

Looping over numbers

We revisit the same problem we had with the monthly income, except now we want an indicator by year. We have variables of the yearly income. In Stata this would be the code, using now the forvalues loop with a macro:

forvalues year=1990/2010 {
gen hadInc`year'=(inc`year'>0) if inc`year'<.
}

In R we can use a for() loop along with the “[[” operator that we used before, but this time we make use of the seq(along=x) syntax that will let us go along a sequence of numbers. Our dataframe called “Income” includes columns for each year of income. We make a new vector called “years” that just contains numbers from 1990 to 1992. Then for each value along that vector, we make a new column in our Income dataframe with a name that concatenates “hadInc_” with the number in the sequence, and this variable is just a binary indicator of whether that year's income was positive.

Inc1990 <- rnorm(100, 5, 6)
Inc1991 <- rnorm(100, 3, 8)
Inc1992 <- rnorm(100, 4, 4)

Income <- as.data.frame(cbind(Inc1990, Inc1991, Inc1992))
years <- c(1990:1992)
head(Income)

##   Inc1990 Inc1991 Inc1992
## 1  3.6391 -14.001   6.169
## 2  0.8242   7.169   8.095
## 3  8.0399   2.629   7.251
## 4  8.6974  -6.196   9.204
## 5 -5.0013   8.035   6.272
## 6  2.9825  -2.812   3.520

for (i in seq(along = years)) {
    Income[[paste0("hadInc_", years[i])]] <- as.numeric(Income[[i]] > 0)
}
head(Income)

##   Inc1990 Inc1991 Inc1992 hadInc_1990 hadInc_1991 hadInc_1992
## 1  3.6391 -14.001   6.169           1           0           1
## 2  0.8242   7.169   8.095           1           1           1
## 3  8.0399   2.629   7.251           1           1           1
## 4  8.6974  -6.196   9.204           1           0           1
## 5 -5.0013   8.035   6.272           0           1           1
## 6  2.9825  -2.812   3.520           1           0           1

Again, run just parts of it to understand it if you're having trouble with the syntax.

Looping over Values and Levelsof

Finally, we may want to run the same functions over values or levels of a variable. Here's are two situations in Stata, the first can use the by statement, and the second uses forvalues for a survey function.

by race: regress income age i.education

forvalues race=1/3 {
svy, subpop(if race==`race'): reg income age i.education
}

In R, there is no “by” option for linear regression but we can use the lapply() function instead. The function lapply() is the same idea as apply() except it can be used to apply some function over a list. We create some data similar to the Stata example:

race <- c(rep(1, 30), rep(2, 30), rep(3, 40))
age <- rnorm(100, 25, 3)
y <- age * 10 + ifelse(race == 1, 100, ifelse(race == 2, 2000, 0)) + rnorm(100, 
    0, 1)

racedata <- as.data.frame(cbind(race, age, y))
racedata$race <- as.factor(racedata$race)

Now we use lapply() to run the summary of lm for a subset of the racedata where we subset by each value of the list.

lapply(1:3, function(index) summary(lm(y~age, data=racedata[racedata$race==index,])))

## [[1]]
## 
## Call:
## lm(formula = y ~ age, data = racedata[racedata$race == index, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0021 -0.8978 -0.0686  0.8602  2.0371 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 103.0121     2.2522    45.7   <2e-16 ***
## age           9.8836     0.0895   110.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.2 on 28 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.998 
## F-statistic: 1.22e+04 on 1 and 28 DF,  p-value: <2e-16
## 
## 
## [[2]]
## 
## Call:
## lm(formula = y ~ age, data = racedata[racedata$race == index, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4785 -0.4682 -0.0427  0.6685  1.3434 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.00e+03   1.34e+00    1491   <2e-16 ***
## age         9.96e+00   5.31e-02     188   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.881 on 28 degrees of freedom
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.999 
## F-statistic: 3.52e+04 on 1 and 28 DF,  p-value: <2e-16
## 
## 
## [[3]]
## 
## Call:
## lm(formula = y ~ age, data = racedata[racedata$race == index, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1961 -0.3204  0.0736  0.5295  2.1861 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.0024     1.0816   -1.85    0.072 .  
## age          10.0814     0.0442  228.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.998 on 38 degrees of freedom
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.999 
## F-statistic: 5.21e+04 on 1 and 38 DF,  p-value: <2e-16

To make this even better, we take the levels of race, which is a factor, and run the lapply() function of those instead of the number 1-3 so that if those levels change, we won't have to change our code.

lapply(as.numeric(levels(race)), function(index) summary(lm(y~age, data=racedata[racedata$race==index,])))

This idea can also be applied to any function that you want to evaluate by different values.

Of course there may be more efficient ways to do what I've shown here. If you have comments on improvement over these solutions, let me know!

If you're a Stata user (or SAS for that matter), you are most likely a big fan of macros. They're very helpful when you're repeating the same actions over and over again. In R, we don't have macros. Instead we have functions and loops, and even better than loops are the apply functions. I already had one post on the apply() function about a year ago, so as this is the one year anniversary of my blog (yay!), I revisit apply() and show even more examples of how incredibly versatile and useful this function is, once you get used to the syntax. I also show where loops can be useful. This blog post is inspired by [this great U-W Madison site](http://www.ssc.wisc.edu/sscc/pubs/stata_prog1.htm) for computing that I found when I was searching for a way to do a loop in Stata. They go through all of the ways you may want to loop using macros in Stata. So in this blog post, I show how to do all of these problems in R. The R perspective The first thing to do when you're trying to think about how to solve a problem in R that you've done in Stata using macros, is to stop thinking 'macro' and start thinking 'objects'. The idea is that when you use R, you have a space in which to store many different objects - vectors, dataframes, matrices, lists, etc. I went over all of these in a [series of blog posts](http://rforpublichealth.blogspot.com/2012/11/data-types-part-1-ways-to-store.html) called "Data types" in November of last year. You can use the power of objects to change the way you're thinking about your programming problem. Let's start with U-W's first example: running multiple regressions of various types using a fixed set of control variables. We'll run a linear regression and a logit. In Stata you do: local controlVars age sex occupation location maritalStatus hasChildren reg income education `controlVars' logit employed education `controlVars' In R, we can take those local controlVars and put them into a new object, for example a matrix. Then we use that same matrix in all of our regressions. Here is an example. We create some data: set.seed(10) x<-rnorm(100,5,2) z<-rnorm(100,6,5) w<-rnorm(100,3,2) y<-x*2+w*.5+rnorm(100,0,1) ybin<-as.numeric(y<10) mydata<-as.data.frame(cbind(x,z,w,y,ybin)) And now if we have two models to run, a linear and a logit, we can create a matrix of explanatory variables that we put on the right hand side each time. You don't need the "data=mydata" part since ybin is also a vector in our workspace, but generally if you were to import this data as a dataframe, then you would need to include it or you would need to create a separate ybin vector object from the dataframe you imported. xvars<-cbind(x,z,w) summary(lm(ybin~xvars, data=mydata)) summary(glm(ybin~xvars, family=binomial(logit), data=mydata)) Next, we want to run the regression if the data meet some requirement. In Stata we would do: local blackWoman race==1 & female reg income education `controlVars' if `blackWoman' Again in R, think objects. We can subset our original dataframe "mydata" to another matrix "data.sub". I go over subsetting in a blogpost [here](http://rforpublichealth.blogspot.com/2012/10/quick-and-easy-subsetting.html). I take the subset of my data based on the conditions I want, and to stick with the above example, I do a xvars matrix as well so I can combine the two methods like so. Notice the degrees of freedom have been reduced since we're only using data in which x>2 and z<3: data.sub<-as.data.frame(mydata[x>2 & z<3,c("x","z","ybin")]) xvars.sub<-as.matrix(data.sub[,c("x","z")]) summary(lm(ybin~xvars.sub, data=data.sub)) Well now that we have the basics down, let's get to some of the more interesting problems of loops. Looping over Variables In Stata, if we want to run regressions for three different outcome variables, we can do it this way, via a **foreach** loop: foreach yvar in mpg price displacement { reg `yvar' foreign weight } In R, we can use apply(). The syntax of apply() is three arguments: apply(dataset, margin, function) * the first argument is the dataframe or matrix that you want to apply your function to, * the second argument is the margin, meaning are you doing it over rows (margin=1) or over columns (margin=2) of the data, * and the third argument is the function that you want to apply. In this case, we can use the apply() function as follows. I will take the two columns of mydata (y and ybin), and for each of those columns (since margin=2) I apply the function that I create in the third argument. The function I create is to take an argument (outcome) which refers to each of those two columns and to run a linear model on each of those columns with x and z as the explanatory variables: apply(mydata[,c("y","ybin")], 2, function(outcome){summary(lm(outcome~x+z))}) And out come the two summaries of the models, the first with y and the second with ybin. ### Looping over parts of variable names In this Stata example, which I feel comes up often, we want to take each of our monthly income variables and create an indicator of whether there was positive income in that month. We want to create 12 new variables with the original name of the variable like "Jan", now in an indicator form like "hadIncJan". The Stata code is as follows: foreach month in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec { gen hadInc`month'=(inc`month'>0) if inc`month'<. } In R, this is one of those times that a for() loop actually works very nicely. I actually found this solution on [Stackoverflow](http://stackoverflow.com/questions/14324562/creating-new-named-variable-in-dataframe-using-loop-and-naming-convention), which if you're not familiar, can be a lifesaver for your programming struggles. I create some month data. In this example, we'll use the names() function in R, which is very useful. I had a [blog post about the names()](http://rforpublichealth.blogspot.com/2013/03/extracting-information-from-objects.html) function if you want to read up on its other uses. We also use the paste0() function, which is a way to concatenate character strings and numbers, along with a new way to index, the "[[" operator. You can read about this operator in the help file. The idea is to refer a single element of the dataframe (a vector) by inserting the column name that is being referred to via an index from the loop. I think reading the code makes things more clear. We make our data: jan<-rnorm(100,3,5) feb<-rnorm(100,4,8) march<-rnorm(100,2,5) months<-as.data.frame(cbind(jan,feb,march)) names(months) head(months) And now we run our loop: for (n in names(months)){ months[[paste0("HadInc_",n)]] <- as.numeric(months[[n]]>0) } head(months) The for() loop takes each column name in months, and creates a new variable by concatenating the string "HadInc_" with that column name, and assigns to it a binary indicator of whether the original variable had monthly income greater than 0. If this is confusing, I suggest breaking it down to parts. You can run it this way to see what is actually happening (output not shown here, but you can run it on your own to understand it) for (n in names(months)){ print(n) print(months[[n]]) print(paste0("HadInc_",n)) } Looping over Varlists In Stata in order to do something to all or a lot of variables (for example, to rename them to have all lower case or upper case letters), you use a **foreach** loop like this: foreach oldname of varlist * { local newname=lower("`oldname'") rename `oldname' `newname' } In R, for this exact situation, you can use the toupper() function on the names of the data, or a subset of the data if you only wanted to do some of the column names. names(months)<-toupper(names(months)) head(months) For other situations, like replacing indicators of missing values to NA for a bunch of variables at a time, check out [my previous blog post](http://rforpublichealth.blogspot.com/2012/09/the-infamous-apply-function.html) on using apply() in these situations. Looping over numbers We revisit the same problem we had with the monthly income, except now we want an indicator by year. We have variables of the yearly income. In Stata this would be the code, using now the **forvalues** loop with a macro: forvalues year=1990/2010 { gen hadInc`year'=(inc`year'>0) if inc`year'<. } In R we can use a for() loop along with the "[[" operator that we used before, but this time we make use of the seq(along=x) syntax that will let us go along a sequence of numbers. Our dataframe called "Income" includes columns for each year of income. We make a new vector called "years" that just contains numbers from 1990 to 1992. Then for each value along that vector, we make a new column in our Income dataframe with a name that concatenates "hadInc_" with the number in the sequence, and this variable is just a binary indicator of whether that year's income was positive. Inc1990<-rnorm(100,5,6) Inc1991<-rnorm(100,3,8) Inc1992<-rnorm(100,4,4) Income<-as.data.frame(cbind(Inc1990, Inc1991, Inc1992)) years<-c(1990:1992) head(Income) for (i in seq(along=years)){ Income[[paste0("hadInc_",years[i])]] <- as.numeric(Income[[i]]>0) } head(Income) Again, run just parts of it to understand it if you're having trouble with the syntax. Looping over Values and Levelsof Finally, we may want to run the same functions over values or levels of a variable. Here's are two situations in Stata, the first can use the **by** statement, and the second uses **forvalues** for a survey function. by race: regress income age i.education forvalues race=1/3 { svy, subpop(if race==`race'): reg income age i.education } In R, there is no "by" option for linear regression but we can use the lapply() function instead. The function lapply() is the same idea as apply() except it can be used to apply some function over a list. We create some data similar to the Stata example: race<-c(rep(1,30),rep(2,30),rep(3,40)) age<-rnorm(100,25,3) y<-age*10+ifelse(race==1,100, ifelse(race==2, 2000, 0))+rnorm(100,0,1) racedata<-as.data.frame(cbind(race,age,y)) racedata$race<-as.factor(racedata$race) Now we use lapply() to run the summary of lm for a subset of the racedata where we subset by each value of the list. lapply(1:3, function(index) summary(lm(y~age, data=racedata[racedata$race==index,]))) To make this even better, we take the levels of race, which is a factor, and run the lapply() function of those instead of the number 1-3 so that if those levels change, we won't have to change our code. lapply(as.numeric(levels(race)), function(index) summary(lm(y~age, data=racedata[racedata$race==index,]))) This idea can also be applied to any function that you want to evaluate by different values. Of course there may be more efficient ways to do what I've shown here. If you have comments on improvement over these solutions, let me know!

Tuesday, June 25, 2013

Sample size calculations equivalent to Stata functions

========================================================

Hi everyone, I’m trying out R knitr for my blog posts now; let me know what you think!

Recently, I was looking for sample size calculations in R and found that R really doesn’t have good built in funtions for sample size and power. For example, for finding the sample size necessary to find a difference in two proportions, one can use the power.prop.test() function and get the following output:

power.prop.test(n=NULL, .10, .25, .05, .90, alternative="two.sided")

## 
##      Two-sample comparison of proportions power calculation 
## 
##               n = 132.7557
##              p1 = 0.1
##              p2 = 0.25
##       sig.level = 0.05
##           power = 0.9
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

But there are very few options in this function; for example, there is no option for a continuity correction for the normal approximation to binary data, you can’t compare a one sample of a null vs alternative value instead of two sample proportion, and you can’t change the ratio of the size of the two samples. Even more importantly, there’s no way to find effective sample size when data is clustered. There are other functions in R but I found them complicated and not user friendly. The sampsi and sampclus functions in Stata are intuitive and useful, and have all of those important options. I decided to build my own R functions that had the same capabilities, and I’ve posted them here for others to use and comment on. Eventually I would like to make an R package that includes all of these functions for ease of use.

The first function, sampsi.prop(), calculates sample size for one and two sample tests of proportions. The required n for both are as follows: * Two-sample test of equality of proportions, with continuity correction:

\[n_1 = \frac{n'}{4}\left(1+\left\{1+\frac{2(r+1)}{n'r|p_1-p_2|}\right\}^{1/2}\right)^2\]

and $n_2 = rn_1$

\[n' = \frac{\left(z_{1-\alpha/2}\left\{(r+1)\bar{p}\bar{q}\right\}^{1/2}+z_{1-\beta}(rp_1q_1+p_2q_2)^{1/2}\right)^2}{r(p_1-p_2)^2}\]

and $\bar{p} = (p_1+rp_2)/(r+1)$ and $\bar{q}=1-\bar{p}$

so that $n_1=n'$ and $n_2=rn_1$

One-sample test of proportions where null is $p=p_0$ and alternative is $p=p_A$:

\[n = \left(\frac{z_{1-\alpha/2}\left\{p_0(1-p_0)\right\}^{1/2} + z_{1-\beta}\left\{p_A(1-p_A)\right\}^{1/2}}{p_A-p_0}\right)^2\]

The function sampsi.prop() takes arguments p1 and p2, which are either the null and alternative proportions, respectively, for the one-sample test, or two proportions for the two-sample test. These arguments are required for the function to run. The rest of the arguments have preset default values so do not need to be indicated, and include the ratio of smaller group to larger group (default=1), power (default of $\beta=.80$), significance level (default of $\alpha=.05$), a logical argument for whether to include the continuity correction (default is true), whether to perform a two-sided test or one-sided (default is two), and whether to perform the two-sample or one-sample test (defaut is two-sample test). The output is a list object.

sampsi.prop<-function(p1, p2, ratio=1, power=.90, alpha=.05, cont.corr=TRUE, two.sided=TRUE, one.sample=FALSE){
  
  effect.size<-abs(p2-p1)
  avg.p<-(p1+ratio*p2)/(ratio+1)
  sd=ifelse(one.sample==FALSE, sqrt(ratio*p1*(1-p1)+p2*(1-p2)), sqrt(p2*(1-p2)))
  
  z.pow<-qt(1-power, df=Inf, lower.tail=FALSE)
  z.alph<-ifelse(two.sided==TRUE, qt(alpha/2, df=Inf, lower.tail=FALSE), qt(alpha, df=Inf, lower.tail=FALSE))
  ct<-(z.pow+z.alph)
  
  n1<-(z.alph*sqrt((ratio+1)*avg.p*(1-avg.p))+z.pow*sd)^2/(effect.size^2*ratio)
  n1.cont<-ifelse(cont.corr==FALSE, n1, (n1/4)*(1+sqrt(1+(2*(ratio+1))/(n1*ratio*effect.size)))^2)

  n<-(((z.alph*sqrt(p1*(1-p1)))+z.pow*sd)/effect.size)^2
  
  if(one.sample==FALSE){
    col1<-c("alpha", "power", "p1", "p2", "effect size", "n2/n1", "n1", "n2")
    col2<-c(alpha,  power, p1, p2, effect.size, ratio, ceiling(n1.cont), ceiling(n1.cont*ratio))
  }
  else{
    col1<-c("alpha", "power", "p", "alternative", "n")
    col2<-c(alpha, power, p1, p2, ceiling(n))
  }
  ret<-as.data.frame(cbind(col1, col2))
  ret$col2<-as.numeric(as.character(ret$col2))
  colnames(ret)<-c("Assumptions", "Value")
  
  description<-paste(ifelse(one.sample==FALSE, "Two-sample", "One-sample"), ifelse(two.sided==TRUE, "two-sided", "one-sided"), "test of proportions", ifelse(cont.corr==FALSE, "without", "with"), "continuity correction")
  
  retlist<-list(description, ret)
  
  return(retlist)
}

Now we can do the following sample size calculations that match up perfectly to the corresponding commands in Stata:

sampsi.prop(.10, .25)

## [[1]]
## [1] "Two-sample two-sided test of proportions with continuity correction"
## 
## [[2]]
##   Assumptions  Value
## 1       alpha   0.05
## 2       power   0.90
## 3          p1   0.10
## 4          p2   0.25
## 5 effect size   0.15
## 6       n2/n1   1.00
## 7          n1 146.00
## 8          n2 146.00

Notice that the results from the calculation above are corrected for continuity, so do not match up with the power.prop.test output from above. To get those results, specify cont.corr=FALSE like so:

sampsi.prop(.10, .25,  cont.corr=FALSE)

## [[1]]
## [1] "Two-sample two-sided test of proportions without continuity correction"
## 
## [[2]]
##   Assumptions  Value
## 1       alpha   0.05
## 2       power   0.90
## 3          p1   0.10
## 4          p2   0.25
## 5 effect size   0.15
## 6       n2/n1   1.00
## 7          n1 133.00
## 8          n2 133.00

Here are a couple more examples with some of the other defaults changed:

sampsi.prop(.5, .55, power=.8, one.sample=TRUE)

## [[1]]
## [1] "One-sample two-sided test of proportions with continuity correction"
## 
## [[2]]
##   Assumptions  Value
## 1       alpha   0.05
## 2       power   0.80
## 3           p   0.50
## 4 alternative   0.55
## 5           n 783.00

sampsi.prop(.5, .55, ratio=2, two.sided=FALSE)

## [[1]]
## [1] "Two-sample one-sided test of proportions with continuity correction"
## 
## [[2]]
##   Assumptions   Value
## 1       alpha    0.05
## 2       power    0.90
## 3          p1    0.50
## 4          p2    0.55
## 5 effect size    0.05
## 6       n2/n1    2.00
## 7          n1 1310.00
## 8          n2 2619.00

Next we can calculate sample size for a comparison of means. The required n for the one sample and two sample tests are as follows:

One sample test of mean where null hypothesis is $\mu=\mu_0$ and the alternative is $\mu=\mu_A$: \[n = \left\{\frac{(z_{1-\alpha/2} + z_{1-\beta})\sigma}{\mu_A-\mu_0}\right\}^2\]
Two-sample test of equality of means:

\[n_1 = \frac{(\sigma_1^2+\sigma_2^2/r)(z_{1-\alpha/2}+z_{1-\beta})^2}{(\mu_1-\mu_2)^2}\] \[n_2=rn_1\]

The sampsi.means() function again works the same way as in Stata where you must input as arguments either a null mean and alternative mean for a one-sample test or two means for a two-sample test. At least one standard deviation of the mean is also required. It is possible to have different standard deviations for the two-sample test, but if only the first standard deviation value is assigned and a two-sample test is designated, then the function assumes both standard deviations to be the same. The other arguments and their defaults are similar to the sampsi.prop() function from above.

sampsi.means<-function(m1, m2, sd1, sd2=NA, ratio=1, power=.90, alpha=.05, two.sided=TRUE, one.sample=FALSE){
  
  effect.size<-abs(m2-m1)
  sd2<-ifelse(!is.na(sd2), sd2, sd1)
  
  z.pow<-qt(1-power, df=Inf, lower.tail=FALSE)
  z.alph<-ifelse(two.sided==TRUE, qt(alpha/2, df=Inf, lower.tail=FALSE), qt(alpha, df=Inf, lower.tail=FALSE))
  ct<-(z.pow+z.alph)
  
  n1<-(sd1^2+(sd2^2)/ratio)*(ct)^2/(effect.size^2)
  n<-(ct*sd1/effect.size)^2
  
  if(one.sample==FALSE){
    col1<-c("alpha", "power", "m1", "m2", "sd1", "sd2", "effect size", "n2/n1", "n1", "n2")
    col2<-c(alpha,  power, m1, m2, sd1, sd2, effect.size, ratio, ceiling(n1), ceiling(n1*ratio))
  }
  else{
    col1<-c("alpha", "power", "null", "alternative", "n")
    col2<-c(alpha, power, m1, m2, ceiling(n))
  }
  ret<-as.data.frame(cbind(col1, col2))
  ret$col2<-as.numeric(as.character(ret$col2))
  colnames(ret)<-c("Assumptions", "Value")
  
  description<-paste(ifelse(one.sample==FALSE, "Two-sample", "One-sample"), ifelse(two.sided==TRUE, "two-sided", "one-sided"), "test of means")
  
  retlist<-list(description, ret)
  
  return(retlist)
}

And here are the examples:

sampsi.means(0, 10, sd1=15, power=.8)

## [[1]]
## [1] "Two-sample two-sided test of means"
## 
## [[2]]
##    Assumptions Value
## 1        alpha  0.05
## 2        power  0.80
## 3           m1  0.00
## 4           m2 10.00
## 5          sd1 15.00
## 6          sd2 15.00
## 7  effect size 10.00
## 8        n2/n1  1.00
## 9           n1 36.00
## 10          n2 36.00

sampsi.means(10, 30, sd1=15, sd2=20, alpha=.1, ratio=1)

## [[1]]
## [1] "Two-sample two-sided test of means"
## 
## [[2]]
##    Assumptions Value
## 1        alpha   0.1
## 2        power   0.9
## 3           m1  10.0
## 4           m2  30.0
## 5          sd1  15.0
## 6          sd2  20.0
## 7  effect size  20.0
## 8        n2/n1   1.0
## 9           n1  14.0
## 10          n2  14.0

Finally, we often work with clustered data and would like to calculate sample sizes for clustered observations. Here I created a samp.clus() function that works the same way as the sampclus function in Stata; that is, it takes a sampsi.prop() or sampsi.means() object, along with an interclass correlation coefficient ($\rho$) and either the number of observations per cluster or the number of clusters, and calculates the corresponding effective sample size.

samp.clus<-function(sampsi.object, rho, num.clus=NA, obs.clus=NA){
  
  if(is.na(num.clus)&is.na(obs.clus)) print("Either num.clus or obs.clus must be identified")
  else{
    
    so<-sampsi.object[[2]]
    n1<-as.numeric(so[so$Assumptions=="n1",2])
    n2<-as.numeric(so[so$Assumptions=="n2",2])
    
    if(!is.na(obs.clus)){
      deff<-1+(obs.clus-1)*rho
      n1.clus<-n1*deff
      n2.clus<-n2*deff
      num.clus<-ceiling((n1.clus+n2.clus)/obs.clus)
    }
    else if(!is.na(num.clus)){
      
      tot<-(n1*(1-rho)+n2*(1-rho))/(1-(n1*rho/num.clus) - (n2*rho/num.clus))
      if(tot<=0) stop("Number of clusters is too small")
      else{
        obs.clus<-ceiling(tot/num.clus)
        deff<-1+(obs.clus-1)*rho
        n1.clus<-n1*deff
        n2.clus<-n2*deff
      }
    }
    
    col1<-c("n1 uncorrected", "n2 uncorrected", "ICC", "Avg obs/cluster", "Min num clusters", "n1 corrected", "n2 corrected")
    col2<-c(n1, n2, rho, obs.clus, num.clus, ceiling(n1.clus), ceiling(n2.clus))
    ret<-as.data.frame(cbind(col1, col2))
    colnames(ret)<-c("Assumptions", "Value")
    return(ret)
  }
}

Here are a couple of examples of how it works (the same function samp.clus() works for both a sampsi.prop object or a sampsi.means object:

ss<-sampsi.prop(.10, .25, power=.8, two.sided=FALSE)
samp.clus(ss, rho=.05, obs.clus=15)

##        Assumptions Value
## 1   n1 uncorrected    92
## 2   n2 uncorrected    92
## 3              ICC  0.05
## 4  Avg obs/cluster    15
## 5 Min num clusters    21
## 6     n1 corrected   157
## 7     n2 corrected   157

samp.clus(ss, rho=.05, num.clus=150)

##        Assumptions Value
## 1   n1 uncorrected    92
## 2   n2 uncorrected    92
## 3              ICC  0.05
## 4  Avg obs/cluster     2
## 5 Min num clusters   150
## 6     n1 corrected    97
## 7     n2 corrected    97

ss2<-sampsi.means(10, 15, sd1=15, power=.8)
samp.clus(ss2, rho=.05, obs.clus=15)

##        Assumptions Value
## 1   n1 uncorrected   142
## 2   n2 uncorrected   142
## 3              ICC  0.05
## 4  Avg obs/cluster    15
## 5 Min num clusters    33
## 6     n1 corrected   242
## 7     n2 corrected   242

samp.clus(ss2, rho=.05, num.clus=5)

## Error in samp.clus(ss2, rho = 0.05, num.clus = 5): Number of clusters is too small

##Note: the above won't work because not enough clusters given as input; this error message will occur to warn you. You must increase the number of clusters to avoid the error.

Finally, I add one functionality that Stata doesn’t have with their basic sample size calculations, which is graphing sample size as a function of power. Here, I created two functions for these purposes, for either proprotions or means. The input is the starting point and ending point for the power, as well as the same inputs and defaults from the corresponding sampsi.prop or sampsi.means functions:

graph.power.prop<-function(from.power, to.power,p1, p2, ratio=1, power=.90, alpha=.05, cont.corr=TRUE, two.sided=TRUE, one.sample=FALSE){
  
  seq.p<-seq(from.power, to.power, by=.01)
  n<-rep(NA, length(seq.p))

  for(i in 1:length(seq.p)){
    ob<-sampsi.prop(p1=p1, p2=p2, power=seq.p[i], alpha=alpha, ratio=ratio, cont.corr=cont.corr, two.sided=two.sided, one.sample=one.sample)[[2]]
    n[i]<-as.numeric(ob[7,2])
  }
  plot(n, seq.p, ylab="Power", xlab="n (in smaller arm)", type="l",  main=paste("Power graph for p1=", p1, "and p2=", p2))
}


graph.power.means<-function(from.power, to.power, m1, m2, sd1, sd2=NA, ratio=1, alpha=.05, cont.corr=TRUE, two.sided=TRUE, one.sample=FALSE){
  
  seq.p<-seq(from.power, to.power, by=.01)
  n<-rep(NA, length(seq.p))
  for(i in 1:length(seq.p)){
    ob<-sampsi.means(m1=m1, m2=m2, sd1=sd1, sd2=sd2, power=seq.p[i], alpha=alpha, ratio=ratio,, one.sample=one.sample, two.sided=two.sided)[[2]]
    n[i]<-as.numeric(ob[9,2])
  }
  plot(n, seq.p, ylab="Power", xlab="n (in smaller arm)", type="l",  main=paste("Power graph for m1=", m1, "and m2=", m2))
}

And this is what it looks like. This example will graph power as a function of sample size. We restrict the graph from $\beta=0.6$ to $\beta=1$.

graph.power.prop(.6, 1, p1=.20, p2=.35)

Why R for public health?

I created this blog to help public health researchers that are used to Stata or SAS to begin using R. I find that public health data is unique and this blog is meant to address the specific data management and analysis needs of the world of public health.

R is a very powerful tool for programming but can have a steep learning curve. In my experience, people find it easier to do it the long way with another programming language, rather than try R, because it just takes longer to learn. I think all statistical packages are useful and have their place in the public health world. However, I am a strong proponent of R and I hope this blog can help you move toward using it when it makes sense for you.

Please email me with posts you would like to see or R questions, and I'll try my best to answer them. Thanks for following!