Tuesday, August 25, 2015

How to use lists in R

In the last post, I went over the basics of lists, including constructing, manipulating, and converting lists to other classes.

Knowing the basics, in this post, we’ll use the apply() functions to see just how powerful working with lists can be. I’ve done two posts on apply() for dataframes and matrics, here and here, so give those a read if you need a refresher.

Intro to apply-based functions for lists

There are a variety of apply()-based functions that can be used depending on what you want to do. The table below shows the function, what it inputs, and what it outputs:

Function Input Output
apply matrix vector or matrix
sapply vector or list vector or matrix
lapply vector or list list

For example, if you have a list and you want to produce a vector (of the same length), use sapply(). If you have a vector and want to produce a list of the same length, use lapply(). Let’s try an example.

The syntax of lapply() is:

lapply(INPUT, function(x) (Some function here))

where INPUT, as we see from the table above, must be a vector or a list, and function(x) is any kind of function that takes each element of the INPUT and applies the function to it. The function can be something that already exists in R, or it can be a new function that you’ve written up.

For example, let’s construct a list of 3 vectors like so:

mylist<-list(x=c(1,5,7), y=c(4,2,6), z=c(0,3,4))
mylist
## $x
## [1] 1 5 7
## 
## $y
## [1] 4 2 6
## 
## $z
## [1] 0 3 4

and now we can use lapply() to find the mean of each element of the list (mean of each of the vectors x, y, and z), and output to a new list:

lapply(mylist, function(x) mean(x))
## $x
## [1] 4.333333
## 
## $y
## [1] 4
## 
## $z
## [1] 2.333333

But let’s say we wanted the result in a vector, not in a list, for whatever reason. Instead of doing the above and then converting the list into a vector (using unlist() or ldply() or whatever), we can do this directly using sapply() instead of lapply(). That’s because, as you can see in table, sapply() can take in a list as the input, and it will return a vector (or matrix). Let’s try it:

sapply(mylist, function(x) mean(x))
##        x        y        z 
## 4.333333 4.000000 2.333333

This is really great! Anytime you want to do the same thing over and over again, put all those things in a list and then use one of the apply functions. This reduces the need to run a loop, which can take a lot longer.

Let’s do another example where we write our own function this time:

#write function to find the span of numbers in a vector and check if it's larger than 5
span.fun<-function(x) {(max(x)-min(x))>=5}

#apply that function to the list
sapply(mylist, span.fun)
##     x     y     z 
##  TRUE FALSE FALSE

Creating a list using lapply()

You don’t need to have a list already created to use lapply() - in fact, lapply() can be used to make a list. This is because the key about lapply() is that it returns a list of the same length as whatever you input.

For example, let’s initialize a list to have 2 empty matrices that are size 2x3. We’ll use lapply(): our input is just a vector containing 1 and 2, and the function we specify uses the matrix() function to construct a 2x3 matrix of empty cells for each element of this vector, so it returns a list of two such matrices.

If instead of empty matrices we wanted to fill these matrices with random numbers, we could do that too. Check out both possibilities below.

#initialize list to to 2 empty matrices of 2 by 3
list2<-lapply(1:2, function(x) matrix(NA, nrow=2, ncol=3))
list2
## [[1]]
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
#initialize list to 2 matrices with random numbers from normal distribution
list2<-lapply(1:2, function(x) matrix(rnorm(6, 10, 1), nrow=2, ncol=3))
list2
## [[1]]
##           [,1]      [,2]     [,3]
## [1,]  9.467982  9.794397 10.52168
## [2,] 10.022561 10.179758 10.47954
## 
## [[2]]
##          [,1]     [,2]     [,3]
## [1,] 7.990455 10.95596 11.94031
## [2,] 8.952418 10.97080 11.24791

Again, we can use lapply() or sapply() on this newly created list to get the sum of each column of each matrix:

#input list, output column sums of each matrix into a new list
lapply(list2, colSums)
## [[1]]
## [1] 19.49054 19.97416 21.00121
## 
## [[2]]
## [1] 16.94287 21.92676 23.18822
#input list, output column sums into a **vector** (which binds them into a matrix)
sapply(list2, colSums)
##          [,1]     [,2]
## [1,] 19.49054 16.94287
## [2,] 19.97416 21.92676
## [3,] 21.00121 23.18822
#instead of binding, we can stack these column sums by using tranpose function t():
t(sapply(list2, colSums))
##          [,1]     [,2]     [,3]
## [1,] 19.49054 19.97416 21.00121
## [2,] 16.94287 21.92676 23.18822

Practical uses of lists using lapply()

Finally, what are lists good for? Often, I find a lists are great when I want to store multi-dimensional objects into one object, for example group a bunch of data.frames into a list, or store all my model results into one list. Here’s an example, where I run four linear models for four different outcomes. I want to store all my models into one object.

There are two ways to do this:

  • Use a for() loop and insert the results of each iteration into the list
  • Use lapply! Faster and less code
#create some data
set.seed(2000)
x=rbinom(1000,1,.6)
mydata<-data.frame(trt=x,
                   out1=x*3+rnorm(1000,0,3),
                   out2=x*5+rnorm(1000,0,3),
                   out3=rnorm(1000,5,3),
                   out4=x*1+rnorm(1000,0,8))

head(mydata)
##   trt      out1      out2      out3       out4
## 1   1  1.496148 5.2140842 7.8220283 12.7108382
## 2   0 -1.243485 0.5332667 2.8407921  4.6709677
## 3   1 11.070722 4.6477594 4.6725192  0.4216170
## 4   1  2.681000 1.8717883 0.3333281  0.4401036
## 5   0 -3.459300 0.8945582 3.1010555 -0.2620342
## 6   1 -2.266221 9.1754452 6.4914437  3.0443185

Now I want to run each of the four outcomes on the trt variable using linear regression and save the results. I’ll do this first as a loop, then using lapply():

#1. Use a loop
#first, initialize the results list
results<-vector("list", 4) 

#now use a loop for each outcome
for(i in 1:4){
  results[[i]]<-lm(mydata[,i+1]~trt, data = mydata) 
}


#2.Or, use lapply in one statement!
results<-lapply(2:5, function(x) lm(mydata[,x]~trt, data = mydata))

In the second case, we are taking the vector c(2,3,4,5) and for each component of this vector, we’re running the model that we describe in the function. We can always name the components of the list as below, and I’ll print out the first two elements:

names(results)<-names(mydata)[2:5]
print(results, max=2)
## $out1
## 
## Call:
## lm(formula = mydata[, x] ~ trt, data = mydata)
## 
## Coefficients:
## (Intercept)          trt  
##      0.1905       2.7707  
## 
## 
## $out2
## 
## Call:
## lm(formula = mydata[, x] ~ trt, data = mydata)
## 
## Coefficients:
## (Intercept)          trt  
##    -0.01892      4.73405  
## 
## 
##  [ reached getOption("max.print") -- omitted 2 entries ]

Why is this a great way to store data? Well, we can keep using the apply() functions, for example to put together all of the treatment effects for each outcome into one matrix:

#extract coefficient and std error for each outcome and store in a matrix
sapply(results, function(x) summary(x)$coefficients[2,1:2])
##                 out1      out2       out3      out4
## Estimate   2.7707490 4.7340543 -0.1344969 1.3293520
## Std. Error 0.1915748 0.1876549  0.1912755 0.5324664

You can also easily use other functions like stargazer() (previous post on this function here) to create a quick table of results like so (in latex code):

require(stargazer)
stargazer(results, 
          column.labels=names(results),
          keep.stat=c("rsq","n"),
          dep.var.labels="")

Or easily create a graph of the model estimates and 95% confidence intervals:

#extract coefficients from the list
coefs<-as.data.frame(t(sapply(results, function(x) summary(x)$coefficients[2,1:2])))
coefs
##        Estimate Std. Error
## out1  2.7707490  0.1915748
## out2  4.7340543  0.1876549
## out3 -0.1344969  0.1912755
## out4  1.3293520  0.5324664
#add outcome columnn and change name of SE column
coefs$Outcome<-rownames(coefs)
names(coefs)[2]<-"SE"

#use ggplot to plot all the estimates
require(ggplot2)
ggplot(coefs, aes(Outcome,Estimate)) +
  geom_point(size=4) + 
  theme(legend.position="none")+
  labs(title="Treatment effect on outcomes", x="", y="Estimate and 95% CI")+
  geom_errorbar(aes(ymin=Estimate-1.96*SE,ymax=Estimate+1.96*SE),width=0.1)+
  geom_hline(yintercept = 0, color="red")+
  coord_flip()

I hope that was useful! There are many great ways to use lists and the apply() functions to make your programming more efficient and less prone to errors.

For another great resource on using the apply() functions with lists, definitely check out this StackOverflow page.

In the [last post](http://rforpublichealth.blogspot.com/2015/03/basics-of-lists.html), I went over the basics of lists, including constructing, manipulating, and converting lists to other classes. Knowing the basics, in this post, we'll use the **apply()** functions to see just how powerful working with lists can be. I've done two posts on apply for dataframes and matrics, [here](http://rforpublichealth.blogspot.com/2012/09/the-infamous-apply-function.html) and [here](http://rforpublichealth.blogspot.com/2013/10/loops-revisited-how-to-rethink-macros.html), so give those a read if you need a refresher. Intro to apply-based functions for lists There are a variety of apply functions that can be used depending on what you want to do. The table below shows the function, what it inputs, and what it outputs: For example, if we have a list and you want to produce a vector (of the same length), we use **sapply()**. If we have a vector and want to produce a list of the same length, we use **lapply()**. Let's try an example. The syntax of lapply is: lapply(INPUT, function(x) (Some function here)) where INPUT, as we see from the table above, must be a vector or a list, and function(x) is any kind of function that takes **each element of the INPUT** and applies the function to it. The function can be something that already exists in R, or it can be a new function that you've written up. For example, let's construct a list of 3 vectors like so: mylist<-list(x=c(1,5,7), y=c(4,2,6), z=c(0,3,4)) and now we can use lapply to find the mean of each element of the list (each of the vectors x, y, and z), and output to a new list: lapply(mylist, function(x) mean(x)) But let's say we wanted the result in a vector, not in a list, for whatever reason. Instead of doing the above and then converting the list into a vector (using unlist() or ldply() or whatever), we can do this directly using **sapply()** instead of **lapply()**. That's because as you can see in table, **sapply()** can take in a list as the input, and it will return a vector (or matrix). Let's try it: sapply(mylist, function(x) mean(x)) This is really great! Anytime you want to do the same thing over and over again, put all those things in a list and then use one of the apply functions. This reduces the need to run a loop, which can take a lot longer. Let's do another example where you write your own function this time: #write function to find the span of numbers in a vector and check if it's larger than 5 span.fun<-function(x) {(max(x)-min(x))>=5} #apply that function to the list sapply(mylist, span.fun) Creating a list using lapply You don't need to have a list already created to use lapply() - in fact, lapply can be used to _make_ a list. This is because the key about **lapply()** is that it *returns* a list of the same length as whatever you input. For example, let's initialize a list to have 2 empty matrices that are size 2x3. We'll use lapply(): our input is just a vector containing 1 and 2, and the function we specify uses the matrix() function to construct a 2x3 matrix of empty cells for each element of this vector, so it returns a list of two such matrices. If instead of empty matrices you wanted to fill these matrices with random numbers, you could do so too. Check out both possibilities below. #initialize list to to 2 empty matrices of 2 by 3 list2<-lapply(1:2, function(x) matrix(NA, nrow=2, ncol=3)) list2 #initialize list to 2 matrices with random numbers from standard normal distribution list2<-lapply(1:2, function(x) matrix(rnorm(6, 10, 1), nrow=2, ncol=3)) Again, we can use **lapply()** or **sapply()** on this newly created list to get the sum of each column of each matrix: #input list, output column sums of each matrix into a new list lapply(list2, colSums) #input list, output column sums into a **vector** (which binds them into a matrix) sapply(list2, colSums) #instead of binding, we can stack these column sums by using tranpose function t(): t(sapply(list2, colSums)) Practical uses of lists using lapply Finally, what are lists good for? Often, I find a lists are great when I want to store multi-dimensional objects into one object, for example group a bunch of data.frames into a list, or store all my model results into one list. Here's an example, where I run four linear models for four different outcomes. I want to store all my models into one object. There are two ways to do this: Use a for() loop and insert the results of each iteration into the list Use lapply! Faster and less code #create some data set.seed(2000) x=rbinom(1000,1,.6) mydata<-data.frame(trt=x,out1=x*3+rnorm(1000,0,3),out2=x*5+rnorm(1000,0,3),out3=rnorm(1000,5,3),out4=x*1+rnorm(1000,0,8)) head(mydata) Now I want to run each of the four outcomes on the trt variable using linear regression and save the results. I'll do this first as a loop, then using **lapply()**: #1. Use a loop #first, initialize the results list results<-vector("list", 4) #now use a loop for each outcome for(i in 1:4){ results[[i]]<-lm(mydata[,i+1]~trt, data = mydata) } #2.Use lapply in one statement! results<-lapply(2:5, function(x) lm(mydata[,x]~trt, data = mydata)) In the second case, we are taking the vector c(2,3,4,5) and for each component of this vector, we're running the model that we describe in the function. We can always name the components of the list as below, and I'll print out the first two elements: names(results)<-names(mydata)[2:5] print(results, max=2) Why is this a great way to store data? Well, we can __keep__ using the **apply()** functions, for example to put together all of the treatment effects for each outcome into one matrix: #extract coefficient and std error for each outcome and store in a matrix sapply(results, function(x) summary(x)$coefficients[2,1:2]) You can also easily use other functions like **stargazer()** (previous post on this function [here](http://rforpublichealth.blogspot.ie/2013/08/exporting-results-of-linear-regression_24.html)) to create a quick table of results like so (in latex code): require(stargazer) stargazer(results, column.labels=names(results),keep.stat=c("rsq","n"),dep.var.labels="") library(png) library(grid) img <- readPNG("~/Dropbox/Harvard Doctoral/Rforpublichealth/post30/sgtable2.png") grid.raster(img) Or easily create a graph of the model estimates and 95 confidence intervals: #extract coefficients from the list coefs<-as.data.frame(t(sapply(results, function(x) summary(x)$coefficients[2,1:2]))) #add outcome columnn and change name of SE column coefs$Outcome<-rownames(coefs) names(coefs)[2]<-"SE" #use ggplot to plot all the estimates require(ggplot2) ggplot(coefs, aes(Outcome,Estimate)) + geom_point(size=4) + theme(legend.position="none")+ labs(title="Treatment effect on outcomes", x="", y="Estimate and 95% CI")+ geom_errorbar(aes(ymin=Estimate-1.96*SE,ymax=Estimate+1.96*SE),width=0.1)+ geom_hline(yintercept = 0, color="red")+ coord_flip() I hope that was useful! There are many great ways to use lists and the **apply()** functions to make your programming more efficient and less prone to errors. For another resource on using the **apply()** functions with lists, definitely check out [this StackOverflow page](http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega).

5 comments:

  1. Very useful introduction but the final section on creating data.frames appears to be missing. Or did you just mean us to use the SO link?

    ReplyDelete
  2. Very useful introduction but the final section on creating data.frames appears to be missing. Or did you just mean us to use the SO link?

    ReplyDelete
    Replies
    1. Hmm - what is the last line you see? The end of the post should be "For another great resource on using the apply() functions with lists, definitely check out this StackOverflow page."

      Delete
  3. Nice! Do you know the coefplot package?

    ReplyDelete
  4. Nice! Do you know the coefplot package?

    ReplyDelete

Note: Only a member of this blog may post a comment.