Tuesday, August 25, 2015

How to use lists in R

In the last post, I went over the basics of lists, including constructing, manipulating, and converting lists to other classes.

Knowing the basics, in this post, we’ll use the apply() functions to see just how powerful working with lists can be. I’ve done two posts on apply() for dataframes and matrics, here and here, so give those a read if you need a refresher.

Intro to apply-based functions for lists

There are a variety of apply()-based functions that can be used depending on what you want to do. The table below shows the function, what it inputs, and what it outputs:

Function Input Output
apply matrix vector or matrix
sapply vector or list vector or matrix
lapply vector or list list

For example, if you have a list and you want to produce a vector (of the same length), use sapply(). If you have a vector and want to produce a list of the same length, use lapply(). Let’s try an example.

The syntax of lapply() is:

lapply(INPUT, function(x) (Some function here))

where INPUT, as we see from the table above, must be a vector or a list, and function(x) is any kind of function that takes each element of the INPUT and applies the function to it. The function can be something that already exists in R, or it can be a new function that you’ve written up.

For example, let’s construct a list of 3 vectors like so:

mylist<-list(x=c(1,5,7), y=c(4,2,6), z=c(0,3,4))
mylist
## $x
## [1] 1 5 7
## 
## $y
## [1] 4 2 6
## 
## $z
## [1] 0 3 4

and now we can use lapply() to find the mean of each element of the list (mean of each of the vectors x, y, and z), and output to a new list:

lapply(mylist, function(x) mean(x))
## $x
## [1] 4.333333
## 
## $y
## [1] 4
## 
## $z
## [1] 2.333333

But let’s say we wanted the result in a vector, not in a list, for whatever reason. Instead of doing the above and then converting the list into a vector (using unlist() or ldply() or whatever), we can do this directly using sapply() instead of lapply(). That’s because, as you can see in table, sapply() can take in a list as the input, and it will return a vector (or matrix). Let’s try it:

sapply(mylist, function(x) mean(x))
##        x        y        z 
## 4.333333 4.000000 2.333333

This is really great! Anytime you want to do the same thing over and over again, put all those things in a list and then use one of the apply functions. This reduces the need to run a loop, which can take a lot longer.

Let’s do another example where we write our own function this time:

#write function to find the span of numbers in a vector and check if it's larger than 5
span.fun<-function(x) {(max(x)-min(x))>=5}

#apply that function to the list
sapply(mylist, span.fun)
##     x     y     z 
##  TRUE FALSE FALSE

Creating a list using lapply()

You don’t need to have a list already created to use lapply() - in fact, lapply() can be used to make a list. This is because the key about lapply() is that it returns a list of the same length as whatever you input.

For example, let’s initialize a list to have 2 empty matrices that are size 2x3. We’ll use lapply(): our input is just a vector containing 1 and 2, and the function we specify uses the matrix() function to construct a 2x3 matrix of empty cells for each element of this vector, so it returns a list of two such matrices.

If instead of empty matrices we wanted to fill these matrices with random numbers, we could do that too. Check out both possibilities below.

#initialize list to to 2 empty matrices of 2 by 3
list2<-lapply(1:2, function(x) matrix(NA, nrow=2, ncol=3))
list2
## [[1]]
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
#initialize list to 2 matrices with random numbers from normal distribution
list2<-lapply(1:2, function(x) matrix(rnorm(6, 10, 1), nrow=2, ncol=3))
list2
## [[1]]
##           [,1]      [,2]     [,3]
## [1,]  9.467982  9.794397 10.52168
## [2,] 10.022561 10.179758 10.47954
## 
## [[2]]
##          [,1]     [,2]     [,3]
## [1,] 7.990455 10.95596 11.94031
## [2,] 8.952418 10.97080 11.24791

Again, we can use lapply() or sapply() on this newly created list to get the sum of each column of each matrix:

#input list, output column sums of each matrix into a new list
lapply(list2, colSums)
## [[1]]
## [1] 19.49054 19.97416 21.00121
## 
## [[2]]
## [1] 16.94287 21.92676 23.18822
#input list, output column sums into a **vector** (which binds them into a matrix)
sapply(list2, colSums)
##          [,1]     [,2]
## [1,] 19.49054 16.94287
## [2,] 19.97416 21.92676
## [3,] 21.00121 23.18822
#instead of binding, we can stack these column sums by using tranpose function t():
t(sapply(list2, colSums))
##          [,1]     [,2]     [,3]
## [1,] 19.49054 19.97416 21.00121
## [2,] 16.94287 21.92676 23.18822

Practical uses of lists using lapply()

Finally, what are lists good for? Often, I find a lists are great when I want to store multi-dimensional objects into one object, for example group a bunch of data.frames into a list, or store all my model results into one list. Here’s an example, where I run four linear models for four different outcomes. I want to store all my models into one object.

There are two ways to do this:

  • Use a for() loop and insert the results of each iteration into the list
  • Use lapply! Faster and less code
#create some data
set.seed(2000)
x=rbinom(1000,1,.6)
mydata<-data.frame(trt=x,
                   out1=x*3+rnorm(1000,0,3),
                   out2=x*5+rnorm(1000,0,3),
                   out3=rnorm(1000,5,3),
                   out4=x*1+rnorm(1000,0,8))

head(mydata)
##   trt      out1      out2      out3       out4
## 1   1  1.496148 5.2140842 7.8220283 12.7108382
## 2   0 -1.243485 0.5332667 2.8407921  4.6709677
## 3   1 11.070722 4.6477594 4.6725192  0.4216170
## 4   1  2.681000 1.8717883 0.3333281  0.4401036
## 5   0 -3.459300 0.8945582 3.1010555 -0.2620342
## 6   1 -2.266221 9.1754452 6.4914437  3.0443185

Now I want to run each of the four outcomes on the trt variable using linear regression and save the results. I’ll do this first as a loop, then using lapply():

#1. Use a loop
#first, initialize the results list
results<-vector("list", 4) 

#now use a loop for each outcome
for(i in 1:4){
  results[[i]]<-lm(mydata[,i+1]~trt, data = mydata) 
}


#2.Or, use lapply in one statement!
results<-lapply(2:5, function(x) lm(mydata[,x]~trt, data = mydata))

In the second case, we are taking the vector c(2,3,4,5) and for each component of this vector, we’re running the model that we describe in the function. We can always name the components of the list as below, and I’ll print out the first two elements:

names(results)<-names(mydata)[2:5]
print(results, max=2)
## $out1
## 
## Call:
## lm(formula = mydata[, x] ~ trt, data = mydata)
## 
## Coefficients:
## (Intercept)          trt  
##      0.1905       2.7707  
## 
## 
## $out2
## 
## Call:
## lm(formula = mydata[, x] ~ trt, data = mydata)
## 
## Coefficients:
## (Intercept)          trt  
##    -0.01892      4.73405  
## 
## 
##  [ reached getOption("max.print") -- omitted 2 entries ]

Why is this a great way to store data? Well, we can keep using the apply() functions, for example to put together all of the treatment effects for each outcome into one matrix:

#extract coefficient and std error for each outcome and store in a matrix
sapply(results, function(x) summary(x)$coefficients[2,1:2])
##                 out1      out2       out3      out4
## Estimate   2.7707490 4.7340543 -0.1344969 1.3293520
## Std. Error 0.1915748 0.1876549  0.1912755 0.5324664

You can also easily use other functions like stargazer() (previous post on this function here) to create a quick table of results like so (in latex code):

require(stargazer)
stargazer(results, 
          column.labels=names(results),
          keep.stat=c("rsq","n"),
          dep.var.labels="")