In the first several examples, it could be costly in terms of memory use to generate sub data frames when the data set is very large. I guess a better way is to work with formula. We can write a function to do this:

Hi Yimeng, thanks for your reply! Those are great suggestions. I try to use the shortest possible code but you're right about needing to think about memory, especially with large data sets, which a lot of people work with. Thanks for your contribution!

Hi Slawa, thank you for this wonderful post!

ReplyDeleteIn the first several examples, it could be costly in terms of memory use to generate sub data frames when the data set is very large. I guess a better way is to work with formula. We can write a function to do this:

genForm = function(dep, main, control){

indep = paste(paste(list.main, collapse = " + "), "+", paste(list.control, collapse = " + "))

form = paste(dep,"~",indep)

form = as.formula(form)

}

list.control = c("z", "w")

list.main = c("x")

summary(lm(genForm("y", "x", list.control), data = mydata))

summary(glm(genForm("ybin", "x", list.control), data = mydata))

And regression with a repeatedly used sub data frame can be done this way

con1 = expression(x > 2 & z < 3)

summary(glm(genForm("ybin", "x", list.control), data = mydata[eval(con1),]))

And looping over variables

lapply(c("y", "ybin") , function(outcome) summary(lm(genForm(outcome, "x", list.control), data = mydata)) )

Hi Yimeng, thanks for your reply! Those are great suggestions. I try to use the shortest possible code but you're right about needing to think about memory, especially with large data sets, which a lot of people work with. Thanks for your contribution!

DeleteActually, there is an even more elegant way to subset on rows within most lm-like model-fitting functions:

ReplyDeletesummary(lm(ybin ~ xvars.sub, data = data.sub,subset=x>2 & z<3))