Day 9: The LSR package has some deficiencies
by Danielle Navarro, 05 May 2018
This is a very different style of post to the previous ones. The
lsr package is one that I’m very familiar with, having written it to accompany my introductory stats lecture notes Learning Statistics with R that I’ve now released under the CC BY-SA 4.0 licence in case anyone has any desire to reuse or adapt them!
The package does have some nice features for teaching purposes, but it also has some flaws. Some of those flaws exist because the code and the notes are old. I started writing in late 2010. Not only was the tidyverse not a thing the way it is now, but RStudio itself hadn’t reached the 1.0 release and was still undergoing a lot of substantial design changes. If I were writing the package (and the notes!) from scratch now, I would do it very differently. That being said, there are still some features to it that I like. In this post, I’m going to revisit the
lsr package and comment on what I like about it, what I think is bad about it, and how I’d do things differently if I ever had the time to revisit it properly!
Broadly speaking, I’d group the functions into the following categories:
- Wrapper functions for introductory statistics concepts:
- Implementations of effect size measures that I couldn’t find elsewhere:
- Miscellaneous functions for data manipulation, viewing, an plotting
In terms of how I think of them now, I’d group them into
- Things that I still think are pedagogically useful
- Things that are kind of nice but I wish someone smarter than me would implement
- Things that are now pointless because there are much better tools
Things I like
I’ve spent a lot of time teaching introductory statistics in R, generally to large undergraduate classes of psychology students with no programming or statistical background. For this audience, I’m not a fan of the way some of the analyses for classical hypothesis tests are reported. Take the
t.test function for instance. It’s a single function that allows you to run a one-sample t-test, a paired-samples t-test or an independent-samples t-test depending on the input you give it. For someone who knows how a t-test works, this can be nice. There’s just one function that does everything. For a novice, it’s a nightmare. What I have found is that students take time to really grasp the differences between the different t-tests, and bundling all the tests into one function doesn’t help them learn. With that in mind, I made the decision in the
lsr package to split this into three wrapper functions
oneSampleTTest to help students be clear in their own mind what test they are running. (I did something similar with categorical data analysis, writing separate wrappers for
chisq.test depending on whether one is intending to run an
associationTest or a
goodnessOfFit). Similarly, when I wrote the wrapper functions, I made the output much more verbose. So, for instance, take the AFL data set:
# load some data afl <- read.csv(file="https://djnavarro.net/files/afl24.csv") afl$margin <- abs(afl$home.score - afl$away.score) head(afl)
## home.team away.team home.score away.score year round weekday day ## 1 North Melbourne Brisbane 104 137 1987 1 Fri 27 ## 2 Western Bulldogs Essendon 62 121 1987 1 Sat 28 ## 3 Carlton Hawthorn 104 149 1987 1 Sat 28 ## 4 Collingwood Sydney 74 165 1987 1 Sat 28 ## 5 Melbourne Fitzroy 128 89 1987 1 Sat 28 ## 6 St Kilda Geelong 101 102 1987 1 Sat 28 ## month is.final venue attendance margin ## 1 3 FALSE MCG 14096 33 ## 2 3 FALSE Waverley Park 22550 59 ## 3 3 FALSE Princes Park 19967 45 ## 4 3 FALSE Victoria Park 17129 91 ## 5 3 FALSE MCG 18012 39 ## 6 3 FALSE Gold Coast Stadium 15867 1
Let’s suppose I want to run a t-test to see if the margins for games during the home and away season tend to be different than during the final series. I might call the
t.test function like this:
t.test( formula = margin ~ is.final, data = afl )
## ## Welch Two Sample t-test ## ## data: margin by is.final ## t = -0.27956, df = 218.21, p-value = 0.7801 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -4.571427 3.435666 ## sample estimates: ## mean in group FALSE mean in group TRUE ## 36.17212 36.74000
This output is informative, but terse, and I didn’t find it easy to teach to psychology undergraduates. With that in mind, the output from the
lsr wrapper function is more verbose, and is more explicit about what the hypothesis test entails:
independentSamplesTTest( formula = margin ~ is.final, data = afl )
## Warning in independentSamplesTTest(formula = margin ~ is.final, data = ## afl): group variable is not a factor
## ## Welch's independent samples t-test ## ## Outcome variable: margin ## Grouping variable: is.final ## ## Descriptive statistics: ## FALSE TRUE ## mean 36.172 36.740 ## std dev. 27.602 28.072 ## ## Hypotheses: ## null: population means equal for both groups ## alternative: different population means in each group ## ## Test results: ## t-statistic: -0.28 ## degrees of freedom: 218.208 ## p-value: 0.78 ## ## Other information: ## two-sided 95% confidence interval: [-4.571, 3.436] ## estimated effect size (Cohen's d): 0.02
To my mind, this output is a lot easier for a novice to understand. Whatever else might be wrong with the
lsr package (and there are a few things 😀) I am really pleased with the wrapper functions. While I might want to tinker with them, I genuinely feel that
posthocPairwiseT all serve useful pedagogical purposes
More wrapper functions!
Many (most?) of the miscellaneous functions are probably useless at this point. In almost all cases there is a better tidyverse option, or some functionality in RStudio that makes them entirely redundant. There are a few, though, that I still think of fondly. For example, the
expandFactors function provides a wrapper function into
model.matrix and returns a data frame that replaces all factors in the original data frame with the corresponding contrasts for linear regresson. Here’s the example from the documentation:
grading <- data.frame( teacher = factor( c("Amy","Amy","Ben","Ben","Cat") ), gender = factor( c("male","female","female","male","male") ), grade = c(75,80,45,50,65) ) grading
## teacher gender grade ## 1 Amy male 75 ## 2 Amy female 80 ## 3 Ben female 45 ## 4 Ben male 50 ## 5 Cat male 65
grading2 <- expandFactors( grading )
## teacherBen teacherCat gendermale grade ## 1 0 0 1 75 ## 2 0 0 0 80 ## 3 1 0 0 45 ## 4 1 0 1 50 ## 5 0 1 1 65
Annoyingly, there’s a bug in the
expandFactors function in which I never deleted a line of debugging code (there’s a
The bug notwithstanding, I still like this idea. I want my students to be able to understand how factors get recoded as a set of contrasts for a linear model, but most of them aren’t at the point where they can work comfortably with the various contrast functions. From a pedagogical perspective, I think this is handy.
Things that need to be implemented better
At the time I started writing the LSR book, I really struggled to find functions that would compute standard measures of effect size in the format that they are typically taught to undergraduates. These days, I’m sure if I asked around on #Rstats twitter I’d get a lot of wonderful assistance! But at the time either that didn’t exist or I didn’t know about it – I was the only person I knew who was actively using R for anything so I was pretty isolated 😢, and I tried to implement them myself.
lsr package has functions
standardCoefs. In truth I’m not entirely sure I agree with any of these as the best way to think about effect sizes (standardised coefficients for regression models strike me as meaningless in a lot of situations, and eta-squared has some flaws), but the curriculum I had to teach to really did need these measures, I couldn’t find anything online about them, so there didn’t to be much of an option for it other than to write my own functions.
What bothers me most, is that I don’t know how much I trust the implementation of these functions. The
etaSquared function in particular doesn’t work when the
weights argument to the
aov function is used, and I haven’t been able to work up the motivation to revisit it. I wouldn’t advise anyone using this function until it does get fixed, but I have a lot of unresolved anxiety associated with this package - or more precisely, about submitting it to CRAN - so for the sake of my own sanity I’m not really all that keen to revisit it.
Things that were always a bad idea…
bars: This was never actually used in the book, as far as I can recall, and it’s got some quirky behaviour. It should be replaced with a call to
importList: Doesn’t really serve much of a purpose. There was a use case for this in one specific class I ran, but realistically I don’t think this needs to be there
maxFreq: These are descriptive statistics calculators that really I didn’t want to have at all, but I felt obligated to include because of curriculum constraints. Now that I’m not teaching that class, I really don’t feel especially pleased with them
Things that can be done better in other ways…
Several of the functions in the package are unnecessary thanks to RStudio:
rmAll: I wrote this function because at the time I wasn’t sure what options RStudio was going to give for novice users. The fact there’s a button you can press to clear all objects from the workspace makes this unnecessary.
unlibrary: I wrote this wrapper because the
detachfunction is a little complicated for total novices. It’s unnecessary now because RStudio has check box options for loading and unloading packages.
who: Inspired by the Matlab function of the same name, this lists the objects in the workspace in a little more detail than the
objectsfunction. Again, RStudio makes this unnecessary because you can get the same thing in a prettier form through the Environment tab.
The remaining functions are very simple data manipulation functions. They all have analogs in other packages (especially in the tidyverse) and probably don’t need to exist.
rowCopyare both similar to
replicate, but are slightly easier for novices; they were useful in some teaching contexts, but not amazingly helpful in practive
permuteLevelsallows you to reorder the levels of a factor. The
forcatspackage is probably a better option for this functionality
wideToLongare both wrappers to the
reshapefunction aimed to be easier for novices to use. I’d probably suggest using
quantileCutallows you to split a variable into groups of approximately equal size; it’s analogous to
cutbut does so by quantile rather than by values on the raw scale. I suppose this is useful in some situations?
tFrametransposes a data frame when this is possible to do. The use case for this is very specific, and honestly I have only rarely found occasion to use it.
sortFrameallows you to sort a data frame by the values on one or more columns. That’s handy, but I’m pretty sure that functionality exists in quite a lot of places.
Is the package worth fixing?
I’m not sure. I feel like if I were writing my lecture notes from scratch today I’d never have bothered writing any of the data manipulation functions. There are better
tidyverse or RStudio based options for all of them; and I don’t see that they serve any purpose. The wrapper functions for tests I still like and would want to keep, but those work just fine as they are, and don’t really need updating. There are some bug fixes I wish I could find time for, but if I’m being perfectly honest my interactions with people when trying to submit packages to CRAN have been… unpleasant. Enough so that I have never been able to work up the motivation to revisit this.
I suppose what I could do is take the functions that I do like and push them to a GitHub repository. 🤷 Maybe I’ll do that someday. Probably not though. All the code for the
lsr package is already on GitHub here, so if anyone ever wanted to do a better job of this they’d be welcome to take the code and do it!
To be honest I feel burned enough by this that I find it hard to think about the
lsr package without anxiety, and I have enough stress in my life as it is. It was hard just looking at the package long enough to write this post, and I’m feeling kind of worn down. I’ve documented my reservations about the package - and the
etaSquared function in particular - and at this point I’m done with it. At least until my anxiety about CRAN goes away, I suppose.
…well, this post got unexpectedly gloomy at the end. I’m a bit tired today! Sorry about that! Next post I’ll try to be more cheerful I promise 😄