Image: Danielle Navarro

Day 9: The LSR package has some deficiencies

This is a very different style of post to the previous ones. The lsr package is one that I’m very familiar with, having written it to accompany my introductory stats lecture notes Learning Statistics with R that I’ve now released under the CC BY-SA 4.0 licence in case anyone has any desire to reuse or adapt them!

The package does have some nice features for teaching purposes, but it also has some flaws. Some of those flaws exist because the code and the notes are old. I started writing in late 2010. Not only was the tidyverse not a thing the way it is now, but RStudio itself hadn’t reached the 1.0 release and was still undergoing a lot of substantial design changes. If I were writing the package (and the notes!) from scratch now, I would do it very differently. That being said, there are still some features to it that I like. In this post, I’m going to revisit the lsr package and comment on what I like about it, what I think is bad about it, and how I’d do things differently if I ever had the time to revisit it properly!

Broadly speaking, I’d group the functions into the following categories:

  • Wrapper functions for introductory statistics concepts: associationTest, ciMean, correlate, goodnessOfFitTest, independentSamplesTTest, oneSampleTTest, pairedSamplesTTest, posthocPairwiseT
  • Implementations of effect size measures that I couldn’t find elsewhere: cohensD, cramersV, etaSquared, standardCoefs
  • Miscellaneous functions for data manipulation, viewing, an plotting bars, colCopy, expandFactors, importList, longToWide, permuteLevels, quantileCut, rmAll, rowCopy, sortFrame, tFrame, unlibrary, who, wideToLong, modeOf, maxFreq, aad,

In terms of how I think of them now, I’d group them into

  • Things that I still think are pedagogically useful
  • Things that are kind of nice but I wish someone smarter than me would implement
  • Things that are now pointless because there are much better tools

Things I like

Wrapper functions!

I’ve spent a lot of time teaching introductory statistics in R, generally to large undergraduate classes of psychology students with no programming or statistical background. For this audience, I’m not a fan of the way some of the analyses for classical hypothesis tests are reported. Take the t.test function for instance. It’s a single function that allows you to run a one-sample t-test, a paired-samples t-test or an independent-samples t-test depending on the input you give it. For someone who knows how a t-test works, this can be nice. There’s just one function that does everything. For a novice, it’s a nightmare. What I have found is that students take time to really grasp the differences between the different t-tests, and bundling all the tests into one function doesn’t help them learn. With that in mind, I made the decision in the lsr package to split this into three wrapper functions independentSamplesTTest, pairedSamplesTTest and oneSampleTTest to help students be clear in their own mind what test they are running. (I did something similar with categorical data analysis, writing separate wrappers for chisq.test depending on whether one is intending to run an associationTest or a goodnessOfFit). Similarly, when I wrote the wrapper functions, I made the output much more verbose. So, for instance, take the AFL data set:

# load some data
afl <- read.csv(file="https://djnavarro.net/files/afl24.csv")
afl$margin <- abs(afl$home.score - afl$away.score)
head(afl)
##          home.team away.team home.score away.score year round weekday day
## 1  North Melbourne  Brisbane        104        137 1987     1     Fri  27
## 2 Western Bulldogs  Essendon         62        121 1987     1     Sat  28
## 3          Carlton  Hawthorn        104        149 1987     1     Sat  28
## 4      Collingwood    Sydney         74        165 1987     1     Sat  28
## 5        Melbourne   Fitzroy        128         89 1987     1     Sat  28
## 6         St Kilda   Geelong        101        102 1987     1     Sat  28
##   month is.final              venue attendance margin
## 1     3    FALSE                MCG      14096     33
## 2     3    FALSE      Waverley Park      22550     59
## 3     3    FALSE       Princes Park      19967     45
## 4     3    FALSE      Victoria Park      17129     91
## 5     3    FALSE                MCG      18012     39
## 6     3    FALSE Gold Coast Stadium      15867      1

Let’s suppose I want to run a t-test to see if the margins for games during the home and away season tend to be different than during the final series. I might call the t.test function like this:

t.test(
  formula = margin ~ is.final,
  data = afl
)
## 
##  Welch Two Sample t-test
## 
## data:  margin by is.final
## t = -0.27956, df = 218.21, p-value = 0.7801
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.571427  3.435666
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            36.17212            36.74000

This output is informative, but terse, and I didn’t find it easy to teach to psychology undergraduates. With that in mind, the output from the lsr wrapper function is more verbose, and is more explicit about what the hypothesis test entails:

independentSamplesTTest(
  formula = margin ~ is.final,
  data = afl
)
## Warning in independentSamplesTTest(formula = margin ~ is.final, data =
## afl): group variable is not a factor
## 
##    Welch's independent samples t-test 
## 
## Outcome variable:   margin 
## Grouping variable:  is.final 
## 
## Descriptive statistics: 
##              FALSE   TRUE
##    mean     36.172 36.740
##    std dev. 27.602 28.072
## 
## Hypotheses: 
##    null:        population means equal for both groups
##    alternative: different population means in each group
## 
## Test results: 
##    t-statistic:  -0.28 
##    degrees of freedom:  218.208 
##    p-value:  0.78 
## 
## Other information: 
##    two-sided 95% confidence interval:  [-4.571, 3.436] 
##    estimated effect size (Cohen's d):  0.02

To my mind, this output is a lot easier for a novice to understand. Whatever else might be wrong with the lsr package (and there are a few things 😀) I am really pleased with the wrapper functions. While I might want to tinker with them, I genuinely feel that associationTest, ciMean, correlate, goodnessOfFitTest, independentSamplesTTest, oneSampleTTest, pairedSamplesTTest and posthocPairwiseT all serve useful pedagogical purposes

More wrapper functions!

Many (most?) of the miscellaneous functions are probably useless at this point. In almost all cases there is a better tidyverse option, or some functionality in RStudio that makes them entirely redundant. There are a few, though, that I still think of fondly. For example, the expandFactors function provides a wrapper function into model.matrix and returns a data frame that replaces all factors in the original data frame with the corresponding contrasts for linear regresson. Here’s the example from the documentation:

grading <- data.frame( 
  teacher = factor( c("Amy","Amy","Ben","Ben","Cat") ), 
  gender = factor( c("male","female","female","male","male") ),
  grade = c(75,80,45,50,65) 
)
grading
##   teacher gender grade
## 1     Amy   male    75
## 2     Amy female    80
## 3     Ben female    45
## 4     Ben   male    50
## 5     Cat   male    65
grading2 <- expandFactors( grading )
grading2
##   teacherBen teacherCat gendermale grade
## 1          0          0          1    75
## 2          0          0          0    80
## 3          1          0          0    45
## 4          1          0          1    50
## 5          0          1          1    65

Annoyingly, there’s a bug in the expandFactors function in which I never deleted a line of debugging code (there’s a print statement that shouldn’t be there)

The bug notwithstanding, I still like this idea. I want my students to be able to understand how factors get recoded as a set of contrasts for a linear model, but most of them aren’t at the point where they can work comfortably with the various contrast functions. From a pedagogical perspective, I think this is handy.

Things that need to be implemented better

At the time I started writing the LSR book, I really struggled to find functions that would compute standard measures of effect size in the format that they are typically taught to undergraduates. These days, I’m sure if I asked around on #Rstats twitter I’d get a lot of wonderful assistance! But at the time either that didn’t exist or I didn’t know about it – I was the only person I knew who was actively using R for anything so I was pretty isolated 😢, and I tried to implement them myself.

So the lsr package has functions cohensD, cramersV, etaSquared, standardCoefs. In truth I’m not entirely sure I agree with any of these as the best way to think about effect sizes (standardised coefficients for regression models strike me as meaningless in a lot of situations, and eta-squared has some flaws), but the curriculum I had to teach to really did need these measures, I couldn’t find anything online about them, so there didn’t to be much of an option for it other than to write my own functions.

What bothers me most, is that I don’t know how much I trust the implementation of these functions. The etaSquared function in particular doesn’t work when the weights argument to the aov function is used, and I haven’t been able to work up the motivation to revisit it. I wouldn’t advise anyone using this function until it does get fixed, but I have a lot of unresolved anxiety associated with this package - or more precisely, about submitting it to CRAN - so for the sake of my own sanity I’m not really all that keen to revisit it.

Things that were always a bad idea…

  • bars: This was never actually used in the book, as far as I can recall, and it’s got some quirky behaviour. It should be replaced with a call to ggplot
  • importList: Doesn’t really serve much of a purpose. There was a use case for this in one specific class I ran, but realistically I don’t think this needs to be there
  • aad, modeOf, maxFreq: These are descriptive statistics calculators that really I didn’t want to have at all, but I felt obligated to include because of curriculum constraints. Now that I’m not teaching that class, I really don’t feel especially pleased with them

Things that can be done better in other ways…

Several of the functions in the package are unnecessary thanks to RStudio:

  • rmAll: I wrote this function because at the time I wasn’t sure what options RStudio was going to give for novice users. The fact there’s a button you can press to clear all objects from the workspace makes this unnecessary.
  • unlibrary: I wrote this wrapper because the detach function is a little complicated for total novices. It’s unnecessary now because RStudio has check box options for loading and unloading packages.
  • who: Inspired by the Matlab function of the same name, this lists the objects in the workspace in a little more detail than the objects function. Again, RStudio makes this unnecessary because you can get the same thing in a prettier form through the Environment tab.

The remaining functions are very simple data manipulation functions. They all have analogs in other packages (especially in the tidyverse) and probably don’t need to exist.

  • colCopy and rowCopy are both similar to replicate, but are slightly easier for novices; they were useful in some teaching contexts, but not amazingly helpful in practive
  • permuteLevels allows you to reorder the levels of a factor. The forcats package is probably a better option for this functionality
  • longToWide and wideToLong are both wrappers to the reshape function aimed to be easier for novices to use. I’d probably suggest using gather and spread from the tidyr package.
  • quantileCut allows you to split a variable into groups of approximately equal size; it’s analogous to cut but does so by quantile rather than by values on the raw scale. I suppose this is useful in some situations?
  • tFrame transposes a data frame when this is possible to do. The use case for this is very specific, and honestly I have only rarely found occasion to use it.
  • sortFrame allows you to sort a data frame by the values on one or more columns. That’s handy, but I’m pretty sure that functionality exists in quite a lot of places.

Is the package worth fixing?

I’m not sure. I feel like if I were writing my lecture notes from scratch today I’d never have bothered writing any of the data manipulation functions. There are better tidyverse or RStudio based options for all of them; and I don’t see that they serve any purpose. The wrapper functions for tests I still like and would want to keep, but those work just fine as they are, and don’t really need updating. There are some bug fixes I wish I could find time for, but if I’m being perfectly honest my interactions with people when trying to submit packages to CRAN have been… unpleasant. Enough so that I have never been able to work up the motivation to revisit this.

I suppose what I could do is take the functions that I do like and push them to a GitHub repository. 🤷 Maybe I’ll do that someday. Probably not though. All the code for the lsr package is already on GitHub here, so if anyone ever wanted to do a better job of this they’d be welcome to take the code and do it!

To be honest I feel burned enough by this that I find it hard to think about the lsr package without anxiety, and I have enough stress in my life as it is. It was hard just looking at the package long enough to write this post, and I’m feeling kind of worn down. I’ve documented my reservations about the package - and the etaSquared function in particular - and at this point I’m done with it. At least until my anxiety about CRAN goes away, I suppose.

…well, this post got unexpectedly gloomy at the end. I’m a bit tired today! Sorry about that! Next post I’ll try to be more cheerful I promise 😄

Avatar
Danielle Navarro
Associate Professor of Cognitive Science

Related