Girl you gotta wonder ‘bout a test like that

by Danielle Navarro, 20 May 2019

I don’t really wanna wake you
I just came to get my things
But the pretty lil’ thing lying there beside you
Better take off my wedding ring
Hey, girl, do you really wanna do this
You don’t know what you’re stepping in
He’s got more where that came from
You’re not so special in the end
   – Gin Wigmore, Man Like That

Consider the oddity of the Student t-test. It’s not a particularly flexible tool. It assumes a single predictor that can be represented as a binary-valued variable, relies on a theory of inference that reduces the scientific goal to a binary yes-no decision, and on top of that applies a half-reasonable assumption (normality) and an absurd one (homogeneity of variance), both of which are ancillary to the research question a scientist typically asks. But it’s been around since the early 20th century. Everyone teaches it and everyone uses it, so as the author of an introductory statistics text it’s hard to escape. Still, if I am going to discuss t-tests in a revised version of Learning Statistics with R, I’d like to do it in a way that makes sense.


Desiderata for t-test functions

If the book is to be accompanied by an R package, one of the first questions I find myself asking is whether or not that package should contain any functions to run t-tests. It’s never a good idea to reinvent the wheel, so maybe I should just stick to the t.test() function in base R? That’s clearly a bad idea: as I mentioned in the last post the t.test() function doesn’t play nicely with tidyverse and the output is pretty much incomprehensible if you don’t already understand how Neyman-Pearson tests operate. As a teaching tool, t.test() has little value to me. Learning Statistics with R is notionally written for a new learner who has no previous experience with statistics, R, or programming, and at the point in the text where I will be introducing t-tests in the revised book, my notional reader is still learning how null hypothesis testing actually works. From that point of view, the functions I want them to be using should support this learning goal.

A more interesting comparison would be to consider the infer package, which includes a clean and simple ttest function that plays nicely with the pipe. However, as much as I like the infer package, I don’t think it will serve my purposes in Learning Statistics with R either: it is a function that is friendly to someone who already knows tidyverse and who understands null hypothesis testing, but not one that is ideal for my pedagogical purposes.

To illustrate why I think this, let’s get some data to work with. As I write this post my playlist includes two songs about infideity, Man Like That by Gin Wigmore to You Know I’m No Good by Amy Winehouse, and they feel rather different in emotional flavour, so I’ll put together a data set to let me test this1 2. To construct the data set, I do this:

cheatsongs <- bind_rows(
  "Gin Wigmore" = genius_lyrics("gin wigmore", "man like that"),
  "Amy Winehouse" =  genius_lyrics("amy winehouse", "you know i'm no good"),
  .id = "artist")

This only imports the lyrics themselves, and I want some measure of the overall emotional tone. So, in the venerable psychological tradition of measuring something only tangentially connected to the thing you’re really interested in because that’s the only thing you know how to measure, I’ll use the sentimentr package to estimate the valence of each line:

cheatsongs <- cheatsongs %>% mutate(
  valence = sentiment(lyric) %>% pull(sentiment))
## # A tibble: 111 x 5
##    artist     track_title   line lyric                              valence
##    <chr>      <chr>        <dbl> <chr>                                <dbl>
##  1 Gin Wigmo… Man Like Th…     1 I don't really wanna wake you        0    
##  2 Gin Wigmo… Man Like Th…     2 I just came to get my things         0    
##  3 Gin Wigmo… Man Like Th…     3 But the pretty lil' thing lying t…   0.188
##  4 Gin Wigmo… Man Like Th…     4 Better take off my wedding ring      0.327
##  5 Gin Wigmo… Man Like Th…     5 Hey, girl, do you really wanna do…   0    
##  6 Gin Wigmo… Man Like Th…     6 You don't know what you're steppi…   0    
##  7 Gin Wigmo… Man Like Th…     7 He's got more where that came from   0    
##  8 Gin Wigmo… Man Like Th…     8 You're not so special in the end    -0.302
##  9 Gin Wigmo… Man Like Th…     9 I'm messing up the place            -0.447
## 10 Gin Wigmo… Man Like Th…    10 Kicking down the door                0    
## # … with 101 more rows

Okay, now that we have some data, we could uses the infer package:

cheatsongs %>% t_test(
  formula = valence ~ track_title, 
  order = c("Man Like That", "You Know I'm No Good"))
## # A tibble: 1 x 6
##   statistic  t_df      p_value alternative lower_ci upper_ci
##       <dbl> <dbl>        <dbl> <chr>          <dbl>    <dbl>
## 1      6.02  89.5 0.0000000378 two.sided      0.197    0.392

There’s a lot to like about this. The infer::t_test() function follows tidyverse principles and the output takes the form of a simple table. For an experienced R user who codes in the tidyverse style, it is helpful to receive output organised into a tibble, and that the output reveals this structure transparently to the user. A fancy print method isn’t useful once you know what you’re doing.

But to a novice? Who doesn’t already know what the t-test does? Who is running this analysis to assist their learning about what a null hypothesis test is? Not so much. At this point in the tutorial, my hypothetical reader would be expected to be somewhat familiar with R but not yet comfortable and fluent with the language, and would be completely unfamiliar with null hypothesis tests. The output from infer::t_test() really isn’t ideal for that purpose, and I’m yet to find anything that provides the kind of output that I want to provide to the reader at this point in the book. To make functions that will serve my purposes, the tidylsr package uses three separate functions: ttest_onesample(), ttest_twosample() and ttest_paired(). With that in mind, I’ll talk about these three functions in this post, in the hope that if someone has a better suggestion for how to approach this they might let me know!

One sample tests

The simplest of the three functions is the one sample test. Let’s suppose that our analyst wants to ask the question “is the valence less than zero?” This question maps naturally3 to a one-sided one-sample test, and our analysis pipeline might look like this:

cheatsongs %>% 
  filter(track_title == "Man Like That") %>%
    outcome = valence,
    null_mean = 0,
    test_greater = FALSE)

One reason to structure the code like this is to highlight the preprocessing steps involved in data analysis. Even for the simplest analyses there are often data wrangling operations that we have to do, and I think it’s good practice for those steps to be easily-visible in the code. In this function, the outcome specifies which variable in the data frame contains the data to be tested (note that it has been preprocessed to include only the setosas, and for consistency with most tidyverse analyses the outcome is a quoted argument – the user doesn’t have to place it in quotes). The null_mean argument specifies the fixed value of the mean against which the data are to be compared.

Finally – and somewhat inelegantly – the test_greater argument requires the user to indicate if the alternative hypothesis specifies that the population mean is greater (test_greater = TRUE), less (test_greater = FALSE) or neither (test_greater = NULL, the default value, indicates a two-sided test).4

Here’s what the output looks like:

##     One sample t-test 
## Variables: 
##     outcome:  valence 
## Descriptives: 
##     sample      mean        sd
##    valence     0.224     0.261
## Hypotheses: 
##     null:         population mean greater than or equal to 0 
##     alternative:  population mean less than 0 
## Test results: 
##     t-statistic:         7.166 
##     degrees of freedom:  69 
##     p-value:             1 
## 95% confidence interval:
##     lower bound:  -Inf 
##     upper bound:  0.276

What I like about this output:

  • It is verbose
  • It spells out the structure of the hypothesis test
  • It includes relevant descriptive statistics
  • It separates the hypothesis test from the confidence interval

Two sample tests

A more typical kind of research hypothesis might be “do the two songs differ in emotional tone?”, and in that case we might consider a two-sample test. When writing the ttest_twosample() function I found myself facing something of a dilemma. In the textbook, my plan is to introduce one sample tests first, then two-sample tests, and finally paired-sample tests. To keep things as clean as possible in the exposition, I really want to be able to call ttest_twosample() using arguments that are as close as possible to the arguments used in ttest_onesample(). To my mind this is good pedagogical practice: try to introduce new ideas one at a time. For the sake of argument, suppose I have a directional hypothesis: having listened to the music for the two songs (and not the words), I might guess that Man Like That (which has something of a jaunty tune) will have higher valence than You Know I’m No Good (which does not). So the way I would call this function would be to specify a group variable, and use the test_greater argument to indicate which song I am hypothesising (under the alternative) to have the higher valence:

cheatsongs %>% 
    outcome = valence, 
    group = track_title,
    test_greater = "You Know I'm No Good")

Desirable properties of this code:

  • The code is pipe-friendly, and the conventions (variable names are quoted arguments) mirror those conventions that they’ll have been learning in the preceding sections in the book. Yay!

  • The arguments look almost identical to those used in ttest_onesample(). In my experience of teaching orthodox null hypothesis tests to psychology students, at this point in the class students are being bombarded with lots of new ideas, so consistency matters.

  • The outcome and group arguments are explicitly named in the code. In the surrounding text describing the ideas behind null hypothesis testing, I’ll be talking about “outcomes” and “grouping variables”, and my suspicion is that having the names specified explicitly will help new learners map from “the statistical concepts in the text” to “the computational concepts in the code”.

To my mind, these three features should help minimise the friction involved when first encountering null hypothesis tests. However, there is one very large drawback: the code doesn’t use model formulas. Later on in the book – very likely the next chapter in fact – I’m planning to introduce linear models via lm(), and it would be nice to be able to foreshadow the role that model formulas play by allowing the t-test functions to support formulas. My current thinking is that it makes most sense to order the content like this:

  • Introduce one sample test (two-sided, named arguments)
  • Introduce two sample test (two-sided, named arguments)
  • Introduce paired samples tests (two-sided, named arguments)
  • Introduce one-sided tests (one-sample, two-sample and paired-sample)
  • Introduce model formulas (two-sample and paired-sample)

To do this cleanly, I need all three functions to use the same arguments as much as possible (e.g., all three tests use test_greater to specify a one-sided hypothesis), but I also need to support model formulas. So the second argument to ttest_twosample() is formula, and allows you to specify the test using model formulas. As a general rule I don’t like the idea of writing functions that allow multiple ways of doing the same thing. It complicates the user interface and is finicky to explain. However, I think this is a rare exception precisely because I want to use the “outcome/group” arguments as a stepping stone to introduce the concept of model formulas later. As a consequence, the code is written so that these yield the same answer:

# named arguments
tt1 <- cheatsongs %>% 
    outcome = valence, 
    group = track_title)

# model formulas
tt2 <- cheatsongs %>% 
  ttest_twosample(valence ~ track_title)

# check
all.equal(tt1, tt2)
## [1] TRUE

In the same spirit as the one sample test, the output from the print method produces verbose output that names the test, the variables, the descriptive statistics, the hypotheses, and the various quantities related to the test itself:

##     Welch's two sample t-test 
## Variables: 
##     outcome:  valence 
##     group:    track_title 
## Descriptives: 
##                   sample      mean        sd
##            Man Like That    0.2239     0.261
##     You Know I'm No Good   -0.0706     0.241
## Hypotheses: 
##     null:         population means are equal 
##     alternative:  population means are different 
## Test results: 
##     t-statistic:         6.02 
##     degrees of freedom:  89.455 
##     p-value:             <.001 
## 95% confidence interval:
##     lower bound:  0.197 
##     upper bound:  0.392

Paired samples tests

Finally we have paired-samples tests. I spent some time wondering whether I should support data in wide form (distinct measurements are distinct variables in the data frame), or long form (distinct measurements are distinct rows grouped by an id variable). Supporting both seems like a very bad idea, and likely confusing to a new user. In the end I decided long form is better, as it is more consistent with how mixed models are specified in lme. To stick with the cheating songs example, one possible objection to the two-sample test is that the songs are different lengths, and it might make more sense to compare the first line of Man Like That to the first line of You Know I’m No Good and so on. Okay, that’s probably a really foolish idea in practice, but it does let me apply the ttest_paired() function. As before, there are two different ways to call the function. Using named arguments, the code looks like this:

tt3 <- cheatsongs %>% 
    outcome = valence, 
    group = track_title, 
    id = line)
## Warning in ttest_handle_missing(wide_data, grp_names): 29 observations
## removed due to missingness

Notice that this code throws a warning. The songs are different lengths, so some data have necessarily been removed in order to conduct the line-matched analysis. In general, this is what the ttest functions do: warn the user about missing data.

For the model formula syntax5 I’ve been relying on a slight variation of the lme syntax, enclosing the name of the id variable in parentheses:

tt4 <- cheatsongs %>% ttest_paired(valence ~ track_title + (line))
tt5 <- cheatsongs %>% ttest_paired(valence ~ (line) + track_title)

There might be better ways to do this, but given that a paired samples t-test isn’t a “true” repeated measures analysis (it’s just a one sample test conducted on difference scores) I don’t want to complicate this too much. The intended role is simply to serve as a precursor to the model formulas notation that will appear later in the book.

In any case, I can quickly check that all three versions produce the same answer:

all.equal(tt3, tt4)
all.equal(tt4, tt5)
## [1] TRUE
## [1] TRUE

Finally the output looks like this:

##     Paired samples t-test 
## Variables: 
##     outcome:  valence 
##     group:    track_title 
##     id:       line 
## Descriptives: 
##                   sample      mean        sd
##            Man Like That    0.1380     0.256
##     You Know I'm No Good   -0.0706     0.241
##                    diff.    0.2087     0.341
## Hypotheses: 
##     null:         population means are equal 
##     alternative:  population means are different 
## Test results: 
##     t-statistic:         3.919 
##     degrees of freedom:  40 
##     p-value:             <.001 
## 95% confidence interval:
##     lower bound:  0.101 
##     upper bound:  0.316

Again, it’s verbose in order to be clear to new learners, and the descriptive statistics are now extended to cover the mean and standard deviation of the difference scores.

Under the hood?

## [1] "lsr_ttest"
## $variables
## # A tibble: 1 x 6
##   outcome group       id    sample1       sample2              null_mean
##   <chr>   <chr>       <chr> <chr>         <chr>                    <dbl>
## 1 valence track_title line  Man Like That You Know I'm No Good        NA
## $test
## # A tibble: 1 x 8
##   type   hypotheses     t    df        p ci_lower ci_upper ci_level
##   <chr>  <chr>      <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 paired two.sided   3.92    40 0.000339    0.101    0.316     0.95
## $descriptives
## # A tibble: 3 x 3
##   sample                  mean    sd
##   <chr>                  <dbl> <dbl>
## 1 Man Like That         0.138  0.256
## 2 You Know I'm No Good -0.0706 0.241
## 3 diff.                 0.209  0.341

  1. Anyone messaging me to tell me that this is data reuse – as I have already listened to the songs – and hence not a valid context in which to conduct hypothesis test will be guillotined by the pedantry police (a.k.a. my children). If it makes you feel any better though, they have preregistered the execution plan.

  2. Oh fine, I’ll be honest. I only just discovered the genius package for importing song lyrics into R (thanks Jen!) and wanted an excuse to use it in a blog post. Sue me.

  3. Well, not that naturally. After all Man Like That is one song, written exactly one time by a single person. It is not a repeatable event in the sense typically required by the frequentist interpretation of probability, and probably would have caused quite considerable consternation among orthodox statisticians before the measure theorists took over and sigma algebras conquered all. More practically though, if you view the song as a finite population of lines, there is a sense in which the “sample” is the population, in which case what precisely is the thing about which we are doing inference? Arguably, it does make sense to consider a population of possible songs that Gin Wigmore might have written instead of this one, if she were in the same state of mind as when writing this one. Under that conceptualisation, there exists some generative process from which we have a single realisation (namely the song that actually was written), and our inferences can be construed as an attempt to make decisions (if we are using orthodox tests) or update our beliefs (if we are Bayesian) about the nature of this process. Inference is philosophically tricky at the best of times, I suppose.

  4. The name for the test_greater argument is a little less than ideal, and as will become clearer later it’s a bit of a compromise between competing goals. Frankly, if anyone has a better suggestion for how to approach this I’d love to hear about it.

  5. Warning: As of this writing I haven’t actually implemented the model formula version properly, so it’s not robust!