Girl you gotta wonder ‘bout a test like that

by Danielle Navarro, 27 May 2019

I don’t really wanna wake you
I just came to get my things
But the pretty lil’ thing lying there beside you
Better take off my wedding ring
Hey, girl, do you really wanna do this
You don’t know what you’re stepping in
He’s got more where that came from
You’re not so special in the end
   – Gin Wigmore, Man Like That

Consider the oddity of the Student t-test. It’s not a particularly flexible tool. It assumes a single predictor that can be represented as a binary-valued variable, relies on a theory of inference that reduces the scientific goal to a binary yes-no decision, and on top of that applies a half-reasonable assumption (normality) and an absurd one (homogeneity of variance), both of which are ancillary to the research question a scientist typically asks. But it’s been around since the early 20th century. Everyone teaches it and everyone uses it, so as the author of an introductory statistics text it’s hard to escape. Still, if I am going to discuss t-tests in a revised version of Learning Statistics with R, I’d like to do it in a way that makes sense.


Desiderata for t-test functions

If the book is to be accompanied by an R package, one of the first questions I find myself asking is whether or not that package should contain any functions to run t-tests. It’s never a good idea to reinvent the wheel, so maybe I should just stick to the t.test() function in base R? That’s clearly a bad idea: as I mentioned in the last post the t.test() function doesn’t play nicely with tidyverse and the output is pretty much incomprehensible if you don’t already understand how Neyman-Pearson tests operate. As a teaching tool, t.test() has little value to me. Learning Statistics with R is notionally written for a new learner who has no previous experience with statistics, R, or programming, and at the point in the text where I will be introducing t-tests in the revised book, my notional reader is still learning how null hypothesis testing actually works. From that point of view, the functions I want them to be using should support this learning goal.

A more interesting comparison would be to consider the infer package, which includes a clean and simple ttest function that plays nicely with the pipe. However, as much as I like the infer package, I don’t think it will serve my purposes in Learning Statistics with R either: it is a function that is friendly to someone who already knows tidyverse and who understands null hypothesis testing, but not one that is ideal for my pedagogical purposes.

To illustrate why I think this, let’s get some data to work with. As I write this post my playlist includes two songs about infideity, Man Like That by Gin Wigmore to You Know I’m No Good by Amy Winehouse, and they feel rather different in emotional flavour, so I’ll put together a data set to let me test this1 2. To construct the data set, the first step is to import the lyrics to both songs…

cheatsongs <- bind_rows(
  "Gin Wigmore" = genius_lyrics("gin wigmore", "man like that"),
  "Amy Winehouse" =  genius_lyrics("amy winehouse", "you know i'm no good"),
  .id = "artist")

Of course, this is only half the job. The genius package lets me imports the lyrics, but I want some measure of the overall emotional tone, preferably on a line-by-line basis. So, in the venerable psychological tradition of measuring something only tangentially connected to the thing you’re really interested in because that’s the only thing you know how to measure,3 I’ll use the sentimentr package to estimate the valence of each line:

cheatsongs <- cheatsongs %>% mutate(
  valence = sentiment(lyric) %>% pull(sentiment))
## # A tibble: 111 x 5
##    artist     track_title   line lyric                              valence
##    <chr>      <chr>        <dbl> <chr>                                <dbl>
##  1 Gin Wigmo… Man Like Th…     1 I don't really wanna wake you        0    
##  2 Gin Wigmo… Man Like Th…     2 I just came to get my things         0    
##  3 Gin Wigmo… Man Like Th…     3 But the pretty lil' thing lying t…   0.188
##  4 Gin Wigmo… Man Like Th…     4 Better take off my wedding ring      0.327
##  5 Gin Wigmo… Man Like Th…     5 Hey, girl, do you really wanna do…   0    
##  6 Gin Wigmo… Man Like Th…     6 You don't know what you're steppi…   0    
##  7 Gin Wigmo… Man Like Th…     7 He's got more where that came from   0    
##  8 Gin Wigmo… Man Like Th…     8 You're not so special in the end    -0.302
##  9 Gin Wigmo… Man Like Th…     9 I'm messing up the place            -0.447
## 10 Gin Wigmo… Man Like Th…    10 Kicking down the door                0    
## # … with 101 more rows

Armed with a data set, we can now attempt to run a hypothesis test.4 Using the t_test() function from the infer package I can run a two-sample t-test as part of a data-analysis pipeline, and all I need to do is specify the model formula and the order in which the two songs should be entered when calculating the t-statistic. Here’s what that looks like:

cheatsongs %>% t_test(
  formula = valence ~ track_title, 
  order = c("Man Like That", "You Know I'm No Good"))
## # A tibble: 1 x 6
##   statistic  t_df      p_value alternative lower_ci upper_ci
##       <dbl> <dbl>        <dbl> <chr>          <dbl>    <dbl>
## 1      6.02  89.5 0.0000000378 two.sided      0.197    0.392

There’s a lot to like about this. The infer package adheres to tidyverse principles so the input is predictable, the output is type-stable and takes the form of a simple table. For an experienced R user who codes tidyverse style, this is very helpful. In particular, the fact that the output is a raw tibble with no frills is really nice. The structure of the output is revealed transparently to the user, making any further data analysis operations easy to plan. Easy input, easy output. This is really nice, and a fancy print method that explains the output is of little utility to an experienced R programmer.

Except of course, my book isn’t pitched at experienced R users wanting to learn statistics. It’s pitched at an audience trying to learn statistics and R at the same time, and given that constraint I have to take a little more care as an author. At this point in the book, my hypothetical reader would be expected to be somewhat familiar with R (and tidyverse) but not yet comfortable and fluent with the language. Moreover, they would be completely unfamiliar with null hypothesis tests, because my current plan is to use the t-test as a tool for introducing orthodox null hypothesis tests. The output from infer::t_test() isn’t ideal for that purpose. In fact I’m yet to find anything that provides the kind of output that I want to provide to the reader at this point in the book.5

To make functions that will serve my purposes, the tidylsr package uses three separate functions: ttest_onesample(), ttest_twosample() and ttest_paired(). With that in mind, I’ll talk about these three functions in this post, in the hope that if someone has a better suggestion for how to approach this they might let me know!

One sample tests

The simplest of the three functions is the one sample test. So let’s imagine that I’d just heard the tune to Man Like That, which is rather upbeat, and I’ve formed the hypothesis that the lyrics will have positive valence. This question maps naturally6 to a one-sided one-sample t-test, and a simple analysis pipeline might look like this:

cheatsongs %>% 
  filter(track_title == "Man Like That") %>%
    outcome = valence,
    null_mean = 0,
    test_greater = TRUE)

One thing I really like about the tidyverse style is that it encourages code that looks like this, where the preprocessing operations (in this case, filtering the data to the relevant subset) are revealed transparently in the code, and passed directly to the test. Accordingly, anyone reading the code later can easily see what was done to the data prior to testing, a state of affairs greatly to be desired in my view. In any case, to specify the test, besides piping the data into the function, there are three detail argiments we need to specify:

  • The outcome specifies which variable in the data frame contains the data to be tested. Note that this is a quoted argument, so the user enters the name of the variable as without quotes, consistent with how variables are usually specified in dplyr and ggplot2.

  • The null_mean argument specifies the fixed value of the mean against which the data are to be compared. In this case it is simply zero.

  • Finally, we have the (optional) test_greater argument, which allows the user to specify a one-sided alternative. By default the function specifies test_greater = NULL, which is interpreted as a two-sided test. Setting test_greater = TRUE implies a one-sided test in which the alternative hypothesis is that the population mean exceeds null_value, whereas test_greater = FALSE implies that opposite. (I am not entirely happy with this name, and might change it later)

Here’s what the output looks like:

##     One sample t-test 
## Variables: 
##     outcome:  valence 
## Descriptives: 
##     sample      mean        sd
##    valence     0.224     0.261
## Hypotheses: 
##     null:         population mean less than or equal to 0 
##     alternative:  population mean greater than 0 
## Test results: 
##     t-statistic:         7.166 
##     degrees of freedom:  69 
##     p-value:             <.001 
## 95% confidence interval:
##     lower bound:  0.172 
##     upper bound:  Inf

This output is much more verbose than infer::t_test() and indeed t.test(). It explcitly spells out the structure of the hypothesis test, it includes relevant descriptive statistics, and it separates the information that is strictly part of the hypothesis test (t, df, p) from the ancillary estimation problem that is typically reported alongside the test (confidence interval). Formatting the output this way makes my writing task a lot easier, as it allows me to walk the reader through the results section by section, introducing the key ideas in orthodox hypothesis testing as I go. As an added bonus, the output doesn’t reveal the underlying data structure to the reader, so I can defer any discussion of that data structure to a later point in the text.

Two sample tests

So far so good. However, in practice one rarely has call to run a one-sample t-test. In psychology at least, it is much more typical to test hypotheses about the relative means of two (or more) populations rather than test hypotheses about the absolute value of those means. To push this onto my example data, we might consider the question of whether the two songs differ in emotional valence, thereby necessitating a two-sample test.

When writing the ttest_twosample() function, however, I found myself facing something of a dilemma. In the textbook, my plan is to introduce one sample tests first, then two-sample tests, and finally paired-sample tests. To keep things as clean as possible in the exposition, I really want to be able to call ttest_twosample() using arguments that are as close as possible to the arguments used in ttest_onesample(). To my mind this is good pedagogical practice: try to introduce new ideas one at a time. Let’s see how that plays out…

For the sake of argument, suppose I have a directional hypothesis: having listened to the music for the two songs (and not the words), I might guess that Man Like That (which has something of a jaunty tune) will have higher valence than You Know I’m No Good (which does not). So the way I would call this function would be to specify a group variable, and use the test_greater argument to indicate which song I am hypothesising (under the alternative) to have the higher valence:

cheatsongs %>% 
    outcome = valence, 
    group = track_title,
    test_greater = "You Know I'm No Good")

This code has a number of really desirable properties, from my perspective as the author of the textbook:

  • Consistency with tidyverse conventions: data argument comes first thus supporting the pipe, outcome and group are quoted arguments, and test_greater is not.

  • Consistency with previous learning. The arguments to ttest_twosample() in this code look almost identical to those used in ttest_onesample(). This is a good thing. In my experience of teaching orthodox null hypothesis tests to psychology students, at this point in the class students are being bombarded with lots of new ideas, so consistency matters. From a writing perspective it is a lot easier to move from the one-sample to two-sample test if the R functions behave similarly.

  • Consistency with statistical language. The outcome and group arguments are explicitly named in this code, and in my book I tend to use outcomes and predictors to describe the roles that variables can play. As a consequence, in the surrounding text describing the ideas behind null hypothesis testing, I’ll be talking about “outcomes” and “grouping variables” (as the predictor), and so I think it is helpful that those same words appear in the code. Again this helps with conceptual mapping: my suspicion is that having the names specified explicitly will help new learners map from “the statistical concepts in the text” to “the computational concepts in the code”.

To my mind, these three features should help minimise the friction involved when first encountering null hypothesis test, and I don’t want to lose any of those properties.

However, there is one huge drawback: this code doesn’t use model formulas. The t.test() and infer::t_test() functions both support model formulas, as of course they should gien how central formulas are to statistical modelling in R. It would be rather negligent of me not to take account of this. Besides, it always helps to be smart from the beginning. As I’ve mentioned, I tend to think of the t-test as a toy rather than a serious statistical tool, and in truth I’m really introducing it as a stepping stone on the way to linear modelling. So later on in the book – very likely the next chapter in fact – I’m planning to introduce linear models via lm(), and it would be nice to be able to foreshadow the role that model formulas play by allowing the t-test functions to support formulas. Accordingly my current thinking is that it makes most sense to order the content like this:

  • Introduce one sample test (two-sided, named arguments)
  • Introduce two sample test (two-sided, named arguments)
  • Introduce paired samples tests (two-sided, named arguments)
  • Introduce one-sided tests (one-sample, two-sample and paired-sample)
  • Introduce model formulas (two-sample and paired-sample)

To do this cleanly, I need all three functions to use the same arguments as much as possible (e.g., all three tests use test_greater to specify a one-sided hypothesis), but I also need to support model formulas.

Given the above the second argument to ttest_twosample() is formula, and allows you to specify the test using model formulas in the usual. As a general rule I don’t like doing this. I think functions should support one way to allow user input, but no more than that. Allowing multiple input formats complicates the user interface and is finicky to explain. However, reality is always messy, and I think the pedagogical goals of the package warrant making an exception. As a consequence, the code is written so that these yield the same answer. For the two-sided test:

# named arguments
tt1 <- cheatsongs %>% 
    outcome = valence, 
    group = track_title)

# model formulas
tt2 <- cheatsongs %>% 
  ttest_twosample(valence ~ track_title)

# check
all.equal(tt1, tt2)
## [1] TRUE

In the same spirit as the one sample test, the output from the print method produces verbose output that names the test, the variables, the descriptive statistics, the hypotheses, and the various quantities related to the test itself:

##     Welch's two sample t-test 
## Variables: 
##     outcome:  valence 
##     group:    track_title 
## Descriptives: 
##                   sample      mean        sd
##            Man Like That    0.2239     0.261
##     You Know I'm No Good   -0.0706     0.241
## Hypotheses: 
##     null:         population means are equal 
##     alternative:  population means are different 
## Test results: 
##     t-statistic:         6.02 
##     degrees of freedom:  89.455 
##     p-value:             <.001 
## 95% confidence interval:
##     lower bound:  0.197 
##     upper bound:  0.392

As before, it is awfully verbose, but I’m okay with that.

Paired samples tests

Finally we have paired-samples tests. I spent some time wondering whether I should support data in wide form (distinct measurements are distinct variables in the data frame), or long form (distinct measurements are distinct rows grouped by an id variable). Supporting both seems like a very bad idea, and likely confusing to a new user. In the end I decided long form is better, as it is more consistent with how mixed models are specified in lme. To stick with the cheating songs example, one possible objection to the two-sample test is that the songs are different lengths, and it might make more sense to compare the first line of Man Like That to the first line of You Know I’m No Good and so on. Okay, that’s probably a really foolish idea in practice, but it does let me apply the ttest_paired() function. As before, there are two different ways to call the function. Using named arguments, the code looks like this:

tt3 <- cheatsongs %>% 
    outcome = valence, 
    group = track_title, 
    id = line)
## Warning in ttest_handle_missing(wide_data, grp_names): 29 observations
## removed due to missingness

Notice that this code throws a warning. The songs are different lengths, so some data have necessarily been removed in order to conduct the line-matched analysis. In general, this is what the ttest functions do: warn the user about missing data.

For the model formula syntax7 I’ve been relying on a slight variation of the lme syntax, enclosing the name of the id variable in parentheses:

tt4 <- cheatsongs %>% ttest_paired(valence ~ track_title + (line))
tt5 <- cheatsongs %>% ttest_paired(valence ~ (line) + track_title)

There might be better ways to do this, but given that a paired samples t-test isn’t a “true” repeated measures analysis – i.e., it’s just a one sample test conducted on difference scores – I don’t want to complicate this syntax too much. The intended role of this argument is to serve as a precursor to the richer model formulas notation that will appear later in the book, nothing more. In any case, it’s easy to verify that all three versions produce the same answer:

all.equal(tt3, tt4)
all.equal(tt4, tt5)
## [1] TRUE
## [1] TRUE

More to the point, here’s what the output looks like:

##     Paired samples t-test 
## Variables: 
##     outcome:  valence 
##     group:    track_title 
##     id:       line 
## Descriptives: 
##                   sample      mean        sd
##            Man Like That    0.1380     0.256
##     You Know I'm No Good   -0.0706     0.241
##                    diff.    0.2087     0.341
## Hypotheses: 
##     null:         population means are equal 
##     alternative:  population means are different 
## Test results: 
##     t-statistic:         3.919 
##     degrees of freedom:  40 
##     p-value:             <.001 
## 95% confidence interval:
##     lower bound:  0.101 
##     upper bound:  0.316

Again, it’s verbose in order to be clear to new learners, and the descriptive statistics are now extended to cover the mean and standard deviation of the difference scores.

Under the hood?

I’m messing up the place
Kicking down the door
Never wanna see his face no more

At some point in the book (I haven’t really decided yet) I’ll probably introduce the reader to the basics of object-oriented programming in R.8 It’s hard to get by in R without at least knowing a tiny bit about S3 classes so – again trying to be smart from the beginning – I’ve written the class in a way that might help me out later on. First off, what is the class called?

## [1] "lsr_ttest"

Okay, yep, that’s a class name that no-one other than me would ever think to specify, so we should be able to avoid conflicts. It’s also pretty clear what kind of object we’re describing, I hope! So what does the object contain?

## [1] "variables"    "test"         "descriptives"

Seems pretty straightforward. I’ve written it so that an lsr_ttest is a list of three tibbles (I’ll probably change the order of these three things later – I’m not sure why “descriptives” are last). So let’s look at them one at a time:

## # A tibble: 1 x 6
##   outcome group       id    sample1       sample2              null_mean
##   <chr>   <chr>       <chr> <chr>         <chr>                    <dbl>
## 1 valence track_title line  Man Like That You Know I'm No Good        NA

Regardless of which of the three t-test functions is called, this is a tibble with one row and six columns (irrelevant variable names, like id for two-sample tests, would be specified as NA in those cases), listing the variables involved in the hypothesis test. I struggled a little trying to decide whether the value of "null_mean" for a one-sample test properly belongs in this part of the output, but my view is that it is analogous to the names of the groups ("sample1" and "sample2“), so I’ve treated it as if it were a variable”name".

Next up, we have a tibble that describes the output of the test itself:

## # A tibble: 1 x 8
##   type   hypotheses     t    df        p ci_lower ci_upper ci_level
##   <chr>  <chr>      <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 paired two.sided   3.92    40 0.000339    0.101    0.316     0.95

This is deliberately very similar (though not identical) to the output of infer::t_test(). It lists the relevant information about the test and the confidence interval, in a fashion I hope makes sense. I’m not happy with the name "hypotheses" for the second column, and will almost certainly change it back to "alternative" for consistency with t.test() and infer::t_test().

Finally, the descriptives:

## # A tibble: 3 x 3
##   sample                  mean    sd
##   <chr>                  <dbl> <dbl>
## 1 Man Like That         0.138  0.256
## 2 You Know I'm No Good -0.0706 0.241
## 3 diff.                 0.209  0.341

Again the internal data structure is a tibble, in order to remain maximally similar to the data sets I’ll be providing to the reader throughout the book, and it reports sufficient statistics (mean and sd in this case) for the original variables and (for the paired test) the difference scores. My hope is that, by designing the object structure in a fashion that is visually quite similar to the pretty output generated by the print.lsr_ttest() method, the mapping between the printed output and the hidden data structure should be fairly easy for a novice user to understand.

You’re not so special in the end

Girl, you better wake up
(Ooh, ooh) girl, you better run
(He’s gone) first thing in the morning
Faster than a bullet coming out of that gun

I find it deeply amusing that I’m in my 40s and writing functions implementing t-tests, of all things. I mean, I’ve written papers discussing the relationship between model selection tools based on algorithmic information theory and those based on Bayesian inference, I’ve spend an inordinate amount of my time trying to understand Bayesian nonparametric models9 … and here I am thinking deeply about t-tests. Seriously? Aren’t I “better” than that?

In short: no. No I’m not.

Despite having a fairly dim view of the t-test in general, I think it matters how we teach the ideas underpinning it, and there’s an element of arrogance to my desire not to bother with such “trivial” things. The t-test is a key stepping stone in the bridge across the perilous waters of statistical learning, and so we really should care about how this tool is implemented in our software, and how we describe it in our textbooks. I’m not so “special” that I think I can ignore these everyday concerns, so my goal in writing these functions is to make the learning process as painless as possible for someone encountering these concepts for the first time. I want functions that allow a novice to quickly become familiar with orthodox null hypothesis testing in R, that cement their knowledge of tidyverse, and allow them to transition smoothly from simple tests to linear modelling. That’s not a small thing, and I’d argue that it’s much more important in the real world than any of the other fancy things I do to pay the bills.

Transitions are always tricky to manage10 and it matters to me that this one is managed as cleanly as possible.

(Ooh, ooh) tells you that he loves you
(Ooh, ooh) then he take it all back
Girl, you gotta wonder
Girl, you gotta wonder
Girl, you gotta wonder ’bout a man like that

  1. Anyone messaging me to tell me that this is data reuse – as I have already listened to the songs – and hence not a valid context in which to conduct hypothesis test will be guillotined by the pedantry police (a.k.a. my children). If it makes you feel any better though, they have preregistered the execution plan.

  2. Oh fine, I’ll be honest. I only just discovered the genius package for importing song lyrics into R (thanks Jen!) and wanted an excuse to use it in a blog post. Sue me.

  3. Do not get me started on my opinions about the perils of an unexamined operationalisation.

  4. In real life, this would be terrible practice. The number of times I see people jumping straight from “collect data” to “run a test” without first properly exploring the data to understand what it is doing is just depressing. Worse yet, I feel that the current zeitgeist in psychology is just encouraging this terrible practice by making feel people obligated to adhere to a preregistered data analysis fantasy (sorry “plan”). However, my opinions about the … um, let’s call it self-confidence … that is required to preregister an analysis plan and then stick to it appear to be rather unpopular these days. At some point I might write a longer post about that topic, but for now let’s just say I’m less than thrilled with statistical self-assurance that seems to underpin the strain of methodological zeal currently dominant in psychology.

  5. Okay fine, I guess the functions I wrote in the original lsr package do that, but that package is so badly written that no-one should ever use it under any circumstances except as an object lesson in what not to do.

  6. Well, not that naturally. After all Man Like That is one song, written exactly one time by a single person. It is not a repeatable event in the sense typically required by the frequentist interpretation of probability, and probably would have caused quite considerable consternation among orthodox statisticians before the measure theorists took over and sigma algebras conquered all. More practically though, if you view the song as a finite population of lines, there is a sense in which the “sample” is the population, in which case what precisely is the thing about which we are doing inference? Arguably, it does make sense to consider a population of possible songs that Gin Wigmore might have written instead of this one, if she were in the same state of mind as when writing this one. Under that conceptualisation, there exists some generative process from which we have a single realisation (namely the song that actually was written), and our inferences can be construed as an attempt to make decisions (if we are using orthodox tests) or update our beliefs (if we are Bayesian) about the nature of this process. Inference is philosophically tricky at the best of times, I suppose.

  7. Warning: As of this writing I haven’t implemented the model formula version especially well, so it is probably not robust to the kinds of things one might normally expect (e.g., inline transformations of the variables). The back end is very much a work in progress

  8. And indeed only the basics. I’m pretty comfortable with S3 and I know the basics of S4, but those are the only OOP paradigms I really know much about and frankly the whole thing is a bit confusing to my tiny little brain. I’m not so delusional as to think I can teach content I only half understand myself

  9. Oh dear lord. Nothing - and I do mean nothing - about statistical inference terrifies me more than Bayesian nonparametrics. Trying to specify priors that have broad suppport across the space of all (computable) probability distributions is the statistical equivalent of staring into the abyss, and I have nothing but admiration for anyone who can read the Diaconis & Friedman (1986) paper and not run screaming into the night madly yelling about the impossibility of specifying an element of an uncountably infinite set using only a countably infinite number of observations… HERE BE FUCKING DRAGONS PEOPLE