Be some other name

by Danielle Navarro, 15 May 2019



JULIET:
O Romeo, Romeo, wherefore art thou Romeo?
Deny thy father and refuse thy name.
Or if thou wilt not, be but sworn my love
And I’ll no longer be a Capulet.
‘Tis but thy name that is my enemy:
Thou art thyself, though not a Montague.
What’s Montague? It is nor hand nor foot
Nor arm nor face nor any other part
Belonging to a man. O be some other name.
What’s in a name? That which we call a rose
By any other name would smell as sweet;
So Romeo would, were he not Romeo call’d,
Retain that dear perfection which he owes
Without that title. Romeo, doff thy name,
And for that name, which is no part of thee,
Take all myself.

I have always loved this speech from Romeo and Juliet. Firstly, oh my. Juliet realising just how badly she is in love with exactly the wrong man never fails to make my heart ache. Secondly, there’s an element of bitterness to her assertion that a rose by any other name would smell as sweet. Romeo’s name is not like the smell of a rose - he is Montague and she is Capulet. One’s family name is not so easily forsworn, nor are the social consequences that attach to that name.

In the software development world this seems to be well understood. The way you name your functions matters to your users, the way you name and order their arguments are important too. One poorly named function in a widely-used package can produce a lot of confusion and frustration in the real world, and so a great deal of care is needed in how one chooses a name. Not for nothing does Hadley Wickham include a slide with this quote in some of his talks about API design

A function by any other name would NOT smell as sweet.

Names matter. They shape the way we organise content, they supply connotations and context, they help us understand the purpose of a thing before we use it, and so on. This principle seems to be less well understood in other disciplines. Statistics, for instance, is the discipline that gave us “errors of the first kind” and “errors of the second kind”, neither of which strike me as particularly helpful

When I first started writing Learning Statistics with R I partially understood this. For instance, I spent a lot of time thinking about the terminology I would use to describe the different roles variables play in a statistical model. Should I discuss independent and dependent variables per the ANOVA tradition, with all their connotations about causal direction? Should I refer to covariates as is more common in regression, despite the fact that the word means very little to a new learner? Both choices seemed a less than ideal to me, so I settled on terminology favoured in the machine learning literature, namely predictors and outcomes. I’m very happy with that choice, and will likely retain it in any future revision.1

On the other hand, I really didn’t think enough about names in some parts of the lsr package.

Case study 1: Good names are helpful

A good example is the who() function. Unless you know MATLAB and know what it does there, it’s pretty hard to guess: what does this function do?

library(tidyverse)
library(lsr)
library(lsr2)

# create some variables
seeker <- "Sarah"
lover <- c("Sally", "Holly")
keeper <- data.frame(
  first = c("Sarah", "Sally", "Holly"),
  second = c("Blasko", "Seltmann", "Throsby")
)

# show the workspace
who()
##    -- Name --   -- Class --   -- Size --
##    keeper       data.frame    3 x 2     
##    lover        character     2         
##    seeker       character     1

Okay well now that makes some sense. The purpose of who() is to provide a summary of the contents of the calling environment, which for the purposes of the book is almost always the user workspace. The idea is to be slightly more informative than objects() and less overwhelming than ls.str(). For users who are in RStudio it’s not particularly important since the environment pane contains that information in a friendlier form. Even so, I often find myself wanting to show the contents of the workspace in the book itself, and I dislike taking screenshots, so the functionality is useful to me.

As I start thinking about what I need in a revised version, I’m playing around with different ideas in an lsr2 package (temporary name!) on GitHub.2 So what should I call this function in the new package?

  • My first thought was workspace(). It’s not too long and the name refers explicitly to what the function produces as output. It is definitely an improvement on who().

  • My second thought was… yes, but function names usually work better as verbs, or at least verb phrases. This isn’t a function that I expect anyone to be typing a lot, so there doesn’t seem to be a need for a very short name. So my next attempt at coming up with a name was show_workspace(). One thing I like about this, besides the fact that it is more explicit about what the function does, it implicitly hints to the user that the intention is for it to be used in an interactive context. I honestly can’t think of a good reason for anyone except me (while writing the book) to use it in source code.

  • My third thought was… yes but consistency matters too. For new users, both R and RStudio can be intimidating. There are so many different panes and tabs and unfamiliar concepts being represented there, and it seems to me that anything I can do to minimise the friction early on would be wise. In that respect, it seems important to me that the relevant RStudio pane is named “environment”. On the one hand I’m not sure I really want my new users to be thinking too much about environments per se (much less scope chains!) but I think that’s an easy one to deal with when writing. So my current thought is that show_environment() is a better choice, because I can conceptually link it to the RStudio environment pane when introducing it in the book.

What I haven’t done is look for conflicts yet. I don’t imagine there are other functions with the same name, but there are many functions in many packages that operate on environments, and I want to think a little more. Nevertheless, this is where I’m up to in this part of the revision:

show_environment()
## # A tibble: 3 x 3
##   variable class      size               
##   <chr>    <chr>      <chr>              
## 1 keeper   data.frame rectangular: 3 by 2
## 2 lover    character  length: 2          
## 3 seeker   character  length: 1

So I’m still pondering this. I like the fact that it returns a tibble, but am wondering what information to include. Variable name, class, and some notion of length or dimension seems sensible. I can’t think of anything else I’d want in a function primarily aimed at new users. One thing I am wondering about is whether I should leave the output as a tibble, or if I should give it a class and write a print method to make it look pretty. Maybe that’s unnecessary? The output seems quite readable as is?

There is one thing I would like to improve. I think show_environment() should have an argument that allows the user to specify which environment to show. At the moment it alwasy displays the calling environment, but my intuition is that’s something that works better as an argument with a default value, rather than a hard coded constraint. As with everything, it’s a work in progress.

Case study 2: Good names work well together

There are a number of other poorly named functions in lsr – as well as many, many functions that should not even exist – but I think the most interesting cases to think about are the ones where I did something half-right the first time.

For example one thing that think lsr did quite well was provide wrapper functions for common statistical tests in a way that presented the output in a more verbose fashion for novices. To illustrate, here’s a simple example from base R. Suppose I have this fake data and I want to run a two-sample t-test.

dat <- data.frame(
  result = c(rnorm(15), rnorm(15,1)),
  treatment = c(rep("A", 15), rep("B", 15))
)

t.test(result ~ treatment, dat)
## 
##  Welch Two Sample t-test
## 
## data:  result by treatment
## t = -2.8474, df = 27.565, p-value = 0.008231
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.6106928 -0.2623148
## sample estimates:
## mean in group A mean in group B 
##      0.04965387      0.98615764

This drives me crazy for so many reasons:

  • First, the function is inconsistent in the input it takes. You could specify the two samples using x and y, or you could specify the outcome and a grouping variable using the formula and data argument. That’s really confusing to new users. Pick one style and stick with it. Sheesh.

  • Second, the function is inconsistent in the test it runs. When teaching, we typically treat the one-sample test, the two-sample test and the paired-samples test as a family of related tests, not as one test with different details. So I strongly feel that for novice users it makes most sense to split it into three related functions.

  • Third, oh my god that output is unreadable. Yes, I do understand that the various hypothesis testing functions in base R all share a common htest class and so are constrained by a shared print method, but for my teaching purposes I really don’t care about this. I’d much rather have a class that is defined just for t-tests, and prints output that novices will find helpful for a t-test.

  • Fourthly, I know everyone points this out,3 but the dot in the function name t.test() is a disaster when there is also an S3 generic called t(). I mean that’s asking for bizarre S3 dispatch if someone ever defines a class "test" and test objects are in some sense transposable.

This is, to put it mildly, not great.

From a pedagogical standpoint, I don’t want my readers encountering this mess so early in their learning. It seems wiser to write some wrapper functions. In lsr what I did was create these three functions:

oneSampleTTest
independentSamplesTTest
pairedSamplesTTest

These names are informative, but in retrospect I am unhappy with them for several reasons:

  • They are too long. Even though I do have a preference for informativeness over conciseness, this is taking things a bit too far.

  • The names don’t group the functions together in a nice, searchable way. It is better to group by prefix than by suffix.

  • I’m a snake_case girl these days, and that uglyCamelCase style is so very 2010.4

To address all three of these concerns, my current plan is to use these as the function names in the next version of the book:

ttest_onesample
ttest_twosample
ttest_paired

To me they seem unobtrusive, informative, and if the user starts typing tt and hits “tab” they will see those three options appear. That seems to be a highly desirable property.

Moving on. Let’s take a look at how the old functions worked. Here’s the independentSamplesTTest() function from lsr in operation:

independentSamplesTTest(
  formula =  result ~ treatment, 
  data = dat
)
## 
##    Welch's independent samples t-test 
## 
## Outcome variable:   result 
## Grouping variable:  treatment 
## 
## Descriptive statistics: 
##                 A     B
##    mean     0.050 0.986
##    std dev. 0.956 0.842
## 
## Hypotheses: 
##    null:        population means equal for both groups
##    alternative: different population means in each group
## 
## Test results: 
##    t-statistic:  -2.847 
##    degrees of freedom:  27.565 
##    p-value:  0.008 
## 
## Other information: 
##    two-sided 95% confidence interval:  [-1.611, -0.262] 
##    estimated effect size (Cohen's d):  1.04

I like the output from this function. Yes it’s verbose, but that’s useful for people who are new to learning statistics. It clearly states which hypothesist test was performed, it shows the descriptive statistics relevant to the test, articulates the hypotheses precisely, and separates information pertaining to the hypothesis test itself (test statistics, degrees of freedom, p-value) from other information that the researcher might want to know (effect size, confidence interval). Yay! Playing nicely with the user.

That being said, one major weakness in the lsr functions is that they don’t play well with the pipe, they don’t accept tibbles (because the type-checking is really badly written) and just… ugh. Everything about them just screams “the author was thinking about base R”, and to a tidyverse native, these are very much not tidy functions. To illustrate, suppose the user started with a nice, tidy tibble, like this one:

irises <- iris %>% 
  janitor::clean_names() %>%
  as_tibble()

head(irises)
## # A tibble: 6 x 5
##   sepal_length sepal_width petal_length petal_width species
##          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa 
## 4          4.6         3.1          1.5         0.2 setosa 
## 5          5           3.6          1.4         0.2 setosa 
## 6          5.4         3.9          1.7         0.4 setosa

Yeah, we’ve all seen the iris data before, so let’s move on. The key thing is that this is how I would expect my readers/users to be encountering their data.

Now suppose I wanted to compare setosas to versicolor on, say, sepal length. At the point in my book where users start thinking about tests, they will already have encountered pipes and dplyr functions, and so they might have the intuition that this is what you’d do. Assuming they have already loaded the relevant packages, this is probably what they’d expect:

irises %>%
  filter(species != "virginica") %>%
  independentSamplesTTest(outcome = sepal_length, group = species)
## Error in independentSamplesTTest(., outcome = sepal_length, group = species): unused arguments (outcome = sepal_length, group = species)

I mean, of course that doesn’t work. The function uses a formula/data interface, the data argument is placed in the second position so the pipe doesn’t do what it’s supposed to, and it doesn’t know how to handle tibbles. To get it to work, the code you’d actually need in the current version is this:

irises %>%
  filter(species != "virginica") %>%
  as.data.frame() %>%
  independentSamplesTTest(sepal_length ~ species, .)

Again. I’ve hidden the results, but suffice it to say this produces the right output. Yes it works, but at what price to the user’s sanity? This is waaaaay too much work just to subset the data and then run a t-test. Not great at all.

So I’ve started work on the new functions. They’re a work in progress but I think this is closer to what a user might expect?

irises %>%
  filter(species != "virginica") %>%
  ttest_twosample(outcome = sepal_length, group = species)

I’m slightly torn about whether using outcome and group as the arguments is the right thing to do. I’m so familiar with specifying model formulas (e.g., sepal_length ~ species) that to me it feels like the natural thing to do. But I don’t think my readers would share the same intutition. At the point in the book where they’re encountering their first hypothesis tests, they’ll have some skills with dplyr, ggplot2 and so on, but none of these rely on formulas to do the work. At a later point (indeed, quite shortly after), the book will introduce linear models and at that point they will definitely start needing to learn model formulas. The real question then is: is it better to keep a tidyverse style interface, in which we have the data argument first, and then specify two detail arguments later (outcome and group),5 or should I foreshadow the arrival of model formulas that will appear at the same time I introduce lm()? At the moment I’m leaning towards the tidyverse style, but I could easily be convinced otherwise.

In any case, I’ve started writing the code. At the moment the output is rubbish, but just to show that it is (technically) functional, here’s what the code chunk above produces:

## [[1]]
## # A tibble: 2 x 3
##   species    `mean(sepal_length)` `sd(sepal_length)`
##   <fct>                     <dbl>              <dbl>
## 1 setosa                     5.01              0.352
## 2 versicolor                 5.94              0.516
## 
## [[2]]
## 
##  Welch Two Sample t-test
## 
## data:  sepal_length by species
## t = -10.521, df = 86.538, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.1057074 -0.7542926
## sample estimates:
##     mean in group setosa mean in group versicolor 
##                    5.006                    5.936

Okay, might still need a little more work. Yes, the code is very pretty but the output ends badly. Perhaps that is only fitting, given the theme of this post.


  1. I have a similar dilemma regarding the terminology for psychological research methodology. Terms like “replication crisis” seem unnecessarily dramatic to me, but overall seem unproblematic. I’m very uncomfortable using terms like “p-hacking” and “data dredging” in my own writings though. Even the ubiquitous “questionable research practice” frustrates me a little, because I feel like I’m being forced to adopt a much more judgemental tone in my book and I’m really trying hard to avoid doing that. My writing goals are pedagogical not policing, and so at some point I’m going to have to find a workaround for that too. But that’s probably a whole post in itself.

  2. This is of course a naming problem in and of itself. I really don’t like the idea of calling it lsr2, that’s not a meaningful name. I haven’t thought of a better one though

  3. Okay fine, I learned this specific thing quite recently at Hadley Wickham’s Tidy Tools workshop, but I’ve been hurt by S3 dispatch so many times that I just want to cry. I do like S3, but sometimes it is hard to love.

  4. The other reason I don’t like these functions is that the code is really quite terrible. That said I’m very much opposed to “code shaming” in all its forms. It would indeed be very poor if I were writing code like that now, given how many opportunities I’ve had no to learn to write better code, but I refuse to go to go back in time and pass judgment on myself for being less than perfect as a novice programmer.

  5. I’m drawing the “data” vs “detail” distinction from tidyverse principles but I feel this is an ambiguous case. When construed as a model formula, sepal_length ~ species feels more substantial than a mere detail, but when recast as outcome = sepal_length and group = species it feels more like two detail arguments. Sigh. This is giving me a headache.