A loss of confidence

by Danielle Navarro, 03 Oct 2018

I’m a big fan of the idea behind the loss of confidence project. The idea is basically that researchers who no longer believe their own published findings can say so without stigma. It’s a nice thought – after all, a non-trivial proportion of the scientific literature is probably incorrect. We could quibble over how much, but really who cares about the number? The point is that it would be nice to know which reported phenomena are genuine and which aren’t. Seems totally worthwhile. Okay, sure, the loss of confidence project itself is no longer receiving submissions, but there’s nothing stopping me thinking about my own work and forming some views on what I now think about each paper. I think that’s just an inherently useful exercise.1 So let’s have a go at this!

Getting started

For simplicity, I’ll take the “20 most cited” 🙄 papers according my google scholar page, on the assumption that these are the papers people have used the most2, and see where that takes me.

The first thing I notice looking at these papers is that almost none of them are empirical papers in the traditional sense. Many of them do contain novel empirical work, but most are driven by computational work, and many of them are entirely theoretical in nature.3

The model selection papers

There’s one cluster of papers I wrote about model selection methods. There’s a paper on Bayesian nonparametrics, a paper on normalised maximum likelihood, etc etc.

In one sense I’ve already made my “loss of confidence” statement in relation to those in this blog post.

The truth is I’ve lost confidence in almost all statistical model selection methods and I don’t think any of them really work as advertised. In fact, the main reason why this series of papers stopped around about 10 years ago is that I stopped believing any of the claims being made about model selection in the psychological literature. I’m not taking sides here - I think we’re all doing it wrong and I have no idea how to do it right. The closest I can come to a positive claim here is that I broadly think that Bayesian inference is pretty sensible, and that model selection works better when the researcher has clear goals about what counts as a theoretically meaningful pattern in the data. So to the extent that I still hold any opinions on this topic, the Pitt et al (2006) paper on “parameter space partitioning” (silly name for an interesting idea) is probably still where I feel like I am.

Sampling assumptions in reasoning

The one that gives me the most headaches is this one:

There was nothing sketchy in the data collection or modelling process (honest!) but I have a lot of difficulty in accepting my own conclusions here. The paper was looking at whether (and to what extent) people’s beliefs about how data were generated shape the way we make inductive generalisations from those data. I think it’s an important problem, since it’s pretty fundamental to how people think about evidence in the world, but the empirical result in the paper feels weaker to me than I’d like. The result ended up being “some people are sensitive to it in a way that is predicted by the standard Bayesian inductive generalisation model proposed by Josh Tenenbaum and Tom Griffiths back in 2001 and some people aren’t”, which somehow feels… unsatisfying.

In a sense, my dissatisfaction with that paper has led to a lot of follow up papers. It wasn’t exactly intentional, but I’ve tried this basic “sampling assumptions” manipulation here, here, here, here, here, here and here. Across all these papers I must have run 15-20 experiments? There’s not much of a file drawer here, thankfully. Almost every experiment I’ve run has ended up documented in one of these papers (note that some are still under review), so there’s only one or two undocumented failures there. The problem I have is that the effects aren’t as consistent as I’d like. There are some cases where something you get the effect in a super reliable way - for instance in this paper simple manipulation of the sampling mechanism produced a large main effect of sampling in 9 out of 9 experiments. That one I feel pretty confident in! But the Bayesian theory makes a lot of other ancillary predictions about what kind of interaction effects should exist. On those occasions where we found any effects, they were (almost) always exactly what the model predicts… but we don’t always get the effects. For instance: the quantity of evidence should interact with the effect of sampling mechanism. When we manipulate within-subject we obtained really strong interactions exactly as predicted, but when we manipulate between-subject the effect is as null as I’ve ever seen. The theory doesn’t accommodate that (yet!)

Anyway, I guess I do have confidence that these papers are saying something sensible, but I’m still very unhappy with them as a whole because – despite being the person who ran all of them – I can’t yet see how all the pieces fit together in a coherent way, and I would not be at all surprised if some of the effects turn out to be fragile or don’t replicate.

To my mind this is still a work in progress.

Bayesian nonparametric categorisation models

There’s a cluster of papers that are all broadly about the same project, and I think it makes sense to evaluate them together.

I don’t think there’s a single empirical result in any of these papers! They’re all computational modelling results or mathematical results specifying the relationships betwene different models. In that sense there’s nothing to lose confidence in. Looking back at all my modelling papers from this era I really wish I’d been better at archiving my code. I think I’ve been on GitHub for quite some time (I was deeply amused to get an email from them referring to me as an “early adopter”… which semes dubious), but I’ve only recently started getting into the habit of using it properly. I’d love to know how well the modelling in these papers stacks up.

My feeling is that Anderson’s (1990) rational model of categorisation is pretty finicky to work with, especially when the underlying Dirichlet process mixture model is combined with the greedy sequential assignment algorithm he originally proposed. I do think that the variations on it that we introduced in the Sanborn et al (2010) paper do actually make the model less temperamental, but it’s been so long and I wish I had better code archives. (Okay that one I think I can “technically” blame Adam since he did all the really hard stuff on that paper, but it’s not like my code from that era is well archived). The one thing that I am a little less than happy with in retrospect is that I guess I wish we’d applied the model to more data sets, to get a better sense of which inference algorithms work best in different contexts.

The textbook

Ah right, the textbook!

I think there’s a lot to like about it as a simple tutorial introduction that covers a range of topics in statistics and in using R, but I really don’t like the lsr package associated with the book. I stopped revising it after a less than pleasant experience submitting to CRAN, and decided that R package development wasn’t for me after that and I’ve never revisited the code. Honestly, I wouldn’t recommend anyone use the lsr package for anything other than following along with the content in the book. Especially the etaSquared() function, which I think is broken for unbalanced ANOVA designs. Besides, $$\eta^2$$ isn’t a very good measure of effect size, so that’s another reason not to use it!

The book itself, on the other hand… it’s okay I think? I think a lot of it is helpful, and because I had the foresight to maintain control over copyright I could release it under a creative commons licence. Seeing other people reuse parts of the book has been really wonderful (I’ve been trying to keep track of them on the lsr homepage) and it’s been one of the most satisfying things I’ve experienced in my career. Plus, seeing other people cannibalise the parts that they liked from it has given me a bit more confidence in thinking about what parts I want to reuse in my own teaching resources. My current version is here

Reaction time modelling

Two more papers that are purely theoretical The Navarro & Fuss (2009) one is the more important one of the two, and derives a simple method for computing first passage time densities for Weiner diffusion processes.

The core idea in the paper isn’t all that novel, insofar as it’s really just a heuristic for working out which of the two infinite series expansion from Feller’s (1968) textbook you should use to compute the density, and the idea had been floating around before we formalised it.

As for loss of confidence? None really! I still believe this one. There are too many different independent implementations of this algorithm and other ones that all give the same results. If there was something broken Joachim Vandekerckhove would have discovered it and told me about it by now 😀

The small world of words paper

Ah the small world of words paper. This is a fun one to think about. This paper was the first release of a large scale data collection exercise that Simon De Deyne has been working on for many years now (the rest of us have helped out where we can but he deserves 99% of the praise for this project!).

This version of the paper pertains to the Dutch language norms, and the English language version has finally been accepted at Behavior Research Methods. In one sense I have quite a lot of confidence in this project: we have data from tens of thousands of people and have several million responses in the data. Of course, there are a lot of limitations to the work (e.g., self-selected samples, definitely a bias towards WEIRD populations etc), but at the same time I think the project is well documented, so this one I trust.

Similarity models

Ugh, the similarity model papers…

Look, at this point your guess is as good as mine. Again, these aren’t empirical papers, they’re modelling papers. For what it’s worth:

• As I mentioned in the the Bayes factor post I’m not at all convinced that the algorithm we used in the “dimensions and features” paper works well. My recollection from doing it is that it’s super finicky and depends on the prior specification in ways that I could never find very defensible.
• The Navarro & Lee (2004) paper, well, I think I still agree with our claim that there’s something weird about how Tversky (1977) operationalised the “global weighting parameters” in his featural models of similarity, but I’m not entirely convinced our version was any better quite frankly
• The Navarro & Griffiths (2008) paper is an interesting case. As I’ve mentioned, I’ve become pretty skeptical about the value of Bayesian nonparametrics as a model selection tool (despite once arguing for them!) so I no longer think there’s much value in using the Indian Buffet Process in the particular way we used it in this paper. On the other hand, I do remember being surprised at how well the algorithm worked at finding good features with very little tuning. The idea of using a Bayesian model averaging method here to select features that have high posterior probability actually seemed to work better than it had any right to. On the gripping hand… it did seem to have a bias to extract “small” features (i.e., features that are posessed only by a small nimber of entities), and I was never 100% convinced my implementation was as good as it needed to be, so I think there’s something a little weird going on.

As with the other modelling papers, I wish I’d archived code better. I’m so happy I’ve moved to GitHub these days. Now I just need to learn how to use Docker so that I can do long-term archiving better!

Miscellaneous

For the other papers, again they’re mostly theoretical.

I have often thought about running a replication of the experiment from Lee & Navarro (2002), though. In retrospect I feel like that was a little underpowered. I have no reason to expect it not to replicate (the effects were large and fairly unsurprising), but it would be worthwhile doing at some point.

The theoretical work in the Shafto et al (2012) paper and the Navarro & Perfors (2011) paper both seem sensible to me, but I think in both cases the key limitation is that we don’t have the data (at least in these papers) to test any non-obvious predictions of the models. For what it’s worth I think that the Navarro & Perfpors (2011) paper has largely been superseded by later papers (e.g., some of Todd Gureckis’ work, Jonathon Nelson, and others) that more clearly articulate where we wanted to go with that work, and honestly I think they do it better than we did. For the Shafto et al (2012) paper you’d probably have to ask Pat his opinion! I helped out on that paper enough that I don’t feel bad about being included as an author, but it’s just far enough outside my usual area that I’d not want to hazard strong guesses about how well that paper holds up.

So what do I take away from this?

I guess looking over this I feel like there’s a few main thoughts:

• I’m not as confident in the sampling assumptions papers as it looks in the literature, which I think is the reason I’m still working on that problem
• I’ve become very skeptical about model selection methods, and think almost all of my old papers are way too optimistic about the usefulness of statistical tools!
• I like my stats textbook much more now that it’s open for other people to use and remix in ways they see fit
• I really wish I’d archived my code better

Now that I write it down, that feels right to me as a simple summary of how I feel about the main lines of work that I’ve done (at least those papers that other people liked enough to cite, I suppose!) None of these strike me as very deep insights, but it is kind of satisfying for me personally to have done this.

Overall verdict: I kind of enjoyed thinking about my own papers in this fashion. Asking yourself, “in all honesty, do you still believe what you wrote?” is kind of useful. B+ navel gazing experience. Would recommend.

1. In truth the real reason I’m doing this is that I’m trying to work out how I want to approach preregistration (something I’m quite overdue on doing, I feel) - I think it’ll help me get my thoughts together for the future by looking back.

2. I’m skipping that one paper on quantum walk models because I think that most of the citations to that are false positives by google scholar, plus I have nothing to say about that paper other than confess that Ian did all the maths and I barely understand what the paper is saying!

3. As an aside, I suspect that’s a big part of why I’ve been struggling to work out how to think about preregistration. Most things I’ve read on the topic seem to focus on preregistration for empirical work that aims to test hypotheses, which is great, but my empirical work is usually pretty exploratory, in the sense that I often don’t have any particular hypotheses other than “I think this data set will help me distinguish between rival models”. I suppose I could preregister that?? Anyway, let’s leave that for later.