Paths in strange spaces, part I

by Danielle Navarro, 09 Nov 2019

Here we go again, around and round
Here we go again, around and round
We’re babies passing for adults
Who’ve loaded up their catapults
And can’t believe the end results
So here we go again
      – Aimee Mann, A simple fix

A little over a week ago some colleagues and I uploaded a preprint to PsyArXiv entitled Preregistration is redundant, at best. The manuscript is very brief, and is a response to another brief paper entitled Preregistration is hard, and worthwhile published in Trends in Cognitive Sciences. We didn’t advertise the paper on twitter. Personally I would have preferred not to do so at all, but the psyarxiv-bot took matters out of our hands and posted it for us.

The firestorm that erupted was almost instantaneous.

I’m only a minor author on the paper – the hard work was done by Aba Szollosi and Chris Donkin, with commentary and contributions added by myself, Rich Shiffrin, Trish Van Zandt, Iris Van Rooij and David Kellen – but I’m terribly visible on twitter. So I had the privilege – or misfortune, depending on your point of view – to be tagged in tweets before I even knew the preprint was up. The experience was unpleasant, and not one that I’m in a hurry to repeat. In the last week there have been many responses to our paper, some positive and others negative, some thoughtful and others cruel. In this deluge of commentary I found two responses that I found particularly valuable: this blog post by E.J. Wagenmakers and this thread by Christina Bergmann. What I like about both responses is that while both authors end up arguing that preregistration is worthwhile – and can be viewed as critiques of our paper – neither is dismissive and neither is insulting. Better yet, both critiques are thoughtful have made me think a lot about my own views, reconsidering some and reframing others. This kind of critique is one of my favourite things in science. I only wish I worked in an environment in which critiques were common and insults rare, rather than the real world in which these quantifiers are reversed.

Inspired by E.J. and Christina, this post is an attempt to articulate my own opinions on preregistration. Curious as this may sound given the high drama of the last week or so, I have not actually said much about my own views about preregistration. The paper – of which I am rather fond – was mostly Aba and Chris’ work. While I’m glad I was able to contribute to their work, the only time I’ve ever expressed my own views about preregistration at any length was when I wrote Prediction, prespecification and transparency in an invited blog post for the Psychonomic Society back in January.

I mention this because my post here is not really intended to add more fuel to the conflagration that has engulfed so much of open science twitter lately. Now that the reviews of our manuscript have been returned (with a revise and resubmit decision) I thought I might instead to pick up where I left off back in January when I started trying to put own my thoughts together.

Before I start, however, an attention-conservation notice. This is going to be a long post, but I’ll try to make it more enjoyable with a few digressions into generative artwork. Perhaps it will transpire that some of these discursions turn out to be relevant to the topic at hand…

seed_heart(1000) %>%
  time_tempest(iterations = 2, scale = .1) %>%
  style_ribbon(type = "curve", size = .5, alpha_init = .5,
    palette = shades("gray"), background = pagecolour)

Prelude I: Defining one’s terms

One property I have observed in the online discussions of preregistration is that people are often unclear about what they take the term preregistration to mean. Does it refer specifically to the registrations toolkit within OSF? Does it cover the less polished system offered by Does a non-public statement count as a preregistration? Is it a preregistration if the author can edit it after the fact but does not in fact do so? Does a precise mathematical model stated in paper X count as a preregistration for its subsequent use in paper Y? There is a distinct lack of clarity about these questions, and this lack of precision makes it difficult to have a sensible conversation on the topic. So for the purposes of the post I will assume that a preregistration system has the following properties:

  • It allows authors to state data collection and analysis plans prior to running an experiment
  • The statement must pertain to a specific experiment, not a general class of future experiments
  • The statement is public, not private, and easily accessible by anyone
  • Once the statement is lodged, the author cannot modify it

According to this construction, the OSF registration system and the system both constitute preregistration tools; GitHub repositories do not. Similarly, open lab notebooks do not constitute preregistations; previously published computational models are not preregistrations; and so on.

This is a narrow definition of preregistration, admittedly, but I think it is an appropriate one. Almost without exception the public discussion around preregistration focuses on systems that are akin to the OSF registration system, and the only occasions on which I have seen anyone assert that “but a formal model is a preregistration…” are those in which the speaker has found themselves trying to move the goalposts. When one reads the Preregistration is hard, and worthwhile paper, it is very clearly making claims about an OSF-like system, and so too was our response paper. Neither paper makes claims that pertain to (say) the appendices of Tversky’s (1977) paper on Features of similarity in which he provides an axiomatic definition of what his featural account of stimulus similarity asserts. It is pure sophistry to pretend that the latter is part of the scope of what we mean in everyday parlance (it violates the first and second criteria above), and I will not treat this as a form of preregistration.

Prelude II: Defining the scope of one’s argument

The second area in which I believe clarity is important at the outset regards the scope of the argument being made. The discussion on the vices and virtues of pregregistration has been far-ranging, and I have neither the desire nor the ability to capture all of the relevant issues in a blog post. As such, I will restrict myself to two specific claims about preregistration, namely:

  • Preregistration prevents p-hacking
  • Preregistration provides transparency

I chose these two quite deliberately, because I think they are the two most commonly advanced justifications for preregistration, but I think they are quite different from one another. In the first half of this post I’ll talk about why I do not take the former claim seriously in the context of psychological research; and in the second half I’ll talk about why I think the latter claim is accurate but misleading.

A second point about the scope of the piece. I’m not going to offer empirical evidence for my assertions here. Although I am quite firmly of the view that many questions about the efficacy and utility of preregistration are ultimately empirical claims rather than logical or philosophical ones, it is the latter that I am addressing in this post. That is to say, I am sketching out a case for why a rational person might have some justifiable skepticism about some of the claims that have been made about preregistration. Whether such skepticism turns out to be warranted by the data about how science progresses (in the presence or absence of preregistration tools) is a different question than the one I’m considering here. It is, nevertheless, an important topic and the fact that I do not discuss it here should not be taken as evidence that I think such matters are irrelevant.

With these preliminaries taken care of, I’ll get to work, and the place for me to begin is to comment a little on how I think about the process of scientific discovery…

seed_bubbles(1, 5000) %>%
  time_tempest(iterations = 50, scale = .05) %>%
  style_ribbon(burnin = 45, type = "curve", 
    size = .25, alpha_init = .5, alpha_decay = .03, 
    palette = shades("gray"), background = pagecolour)

On scientific discovery

Our conclusions must be warranted by the whole of the data, since less than the whole may be to any degree misleading. This, of course, is no reason against the use of absolutely precise forms of statement when these are available. It is only a warning to those who may be tempted to think that the particular precise code of mathematical statements in which they have been drilled at College is a substitute for the use of reasoning powers … in which, as the history of the theory of probability shows, the process of codification is still incomplete.
      – Sir Ronald Fisher, The logic of inductive inference (1935)

Very often, when people discuss preregistration as a tool that scientists can adopt, it is introduced as a tool for improving our statistical inferences. Specifically, preregistration is advocated as a tool we can use to minimise the risk of “p-hacking” in the scientific literature. Although this is so often the go-to justification for preregistration, in my opinion it is easily the least valuable way in which scientists can use preregistration. I say this because I am increasingly of the view that hypothesis testing – be that in the form of an orthodox null hypothesis test or a Bayes factor analysis – is almost never relevant to the practice of scientific discovery. To put it crudely: I am quite unconcerned with p-hacking because in most instances I have little interest in p-values specifically nor in hypothesis testing generally.

The story of why I’m so uninterested in the process of testing a scientific hypothesis – in the usual sense of seeking to falsify that hypothesis – begins with a paper I published with my colleague Amy Perfors in Psychological Review back in 2011. The paper is entitled Hypothesis generation, sparse categories and the positive test strategy and it is not a statistics paper or indeed any kind of methodological paper. Rather, it is a cognitive science paper, and its goal was to discuss some of the reasons why confirmation bias is so prevalent in human reasoning. The key idea in our paper is that in the kinds of environments people tend to inhabit, if you have only a single plausible-sounding hypothesis that might account for some observed phenomenon and are afforded only the opportunity to pose yes/no “questions” of the world, on average you will tend to learn more (in the sense of expected information gain) by asking questions that are designed to confirm your hypothesis, rather than falsify it. Looking for “yes” is more informative – in the typical case – than looking for “no”.

This result is surprising and might sound suspicious to some readers. Indeed, some suspicion is deserved, because the result does not hold generally and only applies to some kinds of worlds that the reasoner might find themselves in. Moreover, there’s quite the literature on this topic and I’m not going to delve into it deeply here (except to note that our paper is merely one of many that have discussed this topic, and arguable not even one of the best papers that have done so!) but the thing I want to point out is this: if you have a hypothesis and your goal is to learn the truth about the world, attempting to falsify your hypothesis is in many cases a very inefficient learning strategy.

To anyone with a background in the philosophy of science literature these psychological findings would come as little surprise. The view of hypothesis testing I articulated earlier (and is popular among many scientists) is usually referred to by philosophers of sceince as the “naive falsificationist” view, and is a simplified version of Karl Popper’s work. From my – admittedly superficial – reading of the history, Karl Popper’s views on falsification stemmed from the logical asymmetry between verification and falsification; if a theoretical proposition \(P\) entails that we should observe the quantity \(Q\) and we do in fact observe \(Q\) we cannot logically conclude \(P\) to be true. Affirmation of the consequent is not a deductively valid form of reasoning. In contrast, if we observe \(Q\) to be false we can indeed reject proposition \(P\); denial of the consequent is our modus tollens, and is deductively valid. Viewed this way, falsificationism offers us the tantalising possibility of placing science on a deductively solid foundation.

This is an appealing idea, as it would give is some of the inferential certainties we as scientist craves, but on closer inspection this foundation is decidedly unstable. Even if we set aside the nuances of what falsificationism looks like in practice – e.g., I’m going to pretend for the purposes of this post that I have never heard of the problem of ancillary assumptions – Popper’s argument pertains to the testing of a single hypothesis. He says nothing about what process a scientist should follow in order to construct this hypothesis, nor does he tell us what we should do if that hypothesis is falsified. Worse yet, the falsificationist perspective provides no guarantee at all that this process of sequentially testing and falsifying hypothesis will get us any closer to the truth, or indeed any truth-like theories. There are infinitely many hypotheses that one might entertain about the world and only finitely many tests that we as scientists have the ability to construct. Mere falsification does not get us very far, and if our goal is to discover truths, a stronger principle is required.

Another way of framing this is to note that we have two rather different problems to consider. The process by which scientists search the space of possible theories for plausible hypotheses, and the process by which a scientist tests these hypotheses once generated are largely unrelated. As we found in our Psychological Review paper, a strategy that is good for one need not be good for the other: a falsificationist perspective allows you to evaluate a single hypothesis (per the infamous “four card task” proposed by Peter Wason in 1968) it says very little about how one should search a space of possible hypotheses (per the much more interesting “2-4-8” task that Wason introduced in 1960).

Naturally, because philosophers of science are sneaky people, this issue has been discussed extensively over in their little corner of the world. In particular, Imre Lakatos raised exactly this counterargument against Karl Popper’s falsificationism. Viewed from the perspective of a scientist whose goal is to make discoveries, rather than the perspective of an engineer who wishes to test a system, Popper’s framing of the problem is a kind of stage magic. What he does, in effect, is sweep everything that makes science hard (i.e., search and discovery) off to the side, replace it with a much simpler problem of testing and falsification, and declare that the result of this procedure can be viewed as a normative theory of science. Fortunately for me, I don’t have to call out this trickery myself, because Lakatos already did it! Having read only a small part of his work, I’ll cheat slightly and rely on the Stanford Encyclopedia of Philosophy:

But Lakatos points out a problem. There is now a disconnect between the game of science and the aim of science. The game of science consists in putting forward falsifiable, risky and problem-solving conjectures and sticking with the unrefuted and the well-corroborated ones. But the aim of science consists in developing true or truth-like theories about a largely mind-independent world. And Popper has given us no reason to suppose that by playing the game we are likely to achieve the aim. After all, a theory can be falsifiable, unfalsified, problem-solving and well-corroborated without being true.

This is precisely my concern with equating scientific work with the “testing” of hypotheses. At a fundamental level I do not believe this is a sensible thing to do, and to focus so relentlessly on testing things the way we do is a very bad idea.

Switching focus slightly, it is worth noting that Lakatos’ philosophical concerns can be found mirrored in the writings of scientists and statisticians. As an example, you can find a lot of the same concern expressed in the early writing of Sir Ronald Fisher. For example, his 1935 paper on the logic of inductive inference is a quite wonderful article. Not only does he cover traditional statistical topics such as consistency, sufficiency and ancillarity, he is quite careful to highlight the dangers of relying too much on deductive reasoning:

Although some uncertain inferences can be rigorously expressed in terms of mathematical probability, it does not follow that mathematical probability is an adequate concept for the rigorous expression of uncertain inferences of every kind. This was at first assumed; but once the distinction between the proposition and its converse is clearly stated, it is seen to be an assumption, and a hazardous one. The inferences of the classical theory of probability are all deductive in character. They are statements about the behaviour of individuals, or samples, or sequences of samples, drawn from populations which are fully known.

In modern parlance, what Fisher is talking about here is the fact that every statistical inference that we make requires us to assume that some model \(\mathcal{M}\) provides an adequate characterisation of the problem at hand. We cannot do statistics at all without constructing a model (or set of models), and all our inferences depend on a kind of parlour trick… we use our models as proxies for the world itself, consider what our models assert about the data we have observed, and then use this consideration to licence conclusions about the world. As was his wont, Fisher pulls no punches. He goes on to note that Bayesianism (the theory of inverse probability) offers no defence against this vulnerability:

Even when the theory attempted inferences respecting populations, as in the theory of inverse probability, its method of doing so was to introduce an assumption, or postulate, concerning the population of populations from which the unknown population was supposed to have been drawn at random; and so to bring the problem within the domain of the theory of probability, by making it a deduction from the general to the particular

Now, some care is needed here. Fisher’s assertions about the limits of probabilistic models are not precisely the same as Lakatos’ claims about the limits of falsificationism. They operate at different levels of abstraction, they have different historical contexts, and so on. The thing they have in common, however, is that they expose the danger of pretending that your elegant inferential tool – whether that be falsificationism or statistical hypothesis testing – can possibly serve as a good proxy for the scientific discovery process itself.

Though I am myself a Bayesian data analyst by preference, and as such I’m less opposed to inverse probability than Fisher was, I am very much in agreement with him that all statistics starts with an unverifiable supposition that a particular model class is applicable to the problem at hand, and because these models are – at best – a crude approximation to the world, it is an act of extreme recklessness to put too much faith in them. They are tools, nothing more. When it comes to doing science, the tools might be handy things to use from time to time, but when you want to work out what they are telling you about the world itself, you’re on your own honey.

On hypothesis tests and mathematistry

It has not escaped my attention that nothing in my discussion of scientific discovery has, to this point, made any connection with the role of preregistration and the dreaded crime of “p-hacking”. It is high time I remedied this oversight, and the place for my to start is by questioning the value of hypothesis tests – after all, if preregistration is purportedly of value because it prevents us from distorting the results of a hypothesis test, then it must follow that hypothesis tests are themselves of some value to us as scientists. There is little point in trying to fix a thing that is irredeemably broken, after all.

With that in mind, I want to turn to my very favourite article in statistics. It is a 1976 paper by George Box, simply entitled Science and statistics. Box’s paper is a reflection of the legacy of Sir Ronald Fisher as scientist and a statistician, and the manner in which Fisher’s thinking was influenced by the fact that he was both. It is a brief paper, and one I would recommend that every scientist and statistician read at least once in their career. In the later stages of the paper Box discussed Fisher’s irritations with mathematistry, which is pertinent to my annoyance with the claim that “preregistration prevents p-hacking”. Here’s Box:

Mathematistry is characterized by development of theory for theory’s sake, which since it seldom touches down with practice, has a tendency to redefine the problem rather than solve it. Typically, there has once been a statistical problem with scientific relevance but this has long since been lost sight of.

To my mind, mathematistry is slightly more general and is concerned with what we might call “the illusion of rigour”. It applies in any situation where we confine ourselves to an unnecessarily rigid system to provide ourselves with an impression of sureties where none in fact exist. In Between the devil and the deep blue sea, I suggested that mathematistry is the practice of

using formal tools to define a statistical problem that differs from the scientific one, solving the redefined problem, and declaring the scientific concern addressed.

The practice of hypothesis testing in psychological science (via null hypothesis testing, Bayes factors, or what have you) is often has this character. Statistical hypothesis testing provides an elegant, clean formalism for connecting a statistical toy (your model) to a dubious measurement (your data)… and at the end you get a number, \(p\), that is supposed to summarise what these two things have revealed to us about the world.

Is this practice of interest to the scientist? I think not.

Hypothesis tests are mostly an irrelevance to scientific process, in my view, and any attempt to “fix” them via preregistration or any other practice is misguided. I’m reminded of the following quote from the classic 1963 paper Bayesian statistical inference for psychological research by Ward Edwards, Harold Lindman and Leonard Savage:

No aspect of classical statistics has been so popular with psychologists and other scientists as hypothesis testing, though some classical statisticians agree with us that the topic has been overemphasized. A statistician of great experience told us, “I don’t know much about tests, because I have never had occasion to use one.”

I wish I could say the same. In my own research I have on many occasions been required to report the results of a hypothesis test, such being the expectations of editors, reviewers and readers, but I confess in no instance have I found these tests helpful. Were I given the freedom to do science the way I think best, I would not report a single \(p\)-value or Bayes factor in any of my papers. The fact that I do so is entirely because I am forced to do so by others.

The reason I have little time for hypothesis tests is that I have yet to encounter a scientific situation where I have a hypothesis that I would deem to be sufficiently precise that it warrants testing. And I am saying this as a mathematical psychologist! Compared to most psychologists my work is highly circumscribed, extremely formal, and precise in a fashion that we don’t usually bother with in this discipline. Nevertheless, I do not believe my theories and my models are well-formed enough that I would attempt to “test” them in the manner that a null hypothesis test requires. There is too much uncertainty regarding my operationalisations and too much unclarity about what my models assert about the structure of human cognition for this to be wise. As Fisher noted in 1935:

Our conclusions must be warranted by the whole of the data, since less than the whole may be to any degree misleading.

For the situation we find ourselves in as psychologists, this is a major concern. In almost all cases our measurement instruments are proxies – we are never measuring the actual thing we care about – and our models are fantasies. It is very rare that our data are tightly linked to the phenomenon of interest and even rarer for the model to bear any strong connection to the theoretical claim we wish to assert. In such a world, your \(p\)-value is a lie before you even compute it. We are not drawing conclusions from “the whole of the data” because most of the relevant facts are hidden from us in the first place. It makes little difference whether your \(p\)-value has been “hacked” because it wasn’t telling you very much in the first place. Rigorously quantifying an inference using a terrible statistical tool simply to cling to the illusion that it is more “scientific” to attach a number to your guesswork is as pure an example of mathematistry as I can think of. As a discipline have become entranced by the mathematical elegance of statistical theory, and I think it is to our detriment. Box continues:

Furthermore, there is unhappy evidence that mathematistry is not harmless. In such areas as sociology, psychology, education, and even, I sadly say, engineering, investigators who are not themselves statisticians sometimes take mathematistry seriously. Overawed by what they do not understand, they mistakenly distrust their own common sense and adopt inappropriate procedures devised by mathematicians with no scientific experience

Unless and until we reach the point where we could plausibly construct the required mapping between a theoretical claim and an observable measurement, there is no point in deluding ourselves into thinking that our hypothesis tests are fit for purpose, regardless of whether we have preregistered them. We are, as Aimee Mann notes “babies passing for adults, who’ve loaded up their catapults, and can’t believe the end results”.

seed_sticks(10, 5000) %>%
  time_tempest(iterations = 5, scale = .05) %>%
  style_ribbon(burnin = 4, type = "curve", curvature = 1,
    size = .25, alpha_init = .2, alpha_decay = .03,
    palette = shades("grey"), background = pagecolour)

Intermission: When your plan becomes a prison

Before I move on, it is important to recognise the narrowness of the claim I have made above. Specfically the argument in the preceding section is entirely focused on the particular claim that “preregistration prevents p-hacking”. It does not have anything to say about other possible virtues of preregistration. In respect of that claim, I argue that advocating preregistration as a solution to p-hacking (or its Bayesian equivalent) is deeply misguided because we should never have been relying on these tools as a proxy for scientific inference in the first place.

It is important to recognise this, I think, because there is a very real danger that preregistration systems will ossify scientific practices in an undesirable fashion. That is to say, while advocates of preregistration often claim that “preregistration is a plan, not a prison”, they will also claim that preregistration is necessary to prevent p-hacking. Unfortunately, you cannot have it both ways: in my opinion these two claims are in direct opposition to one another unless you also stipulate an unreasonable level of foreknowledge on behalf of the experimenter.

To see why I say this, I think it is helpful to formalise what we mean by “p-hacking” here. Under Fisher’s informal view of hypothesis testing it’s not entirely clear what would constitute p-hacking, so – adopting the same rhetorical position that advocates of preregistation use – I’ll switch to using Neyman’s more formal decision theoretic construction, and assert that p-hacking is any research practice that causes the “true” Type I error rate associated with a test to depart from its nominal rate.1 Can preregistration guard against this? Well, I think it depends on how strict you consider the preregistration to be. I can think of three cases:

  • Strict. Researcher specifies the complete analysis plan in advance and does not deviate even if the data turn out to be highly surprising. This does indeed prevent p-hacking in the conventional sense but forces the researcher to use inappropriate statistics: under the strict interpretation, preregistration is in fact a prison, not a plan.

  • Flexible. Researcher specifies the analysis plan, but is willing to deviate if in retrospect the planned analysis seems inappropriate for the data. This satisfies the claim that preregistration is a plan not a prison, but does so by opening the door to p-hacking once more. A researcher may unwittingly decide to apply greater scrutiny to data that they find disappointing, and thereby become more likely discover reasons to justify departures from the plan.

  • Oracular. There is a third possibility: the researcher writes a preregistered plan that covers every possible eventuality, listing exactly how each case will be handled and then – because this composite if/then decision making procedure is no longer a specific hypothesis test – derives an appropriate decision policy for that plan which satisfies Neyman’s admissability criteria. This would indeed allow us to have the best of both worlds: because the prespecified “deviation plan” is incorporated into the design of the subsequent decision policy, we can have the flexibility we desire (our preregistration is indeed a plan and not a prison) while ensuring that our overall Type I error rate remains bounded at its nominal level (no p-hacking allowed). Perfect… and all it requires is godlike planning abilities! I’ll confess I don’t personally have the intellect to construct that kind of analysis plan for the kinds of experiments I do, but perhaps someone smarter than me can figure it out.

There is a second, perhaps worse, sense in which preregistration can become an unintended prison. In practice, when you look at the kinds of plans that researchers are often expected to preregister – and the reasons given for why we must do so – they almost always related to the preregistration of hypothesis tests. Although it is entirely possible to preregister something else (e.g., that the researcher intends to undertake an exploratory data analysis focusing only on estimating unknown quantities and not running any tests), it is very clear from the relentless focus on p-values and p-hacking that this is NOT the use case that most people have in mind for preregistration. If anything, because we focus so much on p-hacking specifically when we talk about preregistration, we end up in a situation where we direct even more of the methodological discussion onto hypothesis tests.

In other words – as much as I love the idea of having a plan, and endorse unreservedly the suggestion that we all should do so when designing, conducting and analysing experiments – I believe that any attempt to frame preregistration as a solution to the problem of p-hacking is a way of locking ourselves into a prison. When we do this we confine ourselves to a statistical practice (hypothesis testing) that I believe is deeply ill-suited to most of the scientific problems we face in psychology.

seed_bubbles(n = 2, grain = 50) %>%
  mutate(x = x * 25, y = y * 25) %>%
  time_meander(length = 100) %>%
  style_ribbon(alpha_init = 1, palette = shades("gray"),
    background = pagecolour)

Registration, documentation, and transparency

If you get a feeling
Next time you see me
Do me a favour and let me know
Because it’s hard to tell
It’s hard to say
Oh well, okay
      – Elliot Smith, Oh Well, OK

There is a second kind of justification that we often give when extolling the virtues of preregistration, namely that it provides transparency to the scientific process. To my way of viewing things, this justification needs to be taken much more seriously. Unlike the “problem” of p-hacking – which I view as a mere side effect of the more serious problem psychologists using trying to force a round peg (scientific discovery) into a square hole (hypothesis testing), and one that cannot itself be solved by preregistering an unwise practice – the lack of transparency around scientific processes is a real problem, and one that preregistration might plausibly help with.

In fact, as it will turn out, it is in this respect that I quite like preregistration and advocate that it – or some other documentation process that is at least as thorough and transparent – be standard practice in scientific work. However, because I have once again found myself in an awkward position on twitter (why does this keep happening?), Part 2 of this post in which I present a defence of preregistration will have to wait…

  1. As an aside, there are so many reasons why this definition – though somewhat conventional in the literature – is completely incoherent. First, simply to refer to Type I error rates requires us to adopt Neyman’s formalism, and Neyman does not allow us to associate an error rate with any specific study, only with an arbitrarily large collection of studies, which may or may not be tests of the same hypothesis. Secondly, there is the awkward fact that a Type I error rate is defined only in those situations where the null hypothesis is actually the correct description of the world; and given the ubiquity of model misspecification in psychology and other sciences, I cannot for the life of me think of any observably quantity that could possibly be mapped onto the Type I error rate. It is a completely unobservable quantity; which should make a frequentist statistician feel a little uncomfortable. But that’s all beside the point.