What good is a bad model?

by Danielle Navarro, 23 Jan 2020

This post is my review of Nathan Evans’ very interesting paper What factors are most important in finding the best model of a psychological process?, written as a commentary on one of my own papers and submitted to Metapsychology. As it is journal policy to publish reviewer comments alongside the manuscript, I thought I might as well write up my review as a blog post. The level of disclosure is identical, and it makes writing the review more fun!


I’ll start with my overall assessment: this is a good paper and I thoroughly recommend that it should be published as-is. It should be entirely unsurprising that I disagree with some of the claims made in the paper, but I strongly agree with Nathan on the following:

I would like to note that my arguments only reflect one side of the contentious debate over how models of psychological processes should be evaluated – just as Navarro’s arguments only reflected another side of the debate. Therefore, I believe that researchers should read both Navarro (2018) and my comment with an appropriate level of scrutiny, in order to gain a more complete perspective on the broad issues within this debate and decide how they believe models of psychological processes should be evaluated.

The “Devil and the Deep Blue Sea” (henceforth DDBS) paper was solicited as a brief commentary, and it is indeed quite opinionated. It is both sensible and appropriate for researchers who disagree with me to write replies to me in this respect, and I think Nathan’s paper makes a very nice counterpoint to my own. Because of this, I will not make any suggestions about how the paper “should” be changed, but will instead comment on those respects in which I agree with it, and those in which I disagree.

Formally though, I recommend acceptance!

Part 1. Agreements, some partial

Directed search is important

After providing an introduction that places the paper in an appropriate context, the first section of Evans’ paper (pages 5-7) is entitled “What is the most important function of a process model?”, and does a nice job of capturing the heart of our points of agreement and disagreement. For example, when I refer to the importance of “directed exploration” in DDBS, this is very much what I have in mind:

Like a state-of-the-art optimization algorithm in the context of estimating the parameter values of a model, a model that makes predictions for novel contexts provides an efficient method of searching through the space of all potential data. These novel predictions can help lead researchers to sources of data that are most informative in teasing apart different models, while avoiding less informative sources of data; a level of efficiency that a giant ‘grid search’ through all possible data would be unable to achieve. Providing an efficient search of the data space is where I believe predictions about novel contexts are most useful, as they provide clear directions for what experiments are most likely to discriminate between competing models most clearly.

Within the mathematical psychology literature there is a long history of using formal models of cognition in exactly this fashion, and I think Evans and I are in full agreement that this is one of the important functions of formal modelling.

Quantitative tools are useful, but so are qualitative tools

The second section of Evans paper (pages 8-12) is entitled “Are certain data of more “theoretical interest” than others?" and I find myself agreeing with a lot of the arguments here, but I am a little puzzled by the section title itself. On my reading, most of this section appears to be addressing a different question, namely “Are quantitative measures of performance useful?” My answer to the latter question would be a qualified yes – quantified measures of model performance can be very useful, but not always, and not to the exclusion of other considerations. I suspect that this is something we agree upon.

As an illustration, consider the claim (p9-10):

in the case of the Bayes factor (Kass & Raftery, 1995), a value of 1 indicates no distinction between the models, whereas larger (or smaller) Bayes factors reflect greater distinction between the models, until the evidence becomes overwhelmingly in favour of one model over the other. Therefore, quantitative model selection appears to both have the ability to reduce the ambiguity in which model is better, and to know the strength of evidence for one model over the other

I am personally of the view that the Bayes factor is not a very good tool in practice because in high-dimensional problems it can be sensitive to aspects of the model (e.g., tails of the prior) that are hard to set in any principled way, so I would not in fact agree with the claim exactly as stated. But I do think a softer version of this claim (i.e., that model fit can be quantified and comparisons can be made between models on that basis) is a very reasonable one to believe. Similarly:

More generally, I do not believe that being able to visually observe a trend – based on the way that the data have been plotted – means that the observed trend should have priority over all other possible qualitative and quantitative trends in the data. Realistically, there are always likely to be several trends that can potentially be visually observed in the data, which may be shown or obscured by different ways of visualizing the data.

This is also very reasonable, though I would have phrased it differently. It does matter what we choose to consider to be a “theoretically meaningful property” of a data set, and different choices will lead to different conclusions.

Where I disagree is with the implication – perhaps intentional? – that quantitative measures of performance are superior in this respect. In my view they are not: Bayes factors, minimum description length, cross-validation and p-values can all give different answers to the same questions, different modelling assumptions can produce different conclusions, and so on. In other words, while I agree with much of what Evans is arguing here, I would like to suggest that this argument is something of a double edged sword: it cuts just as sharply when applied to quantitative measures as it does to qualitative measures. Both approaches to model selection have an inherent ambiguity that can only be resolved if the researcher makes – and explicitly states – some assumptions about what they believe is important.

Theoretical scope, ancillarity, and model complexity

Another area in which I find myself in (partial) agreement relates to the third section of Evans’ paper (pages 12-14) which asks an important question, namely “Where is the border between core and ancillary model assumptions?” This section presents a really nice discussion of some of the issues at hand. In most respects, I think Evans and I are very much in agreement here. For instance, we agree that the that the choice of ancillary assumptions can make a very substantial difference when assessing the performance of a model, and we agree that this is an important issue. We differ a little on how we believe this issue should be addressed, which I’ll discuss later, but I want to first talk about the respects in which our perspectives are not that different.

In DDBS I argued that sometimes it is better to “abstract away” from the raw data and only model those elements of the data that one believes to be relevant to the phenomenon of interest, in part because modelling these additional characterstics to the data requires that one specify more ancillary assumptions, and by doing so develop unnecessarily complicated models and complicating our model selection procedures by rendering the performance measures dependent on those assumptions. In response, Evans raises the following concern:

However, what exactly makes one assumption core, and another ancillary? The distinction may seem like common sense while speaking in abstract terms, but I believe that these different types of assumptions become much harder to distinguish between in practice. In practice, the lines between core and ancillary assumptions can often be blurred, and declaring certain assumptions in a model as being ancillary allows a researcher to still find evidence in favour of their preferred model – success that they attribute to the core assumptions – and dismiss evidence against their preferred model – failure that they attribute to incorrect ancillary assumptions.

This is an interesting and important topic, but I feel compelled to note that it does not appear to be a response to the argument in my paper. The core argument in DDBS is that one should avoid modelling some aspects to the data, in part to avoid introducing additional and unneeded complexity via such ancillary problems. One might even go so far as to say that Evans’ passage here is actually an argument against trying to model the data in too much detail!

Nevertheless, I do agree with the spirit of the concern. To take an extreme example, a malicious researcher could apply the approach I argued for in DDBS and “cheat”, simply by choosing to ignore inconvenient aspects to the data. However, this is another example of the argument that cuts both ways: malicious researchers can also “cheat” under quantitative approaches simply by lying about those aspects to the data. The issue at hand is quite orthogonal to the problem of fraud detection, so the real question has more to do with the practical question of which approach is likely to help honest scientists make sensible inferences. I’ll talk more about this later in the review, when I reach the “disagreements” section.

Before I move on though, I should mention that I think there is a fairly simple solution to the problem of determining what is core and what is ancillary: ask the researcher. Computational modellers are typically very concerned about such matters, and in my experience at least will be quite willing to state clearly what the core modelling claims are. For example, when specifying the contrast and ratio models of featural stimulus similarity, Tversky (1977) was extremely precise in stating which aspects to his model are the core assumptions and which are ancillary: it states it in the Appendices. Tversky stated what his core claims were so clearly that the one time I proposed a critique of his model, and the reviewers argued that my model was actually the model that Tversky’s “really” meant to propose, I was able to quote from the Appendices of the paper to prove that Tversky did not make this claim (see Navarro & Lee 2004). It seems to me that computational modellers of all varieties are – perhaps unlike scientists whose theories are purely verbal in nature? – very keen to be clear about the distinction between core claims and ancillary properties.

Somewhat relatedly, while I agree with the concern raised regarding model complexity …

Importantly, being allowed to adjust these ancillary assumptions can make a model infinitely flexible, even if any instantiation of the model with a specific set of ancillary assumptions is not infinitely flexible.

… I wonder about what lessons the reader should take from it. The example given is the famous paper by Jones and Dzhafarov (2014), who noted that if one does not constrain the shape of the between-trial drift rate distribution, standard choice RT models can be made arbitrarily flexible.

From afar this might look like a major concern, but as most people in the mathematical psychology community are aware, this observation did not suddenly render choice RT modelling useless. On the contrary it has simply led to more clarity from researchers that they do make some theoretical claims about the nature of trial-to-trial variability. Ratcliff (1978) originally assumed this distribution to be normal. Reasonable alternatives to the normal distribution exist, to be sure, but not every possibility is reasonable: a trimodal distribution would be rather strange, for instance. Because of the fact that “everybody knows” roughly the difference between a reasonable choice and an unreasonable one, researchers in the field still think that the diffusion model is a distinct entity from the LBA. The Jones and Dzhafarov (2014) paper has made it clearer that RT modellers do need to care about the shape of this distribution, but it has not in any sense of the term led modellers to think that the diffusion model is infinitely complex.

In a sense, I think Evans and I are in agreement here, but at the same time I am unclear on how it relates to DDBS, in which I argued that modellers need to make choices about what aspects to the data they believe is worth modelling from their theoretical perspective. In DDBS I suggested that we be explicit in doing so, but in practice this is something we all do, all the time.

In fact, choice RT models provide a good illustration of exactly this phenomenon! Consider, for instance, the fact that a choice RT task unfolds in real time, and there are real dependencies between one trial and the next. That this is empirically true is not in dispute, and literature on sequential effects in RT goes back to the 1920s. Typically, when a modeller does choose to model sequential dependences (e.g., Gokaydin, Navarro, Ma-Wyatt and Perfors, 2016) we only consider dependencies that go back 5 trials, and those temporal dependencies do go back at least that far, but in their “canonical” forms, none of the major choice RT models bothers to model this characteristic to a standard 2-AFC task. The diffusion model in its “base” form assumes trials are independent of one another, as does the LBA, and so on. By convention, none of these models provide any account of these inter-trial dependencies, because they deem sequential effects to be outside the theoretical scope, and they do not wish to introduce unnecessary complexity. It is – of course! – entirely possible to incorporate sequential dependencies in a model of a choice RT task, as many authors have done previously, but in most cases it is understood to be “acceptable” to pretend that adjacent trials are independent of one another for the purposes of modelling, because sequential effects are not the characteristic of the data with which the researcher is concerned.

What I’m trying to say here is that in practice all modelling employs some variation of the strategy I outlined in DDBS. Researchers choose to delimit the scope of the modelling effort to some aspects of the data and ignore others, as demanded by the context. Later research might require that we reevaluate some of these choices – meaning that diligent documentation of these choices is critical – but it is very rare indeed for a modeller to find themselves in a situation where no such choices are required.

Part 2: Disagreements, hopefully of substance

Though I find much to agree with in Evans’ lovely paper, there are some points where I disagree with how the paper describes the scientific and statistical problems at hand.

Levels of (model) analysis, levels of (data) abstraction

One of the more striking things about the paper, from my perspective, is that there is a subtle rhetorical shift that takes place from the very beginning and it has a cascading effect through the paper. Throughout the paper Evans refers specifically to process models, whereas in my original DDBS paper I did not restrict my comments in this way. This is of some importance I think.

It has become standard in the cognitive modelling literature to use David Marr’s (1982) “levels of analysis” to (roughly) classify computational models in terms of the level of abstraction at which they operate: “computational level models” or “rational analyses” describe human cognition in terms of an “optimal solution” (or more realistically, “sensible solution”; see Tauber et al 2017) to an inferential problem posed by the environment. The Gaussian Process model proposed by Hayes et al (2019) that I used as my running example in DDBS is an example of a computational level model. By design it does not describe any plausible psychological mechanism by which human reasoners solve this problem.

In contrast, “process models” are typically understood to lie closer to David Marr’s “algorithmic level” of analysis. In models of this kind, human cognition and behaviour is characterised in terms of a set of information processing steps detailing how the cognitive system gives rise to the resulting behaviour. There is of course a very large literature that has debated the relative utility of computational level analysis and process models, and reasonable scientists can differ in terms of which perspective they prefer. Nevertheless, it is worth noting that where DDBS was written by an author (myself!) whose work has primarily involved computational level modelling, the current paper leans more heavily toward process models. For the most part, the Evans paper makes no mention of any level of analysis other than process level accounts.

To illustrate how critical this distinction will turn out to be, it is instructive to consider the list of citations used by Evans on page 3 to introduce the concept of a cognitive model:

Ratcliff, 1978; Brown & Heathcote, 2008; Usher & McClelland, 2001; Nosofsky & Palmeri, 1997; Shiffrin & Steyvers, 1997; Osth & Dennis, 2015

As it happens I have read all these papers bar the Osth & Dennis one (sorry Adam, I’ve been meaning to!) and they cover a very narrow range of the modelling literature. The only paper that could be considered “computational level” in this listing is the Shiffrin & Steyvers paper. Four of the papers present models of choice reaction time tasks (Ratcliff; Brown & Heathcote; Usher & McClelland; Nosofsky & Palmeri), and four are models of recognition memory/categorization (Ratcliff; Nosofsky & Palmeri; Shiffrin & Steyvers; and I believe the Osth & Dennis paper also). Though these are both interesting areas, cognitive modelling is a much broader area than these citations would suggest, and I think it is instructive to take a broader view of the field.

Theoretical commitments of a model?

Why does it matter that Evans’ paper focuses almost entirely on choice RT models, whereas DDBS focuses on problems in inductive reasoning?

Well, let’s consider the reaction time models first. The underlying experimental paradigms for studying choice reaction time have been fairly stable since the 1960s, and – as per the title of Duncan Luce’s classic (1986) book Response Times: Their Role in Inferring Elementary Mental Organization – the models and experiments are designed to investigate low-level properties of human behaviour and cognition. All the models are quite similar too. Though I suspect my former advisor Doug Vickers would strenuously disagree with me on this, there is very little to differentiate Ratcliff’s diffusion model from Brown & Heathcote’s linear ballistic accumulator, nor indeed Usher & McClelland’s leaky competing accumulator. All of these models describe a choice RT in terms of first-passage time to absorbtion, all account for the major phenomena under investigation, and… to be a little blunt about it in a fashion that will probably get me in trouble at the next Math Psych meeting, none of them are very useful in studying higher order cognition. 1

Moving beyond choice RT

What happens, then, when we try to apply formal models to more complicated phenomena? What are the desiderata for a model of “curiosity”? A model of “play”? A model of “politeness”? These are psychological questions in which there are no stable, agreed upon paradigms. There is no analog of a 2AFC task for “politeness”. Should we as cognitive modellers simply walk away and declare that our discipline is simply inapplicable to anything outside the small world encompassed by our standardised task?

I’d like to think we should not, but in order to do so a degree of abstraction is required. As an example, consider the rational speech act model proposed by Yoon, Tessler, Goodman & Frank (2016), which describes polite speech in terms of a trade-off between “kindness” and “informativeness”. It’s a really nice model and one that (in my opinion anyway!) makes a very nice contribution to the literature on pragmatic communication. Here is their figure reporting the model fit to Experiment 3,

Judged by the standards that we would use in the RT modelling literature this is a terrible fit. In the leftmost panel the slope of the red line for “honest” communication is entirely incorrect, the “meanness” curve is too flat, and the “niceness” curve is too high. Quantitatively speaking, this model does a very bad job of accounting for the data.

When modelling higher order cognition, such phenomena are grossly typical because – by their nature – the tasks have too many uncontrolled variables, use operationalisations that may not be well tuned to the underlying phenomenon, and so on. The list of reasons why the rational speech act model in Yoon et al (2016) is so “bad” at accounting for the data is very long.

Everyone in this computational cognitive science literature is well aware of this as an issue, however, and so we judge the Yoon et al (2016) model to be a “provisionally successful” model because it captures the rough trends in the data. Just as it would be an absurdity to judge a choice RT model by its qualitative performance (e.g., nobody would be impressed by my model simply because it can produce positively skewed RT distributions!) it is equally absurd to try to judge the RSA model of polite speech by the (in)ability to reproduce every little feature of the data. In this domain, it is an intellectual triumph to be “merely in the right ballpark” when developing models.

It is in exactly this context that I wrote this in DDBS:

To my way of thinking, understanding how the qualitative patterns in the empirical data emerge naturally from a computational model of a psychological process is often more scientifically useful than presenting a quantified measure of its performance, but it is the latter that we focus on in the “model selection” literature. Given how little psychologists understand about the varied ways in which human cognition works, and given the artificiality of most experimental studies, I often wonder what purpose is served by quantifying a model’s ability to make precise predictions about every detail in the data

Judged by most of the formal model selection tools I used to advocate in my misspent youth (e.g., Lee & Navarro 2005, Myung, Navarro & Pitt 2006), the Yoon et al. (2016) account is a “bad” model. It provides a poor fit to the data and is quantitatively “incorrect” in almost every respect. Scientifically however, it is an extremely useful model and much more helpful to me as a researcher than it would have been if the authors had presented a theoretically vacuous “curve fitting” model that reproduces the quantitative patterns in the data, even if it did so using fewer parameters.

Can (and should) we model everything?

One of the most interesting passages in Evans’ paper, at least on my reading, is this one. After presenting his (very reasonable) defence of the usefulness of quantitative model selection, the paper makes the following claim (p 11-12)

Essentially, quantitative model selection methods are able to take into account everything that visually assessing a finite number of qualitative trends can, and more. The only reason that assessing qualitative trends can give different results to quantitative model selection is that assessing only a subset of the data – as is the case when assessing qualitative trends – ignores all other aspects of the data. If researchers are only interested in explaining the single qualitative trend in the data, then I can see why only assessing the single qualitative trend makes sense. However, in cases where researchers want to explain the entire psychological process – which I think is most situations – then only assessing these visually observed qualitative trends is limiting practice, rather than a theoretically interesting one.

I love this quote, because I think it perfectly captures the issue at hand. For this quote to make any sense, one has to accept the implicit premise here that our models can explain the entire psychological process. As regards higher order cognition at least, I don’t believe that they can.

Certainly, as a researcher who studies human inductive reasoning I would love a “model of induction” that explains the entire psychological process of human reasoning about underconstrained problems. How could I not? But in practice I am but one woman and my abilities as a scientist are limited. I cannot fathom what class of models might be applicable to the entire psychological process at hand, nor can I think of how to circumscribe the scientific problem in a way that allows me render the “real world” phenomenon of human reason subject to any kind of direct measurement.

The underconstrained nature of my scientific problem has consequences. My experiment is a measurement tool that operationalises some aspects to human reasoning, but always and inevitably confounds it with the measurement of some other phenomena. I would argue that this is true of almost every psychological experiment, and as such every data set is an unknown convolution of “the thing you are trying to study” and “numerous unknown things that you are not”. So then – as a modeller – I am trapped in a bind. If I try to model every aspect to the data (as required by quantitative measures of performance, if strictly construed) I must provide some account of these “numerous unknown things” as well as the thing I am trying to study. But if my experiment is too complex then these unknown things will themselves become quite complex (e.g., modelling “demand” effects is hard). Worse yet – because my experiment was not really designed to measure those things – my modelling efforts will be severely underconstrained in this respect, and thse ancillary aspects to my model nothing more than wild speculations.

So what is the solution?

One answer, as the history of mathematical psychology shows, is to make the task simpler. Make the task so small and so simple that we actually can write down models that specify precise assumptions about every aspect to the task. Thus we end up with the science of “choice” that we see in the sequential sampling models literature: we had good models for the 2AFC task in the late 1970s, but it was not until 40 years later that we had good models for the “colour wheel task”. When one pauses to consider the difference between the tasks encompassed by “a theory of choice RT” and “those things in the world where a human has to make a decision under time pressure” … well, the discrepancy is rather large.

It is inconvenient, perhaps, but it remains true that our models are defined only on toy worlds; our participants, however, must occupy the real one.

How did this happen that our models became so narrow in scope? A lot of possible reasons might be proffered, but there’s one unpleasant possibility I want to suggest in particular. It might very well be the case that it is precisely because we have such stringent modelling criteria that the experimental paradigms in mathematical psychology have become ossified and highly restricted, adapted to suit only those phenomena that we know how to model in full. This can be dangerous, insofar as it provides an illusion of explanatory power, one that falls apart once we step outside the narrow confines of our paradigm.

This isn’t an original point I’m making, of course, and it’s one that has been made in the philosophy of science literature. Ian Hacking (1992) argues that over time laboratory sciences create a self-vindicating system by building theories and methods that are “mutually adjusted to each other” and cannot be falsified. He goes on to note that:

The theories of the laboratory sciences are not directly compared to ’the world’; they persist because they are true to phenomena produced or even created by apparatus in the laboratory and are measured by instruments we have engineered

This is precisely what we see in our own discipline. Mathematical psychology has a cultural history of requiring extremely precise modelling, and as a consequence we have developed an experimental tradition of reducing complex problems to simple tasks that are amenable to such precise modelling.

By way of comparison, in the adjacent discipline of computational cognitive science, models are typically assessed in a looser fashion – and the experimental tasks that are typically presented in this literature are much richer than those we employ in mathematical psychology. For example, the paper I referred to in DDBS by Lake et al (2015) presents a theory of human concept learning instantiated as a “Bayesian program induction” process. Essentially, the claim is that the learner attempts to reverse engineer the process by which observed objects were created, “inferring” a probabilistic program that can be used to generate new objects. It’s a lovely paper that then goes on to highlight how the model can pass a (quite limited) version of the Turing test for visual categories. The experimental paradigm is novel, the model is elaborate, and the paper – speaking only for myself – deeply informative. But it would be nigh impossible to evaluate this model (or any of the deep learning models to which it was compared) via Bayes factors. Even in principle, it would be very hard to specify model priors because of all the complexity of the models. Modelling the raw data in very precise detail, on a trial to trial basis, with model parameters estimated (or marginalised) is simply not realistic.

Can we determine which of these two scientific approaches is better? Is one of these disciplines superior to the other? As someone who has worked in both at different stages of her career, I’d argue that this is very difficult to discern. Mathematical psychology “knows” very little, but those things we do know we know very well; over the decades we have slowly and carefully developed a detailed understanding of a narrowly circumscribed class of problems. Computational cognitive science, with its looser modelling standards and broader theoretical scope, knows “something” about a much broader class of problems, but that knowledge is less detailed and almost certainly has more incorrect claims being made.

It is entirely beyond the scope of my review to try to answer a question of that nature, but I should like to suggest that the question does matter. Per the quote above, yes,

quantitative model selection methods are able to take into account everything that visually assessing a finite number of qualitative trends can, and more

but to rely overmuch on these tools comes at a price, and that price may be to delimit the scope of mathematical modelling in psychology to a very small thing indeed.

The tension between principled and practical

Having said this, I can also understand why researchers may be reluctant to commit to all assumptions of a model as being core to the explanation. As Navarro (2018) points out in the Hayes et al. (2018) example, there are often difficult decisions that need to be made to create a formalized model, and some of these choices can end up being somewhat arbitrary. However, making each model a complete explanation with only core assumptions does not mean that assumptions that would normally be considered ancillary cannot be tested. Specifically, multiple models can be proposed as explanations of the unknown psychological process, with each model containing a different instantiation of these ancillary assumptions, and these models being compared using quantitative model selection methods. However, each model with different assumptions is now a different, separate explanation, and researchers cannot switch between these different models for different paradigms while still claiming that this represents a success of a single explanation. I believe this presents a principled way to address the issue of ancillary assumptions in models

I’ll end my review – or commentary, I’ve quite lost track of what I am writing here – with some thoughts about this passage in Evans’ paper, because I think it’s an important one. I entirely agree with the author.

Or at least I used to.

The modelling approach described here is indeed a principled method, as the final section of Evans’ paper goes on to attest. I am certainly not opposed to principled modelling! In fact, the person who wrote the quote below would certainly have agreed wholeheartedly (Navarro, Pitt & Myung 2004):

With over a century’s worth of data collected, there is little doubt that [memory] retention follows a smooth, convex, monotonically decreasing function, and its long-run behavior should be slower-than-exponential (Jost’s law; see Alin 1997). Beyond this, it is difficult to discriminate between models that satisfy these constraints if a model’s fit to data is used as the sole criterion on which to choose a model. Data-fitting all by itself, as we illustrated at the beginning of the paper is simply is not a good tool for discriminating closely competing models. It is inappropriate given the demands of the job because to advance the science, we need to know more about the models and data than can be learned from fit alone

In that paper, we were arguing for a more detailed form of quantitative analysis than cognitive modellers typically employ, and my claims were not all that dissimilar what Evans argues in the current paper. Yet here I am 16 years later recanting these thoughts and arguing for almost precisely the opposite. Why? What changed?

The answer to this is mostly that when I wrote it, I was working as a mathematical psychologist and the problems I was studying were tightly constrained ones. The experiments and the models discussed the Navarro et al (2004) paper are simple enough that I was able to be precise, and the tools of mathematical psychology were appropriate to my problems. As I started to consider the broader class of problems encompassed by computational cognitive science, I started encountering difficulties.

To illustrate, I’ll compare two of my own papers, both published in 2012 around the time I was first starting to have doubts about the usefulness of the model selection literature. In the first (Navarro, Dry and Lee 2012) I attempted to study a very simple problem of reasoning inductively about sequentially observed entities that vary only on a single stimulus dimension. This problem is tightly constrained, and I was able to specify a complete formal model of the underlying reasoning process and (in the Appendix) ancillary modelling assumptions about the response generating process. This meant I could report figures like this one:

There’s a lot I could criticise about this paper, but ultimately I think it’s a useful “math psych style” approach to the problem. It is, above all else, a principled answer to a toy problem.

Now consider the second paper (Shafto, Eaves, Navarro and Perfors 2012), which considers a far more realistic problem. How do children reason inductively when presented with information by possibly-ignorant, possibly-lying agents? The stimuli in the tasks are much less controlled, the experimental data much less detailed, but the modelling framework required to account for “epistemic trust” in children is vastly more complicated than the toy model in the Navarro et al (2012) paper. The only data fits we could plausibly report in that paper looked like this:

By any quantitative statistical assessment, my modelling approach in the first instance was more rigorous, more principled, more reliable and more likely to be “true”. It simply has the minor shortcoming of not telling us very much about human cognition. In later papers I’ve tried to build on the Navarro et al (2012) paper to the point where I now think that the body of work as a whole does make a useful contribution, but on its own this really isn’t a very impressive paper.

The Shafto et al (2012) paper, on the other hand, actually does make a contribution of its own (particularly in connection to the other work Pat Shafto and colleagues have done with this modelling approach). The reason it does so is that the model evaluation method is – above all else – practical. Instead of restricting the experiments to those we can model to a high degree of confidence, the approach in the Shafto et al (2012) paper was to make contact with the messy, complicated problems that developmental psychologists have to think about when studying how children actually learn new concepts.

All of the above being said I still don’t think that one paper is necessarily better or worse than the other, but they do make different trade-offs in how they construct and evaluate models of cognition. When modelling something as complicated as the human mind, there are always trade-offs to be made: what one gains in principled analysis, one often loses in practical utility.


Nathan’s paper is fabulous. I love it, it criticises my own paper in a really interesting way and I agree with about half of it. I thoroughly recommend that it be published in its current form!


  • Brown, S. D., & Heathcote, A. (2008). The simplest complete model of choice response time: Linear ballistic accumulation. Cognitive psychology, 57 (3), 153–178.
  • Gokaydin, D. Navarro, D. J., Ma-Wyatt, A. and Perfors, A. (2016). The structure of sequential effects. Journal of Experimental Psychology: General, 145, 110-123
  • Hacking, I. (1992). The self-vindication of the laboratory sciences. In A. Pickering (Ed.), Science as practice and culture (pp. 29–64). University of Chicago Press
  • Hayes, B. K., Banner, S., Forrester, S. and Navarro, D. J. (2019). Selective sampling and inductive inference: Drawing inferences based on observed and missing evidence. Cognitive Psychology, 113.
  • Jones, M., & Dzhafarov, E. N. (2014). Unfalsifiability and mutual translatability of major modeling schemes for choice reaction time. Psychological review , 121 (1), 1.
  • Lake, B. M., Salakhutdinov, R. & Tenenbaum, J. B. (2015). Human-level concept learning through probabilistic program induction. Science, 350 (6266), 1332-1338
  • Lee, M. D. and Navarro, D. J. (2005). Minimum description length and psychological clustering models. In P. Grunwald, I. J. Myung and M. A. Pitt (Eds) Advances in Minimum Description Length: Theory and Applications (pp. 355-384).
  • Marr, D. (1982), Vision: A Computational Approach, San Francisco, Freeman & Co.
  • Myung, J. I., Navarro, D. J. and Pitt, M. A. (2006). Model selection by Normalized Maximum Likelihood. Journal of Mathematical Psychology, 50, 167-179.
  • Navarro, D. J., Dry, M. J. and Lee, M D. (2012). Sampling assumptions in inductive generalization. Cognitive Science, 36, 187-223
  • Navarro, D. J. and Lee, M. D. (2004). Common and distinctive features in stimulus representation: A modified version of the contrast model. Psychonomic Bulletin and Review, 11, 961-974
  • Navarro, D. J., Pitt, M. A. and Myung, I. J. (2004). Assessing the distinguishability of models and the informativeness of data. Cognitive Psychology, 49, 47-84
  • Navarro, D. J. (2018). Between the devil and the deep blue sea: Tensions between scientific judgement and statistical model selection. Computational Brain & Behavior , 1–7.
  • Nosofsky, R. M., & Palmeri, T. J. (1997). An exemplar-based random walk model of speeded classification. Psychological review , 104 (2), 266.
  • Osth, A. F., & Dennis, S. (2015). Sources of interference in item and associative recognition memory. Psychological review , 122 (2), 260.
  • Ratcliff, R. (1978). A theory of memory retrieval. Psychological review , 85 (2), 59.
  • Shafto, P., Eaves, B., Navarro, D. J. and Perfors, A. (2012). Epistemic trust: Modeling children’s reasoning about others’ knowledge and intent. Developmental Science, 15, 436-447
  • Shiffrin, R. M., & Steyvers, M. (1997). A model for recognition memory: Remretrieving effectively from memory. Psychonomic bulletin & review , 4 (2), 145–166.
  • Tauber, S., Navarro, D. J., Perfors A. and Steyvers, M. (2017). Bayesian models of cognition revisited: Setting optimality aside and letting data drive psychological theory. Psychological Review, 124, 410-441.
  • Tversky, A. (1977). Features of similarity.Psychological Review 84(4), 327–352
  • Usher, M., & McClelland, J. L. (2001). The time course of perceptual choice: the leaky, competing accumulator model. Psychological review , 108 (3), 550
  • Yoon, E. J., Tessler, M. H., Goodman, N. D., & Frank, M. C. (2016). Talking with tact: Polite language as a balance between kindness and informativity. In Proceedings of the 38th annual conference of the cognitive science society (pp. 2771-2776). Cognitive Science Society.

  1. At best, such models can be used as a “front end” to some other model. The Nosofsky & Palmeri (1997) model is a good example of this: it adopts a first-passage time account of the reaction time in speeded classification tasks, but ultimately Rob Nosofsky’s primary theoretical commitment is not to the “random walk” part of the EBRW, it is to the “exemplar based” part. In fact, this is also a nice illustration of the fact that in practice modellers are generally pretty clear about the role of different assumptions: as a model of categorization, EBRW clearly distinguishes between the theoretical core that Nosofsky has developed over many years (exemplar representation of categories) from an ancillary component (sequential sampling models of choice RT) that was adapted from another literature.