Nevertheless, she desisted: A brief review of Steensma et al (2013)

Note: This essay has moved around quite a bit over the years. I keep posting it because Jesse Singal says something I consider to be very foolish, and then I delete it again because I’m exhausted and I hate being caught in this never-ending culture war. I cannot promise I will leave it up, because I really, really hate this.

This paper by Steensma et al (2013) seems to have been the source of some rather intense public discussion.

Steensma, T. D., McGuire, J. K., Kreukels, B. P., Beekman, A. J., & Cohen-Kettenis, P. T. (2013). Factors associated with desistence and persistence of childhood gender dysphoria: a quantitative follow-up study. Journal of the American Academy of Child & Adolescent Psychiatry, 52(6), 582–590.

There’s been quite a lot written on this paper, and I was quite curious to know what it was about. From what I could work out from the media coverage, the paper investigated the incidence of “desistence” among children and adolescents with gender dysphoria, and was (and still is??) the largest such study to date. It featured prominently in this article by Jesse Singal published in mid-2016, though just the other day he published this piece saying that the paper was confusing and that he’d misunderstood what it was saying. The paper is defended by James Cantor here, who argues that it provides strong evidence that most children with gender dysphoria turn out not to be transgender. It’s also been critiqued fairly extensively. Julia Serano discusses it in this article, in which she points out a number of reasons to be careful about interpreting the results. Brynn Tannehill discusses it here. It seems to pop up all over the place.

What is the paper about?

Like many scientists, I’ve learned not to trust journalists when they attempt to summarise academic research. It’s not that I think they’re malicious, it’s that I think they’re not well versed in scientific and statistical methods research, and as a consequence the work is inevitably distorted by the time it gets reported in the media. Based on what I’d read, I was expecting the paper to ask how often it is that kids with gender dysphoria turn out not to be transgender. It doesn’t. In retrospect I shouldn’t have been surprised. After all, it’s in the title:

Factors Associated With Desistence and Persistence of Childhood Gender Dysphoria

Scientific articles aren’t usually supposed to be mystery novels. If the authors had been intending to write a paper measuring the rate of desistence, they would have called it something like this:

Estimates of the Rate of Desistence and Persistence of Childhood Gender Dysphoria

The distinction matters. The paper reports a study measuring the correlates of desistence, not the incidence of desistence. If you try read it as study measuring incidence, it does indeed come across as rather incompetent. There are many, many reasons why I would not want to rely on Steensma et al (2013) as an estimate of the frequency with which gender dysphoric children turn out not to be transgender (and for the most part, my thoughts on that point are pretty similar to those made by Julia Serano and Brynn Tannehill). However, the paper doesn’t appear to make any claim about this frequency. On my reading, the authors actually appear to have gone out of their way not to make any such claims, and focus almost entirely on the correlates of persistence. To some extent, I wonder whether people are reading more into the study than the paper itself claims.

What does the paper conclude?

Steensma et al (2013) explicitly state their conclusions in the abstract:

“Intensity of early GD (gender dysphoria) appears to be an important predictor of persistence of GD. Clinical recommendations for the support of children with GD may need to be developed independently for natal boys and for girls, as the presentation of boys and girls with GD is different, and different factors are predictive for the persistence of GD”

There are two claims here. First, the stronger the gender dysphoria symptoms at the time the child is first referred to the clinic, the more likely they are to continue with gender transition and return for further treatment. That’s not entirely surprising, but it’s useful to know.

Second the claim is that there are sex differences, and that the factors that predict how gender dysphoria unfolds are different for natal boys and natal girls. That’s a little more curious, but it wouldn’t be all that remarkable if true — it would basically amount to a suggestion that we shouldn’t generalise too strongly from trans girls to trans boys and vice versa. That being said, after reading the paper I have some quibbles about their analysis and I am not convinced that this claim is supported by their data.

Again… you’ll notice that the conclusions do not include any claims about the “true rate” with which gender dysphoria resolves spontaneously without transition. I don’t think that’s an accident. My guess is that the authors know perfectly well that their data don’t have a lot to say on that topic, and have been appropriately circumspect.

Curiously, there’s something that does pop up in their discussion that is absent from the abstract — after doing quite a lot of analyses, the single best predictor of whether a child persists with gender transition is simply the child’s beliefs about themselves. Gender identity, it turns out, is the best predictor of gender transition. Weird, huh?

Anyway, let’s take a look at the paper in some detail.

The sample

Any time you do applied research — where you don’t have the luxury of nice experimental control — you are going to run into issues with how people get selected into a sample. Real life is messy. Here’s what Steensma et al did to obtain their sample:

Between 2000 and 2008, 225 children (144 boys, 81 girls) were consecutively referred to the clinic. From this sample, 127 adolescents were selected who were 15 years of age or older during the 4-year period of follow-up between 2008 and 2012. Of these adolescents, 47 adolescents (37%, 23 boys, 24 girls) were identified as persisters. They reapplied to the clinic in adolescence, requested medical treatment, were diagnosed again with GID, and considered eligible for treatment

To be eligible for inclusion, a child had to be referred to the clinic between 2000 and 2008, and old enough at the time of follow-up between 2008 and 2012. That seems like a reasonable practical decision, and while I can certainly think of a few ways this would produce weirdness in the data (e.g., there will be a spurious correlation between age and cohort induced by this selection into the follow-up group) from the perspective of the stated goals of the study (i.e., finding predictors of which children will persist with transition), it doesn’t seem especially problematic. I’m loathe to criticise researchers for doing the best they can given the constraints on applied research.

Measuring persistence

The second thing they mention in this passage is the operational definition of persistence they use. Specifically, they note that 47 of the 127 children are classified as persisters if they reapplied to the clinic, were again diagnosed with gender dysphoria, etc. At this point, it’s hugely important to remember the stated goal of the study:

If the goal of the study was to estimate the rate with which gender dysphoria resolves on its own, this is a very dangerous error that will lead to a massive overestimate. Very few people persist with transition if they aren’t actually transgender (there are some: detransitioners do exist), so the false positive rate is very low. However, people might “desist” in the sense of not returning to the clinic for many reasons. It might be that the dysphoria has resolved harmlessly, but it might also be that living as a transgender person in a hostile world is a pretty scary prospect and they decided to live with the gender dysphoria. It could be something else too. In other words the false negative rate here likely to be higher. Using “does not return to the clinic” as a proxy for “not really transgender” is not a good idea.
If the goal of the study is to determine the factors that predict whether a child will return to the clinic as per the authors’ description of their study, then this is a perfectly sensible measure of persistence, because that’s the actual thing you’re trying to predict In other words, the measure of persistence in Steensma et al (2013) is perfectly appropriate to address the stated goals of the study. It’s just poorly suited to the goals that other people have projected onto the research.

That being said… the authors do not do themselves any favours in this part of the paper. Immediately after defining the measure of persistence, they write the following:

As the Amsterdam clinic is the only gender identity service in the Netherlands where psychological and medical treatment is offered to adolescents with GD, we assumed that for the 80 adolescents (56 boys and 24 girls), who did not return to the clinic, that their GD had desisted, and that they no longer had a desire for gender reassignment (emphasis added)

Hmm… I’m not sure about this. As other people have argued, that’s not a reasonable assumption. If we were talking about getting antibiotics to treat an infection, and the Amsterdam clinic were the only place you could get it, this might be reasonable. However, we’re talking about gender dysphoria, and the context is rather different. The world is rather hostile to transgender people, and there are serious downsides to transitioning. At most, all you can say is that it seems likely that those 80 adolescents decided not to transition at this time (i.e., desisted); you cannot reasonably infer that they no longer had a desire to do so. If some of these kids transition 20 years later, then yes they are certainly “desisters” in the sense used by Steensma et al, but it does not seem reasonable to assume that they no longer had any issues with gender dysphoria. As with a lot of my comments about the paper, this isn’t exactly a criticism of what the authors did so much as a caution against overinterpreting the results, and I do feel this particular passage was not a good word choice.

Measuring childhood dysphoria

The paper reports several measures of the level of gender dysphoria for each child when they were first referred to the clinic, the extent to which the child had begun social transition, and a variety of demographic measures. Because these measures were all taken at the time the children were first referred to the clinic, data are available for all 127 children. I’ve only just started reading this literature, so I don’t have a lot of comments on the measures themselves, but on superficial inspection they seem perfectly ordinary psychometric instruments and clinical assessment measures. I’m sure that there are some problematic things in there somewhere, but behavioural research is always tricky to do well and I’m always sympathetic to researchers trying to use the tools they have.

Measuring adolescent dysphoria

Because this is a longitudinal study, Steensma et al (2013) attempted to follow-up for all 127 participants in adolescence, even those who had not returned to the clinic. Here’s the relevant passage:

a set of questionnaires, assessing information on current GD, body image, and sexual orientation was mailed. All 47 persisters participated in the study. Of the 80 desisters, 46 adolescents sent back the questionnaires

Not surprisingly, all 47 people who were still attending the clinic and pursuing transition filled out the questionnaire. Of the 80 people who had not returned to the clinic, 46 returned the questionnaire, and those that did return the questionnaires reported few symptoms of dysphoria. Again, not surprising — the “desisters” were the kids who arrived at the clinic showing comparatively low levels of dysphoria and (as I’ll comment on later) hadn’t actually started social transition in any sense. As Table 2 illustrates, they had low levels of dysphoria in childhood, and lo and behold they “desisted” and when they returned their questionnaires in adolescence (see Table 4) they still turned out to have low levels of gender dysphoria. Remarkable!

Anyway, it should be noted that care is required with respect to the adolescent data. A response rate of 57.5% isn’t all that bad, but it’s hard to know how to interpret the follow-up data. It beggars belief to presume that the data are missing at random, so if you wanted to make strong claims about “desisters” in general, then you cannot ignore the non-responders or make simplistic assumptions about them. My recollection is that people with depression are far less likely to respond to surveys, so if there were adolescents with high levels of dysphoria among the desisters, they’re very likely to also be nonresponders (i.e., if you’re too miserable to bother following up with the clinic, you’re probably also too miserable to bother filling out the survey). As it happens, there is a statistical literature on non-ignorable non-response models, but you probably wouldn’t be all that surprised to learn that this is a hard problem and the authors didn’t go down that path!

In any case, again I feel the obligation to defend Steensma et al. Remember, they aren’t writing a paper about the “true rate of desistance” and the paper doesn’t make strong claims based on the adolescent data. They’re writing a paper on which variables predict whether a child referred to a gender identity clinic will return for further treatment. For the question as stated, this issue is largely irrelevant. It’s only because people seem to persist (if you’ll pardon the pun) in treating this as a paper about something else entirely that this issue of the nonresponders keeps coming up.

Is Jesse Singal serious? This is not confusing

As an aside, I have to confess to a certain amount of personal irritation at this point. In his recent post, Jesse Singal - who seems to have positioned himself as “the” journalist reporting on transgender topics - commented on the fact that he found this whole thing confusing because the authors analyse the childhood data from desisters, which - duh! - includes data from people who were nonresponders during the follow-up at adolescence, but nevertheless do include some analysis of the data from the responding desisters at adolescence.

I mean, obviously? It’s a two stage design with serious concerns regarding heterogeneous attrition from time point 1 to time point 2, so there are limits to what you can say about the data at the second time point, but that doesn’t stop you from analysing the data from time point 1. Again, duh.

Honestly, I’m aghast.

The paper is entirely unambiguous about the study design, and it utterly escapes me how a science journalist with a focus on transgender issues failed to understand such a simple aspect to the study. It honestly makes me wonder whether he even read the paper when writing his initial article, because if you have actual training in reading papers in the psychological literature (which I do) this is not difficult.

Might I gently suggest that if you’re finding it so difficult to understand a very straightforward paper it might be appropriate for you to maybe take a step back, acknowledge that you aren’t competent to hold opinions, and refrain from commenting on a literature that is beyond your expertise? I’m sorry that this is harsh, but as an actual social scientist who has to live with her gender dysphoria on a day-to-day basis I’m more that a little annoyed at people who decide to insert themselves into a discussion for which they are grossly unqualified. If you are neither transgender nor a social science researcher, you don’t belong in this conversation.

Besides, in a moment I’m going to quibble with their data analysis. I have comments regarding the proper use of interaction terms to test group differences in a multivariate logistic regression model, and on the off-chance that anyone ever reads my notes on this paper I’d rather not have to waste my time responding to amateurs meddling in affairs beyond their ken.

Who pursues transition?

As I’ve mentioned numerous times already, this paper does not focus on the rate of desistence. However, it does report descriptive statistics (in Table 1) for all 127 participants based on the childhood data, broken down by persistence group.

A quick look at the table makes it very clear where the analysis is going - those kids who arrived at the clinic partly or completely transitioned usually persisted, whereas those who hadn’t started social transition at the time of referral tended not to “persist” with transition later. Not entirely surprising, I suppose. Anyway, the table organises the data in a way that obscures the persistence rate - again, not unreasonably, because that’s explicitly not the point of the paper! - but it’s not too hard to convert the percentages to frequencies and then work out the numbers. A quick calculation suggests there must have been 12 natal boys who came in partially or fully socially transitioned to living as a girl, and 10 of those persisted (a persistence rate of 83%, or if you really must frame it this way, a desistence rate of 17%). There were 25 natal girls who came in at least partly transitioned to living as a boy, and of those 14 persisted (56% persistence). In contrast, among those children who had not taken any steps towards transitioning when they first arrived, most did not later transition! (Shocking stuff, right?) For the natal boys, there were (.565 x 23) + (.964 x 56) = 67 who came in with no transitioning, and among those only 13 persisted (19% persistence). For natal girls, 23 came in with no transition steps taken, and 10 of those persisted (44% persistence).

As a gender diverse person myself, none of this seems remarkable to me. There’s a big difference between wanting to transition (because you are trans) and being uncomfortable with one’s assigned gender (because gender roles are pretty stupid). Kids who are referred to the clinic after they start socially transitioning are much more likely to be transgender, so they “persist”. Kids who are referred to the clinic because they aren’t trans but have difficulties with gender roles aren’t likely to have started transitioning, nor are they likely to come back to the clinic to pursue transition. Similarly, those people will likely have fewer symptoms of dysphoria, and… well, you get the idea, right? Pretty sure this is exactly what you’d expect to see?

Interpreting the results

Okay, next up we’ll take a look at Table 2. Because the outcome variable (persistence) is binary valued (i.e., they return to the clinic or they don’t), the analyses are based on logistic regression. Table 2 examines each of the childhood measures (i.e., at time of referral) for all 127 children separately. Natal boys were more likely to desist, with an odds ratio of .41. There’s actually a useful point to make about this calculation, which you can reproduce from Table 1.

natal boys persisting = 23
natal boys desisting = 56

So the odds of persisting for natal boys is 23 / 56 = .41. For the natal girls the numbers are

natal girls persisting = 24
natal girls desisting = 24

and the odds here are 24 / 24 = 1.0. Putting these two together gives you the odds ratio of .41 / 1.0 = .41. This turns out to be a statistically significant effect.

It’s hard to know how to interpret it though.

Suppose it really were the case that all persisters are genuinely transgender, and all desisters aren’t (doubtful, but bear with me). That would imply that there were “really” 23 trans girls (natal boys) and 24 trans boys in the initial sample. If the intake at the clinic does not impose any particular gender bias, this suggests that “actually being transgender” has roughly the same incidence rate in males and females. Yet there were many more natal boys being referred to the clinic in the first place: there were 56 desisters among the natal boys and only 24 among natal girls. Under our very dubious assumption that none of the desisters are transgender, that would suggest that there were too many boys being referred to the clinic.

Why does that happen?

I can think of some possibilities. For instance, that could happen if the social norms for gender performance by boys are narrower than for girls, which would lead to more situations in which a gender non-conforming boy is incorrectly identified as “possibly transgender” (and hence referred to a clinic) than a gender non-conforming girl.

Honestly, I don’t really believe that’s the whole story, but it’s a nice example of how the data don’t tell a nice clean political story. A brief perusal of various websites devoted to being nasty to transgender people (especially the “concerned mums” ones) is remarkable for the fact that they tend to be more worried about natal girls being falsely referred to clinics for treatment than natal boys. I strongly suspect none of these people have actually read the paper properly, because if you were to treat Steensma et al’s (2013) desistence measure as a good proxy for “not actually trans” (which I don’t recommend!) they actually tell the opposite story.

Taking the data at face value, the naive interpretation would imply that it’s gender nonconformity in natal boys that is being overly medicalised, and this is much less of an issue for natal girls. So presumably our main social justice focus should be to protect feminine boys, right, and not worry quite so much about masculine girls who don’t seem to be at risk of begin medicalised in this fashion? Right?? (ツ)

I can’t imagine that conclusion going down terribly well on “4th wave now” and the like but hey… I’m not the one arguing that we should use these data for that purpose. My view is that the data have nothing useful to say on this topic at all ¯\_(ツ)_/¯

Besides, the big take home message from Table 2 is actually that natal sex isn’t the strongest predictor of subsequent behaviour - using the existing gender dysphoria scales is a better idea! I mean… surprise, right? Psychometric instruments designed to measure gender dysphoria symptoms turn out better predictors of whether one subsequently transitions than what genitals you were born with!

Again - I’m not criticising Steensma et al (2013) here. I’m criticising people who look at this study and want to use it to start a moral panic.

A statistical complaint

I lied. I do have a criticism of the paper.

I think their multivariate analysis reported in Table 3 is wrong in a way that undermines one of their main conclusions. The basic idea is sensible: they took the variables with strong relationships in the initial analysis (Table 2) as the basis for a multivariate logistic regression, and found that several of them made unique contributions to the prediction equation (left column of Table 3). The resulting fitted model accounts for 58% of the variance in the outcome. It’s a moderately successful regression model, and so far it all looks good.

My gripe with this analysis is that they fall for a common fallacy: as Andrew Gelman and Hal Stern grumbled back in 2006, the difference between significance and non-significance is not necessarily significant. In their general conclusions Steensma et al (2013) argue that the predictors of desistence for natal boys and natal girls are different, but they never actually test that hypothesis. What they do instead is run two separate regressions, one for natal boys and another for natal girls, then look at which variables came out significant in each case. That’s NOT a test of group difference, and if you actually look at Table 3 properly you can see that for the most part there actually aren’t any meaningful differences. For instance, “age at intake” is a significant predictor for the natal boys but not natal girls. Aha, a difference!

Yeah, no. If you look at the point estimate of the odds ratio in Table 3, it’s 1.90 for natal boys and an almost identical 1.98 for natal girls. I absolutely guarantee that this is not a significant difference between groups — what’s happened here is that the effect hovers on the borderline of statistical significance for both groups, but the sample size is larger for natal boys and the corresponding confidence interval is narrower in that regression. As a consequence one group shows a significant effect and the other doesn’t. There is no evidence of differential predictive ability here.

Across the board, the confidence intervals for the natal boy odds ratio and the natal girl odds ratio overlap considerably. Without access to the raw data it’s very hard to know if there really is a difference here — though I should note in passing that there’s a couple of those predictors where I’d guess that there actually is a real difference, but the paper doesn’t report any analysis to that effect.

This is a bit of a problem, because if we go back to the authors main conclusions, the second one was this:

Clinical recommendations for the support of children with GD may need to be developed independently for natal boys and for girls, as the presentation of boys and girls with GD is different, and different factors are predictive for the persistence of GD

They might actually be right about this, but as far as I can tell they haven’t run the right analysis to demonstrate it. It’s not a big deal and it’s an it’s an easy one to fix. Just re-run the regression on the full data set and include the appropriate interaction terms included. If you don’t know how to do that, just ask the nearest statistician. It’s not too hard.

(Also as another aside — this is how scientists criticise each other’s work. The paper is not junk science. The work is not debunked. There is an error in the data analysis - in my opinion - that undermines one conclusion. It needs fixing, and as long as the raw data still exists it’s an easy fix. The original conclusion might still be correct. It might not. That’s it. No dramas. This is how grown ups deal with these matters)

What about the follow-up questionnaires?

Oh right. I almost forgot. Remember how almost everyone is focused on the mailout questionnaires sent at follow-up for some reason? The authors do talk about them a bit on p. 587, but there’s not a lot of interesting things to say. People who returned to the clinic later on (persisters) have more gender dysphoria than those people who didn’t return but were still sufficiently motivated to return the survey. Well, yeah: those people didn’t have high levels of gender dysphoria to start with, as the childhood data in Table 1 illustrate.

But regardless, that’s pretty hard to interpret in detail because we don’t know much about why the desisters didn’t come back, and why the nonresponders didn’t respond. Commendably, the authors don’t really make a big deal out of this — it’s not especially informative, and they don’t treat it as such.

So what were the successful predictors?

Looking over the results and discussion, it seems to me the most interesting point in the paper is slightly buried. At the childhood intake, there were quite a lot of different measurements taken. So which of these measures turn out to be the best predictor of whether a child subsequently persists with transition? It isn’t sex, or degree of transition, or whatever. It’s the cognitive subscale from the GIIC (Gender Identity Interview for Children). I don’t know anything about the psychometric properties of this instrument nor what the questions are, so I’ll just defer to Steensma et al (2013) here, who write:

Persisters indicated that they believed that they were the “other” sex, and the desisters indicated they wished they were the “other” sex; this difference may also underlie our finding of a higher report of cognitive cross-gender identification in the persisters than in the desisters (emphasis added)

If I’m reading this correctly (I might not be because I don’t know the GIIC), it sounds rather like they’re arguing that self reported gender identity turns out to be the best predictor of whether someone actually transitions. That’s more or less exactly what adult transgender people say about the distinction between gender identity and discomfort with gender roles. I can’t say I’m surprised, but it’s terribly nice to see the data backing up that intuition.

Science, eh?