Image credit: Stan

Day 12: Stan

A colleague comes in with a statistics question. It doesn’t seem particularly unreasonable, and after a bit of thought it’s not too hard to come up with a model that would make sense for the problem. The only difficulty is that it’s non standard. After a little more thought, it seems like it might be possible to come up with a variation of the model that falls within the GLM (or whatever), and so maybe there’s a way of solving the problem that way. The only problem is that it doesn’t feel like the modified version of the model is really answering the question my colleague wants to address. What we want to do is write down the original model and use that do the inference. It’s in this situation where probabilistic programming languages like BUGS, JAGS and Stan come in handy. I’ve been using JAGS for several years now, as a tool for writing down generative models (specifically Bayesian models that can be formally described as directed acyclic graph) and using Gibbs sampling to draw from the posterior distribution in a relatively painless way. However, over the last few years some friends have been urging me to try out Stan. I’ve been meaning to do this for a while, and since it has an associated R package rstan, and since I need a blog post for today… maybe I’ll try to build some simple models with Stan?

Step 1: Installing Stan

Here are my first steps…

  • I check the Getting Started page
  • I discover that I don’t have Xcode installed on this machine
  • I start downloading Xcode, and begin writing a post about a different package
  • I come back a day later with Xcode finally installed. Sigh.
  • I get distracted by several other high importance tasks

Sigh again. So now it’s a few days later…

  • I check the documentation again. Oh that’s right I’m learning Stan
  • The documentation provides detailed instructions on how to create the Makevars file
  • I follow the instructions blindly 😀

The documentation takes care to let me know that I’m doing something with compilers (shudder) and that there are occasionally consequences…

Be advised that setting the optimization level to 3 may prevent some other R packages from installing from source if they are only tested with the stock R configuration.

… Okay, duly noted. Back to the installation process

  • For me, following the instructions has created a file located at "/Users/dan/.R/Makevars" that contains settings for the compiler.
  • So now I can install rstan
install.packages("rstan", repos = "https://cloud.r-project.org/", dependencies=TRUE)
  • Finally I run the test code:
fx <- inline::cxxfunction( signature(x = "integer", y = "numeric" ) , '
    return ScalarReal( INTEGER(x)[0] * REAL(y)[0] ) ;
' )
fx( 2L, 5 ) # should be 10

Yep, I get 10. Okay, I am ready to start doing some Hamiltonian Monte Carlo! And I would be less terrified of that if I had the slightest inkling of how Hamiltonian Monte Carlo works!

Seriously, I need to do some reading

… I mean, it’s not as if I really understood the inner workings of JAGS either, but at the very least I can say that I have derived, implemented, broken and then subsequently fixed my own Gibbs samplers, so the core concepts are familiar enough that I (mostly) know what I’m doing. But this is a different thing. I’ve heard of the No U-Turn Sampler, but that’s about it. For the moment I’m fine with just blindly playing with the tool, but I confess that I don’t feel comfortable actually using Stan for anything serious until I’ve at least built a NUTS algorithm manually. Even a bad one! But that can wait.

Step 2: Getting started with Stan

In any case, now that I have Stan, I can skim over the how to use rstan documentation. Loading the package gives me some messages at startup…

library("rstan") # observe startup messages
## Loading required package: ggplot2
## Loading required package: StanHeaders
## rstan (Version 2.17.3, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)

Duly noted. For my sanity, I’m not even pretending to do multicore right now, but I really like that it tells me about this from the outset.

Step 3: Reading more documentation

The Stan user manual is magnificent. I am amazed at how well documented everything is. The instructions to get me to this point have been detailed, carefully explained, and it’s all gone very smoothly. Now that I’m starting to think about doing modelling, I discover that the manual is 637 pages long and it’s great. Normally I’d find that much documentation intimidating, but it’s written very simply and without pretension. So rather than being worried I find it reassuring… Sure, I know that I don’t know much about Hamiltonian Monte Carlo, but this gives me a real feeling that if I just read the documentation, and take the time to do things properly, I can work out Stan.

A few miscellaneous thoughts I had on a first pass through the manual:

  • Huh, Stan is imperative rather than declarative
  • Interesting, it’s strongly typed. That initially worried me but on reflection I think it’s just a bit of extra overhead in learning, and they explain the decision really well (My worries are probably just horrible C nightmares)
  • Oh wow, it’s not just Hamiltonian sampling. Stan also covers variational Bayes. You can also use it to get MAP estimates. Neat!
  • Ooh user defined functions… Yay!
  • Wow there are a lot of worked examples of models! For now I’m going to ignore them, but that will be super useful later
  • There’s a lot to work through in order to grasp all the data types (but yay for having an explicit representation of a simplex - how is that not a basic data type in every probabilistic language?)
  • Hm. It doesn’t support discrete parameters. That might pose problems in some cases. Okay, well there are no free lunches in life I suppose!

Enough reading … let’s see if I’ve learned anything!

Step #4: Implementing a beta binomial model

My first attempt at a Stan model will be a very traditional one… the beta-binomial model. I like using that as my first attempt for any new programming task because it’s so simple and has closed form solutions that I can do with pencil and paper. So really, my goal here is just to see if I understand the basic structure of a Stan program, and see if I can make sense of the rstan interface. The model I want to implement is simply \[\begin{array}{rcl} \theta & \sim & \mbox{Beta}(a,b) \\ k | N & \sim & \mbox{Binomial}(\theta,N) \end{array} \] where \(k\) is the number of “successes” out of \(N\), \(\theta\) is the probability of success, and the \(a\) and \(b\) values describe the prior. I’ll set \(k=5\), \(N=10\), \(a=b=2\), so (via conjugacy) the posterior distribution over \(\theta\) will be a Beta(7,7). Can I do this in Stan?

With a model as simple as this one, the Stan program has three code blocks: data, parameters and model. In the data block I need to specify the input to the model in the form of the data. In this case, I specify an integer \(k\) and another integer \(N\).

data {
  int<lower=1> N;           // number of observations
  int<lower=0, upper=N> k;  // number of successes
}

From reading the documentation, I’m led to understand that Stan is very particular about the lower and upper bounds. There are multiple places in the code where one has an opportunity (not in this model, it’s too simple) to specify constraints (e.g., truncation) and it really does matter that they be consistent with each other. Apparently things go badly with the sampler if the posterior distribution does not have support across the full range of specified parameters, including the boundaries.

The next code block defines the parameters, unobserved quantities that the model needs to do inference about. In this case there’s only one parameter \(\theta\) and it is bounded between 0 and 1:

parameters {
  real<lower=0, upper=1> theta;  // probability of success
}

Last, I need to define the model, which specifies the full generative model. For this model there are two lines: one for the prior \(P(\theta)\) and another for the likelihood \(P(k|N,\theta)\):

model {
  theta ~ beta(2,2);       // prior
  k ~ binomial(N, theta);  // likelihood
}

My next task is to save this to a file, betabinomial.stan. Now I’m ready to call it using rstan. The command is very simple, since I’m using the default settings for everything:

library(rstan)
x <- stan(
  file = "./betabinomial.stan",
  data = list(k = 5, N = 10)
)

When I do this there’s a few seconds of waiting, and I feel a bit of panic thinking I’ve made a huge mistake, but it turns out the model is just compiling (you can call stan with verbose=TRUE to see all the ugliness). The nice thing is that it won’t recompile the model every single time: if the model hasn’t changed since the most recent compilation, it doesn’t need to! Once the compiling is done, the sampling process is more or less instantaneous and I see messages on screen that look like this:

# SAMPLING FOR MODEL 'betabinomial' NOW (CHAIN 1).
# 
# Gradient evaluation took 1.8e-05 seconds
# 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
# Adjust your expectations accordingly!
#   
#   
#   Iteration:    1 / 2000 [  0%]  (Warmup)
# Iteration:  200 / 2000 [ 10%]  (Warmup)
# Iteration:  400 / 2000 [ 20%]  (Warmup)
# Iteration:  600 / 2000 [ 30%]  (Warmup)
# Iteration:  800 / 2000 [ 40%]  (Warmup)
# Iteration: 1000 / 2000 [ 50%]  (Warmup)
# Iteration: 1001 / 2000 [ 50%]  (Sampling)
# Iteration: 1200 / 2000 [ 60%]  (Sampling)
# Iteration: 1400 / 2000 [ 70%]  (Sampling)
# Iteration: 1600 / 2000 [ 80%]  (Sampling)
# Iteration: 1800 / 2000 [ 90%]  (Sampling)
# Iteration: 2000 / 2000 [100%]  (Sampling)
# 
# Elapsed Time: 0.009085 seconds (Warm-up)
# 0.010223 seconds (Sampling)
# 0.019308 seconds (Total)

It repeats this for four chains. I save the Stan output to a file and then have a look at what I’ve got:

print(x)
Inference for Stan model: betabinomial.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd   2.5%    25%   50%   75% 97.5% n_eff Rhat
theta   0.50    0.00 0.13   0.25   0.41  0.50  0.59  0.76  1746 1.00
lp__  -10.25    0.02 0.76 -12.58 -10.42 -9.95 -9.76 -9.70  1525 1.01

Samples were drawn using NUTS(diag_e) at Tue May  8 22:46:58 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Hm well that ran really nicely. I’ve got a nice set of summary statistics for the parameter \(\theta\), and the log-probability. Very easy to read!

Next, what do I have here? Mindlessly copying from the documentation to work out how to extract the samples… turns out I need the extract function. Hm. Note to self - I need to understand what the data structure produced by stan actually looks like, but that’s a job for later. In the meantime, let’s verify that it worked. I’ll plot a histogram of the sampled values of \(\theta\), and verify that they look like a numerical approximation to the Beta(7,7) distribution:

ex <- extract(x, permuted = TRUE) # returns a list with theta and log prob
hist(ex$theta, breaks = 40, col = "lightblue",
     border = "white", xlab = "theta", main = "",
     freq = FALSE)
v <- seq(0,1,.01)
lines(v, dbeta(v,7,7), col = "black", lwd = 2)

That looks about right! Okay I’m going to call that a win!

Step 5: Towards a “real” problem

This is a nice beginning, but I’m not sure that I really understand what’s going on yet. To test my knowledge I’m going to try a “real” problem, inspired by one that I’m currently working on with some colleagues at UNSW.

Here’s the problem. Glossing over the particulars of the research question, the design is essentially an elicitation task. There is a set of 9 ordered categories \(x\) that are roughly evenly spaced along a bounded continuum. For simplicity I’ll assume \(x_k = .1, .2, \ldots .9\) though in the real problem we might want to treat \(x_k\) as latent variables. For each of these 9 categories, people are asked to answer a question like “out of every 100 people, how many do you think will belong to category X?” The full design is a little more complicated, but for simplicity in this initial example I’m going to pretend that there are 4 replicates of the 9 judgments for each person, leading to 36 responses per person. (Again, I’m glossing stuff here, because obviously it’s not always straightforward to do proper “replicates” within person)

Okay, so what is our model? The implicit assumption we have here is that each person has some latent beliefs about the probability distribution over categories \(P(x_k)\), which we will take to be a truncated normal distribution with mean \(\mu\) and standard deviation \(\sigma\). The truncation range is be assumed to be [0,1], but in this simplified model it will turn out not to matter much because I’m not going to bother with cumulative normals in the way that I should probably do, I’m just going to pretend that the belief is proportional to the normal density at this \(N(x_k-\mu, \sigma)\) at the category location. Yes, yes, that’s a bit of a simplification, but again I can return to this later.

In the full model, we would assume that each peson has their own idiosyncratic belief distribution with subject specific parameters \(\mu_i\) and \(\sigma_i\). People are expected to be pretty similar to each other, so there will be a population disribution over these, but again that’s a complexity I can build into the model later. For my first attempt, all I want to do is model the response from a single person.

Next I need recognise that the responses people give don’t map onto the beliefs in a straightforward way. I don’t want to assume that people generate responses that sum to 100 as they should (because people don’t do that), so there will have to be a scaling parameter \(v_i\) that could vary across people. With this in minde I’ll assume that when the \(i\)th person is presented with the \(k\)th category, they have a “response strength” \(\phi_{ik} = v_i p(x_k|\mu_i,\sigma_i)\). The response \(y_{ik}\) is then assumed to be sampled from a normal distribution with mean \(\phi_ik\) and subject-specific response standard deviation \(\epsilon_i\) … because honestly some people give answers that don’t make any sense and we don’t want to weight their answers too strongly. Again, in the full model we’re going to need population level distributions for \(\epsilon_i\)

So the model uses four parameters to describe an individual subject: two to describe the underlying beliefs \((\mu_i, \sigma_i)\) and two ancillary parameters to capture the response generation process \((v_i, \epsilon_i)\). With only 9 responses per person that would be a tough ask, and we’d be relying very heavily on shrinkage through the population distributions to make this model tractable. However, the design we’re thinking about (somewhat simplified for my sanity right now) allows us to poll each person multiple times, so we’re expecting to get 36 observations per person. Though as I say, the real design is a little sneakier insofar as it partly disguises the repetitions to minimise demand effects… but that’s a long story and I don’t want to think about that right now!

Step 6: Stating the model

So, filling in some “dummy priors” for the time being, we have this as our prior over the parameters

\[\begin{array}{rcl} \mu & \sim & \mbox{Uniform}(0,1) \\ \sigma & \sim & \mbox{Uniform}(.1,2) \\ v & \sim & \mbox{Uniform}(1,30) \\ \epsilon & \sim & \mbox{Uniform}(.1,10) \\ \end{array} \] We’ll have 9 transformed quantities: \[ \phi_k | x_k = v \frac{1}{\sqrt{2\pi}\sigma} \exp \left(-\frac{(\mu-x_k)^2}{2\sigma^2} \right) \] And then the response for the \(j\)-th replication of the \(k\)-th category is distributed according to a truncated normal, where the truncation is between 0 and 100 \[ y_{jk} \sim \mbox{TruncNormal}(\phi_k, \epsilon, 0, 100) \]

Step 7: Stan code

This time around my model is going to involve arrays, and I’ll need to insert a block for the transformed parameters. First off, here’s my data block specifying the number of categories \(N=9\), the number of replications \(M=4\), the category locations \(x\) and the responses \(y\).

data {
  int<lower=1> N; // number of categories (i.e. 9)
  int<lower=1> M; // number of repetitions (i.e. 4)
  real<lower=0, upper=100> y[N,M]; // responses (i.e, a 9x4 2D-array)
  real<lower=0, upper=1> x[N]; // values (length 9 1D-array)
}

Again, lower and upper bounds are important to Stan. Next, my parameters block defines my four parameters:

parameters {
  real<lower=0, upper=1> mu;        // mean of the belief distribution
  real<lower=.01, upper=2> sigma;   // std dev of the belief distribution
  real<lower=1, upper=30> v;        // scaling parameter for the response
  real<lower=.1, upper=10> epsilon; // response noise parameter
}

The transformation on the parameters comes next, and I’m pretty certain there is a better way to do this than the version I wrote down. But for now here it is:

transformed parameters {
  real<lower=0> phi[N];
  
  // the 9 response strengths are a deterministic transformation
  // of the input values (x), and the 3 parameters: mu, sigma, v.
  // (note to self: this cannot possibly be a good way to code this!)
  for(c in 1:N) {
      phi[c] = v * (1/(sigma * sqrt(2*pi()))) * exp(-(mu-x[c])^2/(2*(sigma^2)));
  }
}

Now that we have set everything up, here’s the model block that specifies the probabilistic model:

model {
  // priors over the beliefs are uniform over the plausible
  // range; this should be revisited but for now I just want
  // something I can write down
  mu ~ uniform(0,1);
  sigma ~ uniform(.01,2);

  // priors over the response process parameters are similarly
  // uniform; again, I don't really believe this and will need to
  // think about it later on
  v ~ uniform(1, 30);
  epsilon ~ uniform(0, 100);

  // (More notes to self: Can I vectorise this? Would it even
  // matter for stan? Would it be better to define this on the lpdf
  // directly rather than the pdf???)
  for(c in 1:N) {
    for(r in 1:M) {
      // response is a truncated normal with mean phi and standard
      // deviation epsilon; truncation range is 0 to 100
      y[c,r] ~ normal(phi[c], epsilon) T[0,100];
    }
  }
}

I have to say, now that I’ve seen the structure of Stan models for somethig a little closer to realistic, I’m liking the way it separates out transformed quantities from probabilistic relations. This feels nice.

In any case, I save this to probabilityjudgment.stan.

Step 8: Simulating fake data and modelling it

The next step is to write some code generating fake data from the model in R, feeding that to Stan, and seeing how well it recovers the true belief distribution that generated the data. Code for that is saved as run_probabilityjudgment.R.

library(rstan)

# parameters
mu <- .3
sig <- .2
v <- 15
eps <- 7

# categories
x <-seq(.1,.9,.1)

# response strengths  
phi <- v * dnorm(x, mu, sig)
  
# generate simulated responses
y <- matrix(NA,9,4)
for(i in 1:9) {
  y[i,] <- round(rnorm(4,phi[i], eps))
}
y[y<1] <- 1
y[y>99] <- 99

# run stan
inputdata <- list(N = 9, M = 4, y = y, x = x)
outputdata <- stan(
   file = "./probabilityjudgment.stan",
   data = inputdata
)

# get posterior means for the various chains
parmean <- get_posterior_mean(outputdata)
phi_est <- parmean[grep("phi",rownames(parmean)),"mean-all chains"]
mu_est <- parmean[grep("mu",rownames(parmean)),"mean-all chains"]
sig_est <- parmean[grep("sigma",rownames(parmean)),"mean-all chains"]


# plot
layout(matrix(1:2,1,2))

# plot the raw data, along with the true expected response
plot(phi,type="l", ylim=c(0,max(y)*1.1), xlab = "Stimulus Number", ylab = "Response", 
     lty=2, lwd=2, main="Observed Response")
for(i in 1:4) lines(y[,i], type="p", pch=19)
legend(x="topright", legend=c("data", "expected", "inferred"), 
       lty=c(NA, 2,1), pch=c(19,NA,NA),
       lwd=2, col=c("black","black","blue"))

# overlay the model estimates of phi
lines(phi_est,lwd=2, col="blue")

# second plot to compare the model inferred belief distribution
# to the true one that generated the data
xv <- seq(0,1,.01)
b <- dnorm(xv,mu,sig)
plot(xv, b, lty=2,lwd=2, 
     xlab="Stimulus", ylab="Belief", type="l",
     main="Latent Belief", ylim=c(0,max(b)*1.1))
lines(xv, dnorm(xv, mu_est, sig_est), lwd=2, col="blue")
legend(x="topright", legend=c("true", "inferred"), lty=c(2,1), 
       lwd=2, col=c("black","blue"))

layout(1)

Here’s the results:

The panel on the left plots all 36 simulated responses (black dots) as a function of category. The grey dashed bars are the expected mean response according to the true model parameters for this simulation; and the blue lines are the posterior mean response strengths \(\phi\) estimated by the model. On the right, we have the latent belief distribution that generated the data (grey) and the estimated belief distribution at the posterior mean values of \(\mu\) and \(\sigma\). Despite the noise in the data, the model does okay.

Two more simulations:

These simulations are at the slightly pessimistic end (i.e., I’m assuming people’s response are very noisy), but that’s intentional. I wanted to see if the model is capable of handling the data if people’s judgments are very noisy, and apart from some modest mis-fitting, it seems to do okay. If we halve the amount of response noise, it does much better:

Step 9: Celebrate!

This has been a long post, and there’s a long way to go before I feel comfortable in Stan. But I feel like I’ve learned enough to start using it, and I’m pretty sure that I’ll be able to work out what to do for the more complicated real problem that my colleague wants to study!

Yay! 🎈

Tomorrow’s post: something simpler please!

Avatar
Danielle Navarro
Associate Professor of Cognitive Science

Related