I have recently restarted working with Stan and unfortunately ran into the problem that my (hierarchical) Bayesian models often produced divergent transitions. And when this happens, the warning basically only suggests to increase adapt_delta
:
Warning messages:
1: There were X divergent transitions after warmup. Increasing adapt_delta above 0.8 may help.
2: Examine the pairs() plot to diagnose sampling problems
However, increasing adapt_delta
often does not help, even if one uses values such as .99. Also, I never found pairs()
especially illuminating. This is the first of two blog posts dealing with this issue. In this (the first) post I will show which Stan settings need to be changed to remove the divergent transitions (to foreshadow, these are adapt_delta
, stepsize
, and max_treedepth
). In the next blog post I will show how reparameterizations of the model following Stan recommendations can remove divergent transitions often without the necessity to extensively fiddle with the sampler settings while at the same time dramatically improve the fitting speed.
My model had some similarities to the multinomial processing tree (MPT) example in the Lee and Wagenmakers cognitive modeling book. As I am a big fan of both, MPTs and the book, I investigated the issue of divergent transitions using this example. Luckily, a first implementation of all the examples of Lee and Wagenmakers in Stan has been provided by Martin Šmíra (who is now working on his PhD in Birmingham) and is part of the Stan example models. I submitted a pull request with the changes to the model discussed here so they are now also part of the example models (and contains a README file discussing those changes).
The example uses the pair-clustering model also discussed in the paper introducing MPTs formally . The model has three parameters, c
for cluster-storage, r
for cluster-retrieval, and u
for unique storage-retrieval. For the hierarchical structure the model employs the latent trait approach of : The group level (i.e., hyper-) parameters are estimated separately on the unconstrained space from -infinity to +infinity. Individual level parameters are added to the group means as displacements estimated from a multivariate normal with mean zero and freely estimated variance/covariance matrix. Only then is the unconstrained space mapped onto the unit range (i.e., 0 to 1), which represents the parameter space, via the probit transformation. This allows to freely estimate the correlation among the individual parameters on the unconstrained space and at the same time constrains the parameters after transformation onto the allowed range.
The original implementation employed two features that are particularly useful for models estimated via Gibbs sampling (as implemented in Jags), but not so much for the NUTS sampler implemented in Stan: (a) A scaled inverse Wishart as prior for the covariance matrix due to its computational convenience (following ) and (b) parameter expansion to move the scale parameters of the variance-covariance matrix away from zero and ensure reasonable priors.
The original implementation of the model in Stan is simply a literal translation of the Jags code given in Lee and Wagenmakers. Consequently, it retains the Gibbs specific features. When fitting this model it seems to produce stable estimates, but Stan reports several divergent transitions after warm up. Given that the estimates seem stable and the results basically replicate what is reported in Lee and Wagenmakers (Figures 14.5 and 14.6) one may wonder why not too trust these results. I can give no full explanation, so let me copy the relevant part from the shinystan
help. Important is the last section, it clearly says not to use the results if there are any divergent transitions.
n_divergent
Quick definition The number of leapfrog transitions with diverging error. Because NUTS terminates at the first divergence this will be either 0 or 1 for each iteration. The average value of
n_divergent
over all iterations is therefore the proportion of iterations with diverging error.More details
Stan uses a symplectic integrator to approximate the exact solution of the Hamiltonian dynamics and when
stepsize
is too large relative to the curvature of the log posterior this approximation can diverge and threaten the validity of the sampler.n_divergent
counts the number of iterations within a given sample that have diverged and any non-zero value suggests that the samples may be biased in which case the step size needs to be decreased. Note that, because sampling is immediately terminated once a divergence is encountered,n_divergent
should be only 0 or 1.If there are any post-warmup iterations for which
n_divergent = 1
then the results may be biased and should not be used. You should try rerunning the model with a higher target acceptance probability (which will decrease the step size) untiln_divergent = 0
for all post-warmup iterations.
My first step trying to get rid of the divergent transitions was to increase adapt_delta
as suggested by the warning. But as said initially, this did not help in this case even when using quite high values such as .99
or .999
. Fortunately, the quote above tells that divergent transitions are related to the stepsize with which the sampler traverses the posterior. stepsize
is also one of the control
arguments one can pass to Stan in addition to adapt_delta
. Unfortunately, the stan
help page is relatively uninformative with respect to the stepsize
argument and does not even provide its default value, it simply says stepsize (double, positive)
. Bob Carpenter clarified on the Stan mailing list that the default value is 1
(referring to the CMD Stan documentation). He goes on:
The step size is just the initial step size. It lets the first few iterations move around a bit and set relative scales on the parameters. It’ll also reduce numerical issues. On the negative side, it will also be slower because it’ll take more steps at a smaller step size before hitting a U-turn.
The adapt_delta (target acceptance rate) determines what the step size will be during sampling — the higher the accept rate, the lower the step size has to be. The lower the step size, the less likely there are to be divergent (numerically unstable) transitions.
Taken together, this means that divergent transitions can be dealt with by increasing adapt_delta
above the default value of .8
while at the same time decreasing the initial stepsize
below the default value of 1
. As this may increase the necessary number of steps one might also need to increase the max_treedepth
above the default value of 10
. After trying out various different values, the following set of control arguments seems to remove all divergent transitions in the example model (at the cost of prolonging the fitting process quite considerably):
control = list(adapt_delta = 0.999, stepsize = 0.001, max_treedepth = 20)
As this uses rstan
, the R
interface to stan, here the full call:
samples_1 <- stan(model_code=model, data=data, init=myinits, # If not specified, gives random inits pars=parameters, iter=myiterations, chains=3, thin=1, warmup=mywarmup, # Stands for burn-in; Default = iter/2 control = list(adapt_delta = 0.999, stepsize = 0.01, max_treedepth = 15) )
With these values the traceplots of the post-warmup samples look pretty good. Even for the sigma
parameters which occasionally have problems moving away from 0. As you can see from these nice plots, rstan
uses ggplot2
.
traceplot(samples_1, pars = c("muc", "mur", "muu", "Omega", "sigma", "lp__"))
Any idea when the second part of this tutorial will be available?
This is fantastic! I’ve been searching for a way to deal with divergent transitions and this is the first time I’ve seen someone recommend adjusting stepsize and max_treedepth. Thank you!