Hierarchical MPT in Stan I: Dealing with Convergent Transitions via Control Arguments

I have recently restarted working with Stan and unfortunately ran into the problem that my (hierarchical) Bayesian models often produced divergent transitions. And when this happens, the warning basically only suggests to increase adapt_delta:

Warning messages:
1: There were X divergent transitions after warmup. Increasing adapt_delta above 0.8 may help.
2: Examine the pairs() plot to diagnose sampling problems

However, increasing adapt_delta often does not help, even if one uses values such as .99. Also, I never found pairs() especially illuminating. This is the first of two blog posts dealing with this issue. In this (the first) post I will show which Stan settings need to be changed to remove the divergent transitions (to foreshadow, these are adapt_delta, stepsize, and max_treedepth). In the next blog post I will show how reparameterizations of the model following Stan recommendations can remove divergent transitions often without the necessity to extensively fiddle with the sampler settings while at the same time dramatically improve the fitting speed.

My model had some similarities to the multinomial processing tree (MPT) example in the Lee and Wagenmakers cognitive modeling book. As I am a big fan of both, MPTs and the book, I investigated the issue of divergent transitions using this example. Luckily, a first implementation of all the examples of Lee and Wagenmakers in Stan has been provided by Martin Šmíra (who is now working on his PhD in Birmingham) and is part of the Stan example models. I submitted a pull request with the changes to the model discussed here so they are now also part of the example models (and contains a README file discussing those changes).

The example uses the pair-clustering model also discussed in the paper introducing MPTs formally . The model has three parameters, c for cluster-storage, r for cluster-retrieval, and u for unique storage-retrieval. For the hierarchical structure the model employs the latent trait approach of : The group level (i.e., hyper-) parameters are estimated separately on the unconstrained space from -infinity to +infinity. Individual level parameters are added to the group means as displacements estimated from a multivariate normal with mean zero and freely estimated variance/covariance matrix. Only then is the unconstrained space mapped onto the unit range (i.e., 0 to 1), which represents the parameter space, via the probit transformation. This allows to freely estimate the correlation among the individual parameters on the unconstrained space and at the same time constrains the parameters after transformation onto the allowed range.

The original implementation employed two features that are particularly useful for models estimated via Gibbs sampling (as implemented in Jags), but not so much for the NUTS sampler implemented in Stan: (a) A scaled inverse Wishart as prior for the covariance matrix due to its computational convenience (following ) and (b) parameter expansion to move the scale parameters of the variance-covariance matrix away from zero and ensure reasonable priors.

The original implementation of the model in Stan is simply a literal translation of the Jags code given in Lee and Wagenmakers. Consequently, it retains the Gibbs specific features. When fitting this model it seems to produce stable estimates, but Stan reports several divergent transitions after warm up. Given that the estimates seem stable and the results basically replicate what is reported in Lee and Wagenmakers (Figures 14.5 and 14.6) one may wonder why not too trust these results. I can give no full explanation, so let me copy the relevant part from the shinystan help. Important is the last section, it clearly says not to use the results if there are any divergent transitions.

n_divergent

Quick definition The number of leapfrog transitions with diverging error. Because NUTS terminates at the first divergence this will be either 0 or 1 for each iteration. The average value of n_divergent over all iterations is therefore the proportion of iterations with diverging error.

More details

Stan uses a symplectic integrator to approximate the exact solution of the Hamiltonian dynamics and when stepsize is too large relative to the curvature of the log posterior this approximation can diverge and threaten the validity of the sampler. n_divergent counts the number of iterations within a given sample that have diverged and any non-zero value suggests that the samples may be biased in which case the step size needs to be decreased. Note that, because sampling is immediately terminated once a divergence is encountered, n_divergent should be only 0 or 1.

If there are any post-warmup iterations for which n_divergent = 1 then the results may be biased and should not be used. You should try rerunning the model with a higher target acceptance probability (which will decrease the step size) until n_divergent = 0 for all post-warmup iterations.

My first step trying to get rid of the divergent transitions was to increase adapt_delta as suggested by the warning. But as said initially, this did not help in this case even when using quite high values such as .99 or .999. Fortunately, the quote above tells that divergent transitions are related to the stepsize with which the sampler traverses the posterior. stepsize is also one of the control arguments one can pass to Stan in addition to adapt_delta. Unfortunately, the stan help page is relatively uninformative with respect to the stepsize argument and does not even provide its default value, it simply says stepsize (double, positive). Bob Carpenter clarified on the Stan mailing list that the default value is 1 (referring to the CMD Stan documentation). He goes on:

The step size is just the initial step size.  It lets the first few iterations move around a bit and set relative scales on the parameters.  It’ll also reduce numerical issues. On the negative side, it will also be slower because it’ll take more steps at a smaller step size before hitting a U-turn.

The adapt_delta (target acceptance rate) determines what the step size will be during sampling — the higher the accept rate, the lower the step size has to be.  The lower the step size, the less likely there are to be divergent (numerically unstable) transitions.

Taken together, this means that divergent transitions can be dealt with by increasing adapt_delta above the default value of .8 while at the same time decreasing the initial stepsize below the default value of 1. As this may increase the necessary number of steps one might also need to increase the max_treedepth above the default value of 10. After trying out various different values, the following set of control arguments seems to remove all divergent transitions in the example model (at the cost of prolonging the fitting process quite considerably):

control = list(adapt_delta = 0.999, stepsize = 0.001, max_treedepth = 20)

As this uses rstan, the R interface to stan, here the full call:

samples_1 <- stan(model_code=model,   
                  data=data, 
                  init=myinits,  # If not specified, gives random inits
                  pars=parameters,
                  iter=myiterations, 
                  chains=3, 
                  thin=1,
                  warmup=mywarmup,  # Stands for burn-in; Default = iter/2
                  control = list(adapt_delta = 0.999, stepsize = 0.01, max_treedepth = 15)
)

With these values the traceplots of the post-warmup samples look pretty good. Even for the sigma parameters which occasionally have problems moving away from 0. As you can see from these nice plots, rstan uses ggplot2.

traceplot(samples_1, pars = c("muc", "mur", "muu", "Omega", "sigma", "lp__"))

traceplots_orig

References

Baumann, C., Singmann, H., Gershman, S. J., & Helversen, B. von. (2020). A linear threshold model for optimal stopping behavior. Proceedings of the National Academy of Sciences, 117(23), 12750–12755. https://doi.org/10.1073/pnas.2002312117
Merkle, E. C., & Wang, T. (2018). Bayesian latent variable models for the analysis of experimental psychology data. Psychonomic Bulletin & Review, 25(1), 256–270. https://doi.org/10.3758/s13423-016-1016-7
Overstall, A. M., & Forster, J. J. (2010). Default Bayesian model determination methods for generalised linear mixed models. Computational Statistics & Data Analysis, 54(12), 3269–3288. https://doi.org/10.1016/j.csda.2010.03.008
Llorente, F., Martino, L., Delgado, D., & Lopez-Santiago, J. (2020). Marginal likelihood computation for model selection and hypothesis testing: an extensive review. ArXiv:2005.08334 [Cs, Stat]. Retrieved from http://arxiv.org/abs/2005.08334
Duersch, P., Lambrecht, M., & Oechssler, J. (2020). Measuring skill and chance in games. European Economic Review, 127, 103472. https://doi.org/10.1016/j.euroecorev.2020.103472
Lee, M. D., & Courey, K. A. (2020). Modeling Optimal Stopping in Changing Environments: a Case Study in Mate Selection. Computational Brain & Behavior. https://doi.org/10.1007/s42113-020-00085-9
Xie, W., Bainbridge, W. A., Inati, S. K., Baker, C. I., & Zaghloul, K. A. (2020). Memorability of words in arbitrary verbal associations modulates memory retrieval in the anterior temporal lobe. Nature Human Behaviour. https://doi.org/10.1038/s41562-020-0901-2
Frigg, R., & Hartmann, S. (2006). Models in Science. Retrieved from https://stanford.library.sydney.edu.au/archives/fall2012/entries/models-science/
Greenland, S., Madure, M., Schlesselman, J. J., Poole, C., & Morgenstern, H. (2020). Standardized Regression Coefficients: A Further Critique and Review of Some Alternatives, 7.
Gelman, A. (2020, June 22). Retraction of racial essentialist article that appeared in Psychological Science « Statistical Modeling, Causal Inference, and Social Science. Retrieved June 24, 2020, from https://statmodeling.stat.columbia.edu/2020/06/22/retraction-of-racial-essentialist-article-that-appeared-in-psychological-science/
Rozeboom, W. W. (1970). 2. The Art of Metascience, or, What Should a Psychological Theory Be? In J. Royce (Ed.), Toward Unification in Psychology (pp. 53–164). Toronto: University of Toronto Press. https://doi.org/10.3138/9781487577506-003
Gneiting, T., & Raftery, A. E. (2007). Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102(477), 359–378. https://doi.org/10.1198/016214506000001437
Rouhani, N., Norman, K. A., Niv, Y., & Bornstein, A. M. (2020). Reward prediction errors create event boundaries in memory. Cognition, 203, 104269. https://doi.org/10.1016/j.cognition.2020.104269
Robinson, M. M., Benjamin, A. S., & Irwin, D. E. (2020). Is there a K in capacity? Assessing the structure of visual short-term memory. Cognitive Psychology, 121, 101305. https://doi.org/10.1016/j.cogpsych.2020.101305
Lee, M. D., Criss, A. H., Devezer, B., Donkin, C., Etz, A., Leite, F. P., … Vandekerckhove, J. (2019). Robust Modeling in Cognitive Science. Computational Brain & Behavior, 2(3), 141–153. https://doi.org/10.1007/s42113-019-00029-y
Bailer-Jones, D. (2009). Scientific models in philosophy of science. Pittsburgh, Pa.,: University of Pittsburgh Press.
Suppes, P. (2002). Representation and invariance of scientific structures. Stanford, Calif.: CSLI Publications.
Roy, D. (2003). The Discrete Normal Distribution. Communications in Statistics - Theory and Methods, 32(10), 1871–1883. https://doi.org/10.1081/STA-120023256
Ospina, R., & Ferrari, S. L. P. (2012). A general class of zero-or-one inflated beta regression models. Computational Statistics & Data Analysis, 56(6), 1609–1623. https://doi.org/10.1016/j.csda.2011.10.005
Uygun Tunç, D., & Tunç, M. N. (2020). A Falsificationist Treatment of Auxiliary Hypotheses in Social and Behavioral Sciences: Systematic Replications Framework (preprint). PsyArXiv. https://doi.org/10.31234/osf.io/pdm7y
Murayama, K., Blake, A. B., Kerr, T., & Castel, A. D. (2016). When enough is not enough: Information overload and metacognitive decisions to stop studying information. Journal of Experimental Psychology: Learning, Memory, and Cognition, 42(6), 914–924. https://doi.org/10.1037/xlm0000213
Jefferys, W. H., & Berger, J. O. (1992). Ockham’s Razor and Bayesian Analysis. American Scientist, 80(1), 64–72. Retrieved from https://www.jstor.org/stable/29774559
Maier, S. U., Raja Beharelle, A., Polanía, R., Ruff, C. C., & Hare, T. A. (2020). Dissociable mechanisms govern when and how strongly reward attributes affect decisions. Nature Human Behaviour. https://doi.org/10.1038/s41562-020-0893-y
Nadarajah, S. (2009). An alternative inverse Gaussian distribution. Mathematics and Computers in Simulation, 79(5), 1721–1729. https://doi.org/10.1016/j.matcom.2008.08.013
Barndorff-Nielsen, O., BlÆsild, P., & Halgreen, C. (1978). First hitting time models for the generalized inverse Gaussian distribution. Stochastic Processes and Their Applications, 7(1), 49–54. https://doi.org/10.1016/0304-4149(78)90036-4
Ghitany, M. E., Mazucheli, J., Menezes, A. F. B., & Alqallaf, F. (2019). The unit-inverse Gaussian distribution: A new alternative to two-parameter distributions on the unit interval. Communications in Statistics - Theory and Methods, 48(14), 3423–3438. https://doi.org/10.1080/03610926.2018.1476717
Weichart, E. R., Turner, B. M., & Sederberg, P. B. (2020). A model of dynamic, within-trial conflict resolution for decision making. Psychological Review. https://doi.org/10.1037/rev0000191
Bates, C. J., & Jacobs, R. A. (2020). Efficient data compression in perception and perceptual memory. Psychological Review. https://doi.org/10.1037/rev0000197
Kvam, P. D., & Busemeyer, J. R. (2020). A distributional and dynamic theory of pricing and preference. Psychological Review. https://doi.org/10.1037/rev0000215
Blundell, C., Sanborn, A., & Griffiths, T. L. (2012). Look-Ahead Monte Carlo with People (p. 7). Presented at the Proceedings of the Annual Meeting of the Cognitive Science Society.
Leon-Villagra, P., Otsubo, K., Lucas, C. G., & Buchsbaum, D. (2020). Uncovering Category Representations with Linked MCMC with people. In Proceedings of the Annual Meeting of the Cognitive Science Society (p. 7).
Leon-Villagra, P., Klar, V. S., Sanborn, A. N., & Lucas, C. G. (2019). Exploring the Representation of Linear Functions. In Proceedings of the Annual Meeting of the Cognitive Science Society (p. 7).
Ramlee, F., Sanborn, A. N., & Tang, N. K. Y. (2017). What Sways People’s Judgment of Sleep Quality? A Quantitative Choice-Making Study With Good and Poor Sleepers. Sleep, 40(7). https://doi.org/10.1093/sleep/zsx091
Hsu, A. S., Martin, J. B., Sanborn, A. N., & Griffiths, T. L. (2019). Identifying category representations for complex stimuli using discrete Markov chain Monte Carlo with people. Behavior Research Methods, 51(4), 1706–1716. https://doi.org/10.3758/s13428-019-01201-9
Martin, J. B., Griffiths, T. L., & Sanborn, A. N. (2012). Testing the Efficiency of Markov Chain Monte Carlo With People Using Facial Affect Categories. Cognitive Science, 36(1), 150–162. https://doi.org/10.1111/j.1551-6709.2011.01204.x
Gronau, Q. F., Wagenmakers, E.-J., Heck, D. W., & Matzke, D. (2019). A Simple Method for Comparing Complex Models: Bayesian Model Comparison for Hierarchical Multinomial Processing Tree Models Using Warp-III Bridge Sampling. Psychometrika, 84(1), 261–284. https://doi.org/10.1007/s11336-018-9648-3
Wickelmaier, F., & Zeileis, A. (2018). Using recursive partitioning to account for parameter heterogeneity in multinomial processing tree models. Behavior Research Methods, 50(3), 1217–1233. https://doi.org/10.3758/s13428-017-0937-z
Jacobucci, R., & Grimm, K. J. (2018). Comparison of Frequentist and Bayesian Regularization in Structural Equation Modeling. Structural Equation Modeling: A Multidisciplinary Journal, 25(4), 639–649. https://doi.org/10.1080/10705511.2017.1410822
Raftery, A. E. (1993). Bayesian model selection in structural equation models. In K. A. Bollen & J. S. Long (Eds.), Testing Structural Equation Models (pp. 163–180). Beverly Hills: SAGE Publications.
Lewis, S. M., & Raftery, A. E. (1997). Estimating Bayes Factors via Posterior Simulation With the Laplace-Metropolis Estimator. Journal of the American Statistical Association, 92(438), 648–655. https://doi.org/10.2307/2965712
Mair, P. (2018). Modern psychometrics with R. Cham, Switzerland: Springer.
Rosseel, Y. (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(1), 1–36. https://doi.org/10.18637/jss.v048.i02
Kaplan, D., & Lee, C. (2016). Bayesian Model Averaging Over Directed Acyclic Graphs With Implications for the Predictive Performance of Structural Equation Models. Structural Equation Modeling: A Multidisciplinary Journal, 23(3), 343–353. https://doi.org/10.1080/10705511.2015.1092088
Schoot, R. van de, Verhoeven, M., & Hoijtink, H. (2013). Bayesian evaluation of informative hypotheses in SEM using Mplus: A black bear story. European Journal of Developmental Psychology, 10(1), 81–98. https://doi.org/10.1080/17405629.2012.732719
Lin, L.-C., Huang, P.-H., & Weng, L.-J. (2017). Selecting Path Models in SEM: A Comparison of Model Selection Criteria. Structural Equation Modeling: A Multidisciplinary Journal, 24(6), 855–869. https://doi.org/10.1080/10705511.2017.1363652
Shi, D., Song, H., Liao, X., Terry, R., & Snyder, L. A. (2017). Bayesian SEM for Specification Search Problems in Testing Factorial Invariance. Multivariate Behavioral Research, 52(4), 430–444. https://doi.org/10.1080/00273171.2017.1306432
Matsueda, R. L. (2012). Key advances in the history of structural equation modeling. In Handbook of structural equation modeling (pp. 17–42). New York, NY, US: The Guilford Press.
Bollen, K. A. (2005). Structural Equation Models. In Encyclopedia of Biostatistics. American Cancer Society. https://doi.org/10.1002/0470011815.b2a13089
Tarka, P. (2018). An overview of structural equation modeling: its beginnings, historical development, usefulness and controversies in the social sciences. Quality & Quantity, 52(1), 313–354. https://doi.org/10.1007/s11135-017-0469-8
Sewell, D. K., & Stallman, A. (2020). Modeling the Effect of Speed Emphasis in Probabilistic Category Learning. Computational Brain & Behavior, 3(2), 129–152. https://doi.org/10.1007/s42113-019-00067-6

Comments

  • Any idea when the second part of this tutorial will be available?

    Cornelius Senf2016-04-21
  • This is fantastic! I’ve been searching for a way to deal with divergent transitions and this is the first time I’ve seen someone recommend adjusting stepsize and max_treedepth. Thank you!

    superdayv2021-10-08

Leave a Reply (Markdown is enabled)

This site uses Akismet to reduce spam. Learn how your comment data is processed.