Stan
is probably the most interesting development in computational statistics in the last few years, at least for me. The version of Hamiltonian Monte-Carlo (HMC) implemented in Stan
(NUTS
, ) is extremely efficient and the range of probability distributions implemented in the Stan
language allows to fit an extremely wide range of models. Stan
has considerably changed which models I think can be realistically estimated both in terms of model complexity and data size. It is not an overstatement to say that Stan
(and particularly rstan
) has considerable changed the way I analyze data.
One of the R
packages that allows to implement Stan
models in a very convenient manner and which has created a lot of buzz recently is brms
. It allows to specify a wide range of models using the R
formula interface. Based on the formula and a specification of the family of the model, it generates the model code, compiles it, and then passes it together with the data to rstan
for sampling. Because I usually program my models by-hand (thanks to the great Stan
documentation), I have so far stayed away from brms
.
However, I recently learned that brms
also allows the estimation of the Wiener model (i.e., the 4-parameter diffusion model, ) for simultaneously accounting for responses and corresponding response times for data from two-choice tasks. Such data is quite common in psychology and the diffusion model is one of the more popular cognitive models out there . In a series of (probably 3) posts I provide an example of applying the Wiener model to some published data using brms
. This first part shows how to set up and estimate the model. The second part gives an overview of model diagnostics and an assessment of model fit via posterior predictive distributions. The third part shows how to inspect and compare the posterior distributions of the parameters.
In addition to brms
and a working C++
compiler, this first part also needs package RWiener
for generating the posterior predictive distribution within brms
and package rtdists
for the data.
library("brms")
Data and Model
I expect the reader to already be familiar with the Wiener model and will only provide the a very brief introduction here, for more see . The Wiener model is a continuous-time evidence accumulation model for binary choice tasks. It assumes that in each trial evidence is accumulated in a noisy (diffusion) process by a single accumulator. Evidence accumulation starts at the start point and continues until the accumulator hits one of the two decision bounds in which case the corresponding response is given. The total response time is the sum of the decision time from the accumulation process plus non-decisional components. In sum, the Wiener model allows to decompose responses to a binary choice tasks and corresponding response times into four latent processes:- The drift rate (
delta
) is the average slope of the accumulation process towards the boundaries. The larger the (absolute value of the) drift rate, the stronger the evidence for the corresponding response option. - The boundary separation (
alpha
) is the distance between the two decision bounds and interpreted as a measure of response caution. - The starting point (
beta
) of the accumulation process is a measure of response bias towards one of the two response boundaries. - The non-decision time (
tau
) captures all non-decisional process such as stimulus encoding and response processes.
We will analyze part of the data from Experiment 1 of . The data comes from 17 participants performing a lexical decision task in which they have to decide if a presented string is a word or non-word. Participants made decisions either under speed
or accuracy
emphasis instructions in different experimental blocks. This data comes with the rtdists
package (which provides the PDF, CDF, and RNG for the full 7-parameter diffusion model). After removing some extreme RTs, we restrict the analysis to high-frequency words (frequency
= high
) and the corresponding high-frequency non-words (frequency
= nw_high
) to reduce estimation time. To setup the model we also need a numeric response variable in which 0
corresponds to responses at the lower response boundary and 1
corresponds to responses at the upper boundary. For this we transform the categorical response variable response
to numeric and subtract 1 such that a word
response correspond to the lower response boundary and a nonword
response to the upper boundary.
data(speed_acc, package = "rtdists") speed_acc <- droplevels(speed_acc[!speed_acc$censor,]) # remove extreme RTs speed_acc <- droplevels(speed_acc[ speed_acc$frequency %in% c("high", "nw_high"),]) speed_acc$response2 <- as.numeric(speed_acc$response)-1 str(speed_acc)
'data.frame': 10462 obs. of 10 variables: $ id : Factor w/ 17 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ block : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ condition: Factor w/ 2 levels "accuracy","speed": 2 2 2 2 2 2 2 2 2 2 ... $ stim : Factor w/ 1611 levels "1001","1002",..: 1271 46 110 666 422 ... $ stim_cat : Factor w/ 2 levels "word","nonword": 2 1 1 1 1 1 2 1 1 2 ... $ frequency: Factor w/ 2 levels "high","nw_high": 2 1 1 1 1 1 2 1 1 2 ... $ response : Factor w/ 2 levels "word","nonword": 2 1 1 1 1 1 1 1 1 1 ... $ rt : num 0.773 0.39 0.435 0.427 0.622 0.441 0.308 0.436 0.412 ... $ censor : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ response2: num 1 0 0 0 0 0 0 0 0 0 ...
Model Formula
The important decision that has to be made before setting up a model is which parameters are allowed to differ between which conditions (i.e., factor levels). One common constraint of the Wiener model (and other evidence-accumulation models) is that the parameters that are set before the evidence accumulation process starts (i.e., boundary separation, starting point, and non-decision time) cannot change based on stimulus characteristics that are not known to the participant before the start of the trial. Thus, the item-type, in the present case word versus non-word, is usually only allowed to affect the drift rate. We follow this constraint. Furthermore, all four parameters are allowed to vary between speed and accuracy condition as this is manipulated between blocks of trials. Also note that all relevant variables are manipulated within-subjects. Thus, the maximal random-effects structure entails corresponding random-effects parameters for each fixed-effect. To set up the model we need to invoke the bf()
function and construct one formula for each of the four parameters of the Wiener model.
formula <- bf(rt | dec(response2) ~ 0 + condition:frequency + (0 + condition:frequency|p|id), bs ~ 0 + condition + (0 + condition|p|id), ndt ~ 0 + condition + (0 + condition|p|id), bias ~ 0 + condition + (0 + condition|p|id))
The first formula is for the drift rate and is also used for specifying the column containing the RTs (rt
) and the response or dec
ision (response2
) on the left hand side. On the right hand side one can specify fixed effects as well as random effects in a way similar to lme4
. The drift rate is allowed to vary between both variables, condition
and frequency
(stim_cat
would be equivalent), thus we estimate fixed effects as well as random effects for both factors as well as their interaction.
We then also need to set up one formula for each of the other three parameters (which are only allowed to vary by condition
). For these formulas, the left hand side denotes the parameter names:
bs
: boundary separation (alpha
)ndt
: non-decision time (tau
)bias
: starting point (beta
)
The right hand side again specifies the fixed- and random-effects. Note that one common approach for setting up evidence accumulation models is to specify that one response boundary represent correct responses and one response boundary denotes incorrect responses (in contrast to the current approach in which the response boundaries represent the actually two response options). In such a situation one cannot estimate the starting point and it needs to be fixed to 0.5 (i.e., replace the formula with bias = 0.5
).
Two further points are relevant in the formulas. First, I have used a somewhat uncommon parameterization and suppressed the intercept (e.g., ~ 0 + condition
instead of ~ condition
). The reason for this is that when an intercept is present, categorical variables (i.e., factors) with k levels are coded with k-1 deviation variables that represent deviations from the intercept. Thus, in a Bayesian setting one needs to consider the choice of prior for these deviation variables. In contrast, when suppressing the intercept the model can be setup such that each factor level (or design cells in case of involvement of more than one factor) receives its own parameter, as done here. This essentially allows the same prior for each parameter (as long as one does not expect the parameters to vary dramatically). Furthermore, when programming a model oneself this is a common parameterization. To see the differences between the different parameterizations compare the following two calls (model.matrix
is the function that creates the parameterization internally). Only the first creates a separate parameter for each condition.
unique(model.matrix(~0+condition, speed_acc)) ## conditionaccuracy conditionspeed ## 36 0 1 ## 128 1 0 unique(model.matrix(~condition, speed_acc)) ## (Intercept) conditionspeed ## 36 1 1 ## 128 1 0
Note that when more than one factor is involved and one wants to use this parameterization, one needs to combine the factors using the :
and not *
. This can be seen when running the code below. Also note that when combining the factors with :
without suppressing the intercept, the resulting model has one parameter more than can be estimated (i.e., the model-matrix is rank deficient). So care needs to be taken at this step.
unique(model.matrix(~ 0 + condition:frequency, speed_acc)) unique(model.matrix(~ 0 + condition*frequency, speed_acc)) unique(model.matrix(~ condition:frequency, speed_acc))
Second, brms
formulas provide a way to estimate correlations among random-effects parameters of different formulas. To achieve this, one can place an identifier in the middle of the random-effects formula that is separated by |
on both sides. Correlations among random-effects will then be estimated for all random-effects formulas that share the same identifier. In our case, we want to estimate the full random-effects matrix with correlations among all model parameters, following the “latent-trait approach” . We therefore place the same identifier (p
) in all formulas. Thus, correlations will be estimated among all individual-level deviations across all four Wiener parameters. To estimate correlations only among the random-effects parameters of each formula, simply omit the identifier (e.g., (0 + condition|id)
). Furthermore, note that brms
, similar to afex
, supports suppressing the correlations among categorical random-effects parameters via ||
(e.g., (0 + condition||id)
).
Family, Link-Functions, and Priors
The next step is to setup the priors. For this we can invoke the get_prior
function. This function requires one to specify the formula
, data
, as well as the family
of the model. family
is the argument where we tell brms
that we want to use the wiener
model. We also use it to specify the link function for the four Wiener parameters. Because the drift rate can take on any value (i.e., from -Inf
to Inf
), the default link function is "identity"
(i.e., no transformation) which we retain. The other three parameters all have a restricted range. The boundary needs to be larger than 0, the non-decision time needs to be larger than 0 and smaller than the smallest RT, and the starting point needs to be between 0 and 1. The default link-functions respect these constraints and use "log"
for the first two parameters and "logit"
for the bias. This certainly is a possibility, but has a number of drawbacks leading me to use the "identity"
link function for all parameters. First, when parameters are transformed, the priors need to be specified on the untransformed scale. Second, the individual-levels deviations (i.e., the random-effects estimates) are assumed to come from a multivariate normal distribution. Parameters transformations would entail that these individual-deviations are only normally distributed on the untransformed scale. Likewise, the correlations of parameter deviations across parameters would also be on the untransformed scale. Both make the interpretation of the random-effects difficult.
When specifying the parameters without transformation (i.e., link = "identity"
) care must be taken that the priors places most mass on values inside the allowed range. Likewise, starting values need to be inside the allowed range. Using the identity link function also comes with drawbacks discussed at the end. However, as long as parameter outside the allowed range only occur rarely, such a model can converge successfully and it makes the interpretation easier.
The get_prior
function returns a data.frame
containing all parameters of the model. If parameters have default priors these are listed as well. One needs to define priors either for individual parameters, parameter class
es, or parameter classes for specific group
s, or dpar
s. Note that all parameters that do not have a default prior should receive a specific prior.
get_prior(formula, data = speed_acc, family = wiener(link_bs = "identity", link_ndt = "identity", link_bias = "identity"))
[Two empty columns to the right were removed from the following output.]
prior class coef group resp dpar 1 b 2 b conditionaccuracy:frequencyhigh 3 b conditionaccuracy:frequencynw_high 4 b conditionspeed:frequencyhigh 5 b conditionspeed:frequencynw_high 6 lkj(1) cor 7 cor id 8 student_t(3, 0, 10) sd 9 sd id 10 sd conditionaccuracy:frequencyhigh id 11 sd conditionaccuracy:frequencynw_high id 12 sd conditionspeed:frequencyhigh id 13 sd conditionspeed:frequencynw_high id 14 b bias 15 b conditionaccuracy bias 16 b conditionspeed bias 17 student_t(3, 0, 10) sd bias 18 sd id bias 19 sd conditionaccuracy id bias 20 sd conditionspeed id bias 21 b bs 22 b conditionaccuracy bs 23 b conditionspeed bs 24 student_t(3, 0, 10) sd bs 25 sd id bs 26 sd conditionaccuracy id bs 27 sd conditionspeed id bs 28 b ndt 29 b conditionaccuracy ndt 30 b conditionspeed ndt 31 student_t(3, 0, 10) sd ndt 32 sd id ndt 33 sd conditionaccuracy id ndt 34 sd conditionspeed id ndt
Priors can be defined with the prior
or set_prior
function allowing different levels of control. One benefit of the way the model is parameterized is that we only need to specify priors for one set of parameters per Wiener parameters (i.e., b
) and do not have to distinguish between intercept and other parameters.
We base our choice of the priors on prior knowledge of likely parameter values for the Wiener model, but otherwise try to specify them in a weakly informative manner. That is, they should restrict the range to likely values but not affect the estimation any further. For the drift rate we use a Cauchy distribution with location 0 and scale 5 so that roughly 70% of prior mass are between -10 and 10. For the boundary separation we use a normal prior with mean 1.5 and standard deviation of 1, for the non-decision time a normal prior with mean 0.2 and standard deviation of 0.1, and for the bias we use a normal with mean of 0.5 (i.e., no-bias) and standard deviation of 0.2.
prior <- c( prior("cauchy(0, 5)", class = "b"), set_prior("normal(1.5, 1)", class = "b", dpar = "bs"), set_prior("normal(0.2, 0.1)", class = "b", dpar = "ndt"), set_prior("normal(0.5, 0.2)", class = "b", dpar = "bias") )
With this information we can use the make_stancode
function and inspect the full model code. The important thing is to make sure that all parameters listed in the parameters
block have a prior listed in the model
block. We can also see, at the beginning of the model block, that none of our parameters is transformed just as desired (a bug in a previous version of brms
prevented anything but the default links for the Wiener model parameters).
make_stancode(formula, family = wiener(link_bs = "identity", link_ndt = "identity", link_bias = "identity"), data = speed_acc, prior = prior)
// generated with brms 1.10.2 functions { /* Wiener diffusion log-PDF for a single response * Args: * y: reaction time data * dec: decision data (0 or 1) * alpha: boundary separation parameter > 0 * tau: non-decision time parameter > 0 * beta: initial bias parameter in [0, 1] * delta: drift rate parameter * Returns: * a scalar to be added to the log posterior */ real wiener_diffusion_lpdf(real y, int dec, real alpha, real tau, real beta, real delta) { if (dec == 1) { return wiener_lpdf(y | alpha, tau, beta, delta); } else { return wiener_lpdf(y | alpha, tau, 1 - beta, - delta); } } } data { int<lower=1> N; // total number of observations vector[N] Y; // response variable int<lower=1> K; // number of population-level effects matrix[N, K] X; // population-level design matrix int<lower=1> K_bs; // number of population-level effects matrix[N, K_bs] X_bs; // population-level design matrix int<lower=1> K_ndt; // number of population-level effects matrix[N, K_ndt] X_ndt; // population-level design matrix int<lower=1> K_bias; // number of population-level effects matrix[N, K_bias] X_bias; // population-level design matrix // data for group-level effects of ID 1 int<lower=1> J_1[N]; int<lower=1> N_1; int<lower=1> M_1; vector[N] Z_1_1; vector[N] Z_1_2; vector[N] Z_1_3; vector[N] Z_1_4; vector[N] Z_1_bs_5; vector[N] Z_1_bs_6; vector[N] Z_1_ndt_7; vector[N] Z_1_ndt_8; vector[N] Z_1_bias_9; vector[N] Z_1_bias_10; int<lower=1> NC_1; int<lower=0,upper=1> dec[N]; // decisions int prior_only; // should the likelihood be ignored? } transformed data { real min_Y = min(Y); } parameters { vector[K] b; // population-level effects vector[K_bs] b_bs; // population-level effects vector[K_ndt] b_ndt; // population-level effects vector[K_bias] b_bias; // population-level effects vector<lower=0>[M_1] sd_1; // group-level standard deviations matrix[M_1, N_1] z_1; // unscaled group-level effects // cholesky factor of correlation matrix cholesky_factor_corr[M_1] L_1; } transformed parameters { // group-level effects matrix[N_1, M_1] r_1 = (diag_pre_multiply(sd_1, L_1) * z_1)'; vector[N_1] r_1_1 = r_1[, 1]; vector[N_1] r_1_2 = r_1[, 2]; vector[N_1] r_1_3 = r_1[, 3]; vector[N_1] r_1_4 = r_1[, 4]; vector[N_1] r_1_bs_5 = r_1[, 5]; vector[N_1] r_1_bs_6 = r_1[, 6]; vector[N_1] r_1_ndt_7 = r_1[, 7]; vector[N_1] r_1_ndt_8 = r_1[, 8]; vector[N_1] r_1_bias_9 = r_1[, 9]; vector[N_1] r_1_bias_10 = r_1[, 10]; } model { vector[N] mu = X * b; vector[N] bs = X_bs * b_bs; vector[N] ndt = X_ndt * b_ndt; vector[N] bias = X_bias * b_bias; for (n in 1:N) { mu[n] = mu[n] + (r_1_1[J_1[n]]) * Z_1_1[n] + (r_1_2[J_1[n]]) * Z_1_2[n] + (r_1_3[J_1[n]]) * Z_1_3[n] + (r_1_4[J_1[n]]) * Z_1_4[n]; bs[n] = bs[n] + (r_1_bs_5[J_1[n]]) * Z_1_bs_5[n] + (r_1_bs_6[J_1[n]]) * Z_1_bs_6[n]; ndt[n] = ndt[n] + (r_1_ndt_7[J_1[n]]) * Z_1_ndt_7[n] + (r_1_ndt_8[J_1[n]]) * Z_1_ndt_8[n]; bias[n] = bias[n] + (r_1_bias_9[J_1[n]]) * Z_1_bias_9[n] + (r_1_bias_10[J_1[n]]) * Z_1_bias_10[n]; } // priors including all constants target += cauchy_lpdf(b | 0, 5); target += normal_lpdf(b_bs | 1.5, 1); target += normal_lpdf(b_ndt | 0.2, 0.1); target += normal_lpdf(b_bias | 0.5, 0.2); target += student_t_lpdf(sd_1 | 3, 0, 10) - 10 * student_t_lccdf(0 | 3, 0, 10); target += lkj_corr_cholesky_lpdf(L_1 | 1); target += normal_lpdf(to_vector(z_1) | 0, 1); // likelihood including all constants if (!prior_only) { for (n in 1:N) { target += wiener_diffusion_lpdf(Y[n] | dec[n], bs[n], ndt[n], bias[n], mu[n]); } } } generated quantities { corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1); vector<lower=-1,upper=1>[NC_1] cor_1; // take only relevant parts of correlation matrix cor_1[1] = Cor_1[1,2]; [...] cor_1[45] = Cor_1[9,10]; }
[The output was slightly modified.]
The last piece we need, before we can finally estimate the model, is a function that generates initial values. Without initial values that lead to an identifiable model for all data points, estimation will not start. The function needs to provide initial values for all parameters listed in the parameters
block of the model. Note that many of those parameters have at least one dimension with a parameterized extent (e.g., K
). We can use make_standata
and create the data set used by brms
for the estimation for obtaining the necessary information. We then use this data object (i.e., a list
) for generating the correctly sized initial values in function initfun
(note that initfun
relies on the fact that tmp_dat
is in the global environment which is something of a code smell).
tmp_dat <- make_standata(formula, family = wiener(link_bs = "identity", link_ndt = "identity", link_bias = "identity"), data = speed_acc, prior = prior) str(tmp_dat, 1, give.attr = FALSE) ## List of 26 ## $ N : int 10462 ## $ Y : num [1:10462(1d)] 0.773 0.39 0.435 ... ## $ K : int 4 ## $ X : num [1:10462, 1:4] 0 0 0 0 0 0 0 0 0 0 ... ## $ Z_1_1 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ## $ Z_1_2 : num [1:10462(1d)] 0 1 1 1 1 1 0 1 1 0 ... ## $ Z_1_3 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ## $ Z_1_4 : num [1:10462(1d)] 1 0 0 0 0 0 1 0 0 1 ... ## $ K_bs : int 2 ## $ X_bs : num [1:10462, 1:2] 0 0 0 0 0 0 0 0 0 0 ... ## $ Z_1_bs_5 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ## $ Z_1_bs_6 : num [1:10462(1d)] 1 1 1 1 1 1 1 1 1 1 ... ## $ K_ndt : int 2 ## $ X_ndt : num [1:10462, 1:2] 0 0 0 0 0 0 0 0 0 0 ... ## $ Z_1_ndt_7 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ## $ Z_1_ndt_8 : num [1:10462(1d)] 1 1 1 1 1 1 1 1 1 1 ... ## $ K_bias : int 2 ## $ X_bias : num [1:10462, 1:2] 0 0 0 0 0 0 0 0 0 0 ... ## $ Z_1_bias_9 : num [1:10462(1d)] 0 0 0 0 0 0 0 0 0 0 ... ## $ Z_1_bias_10: num [1:10462(1d)] 1 1 1 1 1 1 1 1 1 1 ... ## $ J_1 : int [1:10462(1d)] 1 1 1 1 1 1 1 1 1 1 ... ## $ N_1 : int 17 ## $ M_1 : int 10 ## $ NC_1 : num 45 ## $ dec : num [1:10462(1d)] 1 0 0 0 0 0 0 0 0 0 ... ## $ prior_only : int 0 initfun <- function() { list( b = rnorm(tmp_dat$K), b_bs = runif(tmp_dat$K_bs, 1, 2), b_ndt = runif(tmp_dat$K_ndt, 0.1, 0.15), b_bias = rnorm(tmp_dat$K_bias, 0.5, 0.1), sd_1 = runif(tmp_dat$M_1, 0.5, 1), z_1 = matrix(rnorm(tmp_dat$M_1*tmp_dat$N_1, 0, 0.01), tmp_dat$M_1, tmp_dat$N_1), L_1 = diag(tmp_dat$M_1) ) }
Estimation (i.e., Sampling)
Finally, we have all pieces together and can estimate the Wiener model using the brm
function. Note that this will take roughly a full day, depending on the speed of your PC also longer. We also already increase the maximal treedepth to 15. We probably should have also increased adapt_delta
above the default value of .8 as there are a few divergent transitions, but this is left as an exercise to the reader.
After estimation is finished, we see that there are a few (< 10) divergent transitions. If this were a real analysis and not only an example, we would need to increase adapt_delta
to a larger value (e.g., .95 or .99) and rerun the estimation. In this case however, we immediately begin with the second step and obtain samples from the posterior predictive distribution using predict
. For this it is important to specify the number of posterior samples (here we use 500). In addition, it is important to set summary = FALSE
, for obtaining the actual posterior predictive distribution and not a summary of the posterior predictive distribution, and negative_rt = TRUE
. The latter ensures that predicted responses to the lower boundary receive a negative sign whereas predicted responses to the upper boundary receive a positive sign.
fit_wiener <- brm(formula, data = speed_acc, family = wiener(link_bs = "identity", link_ndt = "identity", link_bias = "identity"), prior = prior, inits = initfun, iter = 1000, warmup = 500, chains = 4, cores = 4, control = list(max_treedepth = 15)) NPRED <- 500 pred_wiener <- predict(fit_wiener, summary = FALSE, negative_rt = TRUE, nsamples = NPRED)
Because both steps are quite time intensive (estimation 1 day, obtaining posterior predictives a few hours), we save the results of both steps. Given the comparatively large size of both objects, using the 'xz'
compression (i.e., the strongest in R
) seems like a good idea.
save(fit_wiener, file = "brms_wiener_example_fit.rda", compress = "xz") save(pred_wiener, file = "brms_wiener_example_predictions.rda", compress = "xz")
The second part shows how to perform model diagnostics and how to asses the model fit. The third part shows how to test for differences in parameters between conditions.
[…] of my blog series on fitting diffusion models (or better, the 4-parameter Wiener model) with brms. The first part discusses how to set up the data and model. This second part is concerned with perhaps the most […]
[…] summarised using a single value like the mean). Other approaches not covered here include explicit mathematical models of decision making and fitting functions to model the shape of the distributions (Balota & Yap, […]
[…] is the third part of my blog series on fitting the 4-parameter Wiener model with brms. The first part discussed how to set up the data and model. The second part was concerned with (mostly graphical) […]