Review of: Parsimonious mixed models
By: Douglas Bates, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen
Reviewed by: Henrik Singmann


The word has been out for a while that the current list of authors are working on a response to the influential paper by Barr, Levy, Scheepers, and Tily (2013, JML) in which the latter recommended to always fit the "maximal random effects structure" in linear mixed models (LMMs). Hence, I was eagerly awaiting this manuscript given that the recommendation to always fit the maximal model (a) needs to fail in some cases (at least in construed pathological ones) and (b) can lead to convergence problems (e.g., as indicated by the number of question on this issue on the relevant mailing lists). Unfortunately, I do not find the current manuscript overly convincing in addressing the central issue of how to specify a random effects structures when the main interest lies in the fixed effects estimates. Furthermore, I feel that the inclusion of another model class (GAMMs) and the discussion of auto-correlated errors too strongly diverts from this point and leaves the impression that the manuscript is all over the place. Having said that, I do find the manuscript extremely interesting and would hope to see a version of it published in JML. Particularly, the strategy to reduce the random effects structure is very helpful, seems well-principled, and I believe will have a great impact. I suggest the authors more strongly focus on this method, give broader recommendations, and perhaps also include some relevant simulations.


## Main points:

1. I agree with the authors that currently in most cases in which mixed models are employed in Psychology the focus is on assessing the contribution of the fixed effects and not the random effects. Interestingly, in all examples given in the manuscript, the fixed effects estimates and associated standard errors between the maximal and the final or optimal models are virtually identical (see Figures 1 and 4). There does not seem to be any measurable upside of using the optimal model for this question. Nevertheless, the authors take these results as supporting their conclusion that "it is not necessary to aim for maximality when the interest is in a confirmatory analysis of factorial contrasts" (p. 33). I suggest that quite to the contrary, these results also support an extension of the original Barr. et al. recommendation: Even if the maximal model is overparameterized (i.e., some random effects parameter are not identified) its use is a safeguard against Type I errors. From the current manuscript I am not convinced that the proposed reduction method can provide this same level of security (i.e., a conservative approach avoiding Type I errors). Additionally, as the proposed method of iteratively pruning the random structure starts with the maximal one, the estimates of the maximal model are already available to the researcher. Why should those be discarded? And what if the estimates (or decisions) from the maximal model disagree with the one from the optimal model?
The main reason for using the optimal model given by the authors are:
  (a) "We stress the importance of bringing the information in the data and factorial LMM model complexity in agreement, which typically leads to more parsimonious models with a smaller than theoretically possible number of model parameters." (p.5) or
  (b) "These parameters should be excluded from the model on grounds of parsimony." (p. 17).
While I can clearly see the virtue of these arguments, avoiding Type I errors is so important (specifically in light of the current crisis of confidence in Psychology) that I feel uncomfortable with the idea of adopting a new strategy solely based on theoretical arguments. I would be more convinced if the authors could also provide simulation results showing that their proposed method does not increase Type I errors. Given that the authors raise justified criticism at the simulations by Barr et al. this seems especially desirable. I am aware that it might be not feasible to perform the proposed iterative reduction procedure for each synthetic data set in a simulation study, but it might be enough to include factors for which the random variation is minimal and show that models without those random effects perform equally well (and might even have less convergence issues). It might even be possible to include factors in the data generation process that are completely unaccounted for in the modeling step to simulate the issues mentioned in section "Hidden complexities" (but see below).

2. In addition to the question of how well the proposed method behaves in terms of Type I errors, I feel some important issues regarding it are not discussed extensively enough. These are:
  (a) Convergence failures are one reason for reducing the random effects structure. The new version of lme4 (i.e., > 1.0) contains several convergence check producing specific warnings (e.g., "unable to evaluate scaled gradient", "Model failed to converge: degenerate Hessian with X negative eigenvalues", "Model failed to converge with max|grad| ..."). To develop an understanding of the severity of those warnings it would be very helpful to include a list of the different warnings with an explanation of what they mean. This should hopefully allow more researchers to develop a "statistical understanding" (p. 9) of the problem discussed here.
  (b) A central suggestion is to remove the correlation parameters among random effects. For factors (arguably the most common case in experiments) the "||" syntax is unusable as noted (p. 8). The solution to this is of course to first create the model matrix and then use the variables of the model matrix directly as (additive) random effects. This information is however hidden behind a link.
  (c) How to deal with factors with more than two levels, if only one of the parameters representing this factor is not identified? When applying the proposed method to one of my data sets the situation arose that one of two variables representing a factor with three levels was clearly not identified (estimated at virtually zero and indicated by rePCA) whereas the other clearly was. Removing this factor entirely decreased fit significantly, including it led to convergence warnings.
  (d) Similar to (c). How to deal with the situation that a main effect is clearly not identified when the interaction clearly is?

3. RePsychLing package. When installing the RePsychLing package from github (which may already be difficult for many users, as it requires devtools) I cannot open the vignettes from within R. Running e.g., vignette("KKL") returns: "Warning message: vignette ‘KKL’ has no PDF/HTML." To get to the vignettes was somewhat complicated (either use a github html viewer or clone the project). I also am not very happy that the rePCA function, which is central to the manuscript, is not part of a package available on CRAN. Only the distribution via CRAN makes a function accessible to a maximally wide range of users.

4. Hidden complexities. The section devoted to hidden complexities contains many interesting and potentially very useful suggestions and observations. Nevertheless, I feel that it does not really fit into the manuscript. The main reasons are that it does not really deal with the question of the remainder of the manuscript, how to specify random effects in linear mixed models (fitted with lme4), and that it includes many new and and complicated ideas without sufficiently introducing them (e.g., what actually is a generalized additive models? what are those wiggly lines?). In the end, this section completely focuses on models fitted with mgcv and it might be better to write a separate manuscript devoted to introducing GAMMs to Psychology (i.e., similar to what Baayen, Davidson, & Bates, 2008, is for LMMs).


## Note
It should be noted that I have said some of the points mentioned here already publicly on the lme4 mailing list: http://thread.gmane.org/gmane.comp.lang.r.lme4.devel/13444/focus=13457
Importantly, Jake Westfall has also added a review of the current manuscript there which I do not want to copy but would suggest to take into account when revising the paper, see 
http://thread.gmane.org/gmane.comp.lang.r.lme4.devel/13444/focus=13458


## Minor Points:

- p. 10: The main effect of Load is introduced with letter (L) but given in the formula with (supposedly) letter C.

- p. 14, section "Dropping variance components to ...". I find it difficult to understand what variance components "5 and 7 for SubjID and 1 and 4 for ItemID" are. I would suggest using the names of the components instead.

- p. 14, section "Extending the reduced LMM". I do not understand while the inclusion of the correlation provides evidence for a reliable difference between items in the precedence effect. Is the fact that the corresponding variance component is included in the final model not evidence enough for this?

- p. 15: "In a Bayesian linear mixed model, all parameters have a prior distribution defined over them; this is in contrast to the frequentist approach, where each parameter is assumed to have a fixed but unknown value." I think the comparison between Baayesian and frequentist approach necessarily requires the inclusion of the posterior distribution, as this is the equivalent to the frequentist point estimate. In other words, I think this sentence is not completely correct as it stands.

- p. 16: The uniform prior on the standard deviation is most likely an "improper prior" from 0 to infinity. Perhaps it makes sense to explicitly mention this here as it is somewhat uncommon (e.g., compared to a gamma prior).

- p. 17: I think it would be helpful if the abbreviations for the factors used later on would be given when discussing the design.

- p. 18: "maxfun < 10 *length(par)^2 is not recommended". The statement that this warning indicates that we may be asking too much from this data seems incorrect. Given enough data one can fit models with as many parameters such as the one discussed here. However, maxfun does not scale with the number of data and is fixed to an arbitrary value. Consequently, even if we would have enough data to fit this model and all its parameters this warning would appear. 

- p. 23, "Figure 6 shows two subjects with minor autocorrelations, a subject with autocorrelations at short lags, and two subjects with autocorrelation that persist across many lags.". I think it would be helpful if this sentence would start with "The upper row of Figure 6" or something alike.

- p. 26, "initial linear mixed model". It is unclear what is meant by that, the maximal model or the optimal model? If the latter, I would suggest to always use optimal LMM instead to be consistent with the previous sections.

- p. 26, "we have moved from a model that explains 48% of the variance to a model that explains 54% of the variance". Variance explained or R² at least for GLMMs is not a trivial issue and various formulas exist (see lme4-faq; Nakagaw & Shielzeth, 2013; Johnson, 2014). I always though this also applied to LMMs. It would be therefore be helpful to know which R² formula is used here and if others exist.

- p. 29, Figure 9. The text says that "for the fixed-effect factors, removal of a predictor involved removal of all contrasts and random-effect terms for that predictor". Nevertheless, Figure 9 shows two tests of e.g., orn, the fixed effect test (ORN) and the random effect test (orn). This seems somewhat contradictory. Furthermore, when the interest is in assessing the fixed effect estimates (e.g., "Does orn have an overall effect across participants?") withholding the fixed effect only seems to be the appropriate test.

- p. 34: It is unclear what the difference between Bates et al. (2014a) and Bates et al. (2015) is. It is perhaps better to only cite one of them.