A close look at some linear model MCMC comparisons

This is our second blog post taking a careful look at some of the results posted in an arXiv manuscript by Beraha, Falco, and Guglielmi (BFG). They compare JAGS, Stan, and NIMBLE using four examples. In their results, each package performs best in at least one example.

In our previous post, we explained that they compared apples to oranges in the accelerated failure time (AFT) example. They gave Stan a different and easier problem than they gave JAGS and NIMBLE. When we gave NIMBLE the same problem, we saw that its MCMC performance was up to 45 times better than what they reported. We looked first at the AFT example because that’s where NIMBLE seemed to perform comparatively worst.

In this post we’re looking at the simple linear model example. It turns out that the models were written more efficiently for Stan than for JAGS and NIMBLE, because matrix multiplication was used for Stan but all scalar steps of matrix multiplication were written in JAGS and NIMBLE. JAGS and NIMBLE do support matrix multiplication and inner products. When we modify the models to also use matrix multiplication, NIMBLE’s MCMC performance with default samplers increases often by 1.2 to 3-fold but sometimes by 5 to >10-fold over what was reported by BFG, as far as we can tell. This had to do with both raw computational efficiency and also MCMC samplers invoked by different ways to write the model code. Other issues are described below.

BFG’s linear model examples explore different data sizes (n = 30, 100, 1000, or in one case 2000), different numbers of explanatory variables (4, 16, 30, 50 or 100), and different priors for the variance and/or coefficients (beta[i]s), all in a simple linear model. The priors included:

  • “LM-C”: an inverse gamma prior for variance, which is used for both residual variance and variance of normal priors for beta[i]s (regression coefficients). This setup should offer conjugate sampling for both the variance parameter and the beta[i]s.
  • “LM-C Bin”: the same prior as “LM-C”. This case has Bernoulli- instead of normally-distributed explanatory variables in the data simulations. It’s very similar to “LM-C”.
  • “LM-WI”: A weakly informative (“WI”) prior for residual standard deviation using a truncated, scaled t-distribution. beta[i]s have a non-informative (sd = 100) normal prior.
  • “LM-NI”: A non-informative (“NI”) flat prior for residual standard deviation. beta[i]s have a non-informative (sd = 100) normal prior.
  • “LM-L”: A lasso (“L”) prior for beta[i]s. This uses a double-exponential prior for beta[i]s, with a parameter that itself follows an exponential prior. This prior is a Bayesian analog of the lasso for variable selection, so the scenarios used for this have large numbers of explanatory variables, with different numbers of them (z) set to 0 in the simulations. Residual variance has an inverse gamma prior.

Again, we are going to stick to NIMBLE here and not try to reproduce or explore results for JAGS or Stan.

In more detail, the big issues that jumped out from BFG’s code are:

  1. Stan was given matrix multiplication for X %*% beta, while NIMBLE and JAGS were given code to do all of the element-by-element steps of matrix multiplication. Both NIMBLE and JAGS support matrix multiplication and inner products, so we think it is better and more directly comparable to use these features.
  2. For the “LM-C” and “LM-C Bin” cases, the prior for the beta[i]s was given as a multivariate normal with a diagonal covariance matrix. It is better (and equivalent) to give each element a univariate normal prior.

There are two reasons that writing out matrix multiplication as they did is not a great way to code a model. The first is that it is just inefficient. For X that is N-by-p and beta that is p-by-1, there are N*p scalar multiplications and N summations of length p in the model code. Although somewhere in the computer those elemental steps need to be taken, they will be substantially faster if not broken up by hand-coding them. When NIMBLE generates (and then compiles) C++, it generates C++ for the Eigen linear algebra library, which gives efficient implementations of matrix operations.

The second reason, however, may be more important in this case. Using either matrix multiplication or inner products makes it easier for NIMBLE to determine that the coefficients (“beta[i]”s) in many of these cases have conjugate relationships that can be used for Gibbs sampling. The way BFG wrote the model revealed to us that we’re not detecting the conjugacy in this case. That’s something we plan to fix, but it’s not a situation that’s come before us yet. Detecting conjugacy in a graphical model — as written in the BUGS/JAGS/NIMBLE dialects of the BUGS language — involves symbolic algebra, so it’s difficult to catch all cases.

The reasons it’s better to give a set of univariate normal priors than a single multivariate normal are similar. It’s more computationally efficient, and it makes it easier to detect conjugacy.

In summary, they wrote the model inefficiently for NIMBLE and differently between packages, and we didn’t detect conjugacy for the way they wrote it. In the results below, the “better” results use matrix multiplication directly (in all cases) and use univariate normal priors instead of a multivariate normal (in the “LM-C” and “LM-C Bin” cases).

It also turns out that neither JAGS nor NIMBLE detects conjugacy for the precision parameter of the “LM-C” and “LM-C Bin” cases. (This is shown by list.samplers in rjags and configureMCMC in NIMBLE.) In NIMBLE, a summary of how conjugacy is determined is in Table 7.1 of our User Manual.  It can be obtained by changing sd = sigma to var = sigmasq in one line of BFG’s code. In these examples, we found that this issue doesn’t make much different to MCMC efficiency, so we leave it as they coded it.

Before giving our results, we’ll make a few observations on BFG’s results, shown in their Table 2. One is that JAGS gives very efficient sampling for many of these cases, and that’s something we’ve seen before. Especially when conjugate sampling is available, JAGS does well. Next is that Stan and NIMBLE each do better than the other in some cases. As we wrote about in the previous post, BFG chose not to calculate what we see as the most relevant metric for comparison. That is the rate of generating effectively independent samples, the ESS/time, which we call MCMC efficiency. An MCMC system can be efficient by slowly generating well-mixed samples or by rapidly generating poorly-mixed samples. One has to make choices such as whether burn-in (or warmup) time is counted in the denominator, depending on exactly what is of interest. BFG reported only ESS/recorded iterations and total iterations/time. The product of these is a measure of ESS/time, scaled by a ratio of total iterations / recorded iterations.

For example, in the “LM-C” case with “N = 1000, p = 4”, Stan has (ESS/recorded iterations) * (total iterations/time) = 0.99 * 157=155, while NIMBLE has 0.14 * 1571=220. Thus in this case NIMBLE is generating effectively independent samples faster than Stan, because the faster computation out-weighs the poorer mixing. In other cases, Stan has higher ESS/time than NIMBLE. When BFG round ESS/recorded iterations to “1%” in some cases, the ESS/time is unknown up to a factor of 3 because 1% could be rounded from 0.50 or from 1.49. For most cases, Stan and NIMBLE are within a factor of 2 of each other, which is close. One case where Stan really stands out is the non-informative prior (LM-NI) with p>n, but it’s worth noting that this is a statistically unhealthy case. With p>n, parameters are not identifiable without the help of a prior. In the LM-NI case, the prior is uninformative, and the posteriors for beta[i]s are not much different than their priors.

One other result jumps out as strange from their Table 2. The run-time results for “LM-WI” (total iterations / time) are much, much slower than in other cases. For example, with N = 100 and p = 4, this case was only 2.6% (294 vs 11,000 ) as fast as the corresponding “LM-C” case. We’re not sure how that could make sense, so it was something we wanted to check.

We took all of BFG’s source code and organized it to be more fully reproducible. After our previous blog post, set.seed calls were added to their source code, so we use those. We also organize the code into functions and sets of runs to save and process together. We think we interpreted their code correctly, but we can’t be sure. For ESS estimation, we used coda::effectiveSize, but Stan and mcmcse are examples of packages with other methods, and we aren’t sure what BFG used. They thin by 2 and give average results for beta[i]s. We want to compare to their results, so we take those steps too.

Here are the results:

BFG
Better code
Improvement
ESS/Ns Nit/t ESS/t ESS/Ns Nit/t ESS/t Better by
LM-C
N=100, p=4 0.15 56122.45 3738.90 1.03 23060.80 10842.00 2.90
N=1000, p=4 0.14 9401.71 609.97 1.00 2866.82 1303.10 2.14
N=100, p=16 0.04 25345.62 428.45 0.95 5555.56 2396.00 5.59
N=1000, p=16 0.03 3471.13 54.06 1.00 613.98 278.53 5.15
N=2000, p=30 0.01 863.83 5.52 1.00 137.60 62.67 11.35
N=30, p=50 0.00 11470.28 24.49 0.07 3869.15 114.62 4.68
LM-C Bin
N=100, p=4 0.12 61452.51 3303.31 0.52 22916.67 5384.40 1.63
N=1000, p=4 0.10 9945.75 441.07 0.47 2857.14 606.16 1.37
N=100, p=16 0.04 26699.03 430.92 0.49 5530.42 1223.25 2.84
N=1000, p=16 0.03 3505.42 41.68 0.55 655.46 163.59 3.92
N=30, p=50 0.01 11815.25 44.01 0.12 3941.24 211.66 4.81
LM-WI
N=100, p=4 0.38 44117.65 5595.82 0.99 22865.85 7545.97 1.35
N=1000, p=4 0.44 4874.88 709.03 0.98 2834.47 929.87 1.31
N=100, p=16 0.32 11441.65 1233.59 0.94 5845.67 1837.45 1.49
N=1000, p=16 0.42 1269.14 179.09 1.00 653.62 217.22 1.21
LM-NI
N=100, p=4 0.37 43604.65 5415.31 1.01 22935.78 7749.15 1.43
N=1000, p=4 0.43 5613.77 804.61 1.06 2751.28 974.50 1.21
N=100, p=16 0.31 12386.46 1298.40 0.94 6134.97 1932.29 1.49
N=1000, p=16 0.43 1271.83 182.56 1.02 625.94 212.29 1.16
N=30, p=50 0.01 8581.24 14.45 0.01 3755.63 13.80 0.96
LM-Lasso
N=100, p=16, z=0 0.33 10881.39 905.68 0.33 17730.50 1475.74 1.63
N=1000, p=16, z=0 0.44 1219.59 132.65 0.44 2129.02 231.57 1.75
N=1000, p=30, z=2 0.41 552.30 56.81 0.41 942.42 96.94 1.71
N=1000, p=30, z=15 0.42 540.51 56.91 0.42 941.97 99.17 1.74
N=1000, p=30, z=28 0.42 541.01 56.27 0.42 970.73 100.97 1.79
N=1000, p=100, z=2 0.36 77.75 7.06 0.36 141.22 12.83 1.82
N=1000, p=100, z=50 0.37 74.89 6.89 0.37 141.32 13.01 1.89
N=1000, p=100, z=98 0.39 74.78 7.37 0.39 142.60 14.05 1.91

The “BFG” columns gives results from the same way BFG ran the cases, we think. The “ESS/Ns” is the same as their $\varepsilon_{\beta}$. ESS is averaged for the beta parameters. Ns is the number of saved samples, after burn-in and thinning. Their code gives different choices of burn-in and saved iterations for the different cases, and we used their settings. The “Nit/t” is the total number of iterations (including burn-in) divided by total computation time. The final column, which BFG don’t give, is “ESS/t”, what we call MCMC efficiency. Choice of time in the denominator includes burn-in time (the same as for “Nit/t”).

The “Better code” columns give results when we write the code with matrix multiplication and, for “LM-C” and “LM-C Bin”, univariate priors. It is almost as efficient to write the code using an inner product for each mu[i] instead of matrix multiplication for all mu[i] together. Matrix multiplication makes sense when all of the inputs that might changes (in this case, beta[i]s updated by MCMC) require all of the same likelihood contributions to be calculated from the result (in this case, all y[i]s from all mu[i]s). Either way of coding the model makes it easier for NIMBLE to sample the beta[i]s with conjugate samplers and avoids the inefficiency of putting every scalar step into the model code.

The “Better by” column gives the ratio of “ESS/t” for the “Better code” to “ESS/t” for the BFG code. This is the factor by which the “Better code” version improves upon the “BFG” version.

We can see that writing better code often give improvements of say 1.2-3.0 fold, and sometimes of 5-10+ fold in ESS/time. These improvements — which came from writing the model in NIMBLE more similarly to how it was written in Stan — often put NIMBLE closer to or faster than Stan in various cases, and sometimes faster than JAGS with BFG’s version of the model. We’re sticking to NIMBLE, so we haven’t run JAGS with the better-written code to see how much it improves. Stan still shines for p>n, and JAGS is still really good at linear models. The results show that, for the first four categories (above the LM-Lasso results), NIMBLE also can achieve very good mixing (near 100% ESS/saved samples), with the exception of the p>n cases. BFG’s results showed worse mixing for NIMBLE in those cases.

We can also see that BFG’s computation-time results for “LM-WI” (which we noted above) do appear to be really weird. In our results, that case ran somewhat slower than the LM-C cases with matching N and p, but not around 40-times slower as reported by BFG. We won’t make detailed comparisons of LM-WI cases because we’re not confident BFG’s results are solid for these.

As a example, take LM-C, with the simplest being “N=100, p=4” and the hardest being “N=2000, p=30”, not counting the p>n case. For the simplest case, BFG report that JAGS is about 2.1 times more efficient than Stan and about 2.4 times more efficient than NIMBLE. (E.g., the 2.1 comes from (100 * 3667)/(96 * 1883), reading numbers from their Table 2.) By writing the model in the simpler, better way in NIMBLE, we see a 2.9 fold gain in efficiency. This would make NIMBLE more efficient than Stan. We did not also re-run JAGS with the better code. For the hardest case, BFG report JAGS being about 1.8 times more efficient than Stan and about 2.1 times more efficient than NIMBLE. In that case coding the model better makes NIMBLE 11.4 times more efficient, apparently more efficient than Stan and possibly than JAGS. Again, we did not run JAGS with and without the coding improvement. As a final example, in one of the middle LM-L cases, with N = 1000, p = 30, and 15 of those truly 0, Stan is reported by BFG to be about 3.6 times more efficient than NIMBLE. The better-coded model improves NIMBLE by about 1.7-fold, leaving it still behind Stan but only by about half as much.

We ran these comparisons on a MacBook Pro (2.4 GHz 8-Core Intel Core i9). It looks like this was roughly 5 times faster than the computer on which BFG ran.

Inspection of traceplots revealed that the traceplots for the variance in the 5th and 6th “LM-C” cases had not yet converged in the “BFG” version of the model. More burn-in iterations would be needed. This goes hand-in-hand with the recognition that NIMBLE benefits from good initial values. In a real analysis, if a long burn-in was observed, a practical step would be to provide better initial values for the next run. Applied analysis always involves multiple MCMC runs as one gets things working and checked. With the “better code” version, the chains do appear to have converged.

At this point we should highlight that there isn’t only one version of NIMBLE’s MCMC performance. NIMBLE’s MCMC system is highly configurable, and its default samplers are just one possible choice among many. When putting real effort into boosting performance for hard models, we’ve seen improvements by 1-3 orders of magnitude (herehere and here). In non-conjugate cases where JAGS performs well, it is worth noting that JAGS uses a lot of slice samplers, and those can also be configured in NIMBLE. (But the cases here use lots of conjugate samplers, rather than slice samplers.)

The takeaway is that we don’t know why BFG gave Stan the benefit of matrix multiplication but didn’t do so for JAGS or NIMBLE, and doing so makes a substantial difference for NIMBLE. Also, we see more conjugacy cases to catch in our symbolic processing of model relationships.

3 Comments

  1. Comparing systems on an equal footing is a well-nigh impossible task, which is why we shy away from doing it in Stan. The polite thing to do with these kinds of comparisons is to send your code to the devs of each system for tuning. That’s what we did for our autodiff eval paper.

    I don’t like this paper’s evaluation any more than you do! I’d like to see an evaluation with (a) arithmetic on an equal footing, (b) the kinds of priors we actually use in our applied work, and (c) something higher dimensional than p = 100 (as in p = 10,000 or even p = 100,000 like in the genomics regressions I’m working on now). Then the evaluation I care about is time to ESS = 100 as measured by our conservative cross-chain ESS evaluations that also allow for antithetical sampling (Stan can produce samples whose ESS is higher than the number of iterations; may estimators just truncate at the number of iterations because they don’t understand ESS and its relation to square error through the MCMC CLT). The problem with this kind of eval is that want to represent actual practice but also minimize warmup. In simple GLMs like these, Stan usually only needs 100 or maybe 200 iterations of warmup compared to harder models. So if you use our default of 1000 warmup iterations and then run sampling until you hit ESS = 100, then you’ve wasted a lot of time in too much iteration.

    One way to get around this is to evaluate ESS/second after warmup (or what you might call “burnin” if you’re still following BUGS’s terminology). But given that we rarely need more than ESS = 100 and want to run at least 4 parallel chains to debug convergence, that’s not many iterations and you start getting a lot of variance in the number of iterations it takes to get there. And things are even more sensitive to getting adaptation right. And also, I don’t think ESS/second after warmup is the metric practitioners care about unless they’re trying to evaluate tail statistics, at which point they should be seriously considering control variates rather than more sampling.

    In other words, this is a really hard problem.

    • Thanks for the comment, Bob. That’s all really interesting and makes sense.

      We also look at comparing MCMC methods and packages as hard to do, for the reasons you mentioned and others too. But what prompted us to write these blog posts was realizing that only by reading their source code does one see that in the AFT case the models are not the same and in the LM case the model is written poorly for NIMBLE (and JAGS). We’ve seen their arXiv manuscript tweeted around a bit, so we wanted to clarify some of what’s going on. Even before getting into the kinds of issues you’ve raised, the comparisons did not put the packages on similar footing, from our perspective.

      For the record, when we gave examples of backtracking from our results to the results in the arXiv manuscript to say how updated results for NIMBLE would compare to their results for JAGS or Stan, we are not trying to say those are final comparisons either. They are just “back of the envelope” comparisons of how results might look within the comparison framework of the manuscript, if the packages are put on (more) similar footing. Among the reasons to not read too much into our updated comparisons between packages is that relative computational efficiency might differ across computational environments, and we did not want to get into re-running all packages. These posts are primarily about setting up fair comparisons and making better use of NIMBLE.

      Agreed, it is a really hard problem!

  2. I just read their source code. It’s not exactly Stan best practices for statistics or computation. For instance, in mixture_model.stan, there are redundant data computations per iteration (line 25), redundant distributions (also line 25), inefficient function calls (line 31), and a conjugate parameterization inducing extra work like sqrt (line 23). Then in AFT_non_informative.stan, they use very weakly informative priors (so it’s misnamed), there are missing constraints on constrained variables (lines 17, 32), redundant computation of subexpressions and of constants (lines 26, 27), missing algebraic reductions (also lines 26 and 27), redundant initialization and setting (lines 22/23 and 26/27), redundant computations (line 32).

    The worst case for efficiency is in their coding of linear models where they use a loop rather than a matrix multiply (LinearModel_conjugate.stan, lines 30–32, which also violates every naming convention in the world by mixing underscore separators and camel case). This code also divides by 2 everywhere when it should be multiplying by 0.5 and it also has a bunch of problems like the other code of missing constraints (this one’s critical—`sigma` needs to be constrained to be greater than 0).

    Then when we look at LinearModel_non_informative_hc.stan, things get even worse. It combines the problems of LinearModel_conjugate with two really bad ones for performance: not vectorizing the normal distribution and needlessly truncating the Cauchy distribution. These would add up to at least a factor of 2 and probably much more.

    And of course, none of these exploited within-chain parallelization or GPU. And no use of sufficient stats in the conjugate cases like Linear_model_conjugate.stan.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.