A close look at some posted trials of nimble for accelerated failure time models

Posted on October 29, 2021October 29, 2021 by nimble-admin

A bunch of folks have brought to our attention a manuscript by Beraha, Falco and Guglielmi (BFG) posted on arXiv giving some comparisons between JAGS, NIMBLE, and Stan. Naturally, we wanted to take a look. Each package performs best in some of their comparisons. There’s a lot going on, so here we’re just going to work through the last of their four examples, an accelerated failure time (AFT) model, because that’s the one where NIMBLE looks the worst in their results. The code from BFG is given on GitHub here

There may be some issues with their other three examples as well, and we might work through those in future blog post(s). NIMBLE provides a lot of flexibility for configuring MCMCs in different ways (with different samplers), which means a comparison using our default configuration is just a start. Performance differences can also arise from writing the same model in different ways. We see both kinds of issues coming up for the other examples. But the AFT example gives a lot to talk about, so we’re sticking to that one here.

It turns out that NIMBLE and JAGS were put at a huge disadvantage compared to Stan, and that BFG’s results from NIMBLE don’t look valid, and that there isn’t any exploration of NIMBLE’s configurability. If we make the model for NIMBLE and JAGS comparable to the model for Stan, NIMBLE does roughly 2-45 times better in various cases than what BFG reported. If we explore a simple block sampling option, NIMBLE gets a small additional boost in some cases. It’s hard to compare results exactly with what BFG report, and we are not out to re-run the full comparison including JAGS and Stan. A “back of the envelope” comparison suggests that NIMBLE is still less efficient than Stan for this example, but not nearly to the degree reported. We’re also not out to explore many sampling configurations to try for better performance in this particular example problem, but part of NIMBLE’s design is to make it easy to do so.

Before starting into the AFT models, it’s worth recognizing that software benchmarks and other kinds of performance comparisons are really hard to do well. It’s almost inevitable that, when done by developers of one package, that package gets a boost in results even if objectivity is the honest goal. That’s because package developers almost can’t help using their package effectively and likely don’t know how to use other packages as well as their own. In this case, it’s fair to point out that NIMBLE needs more care in providing valid initial values (which BFG’s code doesn’t do) and that NIMBLE’s default samplers don’t work well here, which is because this problem features heavy right tails of Weibull distributions with shape parameter < 1. For many users, that is not a typical problem. By choosing slice samplers (which JAGS often uses too) instead of NIMBLE’s default Metropolis-Hastings samplers, the mixing is much better. This issue is only relevant to the problem as BFG formulated it for JAGS and NIMBLE and goes away when we put it on par with the formulation BFG gave to Stan. In principle, comparisons by third parties, like BFG, might be more objective than those by package developers, but in this case the comparisons by BFG don’t use JAGS or NIMBLE effectively and include incorrect results from NIMBLE.

Below we try to reproduce their (invalid) results for NIMBLE and to run some within-NIMBLE comparisons of other methods. We’ll stick to their model scenarios and performance metrics. Those metrics are not the way we’ve done some published MCMC comparisons here, here and here, but using them will allow readers to interpret our results alongside theirs.

First we’ll give a brief summary of their model scenarios. Here goes.

Accelerated Failure Time (AFT) models

Here’s a lightning introduction to AFT models based on Weibull distributions. These are models for time-to-event data such as a “failure.” For shape $a$ and scale $s$ , the Weibull probability density function for time $t$ is

$f_W(t | a, s) = frac{a}{s}left(frac{t}{s} ight)^{a-1} e^{-left(frac{t}{s} ight)^a}$

One important thing about the Weibull is that its cumulative density can be written in closed form. It is:

$F_W(t | a, s) = 1-e^{left(frac{t}{s} ight)^a}$

The role of covariates is to accelerate or decelerate the time course towards failure, effectively stretching or shrinking the time scale for each item. Specifically, for covariate vector $x$ and coefficient vector $eta$ , define $heta = e^{-x' eta}$ . Then the distribution of times-to-event is defined by rescaling the Weibull: $f(t | heta, a, s) = heta f_W( heta t | a, s)$ . This gives a Weibull with shape $a$ and scale $s / heta$ , so we have

$f(t | heta, a, s) = f_W(t | a, frac{s}{ heta})$

In the code, there are two parameterizations in play. The first is $(a, s)$ as just given. This is used in Stan and could be used in NIMBLE because it supports alternative parameterizations, including that one. Given $eta$ , the scale is $log(2)^{-frac{1}{a}} e^{x' eta}$ . The second is $(a, lambda = left(frac{1}{s} ight)^{a})$ . This is the parameterization in the BUGS model language, so it is used in JAGS and is the default in NIMBLE. Given $eta$ , $lambda = log(2) e^{-a(x' eta)}$ .

The reason for the $log(2)$ is that it makes the median of $f_W(t | a, s)$ be 1 for any $a$ , i.e. when $x' eta = 0$ . Priors are put on $a$ (alpha in the code) and $eta$ (beta). There is no separate scale parameter. Rather, $lambda = log(2)$ when $x' eta = 0$ . The models are equivalent with either parameterization, and they shouldn’t have much impact on computational efficiency. We’re just pointing them out to follow what’s going on.

Right-censored failure time data

When a failure time is directly observed, its likelihood contribution is $f_W(t | a, s e^{x' eta})$ . When a unit hasn’t failed by its last observation, all that is known is that it lasted at least until $t$ . Then its likelihood contribution is $1-F_W(t | a, s e^{x' eta})$ . This is called a right-censored observation. Thus the data consist of some $t$ s that are actual failure times and some $t$ s that are right-censoring times.

There are two ways to handle a right-censored observation in MCMC:

Include the likelihood factor $1-F_W(t | a, s e^{x' eta})$ . This is how BFG set up the model for Stan.
Include a latent state, $t'$ , for the failure time. Include the likelihood factor $f_W(t' | a, s e^{x' eta})$ and let MCMC sample $t'$ , with the numerical effect of integrating over it. This is how BFG set up the model for JAGS and NIMBLE.

The first version is marginalized relative to the second version because $1-F_W(t | a, s e^{x' eta})$ integrates over $t'$ without needing to sample it. Often, but not always, marginalization is computationally faster and gives better mixing, so it makes the MCMC problem easier. That’s why the comparison as set up by BFG seems like an apples-to-oranges comparison. They’ve made the problem substantially easier for Stan.

It’s easy to set up the marginalized version for JAGS or NIMBLE. This can be done using the “zeroes” trick in the BUGS language, which both packages use for writing models. In NIMBLE this can also be done by writing a user-defined distribution as a nimbleFunction, which can be compiled along with a model.

BFG’s scenarios

BFG included the following scenarios:

Sample size, $N$ , is 100 or 1000.
Number of explanatory variables, $p$ , is 4 or 16. These always include an intercept. Other covariates, and the true coefficient values, are simulated.
Censoring times are drawn from another Weibull distribution. This is set up following previous works such that the expected proportion of censored values is 20%, 50% or 80%.
Most of their comparisons use informative priors. Those are the ones we look at here. Again, we weren’t out to look at everything they did.
They used $N_{it} = 10,000$ total iterations. Of these, $5,000$ were discarded as burn-in (warmup). They used a thinning interval of 2, resulting in $N_s = 2,500$ saved samples.

Some issues to explore

Now that we’ve set up the background, we are ready to list some of the issues with BFG’s comparisons that are worth exploring. For the computational experiments below, we decided to limit our efforts to NIMBLE because we are not trying to re-do BFG’s full analysis. Here are the main issues.

BFG gave Stan a much easier problem than they gave JAGS and NIMBLE. Stan was allowed to use direct calculation of right-censored probabilities. These are complementary (right-tail) cumulative probability density calculations. NIMBLE and JAGS were made to sample latent failure times for censored items, even though they can be set up to use the cumulative calculations as well. Below we give NIMBLE a more comparable problem to the one given by BFG to Stan.
It looks like BFG must not have obtained valid results from NIMBLE because they did not set up valid initial values for latent failure times. NIMBLE can be more sensitive to initial values (“inits”) than JAGS. We think that’s partly because NIMBLE uses a lot of adaptive random-walk Metropolis-Hastings samplers in its default MCMC configuration. In any case, NIMBLE gives warnings at multiple steps if a user should give attention to initial values. We give warnings instead of errors because a user might have plans to add initial values at a later step, and because sometimes MCMC samplers can recover from bad initial values. In the AFT example, the model does not “know” that initial values for latent failure times must be greater than the censoring times. If they aren’t, the likelihood calculations will return a -Inf (or possibly NA), which causes trouble for the samplers. Inspection of the model after MCMC runs using BFG’s code shows that even after 10000 iterations, the model likelihood is -Inf, so the results are invalid. It’s fair to say this is an issue in how to use NIMBLE, but it’s confusing to include invalid results in a comparison.
Even with valid initial values in BFG’s model formulation, NIMBLE’s default samplers do not do well for this example. In this post, we explore slice samplers instead. The problem is that the Weibull distributions in these scenarios give long right tails, due to simulating with shape parameter < 1. This corresponds to failure rates that decrease with time, like when many failures occur early and then those that don’t fail can last a long, long time. MCMC sampling of long right tails is a known challenge. In trial runs, we saw that, to some extent, the issue can be diagnosed by monitoring the latent failure times and noticing that they don’t mix well. We also saw that sometimes regression parameters displayed mixing problems. BFG report that NIMBLE’s results have mean posterior values farther from the correct values than given by the other tools, which is a hint that something is more deeply wrong. Slice samplers work much better for this situation, and it is easy to tell NIMBLE to use slice samplers, which we did.
BFG’s code uses matrix multiplication for $x' eta$ in Stan, but not in NIMBLE or JAGS, even though they also support matrix multiplication. Instead, BFG’s code for NIMBLE and JAGS has a scalar declaration for each element of the matrix multiplication operation, followed by the sums that form each element of the result. We modify the code to use matrix multiplication. While we don’t often see this make a huge difference in run-time performance (when we’ve looked at the issue in other examples), it could potentially matter, and it definitely speeds up NIMBLE’s model-building and compilation steps because there is less to keep track of. An intermediate option would be to use inner products (inprod).
It’s worth noting that all of these examples are fairly fast and mix fairly well. Some might disagree, but these all generate reasonable effective sample sizes in seconds-to-minutes, not hours-to-days.
There are some minor issues, and we don’t want to get nit-picky. One is that we don’t see BFG’s code being set up to be reproducible. For example, not only is there no set.seed so that others can generate identical data sets, but it looks like each package was given different simulated data sets. It can happen that MCMC performance depends on the data set. While this might not be a huge issue, we prefer below to give each package the same, reproducible, data sets. Another issue is that looking at average effective sample size across parameters can be misleading because one wants all parameters mixed well, not some mixed really well and others mixed poorly. But in these examples the parameters compared are all regression-type coefficients that play similar roles in the model, and the averaging doesn’t look like a huge issue. Finally, BFG decline to report ESS/time, preferring instead to report ESS and time and let readers make sense of them. We see ESS/time as the primary metric of interest, the number of effectively independent samples generated per second, so we report it below. This gives a way to see how both mixing (ESS) and computation time contribute to MCMC performance.

Setting up the example

We use BFG’s code but modify it to organize it into functions and make it reproducible. The source files for this document includes code chunks to run and save results. We are not running JAGS or Stan because we are not trying to reproduce a full set of comparisons. Instead we are looking into NIMBLE’s performance for this example. Since the main issue is that BFG gave NIMBLE and JAGS harder models than they gave Stan, we fix this in a way that is not NIMBLE-specific and should also work for JAGS.

Here is a summary of what the code does:

Set up the twelve cases with informative priors included in the first twelve rows of BFG’s table 5, which has their AFT results.
For each of the twelve cases, run:
- the original method of BFG, which gives invalid results but is useful for trying to see how much later steps improve over what BFG reported;
- a method with valid initial values and slice sampling, but still in the harder model formulation given by BFG;
- a method with the model formulation matching what BFG gave to Stan, using marginal probabilities for censored times and also using matrix multiplication;
- a method with the model formulation matching what BFG gave to Stan and also with one simple experiment in block sampling. The block sampler used is a multivariate adaptive random-walk Metropolis-Hastings sampler for all the regression coefficients. It sometimes helps to let these try multiple propose-accept/reject steps because otherwise $p$ tries are replaced with 1 try (where $p$ is the number of regression coefficients). As a heuristic choice, we used $p/2$ tries each time the sampler ran.

Although the original method of BFG seems to give invalid results, we include it so we can try to roughly compare performance (shown below) against what they report. However one difficulty is that processing with -Inf and NaN values can be substantially slower than processing with actual numbers, and these issues might differ across systems.

Results here are run on a MacBook Pro (2019), with 2.4 GHz 8-Core Intel Core i9, and OS X version 11.6.

Results

Here are the results, in a table that roughly matches the format of BFG’s Table 5. “Perc” is the average fraction of observations that are right-censored.

As best as we can determine:

“ESS/Ns” is their “ $varepsilon_{eta}$ “. This is the mean effective sample size of the (4 or 16) beta coefficients per saved MCMC iteration. The number of saved iterations, $N_s$ is 2500. We used coda::effectiveSize to estimate ESS. We did not see in their code what method they used. This is another reason we can’t be sure how to compare our results to theirs.
“Nit/t” is their “ $N_{it}/t_s$ “, total number of iterations (10000) per computation time, not counting compilation time.
We calculate “ESS/t”, which is the product of the previous two numbers divided by four, (ESS/Ns)*(Nit/t)/4. This is the mean effective sample size from the saved samples per total sampling time (including burn-in). One might also consider modifying this for the burn-in portion. The factor 4 comes from $N_{it}/N_s$ = 4. We do it this way to make it easier to compare to BFG’s Table 5. They decline to calculate a metric of ESS per time, which we view as a fundamental metric of MCMC performance. An MCMC can be efficient either by generating well-mixed samples at high computational cost or generating poorly-mixed samples at low computational cost, so both mixing and computational cost contribute to MCMC efficiency.

	BFG (invalid)			BFG+inits+slice			Marginal			Marginal+blocks
	ESS/Ns	Nit/t	ESS/t	ESS/Ns	Nit/t	ESS/t	ESS/Ns	Nit/t	ESS/t	ESS/Ns	Nit/t	ESS/t
Perc = 0.2
N=100, p = 4, perc = 0.2	0.27	6844.63	465.80	0.52	2325.58	300.65	0.39	9775.17	951.09	0.27	16233.77	1109.06
N=1000, p = 4, perc = 0.2	0.30	1127.27	84.71	0.55	306.22	41.83	0.41	1527.88	157.65	0.28	2490.04	171.47
N=100, p = 16, perc = 0.2	0.19	3423.49	161.60	0.36	949.49	84.94	0.27	3717.47	248.99	0.29	5621.14	408.77
N=1000, p = 16, perc = 0.2	0.08	404.22	7.80	0.57	98.86	14.16	0.41	591.82	61.12	0.30	1100.47	83.33
Perc = 0.5
N=100, p = 4, perc = 0.5	0.05	7262.16	98.39	0.08	2572.68	54.45	0.38	10214.50	960.31	0.26	15060.24	990.34
N=1000, p = 4, perc = 0.5	0.10	1106.32	26.96	0.10	298.23	7.25	0.44	1987.28	219.92	0.26	3074.09	196.19
N=100, p = 16, perc = 0.5	0.06	3411.80	52.07	0.21	940.56	49.94	0.23	3955.70	229.94	0.28	5854.80	415.89
N=1000, p = 16, perc = 0.5	0.07	339.29	5.88	0.07	95.90	1.66	0.41	601.90	61.98	0.31	1074.58	83.07
Perc = 0.8
N=100, p = 4, perc = 0.8	0.03	6761.33	51.99	0.02	2297.79	10.79	0.24	9842.52	602.28	0.20	15151.52	763.36
N=1000, p = 4, perc = 0.8	0.02	1013.27	5.16	0.02	265.58	1.50	0.39	1831.50	180.50	0.25	2856.33	176.27
N=100, p = 16, perc = 0.8	0.04	3412.97	33.45	0.03	876.96	6.74	0.17	3853.56	166.26	0.23	5820.72	329.18
N=1000, p = 16, perc = 0.8	0.01	395.99	1.22	0.05	95.33	1.22	0.39	560.54	54.91	0.29	1016.57	72.55

The left-most set of results (“BFG (invalid)”) is comparable to the right-most (“NIMBLE”) column of BFG’s Table 5, in the same row order for their first 12 rows. The simulated data sets are different. For that reason and the stochasticity of Monte Carlo methods, we shouldn’t expect to see exactly matching values. And of course the computations were run on different systems, resulting in different times. Again, these results are invalid.

The next column (“BFG+inits+slice”) gives results when BFG’s model formulation for JAGS and NIMBLE is combined with valid initialization and slice sampling in NIMBLE. We can see that valid sampling generally gives lower ESS/time than the invalid results.

The next column shows results when the problem is set up as BFG gave it to Stan, and NIMBLE’s default samplers are used. If we assume the left-most results are similar to what BFG report, but with times from the system used here, then the boost in performance is the ratio of ESS/time between methods. For example, in the last row, the marginal method is 54.91/1.22 = 45.01 times more efficient that what BFG reported. We can make a similar kind of ratio between Stan and NIMBLE from BFG’s results, which gave Stan as about 380 times more efficient than NIMBLE (although rounding error for “1%” could be a substantial issue here). Putting these together, Stan might really be about 8.4 times more efficient than NIMBLE for this case, which is the hardest case considered.

The last column shows results of the single experiment with alternative (block) samplers that we tried. In many cases, it gives a modest additional boost. Often with more work one can find a better sampling strategy, which can be worth the trouble for extended work with a particular kind of model. In the last row of our results, this gives about another 72.55 / 54.91 = 1.32 boost in performance, lowering the ratio to Stan to about 6.4. Again, we decided to limit this post to within-NIMBLE comparisons, and the comparisons to Stan based on BFG’s results should be taken with a grain of salt because we didn’t re-run them.

In summary, it looks like BFG gave Stan a different and easier accelerated failure time problem than they gave NIMBLE and JAGS. When given the same problem as they gave Stan, NIMBLE’s default samplers perform around 2 to 45 times better than what BFG reported.

2 Comments

Mario Beraha

November 2, 2021 at 5:37 am Reply

This is great feedback, thanks a lot! We will make sure to cite your post in a future version.
We are very glad that Nimble developers have been examining our paper: the full customisability given by Nimble is very powerful, but might it be tricky for new users and practitioners to get everything right and this kind of blog posts will really help the community.

Regarding the content of your post:
1) You are completely right that the posterior that Nimble is fitting is different from the one of Stan; however, this “zeroes” trick that you mention is not something easily found in manuals (and actually we hadn’t thought about it) while in Stan you are forced to use the marginalised posterior and this is clearly explained in their manual.

2) In the same way, we acknowledge that tuning the MCMC algorithm used in Nimble can make a huge difference in the performance. Both the zeroes trick and the MCMC tuning is something that for a practitioner is extremely hard to achieve and, as we point out in the paper, our goal was just to provide a comparison for novice practitioners.

3) We did not pass initial values for the censoring times, but this is the same setting also for JAGS and Stan.

4) In our simulations we did indeed fix a random seed (and the specific one is 423156), moreover we always compare different packages on the same dataset. When we performed “repeated” simulations (see, e.g., Appendix D) we launched the same script in parallel with different random seeds. We can add back the random seed to the Github code.
- nimble-admin
  
  November 3, 2021 at 10:51 am Reply
  
  Thanks for posting a comment on our blog post. We appreciate it.
  
  Here are some brief replies to your points:
  
  1. We understand you found it easier to write the marginalized model in Stan than in NIMBLE or JAGS. The NIMBLE User Manual has a chapter titled “Creating user-defined BUGS [language] distributions and functions”. This is pointed to from the table of contents as well as sections 1.2 and 5.1 (on writing models in NIMBLE). Our web site has an example titled “Writing a new distribution for use in BUGS code: zero-inflated Poisson”. We see the “zeroes trick” used regularly and known among JAGS users, and we used that in the blog post to provide a solution that should work for both NIMBLE and JAGS.
  
  In NIMBLE one could, instead of using the zeroes trick, use the language extensibility covered in the User Manual to write a Weibull complementary CDF (shown in shape and scale parameterization) as (this doesn’t come our formatted very well in this space):
  
  dccdf_weib <- nimbleFunction(
  
    run = function(x = double(), shape = double(), scale = double(), log = integer(0, default = 0)) {
  
      # One could use pweibull or write
  
      # out the equation.
  
      ans <- pweibull(x, shape, scale, lower.tail = FALSE, log.p = log)
  
      return(ans)
  
      returnType(double())
  
    }
  
  )
  
  Giving user-defined distributions in NIMBLE is similar to how in Stan one can manually write a likelihood contribution. In JAGS, people do that with the zeroes trick.
  
  2. We understand that trying different sampler choices is a more advanced use, but that was not the central point of our blog post. (Also, when one tries entirely different packages, part of what one gets is different sampler choices, whether those are customizable or not.)
  
  The central point was that it is better not to confound comparisons among MCMC methods (such as from different packages) with comparisons among different model formulations. If it is harder to write a model in an effective way in one package, that seems like a fair point. But we feel it could be confusing if that point is conveyed only by different MCMC performance. In this case, it seemed especially confusing because one can readily write the model in the same way for all packages, and there are multiple ways to do that in NIMBLE.
  
  3. We know that NIMBLE, especially with its default samplers, sometimes needs more helpful initial values. That is a fair point to make. However, as far as we could tell, your article might include invalid results, and that seems like it could be confusing to readers.
  
  4. It would be fantastic if you can update your GitHub repo to include fully reproducible code. Another question we raised is what method you used to estimate effective sample size. For example, there are different methods in Stan, coda, and mcmcse. I may have missed it, but I didn't see it in your code.
  
  Thanks again for commenting.
  
  -Perry de Valpine