Unfortunately, we realized (thanks to a user post from a couple days ago) that there is a bug in this new approach to predictive nodes.
If you haven’t upgraded to version 0.13.0, simply wait to upgrade until we release a bug fix in 0.13.1 in the next couple weeks.
If you have upgraded to version 0.13.0 and if you have run an MCMC on a model that both (1) has predictive nodes and (2) has multivariate nodes, then the bug might affect your results. Please set:
nimbleOptions(MCMCusePredictiveDependenciesInCalculations = TRUE)
and then reconfigure/rebuild and rerun your MCMC. The option above will ensure that the MCMC behaves as it would in previous versions of NIMBLE.
]]>Recently we added support for automatic differentiation (AD) to NIMBLE in a beta release, and the workshop will cover NIMBLE’s AD capabilities in detail.
The workshop will cover the following material:
If you are interested in attending, please preregister. Registration fees will be $125 (regular) or $50 (student). We are also offering a process (see the preregistration form) for students to request a fee waiver.
The workshop will assume attendees have a basic understanding of hierarchical/Bayesian models and MCMC, the BUGS (or JAGS) model language, and some familiarity with R.
]]>We’re making this beta release available to allow our users to test and evaluate the AD functionality and the new algorithms, but it is not recommended for production use at this stage. So please give it a try, and let us know of any problems or suggestions you have, either via the nimbleusers list, bug reports to our GitHub repository, or email to nimble.stats@gmail.com.
You can download the beta version and view an extensive draft manual for the AD functionality.
We plan to release this functionality in the next NIMBLE release on CRAN in the coming months.
]]>Version 0.12.2 is a bug fix release. In particular, this release fixes a bug in our Bayesian nonparametric distribution (BNP) functionality that gives incorrect MCMC results for some models, specifically when using the dCRP distribution when the parameters of the mixture components (i.e., the clusters) have hyperparameters (i.e., the base measure parameters) that are unknown and sampled during the MCMC. Here is an example basic model structure that is affected by the bug:
k[1:n] ~ dCRP(alpha, n) for(i in 1:n) { y[i] ~ dnorm(mu[k[i]], 1) mu[i] ~ dnorm(mu0, 1) ## mixture component parameters with hyperparameter } mu0 ~ dnorm(0, 1) ## unknown cluster hyperparameter
(There is no problem without the hyperparameter layer – i.e., if mu0 is a fixed value – which is the situation in many models.)
We strongly encourage users using models with this type of structure to rerun their analyses, and we apologize for this issue.
Other changes in this release include:
Please see the release notes on our website for more details.
]]>
NIMBLE is a system for building and sharing analysis methods for statistical models, especially for hierarchical models and computationallyintensive methods (such as MCMC and SMC).
More details and registration are available at the workshop website. No previous NIMBLE experience is required, but the workshop will assume some familiarity with hierarchical models, Markov chain Monte Carlo (MCMC), and R.
]]>
In our previous post, we explained that they compared apples to oranges in the accelerated failure time (AFT) example. They gave Stan a different and easier problem than they gave JAGS and NIMBLE. When we gave NIMBLE the same problem, we saw that its MCMC performance was up to 45 times better than what they reported. We looked first at the AFT example because that’s where NIMBLE seemed to perform comparatively worst.
On my first real hike, I went to the Carpathians when I was less than 9 years old (already an adult :), and my sister was 7 then. Some memories of this family trip have survived to this day. I remember how I overate blueberries on the slopes of Temnatik, how they ran wildly up and down the mountain, how they carried small backpacks (it seems that I had 9 kg, and my sister had 7. But I’m not sure in climbing with kids, how my father and I caught delicious trout, how they later roasted it in a cauldron, how they climbed to the top of Mount Stoj and walked among its domes. It’s strange, only a week of hiking, and so many pleasant memories for life…
In this post we’re looking at the simple linear model example. It turns out that the models were written more efficiently for Stan than for JAGS and NIMBLE, because matrix multiplication was used for Stan but all scalar steps of matrix multiplication were written in JAGS and NIMBLE. JAGS and NIMBLE do support matrix multiplication and inner products. When we modify the models to also use matrix multiplication, NIMBLE’s MCMC performance with default samplers increases often by 1.2 to 3fold but sometimes by 5 to >10fold over what was reported by BFG, as far as we can tell. This had to do with both raw computational efficiency and also MCMC samplers invoked by different ways to write the model code. Other issues are described below.
BFG’s linear model examples explore different data sizes (n = 30, 100, 1000, or in one case 2000), different numbers of explanatory variables (4, 16, 30, 50 or 100), and different priors for the variance and/or coefficients (beta[i]s), all in a simple linear model. The priors included:
Again, we are going to stick to NIMBLE here and not try to reproduce or explore results for JAGS or Stan.
In more detail, the big issues that jumped out from BFG’s code are:
X %*% beta
, while NIMBLE and JAGS were given code to do all of the elementbyelement steps of matrix multiplication. Both NIMBLE and JAGS support matrix multiplication and inner products, so we think it is better and more directly comparable to use these features.There are two reasons that writing out matrix multiplication as they did is not a great way to code a model. The first is that it is just inefficient. For X that is Nbyp and beta that is pby1, there are N*p scalar multiplications and N summations of length p in the model code. Although somewhere in the computer those elemental steps need to be taken, they will be substantially faster if not broken up by handcoding them. When NIMBLE generates (and then compiles) C++, it generates C++ for the Eigen linear algebra library, which gives efficient implementations of matrix operations.
The second reason, however, may be more important in this case. Using either matrix multiplication or inner products makes it easier for NIMBLE to determine that the coefficients (“beta[i]”s) in many of these cases have conjugate relationships that can be used for Gibbs sampling. The way BFG wrote the model revealed to us that we’re not detecting the conjugacy in this case. That’s something we plan to fix, but it’s not a situation that’s come before us yet. Detecting conjugacy in a graphical model — as written in the BUGS/JAGS/NIMBLE dialects of the BUGS language — involves symbolic algebra, so it’s difficult to catch all cases.
The reasons it’s better to give a set of univariate normal priors than a single multivariate normal are similar. It’s more computationally efficient, and it makes it easier to detect conjugacy.
In summary, they wrote the model inefficiently for NIMBLE and differently between packages, and we didn’t detect conjugacy for the way they wrote it. In the results below, the “better” results use matrix multiplication directly (in all cases) and use univariate normal priors instead of a multivariate normal (in the “LMC” and “LMC Bin” cases).
It also turns out that neither JAGS nor NIMBLE detects conjugacy for the precision parameter of the “LMC” and “LMC Bin” cases. (This is shown by list.samplers in rjags and configureMCMC in NIMBLE.) In NIMBLE, a summary of how conjugacy is determined is in Table 7.1 of our User Manual. It can be obtained by changing sd = sigma
to var = sigmasq
in one line of BFG’s code. In these examples, we found that this issue doesn’t make much different to MCMC efficiency, so we leave it as they coded it.
Before giving our results, we’ll make a few observations on BFG’s results, shown in their Table 2. One is that JAGS gives very efficient sampling for many of these cases, and that’s something we’ve seen before. Especially when conjugate sampling is available, JAGS does well. Next is that Stan and NIMBLE each do better than the other in some cases. As we wrote about in the previous post, BFG chose not to calculate what we see as the most relevant metric for comparison. That is the rate of generating effectively independent samples, the ESS/time, which we call MCMC efficiency. An MCMC system can be efficient by slowly generating wellmixed samples or by rapidly generating poorlymixed samples. One has to make choices such as whether burnin (or warmup) time is counted in the denominator, depending on exactly what is of interest. BFG reported only ESS/recorded iterations and total iterations/time. The product of these is a measure of ESS/time, scaled by a ratio of total iterations / recorded iterations.
For example, in the “LMC” case with “N = 1000, p = 4”, Stan has (ESS/recorded iterations) * (total iterations/time) = 0.99 * 157=155, while NIMBLE has 0.14 * 1571=220. Thus in this case NIMBLE is generating effectively independent samples faster than Stan, because the faster computation outweighs the poorer mixing. In other cases, Stan has higher ESS/time than NIMBLE. When BFG round ESS/recorded iterations to “1%” in some cases, the ESS/time is unknown up to a factor of 3 because 1% could be rounded from 0.50 or from 1.49. For most cases, Stan and NIMBLE are within a factor of 2 of each other, which is close. One case where Stan really stands out is the noninformative prior (LMNI) with p>n, but it’s worth noting that this is a statistically unhealthy case. With p>n, parameters are not identifiable without the help of a prior. In the LMNI case, the prior is uninformative, and the posteriors for beta[i]s are not much different than their priors.
One other result jumps out as strange from their Table 2. The runtime results for “LMWI” (total iterations / time) are much, much slower than in other cases. For example, with N = 100 and p = 4, this case was only 2.6% (294 vs 11,000 ) as fast as the corresponding “LMC” case. We’re not sure how that could make sense, so it was something we wanted to check.
We took all of BFG’s source code and organized it to be more fully reproducible. After our previous blog post, set.seed calls were added to their source code, so we use those. We also organize the code into functions and sets of runs to save and process together. We think we interpreted their code correctly, but we can’t be sure. For ESS estimation, we used coda::effectiveSize, but Stan and mcmcse are examples of packages with other methods, and we aren’t sure what BFG used. They thin by 2 and give average results for beta[i]s. We want to compare to their results, so we take those steps too.
Here are the results:
BFG

Better code

Improvement



ESS/Ns  Nit/t  ESS/t  ESS/Ns  Nit/t  ESS/t  Better by  
LMC  
N=100, p=4  0.15  56122.45  3738.90  1.03  23060.80  10842.00  2.90 
N=1000, p=4  0.14  9401.71  609.97  1.00  2866.82  1303.10  2.14 
N=100, p=16  0.04  25345.62  428.45  0.95  5555.56  2396.00  5.59 
N=1000, p=16  0.03  3471.13  54.06  1.00  613.98  278.53  5.15 
N=2000, p=30  0.01  863.83  5.52  1.00  137.60  62.67  11.35 
N=30, p=50  0.00  11470.28  24.49  0.07  3869.15  114.62  4.68 
LMC Bin  
N=100, p=4  0.12  61452.51  3303.31  0.52  22916.67  5384.40  1.63 
N=1000, p=4  0.10  9945.75  441.07  0.47  2857.14  606.16  1.37 
N=100, p=16  0.04  26699.03  430.92  0.49  5530.42  1223.25  2.84 
N=1000, p=16  0.03  3505.42  41.68  0.55  655.46  163.59  3.92 
N=30, p=50  0.01  11815.25  44.01  0.12  3941.24  211.66  4.81 
LMWI  
N=100, p=4  0.38  44117.65  5595.82  0.99  22865.85  7545.97  1.35 
N=1000, p=4  0.44  4874.88  709.03  0.98  2834.47  929.87  1.31 
N=100, p=16  0.32  11441.65  1233.59  0.94  5845.67  1837.45  1.49 
N=1000, p=16  0.42  1269.14  179.09  1.00  653.62  217.22  1.21 
LMNI  
N=100, p=4  0.37  43604.65  5415.31  1.01  22935.78  7749.15  1.43 
N=1000, p=4  0.43  5613.77  804.61  1.06  2751.28  974.50  1.21 
N=100, p=16  0.31  12386.46  1298.40  0.94  6134.97  1932.29  1.49 
N=1000, p=16  0.43  1271.83  182.56  1.02  625.94  212.29  1.16 
N=30, p=50  0.01  8581.24  14.45  0.01  3755.63  13.80  0.96 
LMLasso  
N=100, p=16, z=0  0.33  10881.39  905.68  0.33  17730.50  1475.74  1.63 
N=1000, p=16, z=0  0.44  1219.59  132.65  0.44  2129.02  231.57  1.75 
N=1000, p=30, z=2  0.41  552.30  56.81  0.41  942.42  96.94  1.71 
N=1000, p=30, z=15  0.42  540.51  56.91  0.42  941.97  99.17  1.74 
N=1000, p=30, z=28  0.42  541.01  56.27  0.42  970.73  100.97  1.79 
N=1000, p=100, z=2  0.36  77.75  7.06  0.36  141.22  12.83  1.82 
N=1000, p=100, z=50  0.37  74.89  6.89  0.37  141.32  13.01  1.89 
N=1000, p=100, z=98  0.39  74.78  7.37  0.39  142.60  14.05  1.91 
The “BFG” columns gives results from the same way BFG ran the cases, we think. The “ESS/Ns” is the same as their $\varepsilon_{\beta}$. ESS is averaged for the beta parameters. Ns is the number of saved samples, after burnin and thinning. Their code gives different choices of burnin and saved iterations for the different cases, and we used their settings. The “Nit/t” is the total number of iterations (including burnin) divided by total computation time. The final column, which BFG don’t give, is “ESS/t”, what we call MCMC efficiency. Choice of time in the denominator includes burnin time (the same as for “Nit/t”).
The “Better code” columns give results when we write the code with matrix multiplication and, for “LMC” and “LMC Bin”, univariate priors. It is almost as efficient to write the code using an inner product for each mu[i] instead of matrix multiplication for all mu[i] together. Matrix multiplication makes sense when all of the inputs that might changes (in this case, beta[i]s updated by MCMC) require all of the same likelihood contributions to be calculated from the result (in this case, all y[i]s from all mu[i]s). Either way of coding the model makes it easier for NIMBLE to sample the beta[i]s with conjugate samplers and avoids the inefficiency of putting every scalar step into the model code.
The “Better by” column gives the ratio of “ESS/t” for the “Better code” to “ESS/t” for the BFG code. This is the factor by which the “Better code” version improves upon the “BFG” version.
We can see that writing better code often give improvements of say 1.23.0 fold, and sometimes of 510+ fold in ESS/time. These improvements — which came from writing the model in NIMBLE more similarly to how it was written in Stan — often put NIMBLE closer to or faster than Stan in various cases, and sometimes faster than JAGS with BFG’s version of the model. We’re sticking to NIMBLE, so we haven’t run JAGS with the betterwritten code to see how much it improves. Stan still shines for p>n, and JAGS is still really good at linear models. The results show that, for the first four categories (above the LMLasso results), NIMBLE also can achieve very good mixing (near 100% ESS/saved samples), with the exception of the p>n cases. BFG’s results showed worse mixing for NIMBLE in those cases.
We can also see that BFG’s computationtime results for “LMWI” (which we noted above) do appear to be really weird. In our results, that case ran somewhat slower than the LMC cases with matching N and p, but not around 40times slower as reported by BFG. We won’t make detailed comparisons of LMWI cases because we’re not confident BFG’s results are solid for these.
As a example, take LMC, with the simplest being “N=100, p=4” and the hardest being “N=2000, p=30”, not counting the p>n case. For the simplest case, BFG report that JAGS is about 2.1 times more efficient than Stan and about 2.4 times more efficient than NIMBLE. (E.g., the 2.1 comes from (100 * 3667)/(96 * 1883), reading numbers from their Table 2.) By writing the model in the simpler, better way in NIMBLE, we see a 2.9 fold gain in efficiency. This would make NIMBLE more efficient than Stan. We did not also rerun JAGS with the better code. For the hardest case, BFG report JAGS being about 1.8 times more efficient than Stan and about 2.1 times more efficient than NIMBLE. In that case coding the model better makes NIMBLE 11.4 times more efficient, apparently more efficient than Stan and possibly than JAGS. Again, we did not run JAGS with and without the coding improvement. As a final example, in one of the middle LML cases, with N = 1000, p = 30, and 15 of those truly 0, Stan is reported by BFG to be about 3.6 times more efficient than NIMBLE. The bettercoded model improves NIMBLE by about 1.7fold, leaving it still behind Stan but only by about half as much.
We ran these comparisons on a MacBook Pro (2.4 GHz 8Core Intel Core i9). It looks like this was roughly 5 times faster than the computer on which BFG ran.
Inspection of traceplots revealed that the traceplots for the variance in the 5th and 6th “LMC” cases had not yet converged in the “BFG” version of the model. More burnin iterations would be needed. This goes handinhand with the recognition that NIMBLE benefits from good initial values. In a real analysis, if a long burnin was observed, a practical step would be to provide better initial values for the next run. Applied analysis always involves multiple MCMC runs as one gets things working and checked. With the “better code” version, the chains do appear to have converged.
At this point we should highlight that there isn’t only one version of NIMBLE’s MCMC performance. NIMBLE’s MCMC system is highly configurable, and its default samplers are just one possible choice among many. When putting real effort into boosting performance for hard models, we’ve seen improvements by 13 orders of magnitude (here, here and here). In nonconjugate cases where JAGS performs well, it is worth noting that JAGS uses a lot of slice samplers, and those can also be configured in NIMBLE. (But the cases here use lots of conjugate samplers, rather than slice samplers.)
The takeaway is that we don’t know why BFG gave Stan the benefit of matrix multiplication but didn’t do so for JAGS or NIMBLE, and doing so makes a substantial difference for NIMBLE. Also, we see more conjugacy cases to catch in our symbolic processing of model relationships.
]]>There may be some issues with their other three examples as well, and we might work through those in future blog post(s). NIMBLE provides a lot of flexibility for configuring MCMCs in different ways (with different samplers), which means a comparison using our default configuration is just a start. Performance differences can also arise from writing the same model in different ways. We see both kinds of issues coming up for the other examples. But the AFT example gives a lot to talk about, so we’re sticking to that one here.
It turns out that NIMBLE and JAGS were put at a huge disadvantage compared to Stan, and that BFG’s results from NIMBLE don’t look valid, and that there isn’t any exploration of NIMBLE’s configurability. If we make the model for NIMBLE and JAGS comparable to the model for Stan, NIMBLE does roughly 245 times better in various cases than what BFG reported. If we explore a simple block sampling option, NIMBLE gets a small additional boost in some cases. It’s hard to compare results exactly with what BFG report, and we are not out to rerun the full comparison including JAGS and Stan. A “back of the envelope” comparison suggests that NIMBLE is still less efficient than Stan for this example, but not nearly to the degree reported. We’re also not out to explore many sampling configurations to try for better performance in this particular example problem, but part of NIMBLE’s design is to make it easy to do so.
Before starting into the AFT models, it’s worth recognizing that software benchmarks and other kinds of performance comparisons are really hard to do well. It’s almost inevitable that, when done by developers of one package, that package gets a boost in results even if objectivity is the honest goal. That’s because package developers almost can’t help using their package effectively and likely don’t know how to use other packages as well as their own. In this case, it’s fair to point out that NIMBLE needs more care in providing valid initial values (which BFG’s code doesn’t do) and that NIMBLE’s default samplers don’t work well here, which is because this problem features heavy right tails of Weibull distributions with shape parameter < 1. For many users, that is not a typical problem. By choosing slice samplers (which JAGS often uses too) instead of NIMBLE’s default MetropolisHastings samplers, the mixing is much better. This issue is only relevant to the problem as BFG formulated it for JAGS and NIMBLE and goes away when we put it on par with the formulation BFG gave to Stan. In principle, comparisons by third parties, like BFG, might be more objective than those by package developers, but in this case the comparisons by BFG don’t use JAGS or NIMBLE effectively and include incorrect results from NIMBLE.
Below we try to reproduce their (invalid) results for NIMBLE and to run some withinNIMBLE comparisons of other methods. We’ll stick to their model scenarios and performance metrics. Those metrics are not the way we’ve done some published MCMC comparisons here, here and here, but using them will allow readers to interpret our results alongside theirs.
First we’ll give a brief summary of their model scenarios. Here goes.
Here’s a lightning introduction to AFT models based on Weibull distributions. These are models for timetoevent data such as a “failure.” For shape and scale , the Weibull probability density function for time is
One important thing about the Weibull is that its cumulative density can be written in closed form. It is:
The role of covariates is to accelerate or decelerate the time course towards failure, effectively stretching or shrinking the time scale for each item. Specifically, for covariate vector and coefficient vector , define . Then the distribution of timestoevent is defined by rescaling the Weibull: . This gives a Weibull with shape and scale , so we have
In the code, there are two parameterizations in play. The first is as just given. This is used in Stan and could be used in NIMBLE because it supports alternative parameterizations, including that one. Given , the scale is . The second is . This is the parameterization in the BUGS model language, so it is used in JAGS and is the default in NIMBLE. Given , .
The reason for the is that it makes the median of be 1 for any , i.e. when . Priors are put on (alpha
in the code) and (beta
). There is no separate scale parameter. Rather, when . The models are equivalent with either parameterization, and they shouldn’t have much impact on computational efficiency. We’re just pointing them out to follow what’s going on.
When a failure time is directly observed, its likelihood contribution is . When a unit hasn’t failed by its last observation, all that is known is that it lasted at least until . Then its likelihood contribution is . This is called a rightcensored observation. Thus the data consist of some s that are actual failure times and some s that are rightcensoring times.
There are two ways to handle a rightcensored observation in MCMC:
The first version is marginalized relative to the second version because integrates over without needing to sample it. Often, but not always, marginalization is computationally faster and gives better mixing, so it makes the MCMC problem easier. That’s why the comparison as set up by BFG seems like an applestooranges comparison. They’ve made the problem substantially easier for Stan.
It’s easy to set up the marginalized version for JAGS or NIMBLE. This can be done using the “zeroes” trick in the BUGS language, which both packages use for writing models. In NIMBLE this can also be done by writing a userdefined distribution as a nimbleFunction
, which can be compiled along with a model.
BFG included the following scenarios:
Now that we’ve set up the background, we are ready to list some of the issues with BFG’s comparisons that are worth exploring. For the computational experiments below, we decided to limit our efforts to NIMBLE because we are not trying to redo BFG’s full analysis. Here are the main issues.
Inf
(or possibly NA
), which causes trouble for the samplers. Inspection of the model after MCMC runs using BFG’s code shows that even after 10000 iterations, the model likelihood is Inf
, so the results are invalid. It’s fair to say this is an issue in how to use NIMBLE, but it’s confusing to include invalid results in a comparison.inprod
).set.seed
so that others can generate identical data sets, but it looks like each package was given different simulated data sets. It can happen that MCMC performance depends on the data set. While this might not be a huge issue, we prefer below to give each package the same, reproducible, data sets. Another issue is that looking at average effective sample size across parameters can be misleading because one wants all parameters mixed well, not some mixed really well and others mixed poorly. But in these examples the parameters compared are all regressiontype coefficients that play similar roles in the model, and the averaging doesn’t look like a huge issue. Finally, BFG decline to report ESS/time, preferring instead to report ESS and time and let readers make sense of them. We see ESS/time as the primary metric of interest, the number of effectively independent samples generated per second, so we report it below. This gives a way to see how both mixing (ESS) and computation time contribute to MCMC performance.We use BFG’s code but modify it to organize it into functions and make it reproducible. The source files for this document includes code chunks to run and save results. We are not running JAGS or Stan because we are not trying to reproduce a full set of comparisons. Instead we are looking into NIMBLE’s performance for this example. Since the main issue is that BFG gave NIMBLE and JAGS harder models than they gave Stan, we fix this in a way that is not NIMBLEspecific and should also work for JAGS.
Here is a summary of what the code does:
Although the original method of BFG seems to give invalid results, we include it so we can try to roughly compare performance (shown below) against what they report. However one difficulty is that processing with Inf
and NaN
values can be substantially slower than processing with actual numbers, and these issues might differ across systems.
Results here are run on a MacBook Pro (2019), with 2.4 GHz 8Core Intel Core i9, and OS X version 11.6.
Here are the results, in a table that roughly matches the format of BFG’s Table 5. “Perc” is the average fraction of observations that are rightcensored.
As best as we can determine:
coda::effectiveSize
to estimate ESS. We did not see in their code what method they used. This is another reason we can’t be sure how to compare our results to theirs.
BFG (invalid)

BFG+inits+slice

Marginal

Marginal+blocks



ESS/Ns  Nit/t  ESS/t  ESS/Ns  Nit/t  ESS/t  ESS/Ns  Nit/t  ESS/t  ESS/Ns  Nit/t  ESS/t  
Perc = 0.2  
N=100, p = 4, perc = 0.2  0.27  6844.63  465.80  0.52  2325.58  300.65  0.39  9775.17  951.09  0.27  16233.77  1109.06 
N=1000, p = 4, perc = 0.2  0.30  1127.27  84.71  0.55  306.22  41.83  0.41  1527.88  157.65  0.28  2490.04  171.47 
N=100, p = 16, perc = 0.2  0.19  3423.49  161.60  0.36  949.49  84.94  0.27  3717.47  248.99  0.29  5621.14  408.77 
N=1000, p = 16, perc = 0.2  0.08  404.22  7.80  0.57  98.86  14.16  0.41  591.82  61.12  0.30  1100.47  83.33 
Perc = 0.5  
N=100, p = 4, perc = 0.5  0.05  7262.16  98.39  0.08  2572.68  54.45  0.38  10214.50  960.31  0.26  15060.24  990.34 
N=1000, p = 4, perc = 0.5  0.10  1106.32  26.96  0.10  298.23  7.25  0.44  1987.28  219.92  0.26  3074.09  196.19 
N=100, p = 16, perc = 0.5  0.06  3411.80  52.07  0.21  940.56  49.94  0.23  3955.70  229.94  0.28  5854.80  415.89 
N=1000, p = 16, perc = 0.5  0.07  339.29  5.88  0.07  95.90  1.66  0.41  601.90  61.98  0.31  1074.58  83.07 
Perc = 0.8  
N=100, p = 4, perc = 0.8  0.03  6761.33  51.99  0.02  2297.79  10.79  0.24  9842.52  602.28  0.20  15151.52  763.36 
N=1000, p = 4, perc = 0.8  0.02  1013.27  5.16  0.02  265.58  1.50  0.39  1831.50  180.50  0.25  2856.33  176.27 
N=100, p = 16, perc = 0.8  0.04  3412.97  33.45  0.03  876.96  6.74  0.17  3853.56  166.26  0.23  5820.72  329.18 
N=1000, p = 16, perc = 0.8  0.01  395.99  1.22  0.05  95.33  1.22  0.39  560.54  54.91  0.29  1016.57  72.55 
The leftmost set of results (“BFG (invalid)”) is comparable to the rightmost (“NIMBLE”) column of BFG’s Table 5, in the same row order for their first 12 rows. The simulated data sets are different. For that reason and the stochasticity of Monte Carlo methods, we shouldn’t expect to see exactly matching values. And of course the computations were run on different systems, resulting in different times. Again, these results are invalid.
The next column (“BFG+inits+slice”) gives results when BFG’s model formulation for JAGS and NIMBLE is combined with valid initialization and slice sampling in NIMBLE. We can see that valid sampling generally gives lower ESS/time than the invalid results.
The next column shows results when the problem is set up as BFG gave it to Stan, and NIMBLE’s default samplers are used. If we assume the leftmost results are similar to what BFG report, but with times from the system used here, then the boost in performance is the ratio of ESS/time between methods. For example, in the last row, the marginal method is 54.91/1.22 = 45.01 times more efficient that what BFG reported. We can make a similar kind of ratio between Stan and NIMBLE from BFG’s results, which gave Stan as about 380 times more efficient than NIMBLE (although rounding error for “1%” could be a substantial issue here). Putting these together, Stan might really be about 8.4 times more efficient than NIMBLE for this case, which is the hardest case considered.
The last column shows results of the single experiment with alternative (block) samplers that we tried. In many cases, it gives a modest additional boost. Often with more work one can find a better sampling strategy, which can be worth the trouble for extended work with a particular kind of model. In the last row of our results, this gives about another 72.55 / 54.91 = 1.32 boost in performance, lowering the ratio to Stan to about 6.4. Again, we decided to limit this post to withinNIMBLE comparisons, and the comparisons to Stan based on BFG’s results should be taken with a grain of salt because we didn’t rerun them.
In summary, it looks like BFG gave Stan a different and easier accelerated failure time problem than they gave NIMBLE and JAGS. When given the same problem as they gave Stan, NIMBLE’s default samplers perform around 2 to 45 times better than what BFG reported.
]]>