Adding a user-defined distribution to the BUGS language

In previous implementations of the BUGS language (WinBUGS, OpenBUGS, and JAGS), there is no way to extend the language with new distributions or functions (other than diving into the low-level source code). NIMBLE tries to make it easy to add distributions and functions and then to share your code with others.

In this example, we’ll add a user-defined distribution in such a way that it can also be used with derivative-based (AD-based) algorithms.

Example: zero-inflated Poisson distribution

In a variety of situations, scientists consider the possibility that some measurements will yield zeros for reasons separate from other considerations. For example, ecologists counting plants or animals of a particular species at many sites may get a zero when the species is absent but get a Poisson-distributed count when it is present. The explanatory variables they wish to consider as influencing the probability of a so-called structural zero (the species is absent) may be different from those influencing the abundance when it is present. This is done by having one parameter representing the probability of a structural zero and another parameter representing the Poisson mean. Each parameter may itself be predicted by other variables, and a zero in the data could result from either a structural zero or a Poisson sample that just happens to be zero. Similar ideas are used for zero-inflated binomial and zero-inflated negative binomial models. Hurdle models represent a subtly different approach, in which the probability of a zero is mixed with the zero-truncated probability of non-zeros.

Since such distributions were not built into earlier dialects of BUGS, users of that language have made use of a couple of indirect techniques to accomplish them. One approach is to add a Bernoulli (0/1) latent variable to the model for each observed zero. The latent variable indicates whether the corresponding data value is a structural or non-structural zero, and the MCMC must sample it. The other approach is to use a trick for providing an arbitrary log-likelihood to BUGS (for reasons unrelated to zero-inflation, this is sometimes called the “zero trick” – are you confused yet?). The former approach is more general but adds computational cost since the MCMC must sample the latent variables. The latter trick avoids the computational cost but is less general because every time one wants to consider a different distribution one must be sure to get the log likelihood correct, which could become cumbersome and difficult for others to follow.

There are plenty of other situations where one may wish to add a distribution to their model that was not envisioned in earlier dialects of BUGS.

Adding a new distribution to BUGS in NIMBLE

We won’t labor through the indirect techniques mentioned for implementing zero inflation. Also, we won’t do this example in an interesting real model because all the other model components would get in the way. Instead we’ll just show a toy example. Say we have \(N\) observations, \(y[i], i = 1 \ldots N\), each from a zero-inflated Poisson. The parameters are the probability \(p\) of a structural zero and the mean \(\lambda\) of counts that are not structural zeros (but may be non-structural zeros).

We’ll call the density function for our new zero-inflated Poisson distribution “dZIP”.

Of course we need to load the package first:

library(nimble, warn.conflicts = FALSE)

## nimble version 1.3.0 is loaded.
## For more information on NIMBLE and a User Manual,
## please visit https://R-nimble.org.
## 
## Note for advanced users who have written their own MCMC samplers:
##   As of version 0.13.0, NIMBLE's protocol for handling posterior
##   predictive nodes has changed in a way that could affect user-defined
##   samplers in some situations. Please see Section 15.5.1 of the User Manual.

Write the model to use the dZIP distribution we plan to write

The BUGS code for a toy model with our user-defined distribution is:

ZIPcode <- nimbleCode({
  p ~ dunif(0,1)   # Probability of presence.
  lambda ~ dunif(0,10)
  for (i in 1:N)
    y[i] ~ dZIP(lambda, zeroProb = 1-p) # Note NIMBLE allows R-like named-parameter syntax.
})

Here there is nothing more than the parameters \(p\) and \(lambda\), the data \(y\), and the constant \(N\).

Write our dZIP distribution

Before we can use this code to build a model, we need to define dZIP. We do that with a nimbleFunction, which is a lot like an R function.

dZIP <- nimbleFunction(
 run = function(x = double(), lambda = double(),
                zeroProb = double(), log = logical(0, default = 0)) {
   returnType(double())
   ## For use with AD, we cannot use an `if` statement to handle the mixture.
   prob <- zeroProb * dbinom(x, size = 1, prob = 0) + (1 - zeroProb) * dpois(x, lambda)
   if (log) return(log(prob))
     return(prob)
   },
   buildDerivs = 'run'   # Needed when used with AD-based algorithms.
 )

This example doesn’t include a full introduction to nimbleFunctions, but here are some brief points:

A nimbleFunction allows only a small subset of R along with functions that can use models.
A nimbleFunction is compilable, which means NIMBLE will generate and compile C++ from it. It can also be executed in R prior to compilation for testing purposes.
The double(), integer() and logical() notation means those arguments are scalar doubles, integers, and logicals (TRUE or FALSE), respectively.
The returnType(double()) means this function will return a scalar double.
Everything else here is like R.
In a user-defined distribution, we must define a log argument indicating whether to return the probability or the log probability, as is standard in R’s distribution functions. Also, the first argument, representing a value of the random variable, must be named “x”, which again is like R’s distribution functions.
To use the distribution in a model with a derivative-based algorithm, we need to add buildDerives = 'run' as an argument to nimbleFunction().

You are not required to provide an “r” function (for example, rZIP) for random number generation unless any algorithms you use would need to generate from the distribution. Two general situations in which the “r” function is necessary are for model initialization when no initial value is provided for a node following the user-defined distribution and for sampling posterior predictive nodes (those nodes with no downstream data dependencies) that follow the user-defined distribution.

For completeness, we show the “r” function here.

rZIP <- nimbleFunction(
 run = function(n = integer(), lambda = double(), zeroProb = double()) {
   returnType(double())
   isStructuralZero <- rbinom(1, prob = zeroProb, size = 1)
   if (isStructuralZero) return(0)
   return(rpois(1, lambda))
})

A brief note on this:

The first argument, n, is the number of random draws you want, but at this moment in NIMBLE’s development the n argument isn’t really used. We require and assume n = 1. It is there for compatibility with the standard argument list of “r” functions in R and for future implementation.

(Optional) Register dZIP with NIMBLE

NIMBLE will automatically find your distribution without you doing anything else. In certain cases, you may need to use the registerDistributions function first to tell NIMBLE some information about your function that NIMBLE does not determine automatically. In this case, using registerDistributions is not needed, but we’ll illustrate its use for reference.

registerDistributions(list(
    dZIP = list(
        BUGSdist = "dZIP(lambda, zeroProb)",
        discrete = TRUE,
        range = c(0, Inf),
        types = c('value = integer()', 'lambda = double()', 'zeroProb = double()')
     )))

Build the model and simulate some data

Now we are ready to build and compile the model. We’ll also use the model to generate a data set and then run MCMC on it.

set.seed(1)
ZIPmodel <- nimbleModel(ZIPcode, constants = list(N = 100),
                        buildDerivs = TRUE) ## Needed when used with AD-based algorithms.

## Defining model

## Building model

## Running calculate on model
##   [Note] Any error reports that follow may simply reflect missing values in model variables.

## Checking model sizes and dimensions

##   [Note] This model is not fully initialized. This is not an error.
##          To see which variables are not initialized, use model$initializeInfo().
##          For more information on model initialization, see help(modelInitialization).

ZIPmodel$p <- 0.6             ## Choose values of p and lambda
ZIPmodel$lambda <- 1.8
ZIPmodel$simulate(ZIPmodel$getDependencies(c("p", "lambda"), self=FALSE)) ## Simulate `y`.
simulatedData <- ZIPmodel$y
simulatedData

##   [1] 1 4 4 0 0 0 1 2 3 2 0 3 0 2 1 0 3 2 2 3 0 0 2 3 0 0 2 0 2 0 0 3 1 0 2 0 4
##  [38] 1 2 2 0 3 3 1 4 0 3 0 2 1 0 2 1 1 2 0 0 0 1 0 0 0 1 0 0 0 4 5 0 1 0 0 1 0
##  [75] 0 2 3 2 1 2 0 0 0 2 5 2 0 1 0 1 0 3 0 1 2 2 1 1 0 3

ZIPmodel$setData(list(y = simulatedData))  ## Set those values as data in the model.
cZIPmodel <- compileNimble(ZIPmodel)       ## Compile the model.

## Compiling
##   [Note] This may take a minute.
##   [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.

Set up and run MCMC

For info on how to run MCMC in NIMBLE, see other examples. Here we’ll use buildHMC to set up an HMC sampler, but we could use buildMCMC to set up NIMBLE’s default sampling.

nsamples <- 2000
library(nimbleHMC)
ZIPmcmc <- buildHMC(ZIPmodel, control = list(warmup = nsamples/2))

## ===== Monitors =====
## thin = 1: lambda, p
## ===== Samplers =====
## NUTS sampler (1)
##   - p, lambda

cZIPmcmc <- compileNimble(ZIPmcmc, project = ZIPmodel)

## Compiling
##   [Note] This may take a minute.
##   [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.

samples <- runMCMC(cZIPmcmc, niter = nsamples)

## running chain 1...

##   [Note] NUTS sampler (nodes: p, lambda) is using 1000 warmup iterations.
##          Since `warmupMode` is 'default' and `nburnin` = 0,
##          the number of warmup iterations is equal to `niter/2`.
##          No samples will be discarded, so the first half of the samples returned
##          are from the warmup period, and the second half of the samples are post-warmup.
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|

Let’s look at a summary and trace plots to see if everything looks reasonable. As stated in the message, the second half are the valid HMC samples, although in this case they all look similar.

summary(samples[(nsamples/2 + 1):nsamples, ])

##      lambda            p         
##  Min.   :1.283   Min.   :0.5271  
##  1st Qu.:1.655   1st Qu.:0.6698  
##  Median :1.787   Median :0.7133  
##  Mean   :1.792   Mean   :0.7134  
##  3rd Qu.:1.923   3rd Qu.:0.7534  
##  Max.   :2.519   Max.   :0.9212

plot(samples[,'lambda'], type = 'l', main = 'lambda trace plot')

plot(samples[,'p'], type = 'l', main = 'p trace plot')

It worked!