Chapter 12 Creating user-defined distributions and functions for models
NIMBLE allows you to define your own functions and distributions as nimbleFunctions for use in model code. As a result, NIMBLE frees you from being constrained to the functions and distributions discussed in Chapter 5. For example, instead of setting up a Dirichlet prior with multinomial data and needing to use MCMC, one could recognize that this results in a Dirichlet-multinomial distribution for the data and provide that as a user-defined distribution instead.
Since NIMBLE allows you to wrap calls to external compiled code or arbitrary R functions as nimbleFunctions, and since you can define model functions and distributions as nimbleFunctions, you can combine these features to build external compiled code or arbitrary R functions into a model. See Sections 11.6-11.7.
As of version 1.2.0, NIMBLE also supports using more advanced nimbleFunctions, those with setup
code and possibly multiple methods (see Chapter 15), to provide functions and/or distributions for use in models. This allows the functions and/or distributions to store information internally. See section 12.3 below.
Note that NIMBLE generally expects user-defined distributions or functions to be defined in the global environment. If you define them in a function (which would generally be the case if you are using them in the context of parallelization), one approach would be to assign them to the global environment in your function:
## for a user-defined function
assign('myfun', myfun, envir = .GlobalEnv)
## for a user-defined distribution
assign('dfoo', dfoo, envir = .GlobalEnv)
assign('rfoo', rfoo, envir = .GlobalEnv)
## similarly for 'p' and 'q' functions if you define them
12.1 User-defined functions
To provide a new function for use in BUGS code, simply create a nimbleFunction as discussed in Chapter 11. Then use it in your BUGS code. That’s it.
Writing nimbleFunctions requires that you declare the dimensionality of arguments and the returned object (Section 11.4). Make sure that the dimensionality specified in your nimbleFunction matches how you use it in BUGS code. For example, if you define scalar parameters in your BUGS code you will want to define nimbleFunctions that take scalar arguments. Here is an example that returns twice its input argument:
timesTwo <- nimbleFunction(
run = function(x = double(0)) {
returnType(double(0))
return(2*x)
})
code <- nimbleCode({
for(i in 1:3) {
mu[i] ~ dnorm(0, 1)
mu_times_two[i] <- timesTwo(mu[i])
}
})
The x = double(0)
argument and returnType(double(0))
establish that the input and output will both be zero-dimensional
(scalar) numbers.
You can define nimbleFunctions that take inputs and outputs with more dimensions. Here is an example that takes a vector (1-dimensional) as input and returns a vector with twice the input values:
vectorTimesTwo <- nimbleFunction(
run = function(x = double(1)) {
returnType(double(1))
return(2*x)
}
)
code <- nimbleCode({
for(i in 1:3) {
mu[i] ~ dnorm(0, 1)
}
mu_times_two[1:3] <- vectorTimesTwo(mu[1:3])
})
There is a subtle difference between the mu_times_two
variables in the two examples. In the first example, there are individual nodes for each mu_times_two[i]
. In the second example, there is a single multivariate node, mu_times_two[1:3]
. Each implementation could be more efficient for different needs. For example, suppose an algorithm modifies the value of mu[2]
and then updates nodes that depend on it. In the first example, mu_times_two[2]
would be updated. In the second example mu_times_two[1:3]
would be updated because it is a single, vector node.
At present in compiled use of a model, you cannot provide a scalar argument where the user-defined nimbleFunction expects a vector; unlike in R, scalars are not simply vectors of length 1.
12.2 User-defined distributions
To provide a user-defined distribution, you need to define density
(‘d’) and (optionally)
simulation (‘r’) nimbleFunctions
for your distribution. In many cases you can then
simply use your distribution in BUGS code as you would any
distribution already provided by NIMBLE, while in a few special cases24
you need to explicitly register your distribution as described in Section 12.2.1.
You need to provide the simulation (‘r’) function if any algorithm used with a model that uses the distribution needs to simulate from the distribution. This is not the case for NIMBLE’s built-in MCMC sampler functions, and therefore the simulation function is not generally required for standard MCMC in NIMBLE. However, the ‘r’ function is necessary for initialization of nodes that are assigned the user-defined distribution if no initial value is provided, and for sampling posterior predictive nodes (those nodes with no downstream data dependencies) that are assigned the user-defined distribution.
You can optionally provide distribution (‘p’) and quantile (‘q’) functions, which will allow truncation to be applied to a user-defined distribution. You can also provide a list of alternative parameterizations, but only if you explicitly register the distribution.
Here is an extended example of providing a univariate exponential distribution (solely for illustration as this is already provided by NIMBLE) and a multivariate Dirichlet-multinomial distribution.
dmyexp <- nimbleFunction(
run = function(x = double(0), rate = double(0, default = 1),
log = integer(0, default = 0)) {
returnType(double(0))
logProb <- log(rate) - x*rate
if(log) return(logProb)
else return(exp(logProb))
})
rmyexp <- nimbleFunction(
run = function(n = integer(0), rate = double(0, default = 1)) {
returnType(double(0))
if(n != 1) print("rmyexp only allows n = 1; using n = 1.")
dev <- runif(1, 0, 1)
return(-log(1-dev) / rate)
})
pmyexp <- nimbleFunction(
run = function(q = double(0), rate = double(0, default = 1),
lower.tail = integer(0, default = 1),
log.p = integer(0, default = 0)) {
returnType(double(0))
if(!lower.tail) {
logp <- -rate * q
if(log.p) return(logp)
else return(exp(logp))
} else {
p <- 1 - exp(-rate * q)
if(!log.p) return(p)
else return(log(p))
}
})
qmyexp <- nimbleFunction(
run = function(p = double(0), rate = double(0, default = 1),
lower.tail = integer(0, default = 1),
log.p = integer(0, default = 0)) {
returnType(double(0))
if(log.p) p <- exp(p)
if(!lower.tail) p <- 1 - p
return(-log(1 - p) / rate)
})
ddirchmulti <- nimbleFunction(
run = function(x = double(1), alpha = double(1), size = double(0),
log = integer(0, default = 0)) {
returnType(double(0))
logProb <- lgamma(size+1) - sum(lgamma(x+1)) + lgamma(sum(alpha)) -
sum(lgamma(alpha)) + sum(lgamma(alpha + x)) -
lgamma(sum(alpha) + size)
if(log) return(logProb)
else return(exp(logProb))
})
rdirchmulti <- nimbleFunction(
run = function(n = integer(0), alpha = double(1), size = double(0)) {
returnType(double(1))
if(n != 1) print("rdirchmulti only allows n = 1; using n = 1.")
p <- rdirch(1, alpha)
return(rmulti(1, size = size, prob = p))
})
code <- nimbleCode({
y[1:K] ~ ddirchmulti(alpha[1:K], n)
for(i in 1:K) {
alpha[i] ~ dmyexp(1/3)
}
})
model <- nimbleModel(code, constants = list(K = 5, n = 10))
The distribution-related functions should take as input the parameters for a single parameterization, which will be the canonical parameterization that NIMBLE will use.
Here are more details on the requirements for distribution-related nimbleFunctions, which follow R’s conventions:
- Your distribution-related functions must have names that begin with ‘d’, ‘r’, ‘p’ and ‘q’. The name of the distribution must not be identical to any of the NIMBLE-provided distributions.
- All simulation (‘r’) functions must take
n
as their first argument. Note that you may simply have your function only handlen=1
and return an warning for other values ofn
. - NIMBLE uses doubles for numerical calculations, so we suggest simply using doubles in general, even for integer-valued parameters or values of random variables. In fact, non-scalars must be declared as doubles.
- All density functions must have as their last argument
log
and implement return of the log probability density. NIMBLE algorithms typically use onlylog = 1
(i.e., TRUE), but we recommend you implement thelog = 0
(i.e., FALSE) case for completeness. - All distribution and quantile functions must have their last two arguments be (in order)
lower.tail
andlog.p
. These functions must work forlower.tail = 1
(i.e., TRUE) andlog.p = 0
(i.e., FALSE), as these are the inputs we use when working with truncated distributions. It is your choice whether you implement the necessary calculations for other combinations of these inputs, but again we recommend doing so for completeness. - Define the nimbleFunctions in R’s global environment. Don’t expect R’s standard scoping to work25.
12.2.1 Using registerDistributions for alternative parameterizations and providing other information
Behind the scenes, NIMBLE uses the function
registerDistributions
to set up new distributions for use in BUGS
code. In some circumstances, you will need to call
registerDistributions
directly to provide information that NIMBLE
can’t obtain automatically from the nimbleFunctions you write.
The cases in which you’ll need to explicitly call
registerDistributions
are when you want to do any of the following:
- provide alternative parameterizations,
- indicate a distribution is discrete, and
- provide the range of possible values for a distribution.
If you would like to allow for
multiple parameterizations, you can do this via the Rdist
element
of the list provided to registerDistributions
as illustrated below. If
you provide CDF (‘p’) and inverse CDF (quantile, i.e. ‘q’)
functions, be sure to specify pqAvail = TRUE
when you call
registerDistributions
. Here’s an example of using registerDistributions
to provide an alternative parameterization (scale instead of rate) and to provide the range
for the user-defined exponential distribution. We can then use
the alternative parameterization in our BUGS code.
registerDistributions(list(
dmyexp = list(
BUGSdist = "dmyexp(rate, scale)",
Rdist = "dmyexp(rate = 1/scale)",
altParams = c("scale = 1/rate", "mean = 1/rate"),
pqAvail = TRUE,
range = c(0, Inf)
)
))
code <- nimbleCode({
y[1:K] ~ ddirchmulti(alpha[1:K], n)
for(i in 1:K) {
alpha[i] ~ T(dmyexp(scale = 3), 0, 100)
}
})
model <- nimbleModel(code, constants = list(K = 5, n = 10),
inits = list(alpha = rep(1, 5)))
There are a few rules for how you specify the information about a
distribution that you provide to registerDistributions
:
- The function name in the
BUGSdist
entry in the list provided toregisterDistributions
will be the name you can use in BUGS code. - The names of your nimbleFunctions must match the function name in the
Rdist
entry. If missing, theRdist
entry defaults to be the same as theBUGSdist
entry. - Your distribution-related functions must take as arguments the
parameters in default order, starting as the second argument and in the order used in the parameterizations in the
Rdist
argument toregisterDistributions
or theBUGSdist
argument if there are no alternative parameterizations. - You must specify a
types
entry in the list provided toregisterDistributions
if the distribution is multivariate or if any parameter is non-scalar.
Further details on using registerDistributions
can be found via R help on registerDistributions
. NIMBLE uses the same list format as registerDistributions
to define its distributions. This list can be found in the R/distributions_inputList.R
file in the package source code directory or as the R list nimble:::distributionsInputList
.
12.3 Advanced user-defined functions and distributions
Sometimes it is useful for a user-defined function or distribution to use
external information and/or hold variables that persist between calls. One can
do this by creating nimbleFunctions that use setup
code and have one or more
methods. If you are familiar with object-oriented programming, this variety of
nimbleFunction defines a class. Here we will give a toy example that is not
fully explained because the NIMBLE programming concepts are covered in Chapter
15. After studying that chapter, the example here
should make more sense.
As a toy example, consider this basic linear regression model with one explanatory variable:
code <- nimbleCode({
beta0 ~ dnorm(0, sd = 100)
beta1 ~ dnorm(0, sd = 100)
sigma ~ dhalfflat()
for(i in 1:n) {
y[i] ~ dnorm(beta0 + beta1*x[i], sd = sigma)
}
})
# Simulate some data:
x <- runif(10)
y <- rnorm(10, 3 + 0.5 * x, sd = 0.2)
model <- nimbleModel(code, constants = list(n = 10, x = x),
data = list(y = y),
inits = list(beta0=3.1,beta1=0.4,sigma=0.3))
Now suppose we want to rewrite this model in an equivalent way by providing a “distribution” that actually contains all the steps to calculate the log likelihood for the linear regression internally (for all of y[1:10]
at once), including that it contains the explanatory variable x
. We could do that as follows:
nimbleOptions(allowNFobjInModel=TRUE) # This option should be TRUE by default,
# but we include this step to be sure.
linear_reg <- nimbleFunction(
setup = function(predictor) {}, # Call x "predictor" here for clarity
methods = list(
dcalc = function(x = double(1), # Note that this will be y!
beta0=double(), beta1=double(), sigma=double(),
log=integer(0, default=0)) {
llh <- sum(dnorm(x, beta0+beta1*predictor, sd=sigma, log=TRUE))
if(log) return(llh)
return(exp(llh))
returnType(double())
}))
my_linear_reg <- linear_reg(x)
code2 <- nimbleCode({
beta0 ~ dnorm(0, sd = 100)
beta1 ~ dnorm(0, sd = 100)
sigma ~ dhalfflat()
y[1:n] ~ my_linear_reg$dcalc(beta0, beta1, sigma)
})
model <- nimbleModel(code2, constants = list(n = 10),
data = list(y = y),
inits = list(beta0=3.1,beta1=0.4,sigma=0.3))
What is going on, and why would one even think of doing it this way? linear_reg
here is defined as a nimbleFunction to create objects that will hold values of x
internally and use them when the dcalc
function (or “method”) is called. my_linear_reg
is one instance of linear_reg
that holds the specific values of x
provided. The method dcalc
has the format (arguments, return type, and prefix “d”) of a user-defined distribution, so it can be used that way via my_linear_reg$dcalc
. One can provide a corresponding rcalc
method, but as above this is only required if it will actually be needed by an algorithm. The suffix “calc” could be replaced with anything, but a user-defined distribution must have the prefix “d” (and can even be named just “d
”), and a corresponding simulation must begin with “r”. Any method in an object like my_linear_reg
can be used as a user-defined function, and the same nimbleFunction (linear_reg
) can define multiple user-defined functions and/or distributions.
(Note the deliberate illustration of a potential confusion: The random variable argument to a user-defined distribution is always named “x
”, which in this example will get the values of y[1:10]
.)
Now, why would one want to do this? There are several reasons. First, sometimes instead of a short vector like x
, there may be quite large data objects involved in some calculation. It can be wasteful (of time and memory when building and compiling models) to provide large objects in models for the sole purpose of passing them to user-defined functions or distributions. It can work well to access them internally as shown with x
in this example. One can also consider caching intermediate values from costly calculations or other tricks to speed up calculations.
It is vital when using this approach to understand clearly how an algorithm will use a model, and there is potential to create problems. For example, NIMBLE’s MCMC system uses a set of samplers, each of which assumes the model is fully up-to-date when it begins its operations. “Up-to-date” means that model variables all have the correct current values and all deterministic and stochastic calculations are up-to-date with those values. Sometimes, samplers copy values around to achieve this. However, a sampler has no way to know what variables a nimbleFunction object (e.g. my_linear_reg
) is managing internally, and thus has no way to keep them up to date, if necessary. Techniques such as caching or “memoization” should be careful in this regard.
One can also register a distribution in a nimbleFunction object manually, for the reasons given above. The name of the distribution is simply the string “my_linear_reg$dcalc”, to use the current example.
These include providing alternative parameterizations, specifying the range of the distribution, or specifying that the distribution is a discrete distribution.↩︎
NIMBLE can’t use R’s standard scoping because it doesn’t work for R reference classes, and nimbleFunctions are implemented as custom-generated reference classes.↩︎