Chapter 16 Automatic Derivatives
As of version 1.0.0, NIMBLE can automatically provide numerically accurate derivatives of potentially arbitrary order for most calculations in models and/or nimbleFunctions. This feature enables methods such as Hamiltonian Monte Carlo (HMC, see package nimbleHMC
), Laplace approximation, and fast optimization with methods that use function gradients.
Automatic (or algorithmic) differentiation (AD) refers to the method of carrying derivative information through a set of mathematical operations. When done this way, derivatives are numerically accurate to the precision of the computer. This is distinct from finite difference methods, used by R packages such as numDeriv
(Gilbert and Varadhan 2019) and pracma
(Borchers 2022), which approximate derivatives by calculating function values at extremely nearby points. Finite difference methods are slower and less accurate than AD. It is also distinct from writing separate functions for each derivative, which can become very complicated and sometimes slower than AD. NIMBLE uses the CppAD package (Bell 2022) as its AD engine, following TMB’s success in doing so (Kristensen et al. 2016). A general reference on AD is Griewank and Walther (2008).
Using a packaged AD algorithm should be as simple as setting buildDerivs=TRUE
in the model. On the other hand, writing new algorithms (as nimbleFunctions) that use AD requires understanding how the AD system works internally, including what can go wrong. Calls to compileNimble
that include AD features will result in slower C++ compilation.
We will introduce NIMBLE’s AD features step by step, from simply turning them on for a model to using them in your own nimbleFunctions. We will show you:
- how to turn on derivatives in a model and use them in Laplace approximation (for Hamiltonian Monte Carlo (HMC), see 7.11.2).
- how to modify user-defined functions and distributions to support derivatives.
- what functions are supported and not supported for derivatives.
- basics of obtaining derivatives in your own algorithms written as
nimbleFunctions
. - advanced methods of obtaining derivatives in
nimbleFunctions
, including double-taping. - how to get derivatives involving model calculations.
- automatic parameter transformations to give any model an unconstrained parameter space for algorithms to work in.
- an example showing use of nimble’s derivatives for maximum likelihood estimation.
First, make sure to set the option to enable derivative features. This should be TRUE
by default, but just in case:
16.1 How to turn on derivatives in a model
To allow algorithms to use automatic derivatives for a model, include buildDerivs=TRUE
in the call to nimbleModel
. If you want derivatives to be set up for all models, you can run nimbleOptions(buildModelDerivs = TRUE)
and omit the buildDerivs
argument.
We’ll re-introduce the simple Poisson Generalized Linear Mixed Model (GLMM) example model from 7.11.2.1 and use Laplace approximation on it. There will be 10 groups (i
) of 5 observations (j
) each. Each observation has a covariate, X
, and each group has a random effect ran_eff
. Here is the model code:
model_code <- nimbleCode({
# priors
intercept ~ dnorm(0, sd = 100)
beta ~ dnorm(0, sd = 100)
sigma ~ dhalfflat()
# random effects and data
for(i in 1:10) {
# random effects
ran_eff[i] ~ dnorm(0, sd = sigma)
for(j in 1:5) {
# data
y[i,j] ~ dpois(exp(intercept + beta*X[i,j] + ran_eff[i]))
}
}
})
Note that we changed the prior on sigma
to avoid having an upper bound. Prior distributions are not included in maximum likelihood using the Laplace approximation but do indicate the range of valid values. We recommend caution in using priors for variance component parameters (standard deviations, variances, precisions) that have a finite upper bound (e.g., sigma ~ dunif(0, 100)
), because the probit transformation applied in that case may result in poor optimization performance.
We’ll simulate some values for X
.
Next, we build the model, including buildDerivs=TRUE
.
model <- nimbleModel(model_code, constants = list(X = X), calculate = FALSE,
buildDerivs = TRUE) # Here is the argument needed for AD.
16.1.1 Finish setting up the GLMM example
As preparation for the Laplace examples below, we need to finish setting up the GLMM. We could have provided data in the call to nimbleModel
, but instead we will simulate it using the model itself. Specifically, we will set parameter values, simulate data values, and then set those as the data to use.
model$intercept <- 0
model$beta <- 0.2
model$sigma <- 0.5
model$calculate() # This will return NA because the model is not fully initialized.
## [1] NA
model$simulate(model$getDependencies('ran_eff'))
model$calculate() # Now the model is fully initialized: all nodes have valid values.
## [1] -78.44085
If you are not very familiar with using a nimble
model, that might have been confusing, but it was just for setting up the example.
Finally, we will make a compiled version of the model.
16.2 How to use Laplace approximation and adaptive Gauss-Hermite quadrature
Next we will show how to use nimble’s Laplace approximation, which uses derivatives internally, to get maximum (approximate) likelihood estimates for the GLMM model above. Laplace approximation is equivalent to first-order adaptive Gauss-Hermite quadrature, which is also available (via buildAGHQ
and runAGHQ
), although here we will focus on Laplace approximation only. In this context, Laplace approximation approximates the integral over continuous random effects needed to calculate the likelihood. Hence, it gives an approximate likelihood (often quite accurate) that can be used for maximum likelihood estimation. Note that the Laplace approximation uses second derivatives, and the gradient of the Laplace approximation (used for finding the MLE efficiently) uses third derivatives. These are described in detail by Skaug and Fournier (2006) and Fournier et al. (2012).
To create a Laplace approximation specialized to the parameters of interest for this model, we use the nimbleFunction buildLaplace
. For many models, the setup code in buildLaplace
will automatically determine the random effects to be integrated over and the associated nodes to calculate. In fact, if you omit the parameter nodes, it will assume that all top-level nodes in the model should be treated as parameters. If fine-grained control is needed, these various sets of nodes can be input directly into buildLaplace
. To see what default handling of nodes is being done for your model, use setupMargNodes
with the same node inputs as buildLaplace
.
glmm_laplace <- buildLaplace(model, paramNodes = c('intercept','beta','sigma'))
Cglmm_laplace <- compileNimble(glmm_laplace, project = model)
With the compiled Laplace approximation, we can now find the MLE and related information such as standard errors.
## $params
## estimate stdError
## intercept -0.1491944 0.2464880
## beta 0.1935212 0.1467230
## sigma 0.5703362 0.2066517
##
## $randomEffects
## estimate stdError
## ran_eff[1] -0.33711373 0.4305831
## ran_eff[2] -0.02964535 0.3987838
## ran_eff[3] 0.40575212 0.3858675
## ran_eff[4] 1.04768889 0.3779772
## ran_eff[5] -0.36731650 0.4290568
## ran_eff[6] 0.26907207 0.3863272
## ran_eff[7] -0.54950702 0.4654196
## ran_eff[8] -0.11864461 0.4175452
## ran_eff[9] 0.10006643 0.3926128
## ran_eff[10] -0.04411292 0.3971147
##
## $vcov
## intercept beta sigma
## intercept 0.060756345 -0.002691117 -0.014082707
## beta -0.002691117 0.021527641 -0.005098532
## sigma -0.014082707 -0.005098532 0.042704916
##
## $logLik
## [1] -63.44875
##
## $df
## [1] 3
##
## $originalScale
## [1] TRUE
One of output elements is the maximized log likelihood (logLik
), which is useful for model comparison.
buildLaplace
actually offers several choices in how computations are done, differing in how they use double taping for derivatives or not. In some cases one or another choice might be more efficient. See help(buildLaplace)
if you want to explore it further.
Finally, let’s confirm that it worked by comparing to results from package glmmTMB
. In this case, nimble’s Laplace approximation is faster than glmmTMB
(on the machine used here), but that is not the point of this example. Here our interest is in checking that nimble’s Laplace approximation worked correctly in a case where we have an established tool such as glmmTMB
.
library(glmmTMB)
y <- as.numeric(model$y) # Re-arrange inputs for call to glmmTMB
X <- as.numeric(X)
group <- rep(1:10, 5)
data <- as.data.frame(cbind(X,y,group))
tmb_fit <- glmmTMB(y ~ X + (1 | group), family = poisson, data = data)
summary(tmb_fit)
## Family: poisson ( log )
## Formula: y ~ X + (1 | group)
## Data: data
##
## AIC BIC logLik deviance df.resid
## 132.9 138.6 -63.4 126.9 47
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## group (Intercept) 0.3253 0.5703
## Number of obs: 50, groups: group, 10
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1492 0.2465 -0.605 0.545
## X 0.1935 0.1467 1.319 0.187
## 'log Lik.' -63.44875 (df=3)
The results match within numerical tolerance typical of optimization problems. Specifically, the coefficients for (Intercept)
and X
match nimble’s Intercept
and beta
, the random effects standard deviation for group
matches nimble’s sigma
, and the standard errors match.
16.2.1 Using the Laplace approximation methods directly
If one wants finer grain control over using the approximation, one can use the methods provided by buildLaplace
. These include calculating the Laplace approximation for some input parameter values, calculating its gradient, and maximizing the Laplace-approximated likelihood. Here we’ll show some of these steps.
# Get the Laplace approximation for one set of parameter values.
Cglmm_laplace$calcLaplace(c(0, 0, 1))
## [1] -65.57246
## [1] -1.866840 8.001648 -4.059555
MLE <- Cglmm_laplace$findMLE(c(0, 0, 1)) # Find the (approximate) MLE.
MLE$par # MLE parameter values
## [1] -0.1491983 0.1935269 0.5703413
## [1] -63.44875
The final outputs show the MLE for intercept
, beta
, and sigma
, followed by the maximum (approximate) likelihood.
More information about the MLE can be obtained in two ways. The summary
method
can give estimated random effects and standard errors as well as the variance-covariance matrix
for the parameters and/or the random effects. The summaryLaplace
function
returns similar information but with names included in a more useful way. Here is some example code (results not shown):
## [1] -0.33711487 -0.02964301 0.40575572 1.04769223 -0.36731868 0.26907375
## [7] -0.54951160 -0.11864177 0.10006955 -0.04411124
## estimate stdError
## intercept -0.1491983 0.2464895
## beta 0.1935269 0.1467230
## sigma 0.5703413 0.2066531
16.2.2 Changing the optimization methods
When finding the MLE via Laplace approximation or adaptive Gauss-Hermite quadrature (AGHQ), there are two numerical optimizations: (1) maximizing the joint log-likelihood of random effects and data given a set of parameter values to construct the approximation to the marginal log-likelihood at the given parameter values, and (2) maximizing the approximation to the marginal log-likelihood over the parameter values. Optimization (1) is the “inner” optimization and optimization (2) is the “outer” optimization.
Finding the MLE via Laplace approximation may be sensitive to the optimization methods used, in particular the choice of optimizer for the inner optimization, and the “BFGS” optimizer available through optim()
may not perform well for inner optimization.
As of version 1.3.0, the default choices for both the inner and outer optimization use R’s nlminb
optimizer.
Users can choose a different optimizer for both of the optimizations.
To change the inner or outer optimizers, one can use the innerOptimMethod
and outerOptimMethod
elements of the control
list argument to buildLaplace
. One can modify various settings that control the behavior of the inner and outer optimizers via control
as well. See help(buildLaplace)
for more details.
Once a Laplace approximation is built, one can use updateSettings
to modify the choices of optimizers and various settings that control the behavior of the inner and outer optimizers (see help(buildLaplace)
for details).
By default, NIMBLE provides various optimization methods available through R’s optim()
as well as R’s nlminb
method and the BOBYQA method from the nloptr
package (by specifying bobyqa
). Users can also provide their own optimization function in R that they can then use with Laplace approximation. User optimization functions must have a particular set and order of arguments and must first be registered with NIMBLE via nimOptimMethod
. See help(nimOptim)
for more details.
Here’s an example of setting up the Newton method optimizer from the TMB
package as the inner optimizer for use with NIMBLE’s Laplace approximation. (Note that NIMBLE and TMB have distinct AD systems and Laplace approximation implementations; here we simply use the TMB::newton
optimization function.)
library(TMB)
library(Matrix)
## Create an R wrapper function that has the interface needed for NIMBLE
## and wraps the optimizer of interest.
nimbleTMBnewton <- function(par, fn, gr, he, lower, upper, control, hessian) {
## Wrap `he` as return value needs to be of class `dsCMatrix`.
he_matrix <- function(p) Matrix(he(p), doDiag = FALSE, sparse = TRUE)
invalid <- function(x) is.null(x) || is.na(x) || is.infinite(x)
if(invalid(control$trace)) control$trace <- 1
if(invalid(control$maxit)) control$maxit <- 100
if(invalid(control$reltol)) control$reltol <- 1e-8
res <- newton(par, fn, gr, he_matrix,
trace = control$trace, maxit = control$maxit, tol = control$reltol)
## Additional arguments (e.g., `alpha` and `tol10`) can be hard-coded in `newton()` call.
ans <- list(
## What is handled in the return is fairly particular, so often needs conversion
## from a given method such as TMB::newton.
par = res$par,
value = res$value,
counts = c(0, 0, 0), # Counts of fn/gr/he calls, but these are not counted by TMB::newton.
evaluations = res$iterations,
hessian = NULL, # TMB::newton gives a `dsCMatrix` but we need a base R matrix.
message = "ran by TMB::newton",
convergence = 0 # TMB::newton does not return a convergence code so give 0 (converged).
)
return(ans)
}
## Register the optimizer with NIMBLE.
nimOptimMethod("nimbleTMBnewton", nimbleTMBnewton)
## Use the optimizer for the inner optimization when finding the Laplace MLE.
glmm_laplace <- buildLaplace(model, c('intercept','beta','sigma'),
control = list(innerOptimMethod = "nimbleTMBnewton"))
16.3 How to support derivatives in user-defined functions and distributions
It is possible to activate derivative support for a user-defined function or distribution used in a model. Simply set buildDerivs=TRUE
in nimbleFunction
.
Here is an extremely toy example. Let’s say we want a model with one node that follows a user-defined distribution, which will happen to be the same as dnorm
(a normal distribution) for illustration.
The model code is:
toyCode <- nimbleCode({
x ~ d_my_norm(mean = mu, sd = sigma)
mu ~ dunif(-10, 10)
sigma ~ dunif(0, 5)
})
The user-defined distribution is:
d_my_norm <- nimbleFunction(
run = function(x = double(), mean = double(), sd = double(),
log = integer(0, default = 0)) {
ans <- -log(sqrt(2*pi)*sd) - 0.5*((x-mean)/sd)^2
if(log) return(ans)
return(exp(ans))
returnType(double())
},
buildDerivs=TRUE
)
Now we can build the model with buildDerivs=TRUE
and compile it:
# Don't worry about the warnings from nimbleModel in this case.
toyModel <- nimbleModel(toyCode, inits = list(mu = 0, sigma = 1, x = 0),
buildDerivs = TRUE)
Now toyModel
can be used in algorithms such as HMC and Laplace approximation, as above, or new ones you might write, as below.
Note that in models, all model variables are implemented as doubles, even when they will actually only ever take integer values. Therefore all arguments to user-defined distributions and functions used in models should be declared as type double
. The ADbreak
method discussed in Section 16.5.2 should be used if you want to stop derivative tracking on a model variable.
16.4 What operations are and aren’t supported for AD
Much of the math supported by NIMBLE will work for AD. Here are some details on what is and isn’t supported, as well as on some options to control how linear algebra is handled.
Features that are not supported for AD include:
- cumulative (“p”) and quantile (“q”) functions for distributions;
- truncated distributions (because they use cumulative distribution functions), for any derivatives;
- discrete distributions, for derivatives with respect to the random variable;
- random number generation;
- some specific functions including
step
,%%
(mod),eigen
,svd
, andbesselK
; and - most model operations (other than
model$calculate
, which is supported) includingmodel$calculateDiff
,model$simulate
,model$getBound
, andmodel$getParam
. - dynamic indexing that would produce a nonscalar result, such as
p[z[i], 1:3]
. Dynamic indexing that produces a scalar result, such asp[z[i], j]
, is allowed and can be used as a work-around, e.g., definingptmp
asfor(j in 1:3) ptmp[i,j] <- p[z[i],j]
and using it in place ofp[z[i], 1:3]
.
Note that functions and distributions that are not supported can still be used in a model, as long as no algorithm tries to take derivatives with respect to node(s) that is (are) not supported. For example, x ~ dcat(p[1:3])
declares x
to be a discrete random variable following a categorical distribution. An algorithm can use derivatives with respect to p[1:3]
(or with respect to parent nodes of p[1:3]
) because the calculation of the log probability of x
given p[1:3]
is continuous with respect to p
. However, derivatives with respect to x
will not work. In the case of truncated distributions, derivatives with respect to neither the parameters nor to the random variable are supported.
Some details on what is supported include:
round
,floor
,ceil
, andtrunc
are all supported with derivatives defined to be zero everywhere.pow(a, b)
requires positivea
andb
. Note that ifb
is coded as a (possibly negative) integer,pow_int
will be used. For examplepow(tau, -2)
will be converted topow_int(tau, -2)
.- A new function
pow_int(a, b)
returnspow(a, round(b))
and thus sets all derivatives with respect tob
to zero. This allows valid derivatives with respect toa
even if it takes a negative value.
For the linear algebra functions %*%
, chol
, forwardsolve
, backsolve
, and inverse
, there are special extensions provided by NIMBLE for CppAD called (in CppAD jargon) “atomics”. By default, these atomics will be used and often improve efficiency. There may be cases where they decrease efficiency, which might include when the matrix operands are small or contain many zeros. To compare results with and without use of the atomics, they can be turned off with a set of options:
nimbleOptions(useADmatMultAtomic = FALSE) # for %*%
nimbleOptions(useADcholAtomic = FALSE) # for chol
nimbleOptions(useADmatInverseAtomic = FALSE) # for inverse
nimbleOptions(useADsolveAtomic = FALSE) # for forwardsolve and backsolve
When a linear algebra atomic is turned off, the AD system simply uses all the scalar operations that compose the linear algebra operation.
Another important feature of AD is that sometimes values get “baked in” to AD calculations, meaning they are used and permanently retained from the first set of calculations and then can’t have their value changed later (unless an algorithm does a “reset”, described below). For people writing user-defined distributions and functions, a brief summary of what can get baked in includes:
- the extent (iteration values) of any for-loops.
- values of any arguments that are not of type ‘double’, including e.g. the ‘log’ argument in
d_my_norm
. (‘d’ functions called from models always havelog = TRUE
, so in that case it is not a problem.) - the evaluation path followed by any if-then-else calls.
- values of any integer indices.
See below for more thorough explanations.
16.5 Basics of obtaining derivatives in nimbleFunctions
Now that we have seen a derivative-enabled algorithm work, let’s see how to obtain derivatives in methods you write. From here on, this part of the chapter is oriented towards algorithm developers. We’ll start by showing how derivatives work in nimbleFunctions without using a model. The AD system allows you to obtain derivatives of one function or method from another.
Let’s get derivatives of the function y = exp(-d * x)
where x
is a vector, d
is a scalar, and y
is a vector.
derivs_demo <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
return(exp(-d*x))
returnType(double(1))
},
methods = list(
derivsRun = function(d = double(), x = double(1)) {
wrt <- 1:(1 + length(x)) # total length of d and x
return(derivs(run(d, x), wrt = wrt, order = 0:2))
returnType(ADNimbleList())
}
),
buildDerivs = 'run'
)
Here are some things to notice:
- Having
setup
code allows the nimbleFunction to have multiple methods (i.e. to behave like a class definition in object-oriented programming). Some nimbleFunctions don’t havesetup
code, butsetup
code is required when there will be a call toderivs
. If, as here, you don’t need thesetup
to do anything, you can simply usesetup=TRUE
, which is equivalent tosetup=function(){}
. - Any arguments to
run
that are real numbers (i.e. regular double precision numbers, but not integers or logicals) will have derivatives tracked when called throughderivs
. - The “with-respect-to” (
wrt
) argument gives indices of the arguments for which you want derivatives. In this case, we’re including all elements ofd
andX
. The indices form one sequence over all arguments. order
is a vector of derivative orders to return, which can include any of0
(the function value, not a derivative!),1
(1st order), or2
(2nd order). Higher-order derivatives can be obtained by double-taping, described below.- Basics of
nimbleFunction
programming are covered in Chapters 11-15. These include type declarations (double()
for scalar,double(1)
for vector), the distinction betweensetup
code andrun
code, and how to write more methods (of whichrun
is simply a default name when there is only one method). derivs
can alternatively be callednimDerivs
. In fact the former will be converted to the latter internally.
Let’s see this work.
my_derivs_demo <- derivs_demo()
C_my_derivs_demo <- compileNimble(my_derivs_demo)
d <- 1.2
x <- c(2.1, 2.2)
C_my_derivs_demo$derivsRun(d, x)
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] 0.08045961 0.07136127
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -0.1689652 -0.09655153 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] 0.3548269 0.1222986 0
## [2,] 0.1222986 0.1158618 0
## [3,] 0.0000000 0.0000000 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 0.3453885 0 0.1170325
## [2,] 0.0000000 0 0.0000000
## [3,] 0.1170325 0 0.1027602
We can see that using order = 0:2
results in the the value (“0th” order result, i.e. the value returned by run(d, x)
), the Jacobian (matrix of first order derivatives), and the Hessian (array of second order derivatives). The run
function here has taken three inputs (in the order d
, x[1]
, x[2]
) and returned two outputs (the first and second elements of exp(-d*x)
).
The i-th Jacobian row contains the first derivatives of the i-th output with respect to d
, x[1]
, and x[2]
, in that order. The first and second indices of the Hessian array follow the same ordering as the columns of the Jacobian. The third index of the Hessian array is for the output index. For example, hessian[3,1,2]
is the second derivative of the second output value (third index = 2) with respect to x[2]
(first index = 3) and d
(second index = 1).
(Although it may seem inconsistent to have the output index be first for Jacobian and last for Hessian, it is consistent with some standards for how these objects are created in other packages and used mathematically.)
When a function being called for derivatives (run
in this case) has non-scalar arguments (x
in this case), the indexing of inputs goes in order of arguments, and in column-major order within arguments. For example, if we had arguments x = double(1)
and z = double(2)
(a matrix), the inputs would be ordered as x[1]
, x[2]
, …, z[1, 1]
, z[2, 1]
, …, z[1, 2]
, z[2, 2]
, …, etc.
16.5.1 Checking derivatives with uncompiled execution
Derivatives can also be calculated in uncompiled execution, but they will be much slower and less accurate: slower because they are run in R, and less accurate because they use finite element methods (from packages pracma
and/or numDeriv
). Uncompiled execution is mostly useful for checking that compiled derivatives are working correctly, because although they are slower and less accurate, they are also much simpler internally and thus provide good checks on compiled results. For example:
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] 0.08045961 0.07136127
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -0.1689652 -0.09655153 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] 0.3548269 0.1222986 0
## [2,] 0.1222986 0.1158618 0
## [3,] 0.0000000 0.0000000 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 0.3453885 0 0.1170325
## [2,] 0.0000000 0 0.0000000
## [3,] 0.1170325 0 0.1027602
We can see that the results are very close, but typically not identical, to those from the compiled version.
16.5.2 Holding some local variables out of derivative tracking
Sometimes one wants to omit tracking of certain variables in derivative calculations. There are three ways to do this: naming variables in an ignore
set; ensuring a variable’s type is integer
(not simply numeric
with an integer value) or logical
; or using the ADbreak
function. The differences between these have to do with the fact that the variable types used for tracking derivatives in C++ are special. The ignore
method marks a variable as a regular type rather than a derivative-tracking type. The integer
or logical
method results in a regular (integer or logical) variable simply because those types are never used in derivative-tracking. The ADbreak
method assigns one derivative-tracked variable to another (derivative-tracked or not) variable while severing any actual derivative tracking between them.
16.5.2.1 Holding variables out of derivative tracking using ignore
Here is an example of ignore
, while ADbreak
is described below.
Say that we write exp(-d*x)
using a for-loop instead of a vectorized operation as above. It wouldn’t make sense to track derivatives for the for-loop index (i
), and indeed it would cause compileNimble
to fail. The code to use a for-loop index i
while telling derivs
to ignore derivatives for i
is:
derivs_demo2 <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
ans <- numeric(length = length(x))
for(i in 1:length(x))
ans[i] <- exp(-d*x[i])
return(ans)
returnType(double(1))
},
methods = list(
derivsRun = function(d = double(), x = double(1)) {
wrt <- 1:(1 + length(x)) # total length of d and x
return(derivs(run(d, x), wrt = wrt, order = 0:2))
returnType(ADNimbleList())
}
),
buildDerivs = list(run = list(ignore = 'i'))
)
We can see that it gives identical results as above, looking at only the Jacobian to keep the output short.
my_derivs_demo2 <- derivs_demo2()
C_my_derivs_demo2 <- compileNimble(my_derivs_demo2)
d <- 1.2
x <- c(2.1, 2.2)
C_my_derivs_demo2$derivsRun(d, x)$jacobian
## [,1] [,2] [,3]
## [1,] -0.1689652 -0.09655153 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352
One might think it should be obvious that i
should not be involved in derivatives, but sometimes math is actually done with a for-loop index. Another way to ensure derivatives won’t be tracked for i
would be to write i <- 1L
, before the loop. By assigning a definite integer to i
, it will be established as definitely of integer type throughout the function and thus not have derivatives tracked. (If you are not familiar with the L
in R, try storage.mode(1L)
vs storage.mode(1)
.) The only way to mark a double-precision variable to be a regular (not derivative-tracking) type is to name it in the ignore
set.
Another case where the ignore
or integer
methods are useful is in determining the start and/or end values of a for-loop. If you need for(i in 1:n) {...}
where you previously set n <- length(some_values)
, you need n
to be a non-derivative tracking type. You can achieve this either by including "n"
in the ignore
set or by doing n <- 0L
before setting n <- length(some_values)
, to establish with certainty that n
is an integer.
Because derivatives are not tracked for integer or logical variables, arguments to run
that have integer
or logical
types will not have derivatives tracked. You can also include an argument name (even for a double-precision argument) in the ignore
set. In any of these cases, you must then be careful in passing a non-derivative tracking type to another function. If passed as an argument to another function, the argument in that function must also be a non-derivative tracking type (integer
, logical
, or in the ignore
set).
If you need to assign a value from a derivative-tracked variable to a non-derivative tracked variable, you must use ADbreak
(see the next subsection for more on this). For example if y
is in the ignore
set, y <- ADbreak(x)
assigns the value of x
(with no derivative tracking) to y
. You can also do y <- ADbreak(x)
if y
is not in the ignore
set, and the result is that y
has derivative tracking and is initialized to the value of x
, but all derivatives of y
with respect to x
are 0. This severs the derivative (chain-rule) relationship between x
and any subsequent calculations with y
, but it does allow y
to be used where a derivative-tracking variable is expected (e.g. as another function’s argument). Derivative-tracked variables that are severed by ADbreak
will be “baked in” to the AD tape from the first call. This means their values will be constants and will not change on subsequent calls even if the input values change, unless you use the reset
feature. See section 16.5.4 to learn more about “baked in” values.
Only some math operations will work with a mix of derivative-tracking and regular variables. For example, most scalar operations will work with one argument of each kind. However, more complicated operations such as matrix multiplication require the arguments to be either both derivative-tracking or both regular.
Note the more elaborate value for buildDerivs
. In the previous example, this was just a character vector. Here it is a named list, with each element itself being a list of control values. The control value ignore
is a character vector of variable names to ignore in derivative tracking. buildDerivs = 'run'
is equivalent to buildDerivs = list(run = list())
.
16.5.2.2 Holding function arguments (including variables passed from a model) out of derivative tracking
Derivatives can be tracked through function calls, including from a model to a user-defined distribution or function (see 16.3 and 16.5.3). However, if one function is tracking derivatives for x
and passes x
as an argument to another function where it is non-derivative tracking type, there will be an error (most likely from compileNimble
). The problem is that, as noted above, the C++ variable types used for tracking derivatives or not are different and are not automatically interchangeable. Importantly, variables passed from models are always derivative-tracking (even if they will always have integer values).
For this situation, you can use ADbreak
to stop derivative tracking. For example, if one has a nimbleFunction argument x
, one would do x_noDeriv <- ADbreak(x)
and then use x_noDeriv
in subsequent calculations. This “breaks” derivative tracking in the sense that the derivative of x_noDeriv
with respect to x
is 0. However, it is important to realize that x_noDeriv
is still a derivative-tracking type (unless it is in the ignore
set), so it can be passed to other functions involved in the derivative-tracked operations. And, on the other hand, it can’t be used as a for-loop index or as the start of end of a for-loop extent. As noted above, if the values of x
or x_noDeriv
are used in further calculations, their values will be “baked in” to the tape.
Including "x"
in the ignore
set won’t work in this case because, if x
is a function argument, its type will then be changed to be a non-derivative type. If a derivative-tracking type is then passed as the value for x
there will be a mismatch in types. Since variables passed from models always allow derivative-tracking, one should not ignore
the corresponding function arguments.
Note that ADbreak
only works for scalars. If you need to break AD tracking for a non-scalar, you have to write a for-loop to apply ADbreak
for each element. For example:
x_noDeriv <- nimNumeric(length(x), init=FALSE)
i <- 1L
for(i in 1:length(x)) x_noDeriv[i] <- ADbreak(x[i])
In summary:
- Use the
ignore
set or explicit creation of variables asinteger
orlogical
to create regular variable types. - If you need an argument to have derivative-tracking (as is always the case when called from a model) but derivatives with respect to it don’t make sense, use
ADbreak
to assign to a local variable that will “bake in” the value and remove it from derivative tracking. The local variable can be regular (if it is in theignore
list or was created first as aninteger
orlogical
) or derivative-tracking.
16.5.3 Using AD with multiple nimbleFunctions
Derivatives will be tracked through whatever series of calculations occur in a method, possibly including calls to other methods or nimbleFunctions that have buildDerivs
set. Let’s look at an example where we have a separate function to return the element-wise square root of an input vector. The net calculation for derivatives will be sqrt(exp(-d*x)))
.
nf_sqrt <- nimbleFunction(
run = function(x = double(1)) {
return(sqrt(x))
returnType(double(1))
},
buildDerivs = TRUE
)
derivs_demo3 <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
ans <- exp(-d*x)
ans <- nf_sqrt(ans)
return(ans)
returnType(double(1))
},
methods = list(
derivsRun = function(d = double(), x = double(1)) {
wrt <- 1:(1 + length(x)) # total length of d and x
return(derivs(run(d, x), wrt = wrt, order = 0:2))
returnType(ADNimbleList())
}
),
buildDerivs = 'run'
)
And then let’s see it work:
my_derivs_demo3 <- derivs_demo3()
C_my_derivs_demo3 <- compileNimble(my_derivs_demo3)
d <- 1.2
x <- c(2.1, 2.2)
C_my_derivs_demo3$derivsRun(d, x)$jacobian
## [,1] [,2] [,3]
## [1,] -0.2978367 -0.1701924 0.0000000
## [2,] -0.2938488 0.0000000 -0.1602812
Note that for a nimbleFunction without setup code, one can say buildDerivs=TRUE
, buildDerivs = 'run'
, or buildDerivs = list(run = list())
. One can’t take derivatives of nf_sqrt
on its own, but it can be called by a method of a nimbleFunction that can have its derivatives taken (e.g. the run
method of a derivs_demo3
object).
16.5.4 Understanding more about how AD works: taping of operations
At this point, it will be helpful to grasp more of how AD works and its implementation in nimble via the CppAD library (Bell 2022). AD methods work by following a set of calculations multiple times, sometimes in reverse order.
For example, the derivative of \(y = \exp(x^2)\) can be calculated (by the chain rule of calculus) as the derivative of \(y = \exp(z)\) (evaluated at \(z = x^2\)) times the derivative of \(z = x^2\) (evaluated at \(x\)). This can be calculated by first going through the sequence of steps for the value, i.e. (i) \(z = x^2\), (ii) \(y = \exp(z)\), and then again for the derivatives, i.e. (i) \(dz = 2x (dx)\), (ii) \(dy = \exp(z) dz\). These steps determine the instantaneous change \(dy\) that results from an instantaneous change \(dx\). (It may be helpful to think of both values as a function of a third variable such as \(t\). Then we are determining \(dy/dt\) as a function of \(dx/dt\).) In CppAD, the basic numeric object type (double) is replaced with a special type and corresponding versions of all the basic math operations so that those operations can track not just the values but also derivative information.
The derivative calculations typically need the values. For example, the derivative of \(\exp(z)\) is (also) \(\exp(z)\), for which the value of \(z\) is needed, which in this case is \(x^2\), which is one of the steps in calculating the value of \(\exp(x^2)\). (In fact, in this case the derivative of \(\exp(z)\) can more simply use the value of \(y\) itself.) Hence, values are calculated before derivatives, which are calculated by stepping through the calculation sequence a second time
What was just described is called forward mode in AD. The derivatives can also be calculated by working through \(\exp(x^2)\) in reverse mode. This is often less familiar to graduates of basic calculus courses. In reverse mode, one determines what instantaneous change \(dx\) would give an instantaneous change \(dy\). For example, given a value of \(dy\), we can calculate \(1/dz = \exp(z) / dy\) and then \((1/dx) = (x^2) / dz\). Again, values are calculated first, followed by derivatives in a second pass through the calculation sequence, this time in reverse order.
In general, choice of forward mode versus reverse mode has to do with the lengths of inputs and outputs and possibly specific operations involved. In nimble, some reasonable choices are used.
In CppAD, the metaphor of pre-digital technology – magnetic tapes – is used for the set of operations in a calculation. When a line with nimDerivs
is first run, the operations in the function it calls are taped and then re-used – played – in forward and/or reverse orders to obtain derivatives. Hence we will refer to the AD tape in explaining some features below. It is also possible to reset (i.e. re-record) the AD tape used for a particular nimDerivs
call.
When taped operations are played, the resulting operations can themselves be taped. We call this meta-taping or double-taping. It is useful because it can sometimes boost efficiency and it can be used to get third and higher order derivatives.
16.5.5 Resetting a nimDerivs
call
The first time a call to nimDerivs
runs, it records a tape of the operations in its first argument (run(d, x)
in the examples here). On subsequent calls, it re-uses that tape without re-recording it. That means subsequent calls are usually much faster. It also means that subsequent calls must use the same sized arguments as the recorded call. For example, we can use C_my_derivs_demo$derivsRun
with new arguments of the same size(s):
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] 3.596640 7.690609
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -11.50925 1.438656 0.000000
## [2,] -39.22211 0.000000 3.076244
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] 36.829591 -8.2003386 0
## [2,] -8.200339 0.5754624 0
## [3,] 0.000000 0.0000000 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 200.03275 0 -23.379452
## [2,] 0.00000 0 0.000000
## [3,] -23.37945 0 1.230497
However, if we call C_my_derivs_demo$derivsRun
with length(x)
different from 2, the result will be garbage. If we need to change the size of the arguments, we need to re-record the tape. This is done with the reset
argument.
Here is a slightly more general version of the derivs_demo
allowing a user to reset the tape. It also takes order
and wrt
as arguments instead of hard-coding them.
derivs_demo4 <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
return(exp(-d*x))
returnType(double(1))
},
methods = list(
derivsRun = function(d = double(), x = double(1),
wrt = integer(1), order = integer(1),
reset = logical(0, default=FALSE)) {
return(derivs(run(d, x), wrt = wrt, order = order, reset = reset))
returnType(ADNimbleList())
}
),
buildDerivs = 'run'
)
Now we will illustrate the use of reset
. To make shorter output, we’ll request only the Jacobian.
my_derivs_demo4 <- derivs_demo4()
C_my_derivs_demo4 <- compileNimble(my_derivs_demo4)
d <- 1.2
x <- c(2.1, 2.2)
# On the first call, reset is ignored because the tape must be recorded.
C_my_derivs_demo4$derivsRun(d, x, 1:3, 1)
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## numeric(0)
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -0.1689652 -0.09655153 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352
## Field "hessian":
## <0 x 0 x 0 array of double>
##
# On the second call, reset=FALSE, so the tape is re-used.
C_my_derivs_demo4$derivsRun(-0.4, c(3.2, 5.1), 1:3, 1)
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## numeric(0)
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -11.50925 1.438656 0.000000
## [2,] -39.22211 0.000000 3.076244
## Field "hessian":
## <0 x 0 x 0 array of double>
##
# If we need a longer X, we need to say reset=TRUE
C_my_derivs_demo4$derivsRun(1.2, c(2.1, 2.2, 2.3), 1:4, 1, reset=TRUE)
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## numeric(0)
## Field "jacobian":
## [,1] [,2] [,3] [,4]
## [1,] -0.1689652 -0.09655153 0.00000000 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352 0.00000000
## [3,] -0.1455711 0.00000000 0.00000000 -0.07595012
## Field "hessian":
## <0 x 0 x 0 array of double>
##
# Now we can use with reset=FALSE with X of length 3
C_my_derivs_demo4$derivsRun(-0.4, c(3.2, 5.1, 4.5), 1:4, 1)
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## numeric(0)
## Field "jacobian":
## [,1] [,2] [,3] [,4]
## [1,] -11.50925 1.438656 0.000000 0.000000
## [2,] -39.22211 0.000000 3.076244 0.000000
## [3,] -27.22341 0.000000 0.000000 2.419859
## Field "hessian":
## <0 x 0 x 0 array of double>
##
You could also make multiple instances of derivs_demo4
and keep track of the argument sizes currently recorded for each one.
Other important situations where you need to reset a tape are when values that are otherwise “baked in” to an AD tape need to be changed. By “baked in”, we mean that some values are permanently part of a tape until it is reset.
16.5.5.1 What gets baked into AD tapes until reset=TRUE
Specifically, you need to reset a tape when:
- the extent of any for-loops change. For-loops are not natively recorded, but rather each operation done in execution of the for-loop is recorded. (This is sometimes called “loop unrolling”.) Therefore, in
for(i in start:end)
, the values ofstart
andend
are in effect baked into a tape. - the outcomes of any if-then-else conditions in a nimbleFunction are changed.
- values of any
integer
orlogical
inputs change. - values of any variables with derivative tracking severed
ADbreak
change. (See 16.5.2.) - any other member data used in calculations changes. It is possible to create a variable in the
setup
function and use it in methods such asrun
. Such variables become member data in the terminology of object-oriented programming. Such variables will not have derivatives tracked if used in a function while it is being taped, so the values will be baked into the tape. This can be useful if you understand what is happening or confusing if not.
Uncompiled execution of derivs
ignores the reset argument.
16.6 Advanced uses: double taping
Suppose you are interested only in the Jacobian, not in the value, and/or want only some elements of the Jacobian. You might still need the value, but obtaining the value alone from an AD tape (i.e. from derivs
) will be slower than obtaining it by simply calling the function. On the other hand, AD methods need to calculate the value before calculating first order derivatives, so the value will be calculated anyway. However, in some cases, some of the steps of value calculations aren’t really needed if one only wants, say, first-order derivatives. In addition, if not all elements of the Jacobian are wanted, then some unnecessary calculations will be done internally that one might want to avoid.
A way to cut out unnecessary calculations is to record a tape of a tape, which we call double taping. Let’s see an example before explaining further.
derivs_demo5 <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
return(exp(-d*x))
returnType(double(1))
},
methods = list(
jacobian_run_wrt_d = function(d = double(), x = double(1),
wrt = integer(1),
reset = logical(0, default=FALSE)) {
ans <- derivs(run(d, x), wrt = wrt, order = 1, reset = reset)
jac <- ans$jacobian[,1] # derivatives wrt 'd' only
return(jac)
returnType(double(1))
},
derivsJacobian = function(d = double(), x = double(1),
wrt = integer(1),
order = integer(1),
reset = logical(0, default = FALSE)) {
innerWrt <- nimInteger(value = 1, length = 1)
ans <- nimDerivs(jacobian_run_wrt_d(d, x, wrt = innerWrt, reset = reset),
wrt = wrt, order = order, reset = reset)
return(ans)
returnType(ADNimbleList())
}
),
buildDerivs = c('run', 'jacobian_run_wrt_d')
)
What is happening in this code? jacobian_run_wrt_d
is a method that returns part of the Jacobian of run(d, x)
, specifically the first column, which contains derivatives with respect to d
. It happens to use AD to do it, but otherwise it is just some function with inputs d
and x
and a vector output. (Arguments that are integer or logical do not have derivatives tracked.) derivsJacobian
calculates derivatives of jacobian_run_wrt_d
. This means that the value
returned by derivsJacobian
will be the value of jacobian_run_wrt_d
, which comprises the first derivatives of run
with respect to d
. The jacobian
returned by derivsJacobian
will contain second derivatives of run
with respect to d
and each of d
, x[1]
, and x[2]
. And the hessian
returned by derivsJacobian
will contain some third derivatives of run
. Notice that buildDerivs
now includes jacobian_run_wrt_d
.
Let’s see this in use.
my_derivs_demo5 <- derivs_demo5()
C_my_derivs_demo5 <- compileNimble(my_derivs_demo5)
d <- 1.2
x <- c(2.1, 2.2)
C_my_derivs_demo5$derivsJacobian(d, x, wrt = 1:3, order = 0:2)
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] -0.1689652 -0.1569948
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] 0.3548269 0.1222986 0.0000000
## [2,] 0.3453885 0.0000000 0.1170325
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] -0.74513642 -0.08786189 0
## [2,] -0.08786189 -0.05020679 0
## [3,] 0.00000000 0.00000000 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] -0.7598548 0 -0.10047667
## [2,] 0.0000000 0 0.00000000
## [3,] -0.1004767 0 -0.05480546
We can compare these results to those shown above from the same values of d
and x
. The value
here is the same as jacobian[1:2, 1]
above. The jacobian
here is the same as t(hessian[1, 1:3, 1:2])
or (by symmetry of second derivatives) t(hessian[1:3, 1, 1:2])
above. And the hessian[i,j,k]
here contains the third derivative of the k-th output of run(d, x)
with respect to d
, the i-th input, and the j-th input.
It is also possible to call derivs
on (i.e. tape) a function containing a derivs
call from which some 0-th and/or 2nd order derivatives are extracted. It is even possible to triple tape by calling derivs
on a function calling derivs
on a function calling derivs
, and so on.
16.7 Derivatives involving model calculations
Obtaining derivatives involving model calculations takes some special considerations, and there are two ways to do it.
We will use the Poisson GLMM above as an example model for which we want derivatives. We can obtain derivatives with respect to any variables for all or any subset of model calculations.
16.7.1 Method 1: nimDerivs
of model$calculate
Recall that model$calculate(nodes)
returns the sum of the log probabilities of all stochastic nodes in nodes
. Deterministic calculations are also executed; they contribute 0 to the sum of probabilities but may be needed for inputs to subsequent calculations. Calculations are done in the order of nodes
, which should be a valid order for the model, often obtained from model$getDependencies
.
The simplest way to get derivatives for model calculations is to use model$calculate
as the function taped by derivs
(here shown by its alternative name nimDerivs
for illustration).
derivs_nf <- nimbleFunction(
setup = function(model, with_respect_to_nodes, calc_nodes) {},
run = function(order = integer(1),
reset = logical(0, default=FALSE)) {
ans <- nimDerivs(model$calculate(calc_nodes), wrt = with_respect_to_nodes,
order = order, reset = reset)
return(ans)
returnType(ADNimbleList())
}
)
In derivs_nf
:
- The mere presence of
model
,with_respect_to_nodes
, andcalc_nodes
assetup
arguments makes them available as member data forrun
or other methods. model
will be a model object returned fromnimbleModel
.with_respect_to_nodes
will be the names of nodes we want derivatives with respect to.calc_nodes
will be the nodes to be calculated, in the order given.order
can contain any of0
(value),1
(1st order), or2
(2nd order) derivatives requested, as above.reset
should be TRUE if the AD tape should be reset (re-recorded), as above. There are additional situations when a tape formodel$calculate
should be reset, discussed below.
Thus, nimDerivs(model$calculate(calc_nodes), wrt = with_respect_to_nodes, <other args>)
takes derivatives of a function whose inputs are the values of with_respect_to_nodes
(using their values in the model object) and whose output is the summed log probability returned by model$calculate(calc_nodes)
. The internal handling of this case is distinct from other calls to nimDerivs
.
Now we can make an instance of derivs_nf
, compile the model and nimbleFunction instance, and look at the results. We will assume the model
was built as above in the Laplace example.
wrt_nodes <- c('intercept','beta', 'sigma')
calc_nodes <- model$getDependencies(wrt_nodes)
derivs_all <- derivs_nf(model, wrt_nodes, calc_nodes)
cModel <- compileNimble(model)
cDerivs_all <- compileNimble(derivs_all, project = model)
derivs_result <- cDerivs_all$run(order = 0:2)
derivs_result
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] -78.44085
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -9.358104 -0.3637619 -5.81414
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] -62.35820 -16.21705 0.00000
## [2,] -16.21705 -69.47074 0.00000
## [3,] 0.00000 0.00000 -45.11516
As above, using order = 0:2
results in the the value (0th order), Jacobian (1st order), and Hessian (2nd order). The function model$calculate
is organized here to have inputs that are the current values of intercept
, beta
, and sigma
in the model. It has output that is the summed log probability of the calc_nodes
. The Jacobian columns are first derivatives with respect to intercept
, beta
and sigma
, respectively. The first and second indices of the Hessian array follow the same ordering. For example derivs_result$hessian[2,1,1]
is the second derivative with respect to beta
(first index = 2) and intercept
(second index = 1).
In the case of model$calculate
, the first index of the Jacobian and the last index of the Hessian are always 1 because derivatives are of the first (and only) output value.
The ordering of inputs is similar to that used above, such as for the arguments d
and x
, but in this case the inputs are model nodes. When non-scalar nodes such as matrix or array nodes are used as with_respect_to_nodes
, the resulting elements of Jacobian columns and Hessian first and second indices will follow column-major order. For example, for a 2x2 matrix m
, the element order would be m[1, 1]
, m[2, 1]
, m[1, 2]
, m[2, 2]
. This will usually be the same ordering as the names returned by model$expandNodeNames("m", returnScalarElements=TRUE)
.
The actual values used as inputs are the current values in the compiled model object. These would typically be set by code like values(model, with_respect_to_nodes) <<- my_values
before the call to nimDerivs
in the above example.
As above, derivatives can also be calculated in uncompiled execution, but they will be much slower and less accurate. Uncompiled execution is mostly useful for checking that compiled derivatives are working correctly. This will use values in the uncompiled model object.
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] -78.44085
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -9.358104 -0.3637619 -5.81414
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] -62.35821 -1.621705e+01 0.000000e+00
## [2,] -16.21705 -6.947075e+01 2.384186e-07
## [3,] 0.00000 2.384186e-07 -4.511517e+01
We can see that all the results clearly match the compiled results to a reasonable numerical precision.
16.7.2 Method 2: nimDerivs
of a method that calls model$calculate
Sometimes one needs derivatives of calculations done in a nimbleFunction as well as in a model. And sometimes one needs to change values in a model before and/or after doing model calculations and have those changes recorded in the derivative tape.
For these reasons, it is possible to take derivatives of a method that includes a call to model$calculate
. Continuing the above example, say we want to take derivatives with respect to the log of sigma
, so the input vector will be treated as intercept
, beta
, and log(sigma)
, in that order. We will convert log(sigma)
to sigma
before using it in the model, and we want the derivative of that transformation included in the tape (i.e. using the chain rule).
(Using log(sigma)
instead of sigma
is useful so an algorithm such as optimization or HMC can use an unconstrained parameter space. A constraint such as sigma > 0
can be difficult for many algorithms. See below for NIMBLE’s automated parameter transformation system if you need to do this systematically.)
Here is a nimbleFunction to do that.
derivs_nf2 <- nimbleFunction(
setup = function(model, wrt_nodes, calc_nodes) {
derivsInfo <- makeModelDerivsInfo(model, wrt_nodes, calc_nodes)
updateNodes <- derivsInfo$updateNodes
constantNodes <- derivsInfo$constantNodes
n_wrt <- length(wrt_nodes)
# If wrt_nodes might contain non-scalar nodes, the more general way
# to determine the length of all scalar elements is:
# length(model$expandNodeNames(wrt_nodes, returnScalarComponents = TRUE))
},
run = function(x = double(1)) {
x_trans <- x # x[1:2] don't need transformation
x_trans[3] <- exp(x[3]) # transformation of x[3]
values(model, wrt_nodes) <<- x_trans # put inputs into model
ans <- model$calculate(calc_nodes) # calculate model
return(ans)
returnType(double(0))
},
methods = list(
derivsRun = function(x = double(1),
order = integer(1),
reset = logical(0, default=FALSE)) {
wrt <- 1:n_wrt
ans <- nimDerivs(run(x), wrt = wrt, order = order, reset = reset,
model = model, updateNodes = updateNodes,
constantNodes = constantNodes)
return(ans)
returnType(ADNimbleList())
}
),
buildDerivs = list(run = list()) # or simply 'run' would work in this case
)
Let’s see how this can be used.
derivs_all2 <- derivs_nf2(model, wrt_nodes, calc_nodes)
cDerivs_all2 <- compileNimble(derivs_all2, project = model)
params <- values(model, wrt_nodes)
params[3] <- log(params[3])
cDerivs_all2$derivsRun(params, order = 0:2)
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] -78.44085
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -9.358104 -0.3637619 -2.90707
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] -62.35820 -16.21705 0.00000
## [2,] -16.21705 -69.47074 0.00000
## [3,] 0.00000 0.00000 -14.18586
Notice that these results are the same as we saw above except for third elements, which represent derivatives with respect to sigma
above and to log(sigma)
here.
There are several important new points here:
The only valid way to get values into the model that will be recorded on the AD tape is with
values(model, nodes) <<- some_values
as shown. Other ways such asnimCopy
ormodel[[node]] <<- some_value
are not currently supported to work with AD.The call to
nimDerivs
ofrun
must be told some information about the model calculations that will be done inside ofrun
.model
is of course the model that will be used.constantNodes
is a vector of node names whose values are needed for the calculations but are not expected to change until you usereset = TRUE
. Values of these nodes will be baked into the AD tape untilreset = TRUE
.updateNodes
is a vector of node names whose values are needed for the calculations, might change between calls even withreset = FALSE
, and are neither part ofwrt_nodes
nor a deterministic part ofcalc_nodes
. The functionmakeModelDerivsInfo
, as shown in thesetup
code, determines what is usually needed forconstantNodes
andupdateNodes
.In Method 1 above, the NIMBLE compiler automatically determines
constantNodes
andupdateNodes
usingmakeModelDerivsInfo
based on the inputs tonimDerivs(model$calculate(...),...)
.One can use double-taping, but if so both calls to
nimDerivs
need themodel
,updateNodes
, andconstantNodes
arguments, which should normally be identical.
16.7.2.1 Advanced topic: more about constantNodes
and updateNodes
In most cases, you can obtain constantNodes
and updateNodes
from makeModelDerivsInfo
without knowing exactly what they mean. But for advanced uses and possible debugging needs, let’s explore these arguments in more detail. The purpose of constantNodes
and updateNodes
is to tell the AD system about all nodes that will be needed for taped model calculations. Specifically:
updateNodes
andconstantNodes
are vectors of nodes names that together comprise any nodes that are necessary formodel$calculate(calc_nodes)
but are:- not in the
wrt
argument tonimDerivs
(only relevant for Method 1), and - not assigned into the model using
values(model, nodes) <<- some_values
prior tomodel$calculate
(only relevant for Method 2), and - not a deterministic node in
calc_nodes
.
- not in the
updateNodes
includes node names satisfying those conditions whose values might change between uses of the tape regardless of thereset
argument.constantNodes
includes node names satisfying those conditions whose values will be baked into the tape (will not change) until the next call withreset=TRUE
.
To fix ideas, say that we want derivatives with respect to ran_eff[1]
for calculations of it and the data that depend on it, including any deterministic nodes. In other words, wrt
will be ran_eff[1]
, and calc_nodes
will be:
## [1] "ran_eff[1]"
## [2] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 1]"
## [3] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 2]"
## [4] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 3]"
## [5] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 4]"
## [6] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 5]"
## [7] "y[1, 1]"
## [8] "y[1, 2]"
## [9] "y[1, 3]"
## [10] "y[1, 4]"
## [11] "y[1, 5]"
(The “lifted” nodes here are for the exp(intercept + beta*X[i,j] + ran_eff[i])
, i.e. the inputs to dpois
. See chapters 2 and 13 to learn about lifted nodes.)
In this case, the log probability of ran_eff[1]
itself requires sigma
. Since sigma
is not in wrt_nodes
, it is not assigned into the model by the line values(model, wrt_nodes) <<- x_trans
. It is also not a deterministic node (or any node) in calc_nodes
, so it must be included in updateNodes
or constantNodes
. Since it might change between calls, it should be included in updateNodes
.
Next, notice that the stochastic node y[1, 1]
in calc_nodes
means that the log probability of y[1, 1]
will be calculated, and this requires the actual value of y[1, 1]
. This node is part of calc_nodes
but is not a deterministic part of it, so it must be provided in either updateNodes
or constantNodes
. When data values will not be changed often, it is better to put those nodes in constantNodes
, because that will be more efficient than putting them in updateNodes
. (If the data values are changed, use reset=TRUE
, which will reset values of all constantNodes
in the tape.)
The function makeModelDerivsInfo
inspects the model and determines the usual needs for updateNodes
and constantNodes
. By default, data nodes are put in constantNodes
. Use dataAsConstantNodes = FALSE
in makeModelDerivsInfo
if you want them put in updateNodes
.
Note that a deterministic node in calc_nodes
will have its value calculated as part of the operations recorded in the AD tape, so it does not need to be included in updateNodes
or constantNodes
.
As usual in model-generic programming in NIMBLE, be aware of lifted nodes and their implications. Suppose in the Poisson GLMM we had used a precision parameterization for the random effects, with the changes in this code snippet:
This would result in a lifted node for the standard deviation, calculated as 1/sqrt(precision)
. That lifted node is what would actually be used in dnorm
for each ran_eff[i]
. Now if ran_eff[1]
is in wrt_nodes
, the lifted node (but not precision
) will be in updateNodes
(as determined by makeModelDerivsInfo
). If you then change the value of precision
, you must be sure to calculate the lifted node before obtaining derivatives. Otherwise the value of the lifted node will correspond to the old value of precision. These considerations are not unique to AD but rather are part of model-generic programming (see chapter 15).
An example of model-generic programming to update any lifted nodes that depend on precision
would be:
In Method 1 above, NIMBLE automatically uses makeModelDerivsInfo
based on the code nimDerivs(model$calculate(<args>), <args>)
. However, in Method 2, when model$calculate(<args>)
is used in a method such as run
, then a call to nimDerivs(run(<args>), <args>)
requires the model
, updateNodes
and constantNodes
to be provided. Hence, the two functions (run
and derivsRun
) must be written in coordination.
16.8 Parameter transformations
Many algorithms that in some way explore a parameter space are best used in an unconstrained parameter space. For example, there are a bunch of optimization methods provided in R’s optim
, but only one (L-BFGS-B
) allows constraints on the parameter space. Similarly, HMC is implemented to work in an unconstrained parameter space.
NIMBLE provides a nimbleFunction to automatically create a transformation from original parameters to unconstrained parameters and the inverse transformation back to the original parameters. Denoting x
as an original parameter and g(x)
as the transformed parameter, the cases handled include:
- If scalar \(x \in (0, \infty)\), then \(g(x) = \log(x)\). For example,
x ~ dweibull
in a model means \(x \in (0, \infty)\). - If scalar \(x \in (0, 1)\), then \(g(x) = \mbox{logit}(x)\). For example,
x ~ dbeta
in a model means \(x \in (0, 1)\) - If scalar \(x \in (a, \infty)\), then \(g(x) = \log(x - a)\).
- If scalar \(x \in (-\infty, b)\), then \(g(x) = -\log(b - x)\).
- If scalar \(x \in (a, b)\), then \(g(x) = \mbox{logit}((x-a)/(b-a))\). For example
x ~ dunif(a, b)
in a model means \(x \in (a, b)\). - If matrix
x[1:n, 1:n] ~ dwishart
orx[1:n, 1:n] ~ dinvwishart
in a model, then \(g(x) = \mbox{chol}(x)\) for non-diagonal elements of \(\mbox{chol}(x)\) and \(g(x) = \log(\mbox{chol}(x))\) for diagonal elements of \(\mbox{chol}(x)\), where \(\mbox{chol}(x)\) is the Cholesky decomposition of \(x\). Note that \(x\) in these cases follows a Wishart or inverse Wishart distribution, respectively, and thus is a random precision or covariance matrix. That means it must be positive definite and thus have positive diagonal elements of \(\mbox{chol}(x)\). - If vector
x[1:n] ~ ddirch
in a model, then \(g(x[1]) = \mbox{logit}(x[1])\) and \(g(x[i]) = \mbox{logit}(x[i] / (1-\sum_{k=1}^{i-1}x[i]))\) for \(i > 1\). In this case,x
follows a Dirichlet distribution and thus has the simplex constraint that \(\sum_{i=1}^n x[i] = 1\). - If vector
x[1:n] ~ dlkj_corr_cholesky
, \(g(x)\) is the LKJ Cholesky transformation. This provides the LKJ prior for a covariance matrix.
Note that the scalar transformations are all monotonic increasing functions of the original parameter.
The worked example next will illustrate use of the parameter transformation system.