As of version xx.xx.xx, NIMBLE can automatically provide numerically accurate derivatives of potentially arbitrary order for most calculations in models and/or nimbleFunctions. This feature is in beta-testing mode, meaning it has some rough edges and we expect bugs will be found. This feature supports methods such as Hamiltonian Monte Carlo (HMC, see package nimbleHMC), Laplace approximation, and fast optimization with methods that use function gradients.

You can install the beta-testing version using the
`remotes`

package like this:

`remotes::install_github("nimble-dev/nimble", ref="AD-rc0", subdir="packages/nimble", quiet=TRUE)`

Automatic (or algorithmic) differentiation (AD) refers to the method
of carrying derivative information through a set of mathematical
operations. When done this way, derivatives are numerically accurate to
the precision of the computer. This is distinct from finite element
methods, used by packages such as `numDeriv`

(Gilbert and Varadhan 2019) and
`pracma`

(Borchers 2022), which
approximate derivatives by calculating function values at extremely
nearby points. Finite element methods are slower and less accurate than
AD. It is also distinct from writing separate functions for each
derivative, which can become very complicated and sometimes slower than
AD. NIMBLE uses the CppAD package (Bell
2022) as its AD engine, following TMB’s success (Kristensen et al. 2016) in doing so. CppAD
supports potentially arbitrary-order derivatives. A general reference on
AD is Griewank and Walther (2008).

AD is not a simple feature in NIMBLE. It requires care in models and
nimbleFunctions, awareness of what can go wrong, and testing to be sure
things are working as you want. Calls to `compileNimble`

that
include AD features will result in slower C++ compilation. That said,
using a packaged AD algorithm with a model should be as simple as
setting `buildDerivs=TRUE`

in the model.

We will introduce NIMBLE’s AD features step by step, from simply turning them on for a model to using them in your own nimbleFunctions. We will show you:

- how to turn on derivatives in a model and use them for Hamiltonian Monte Carlo (HMC).
- basics of obtaining derivatives in
`nimbleFunctions`

. - advanced uses of obtaining derivatives in
`nimbleFunctions`

, including*double-taping*. - how to get derivatives involving model calculations.
- details on what functions are supported and not supported for derivatives.
- automatic parameter transformations to give any model an unconstrained parameter space for algorithms to work in.
- an example showing use of nimble’s derivatives for maximum likelihood estimation.
- an example of nimble’s Laplace approximation.

First, make sure to set the option to enable derivative features.
This should be `TRUE`

by default, but just in case:

`nimbleOptions(enableDerivs = TRUE)`

NIMBLE’s HMC sampler is provided in a separate package, nimbleHMC.
HMC is an MCMC method that works by a physical analogy of frictionless
motion on a surface that is the negative log posterior density. It often
achieves good mixing, although at high computational cost. Within
`nimble`

, HMC is more flexible than in some implementations
because HMC can be used on some parts of a model while other samplers
are used on (the same or) other parts of the same model. Indeed, the
inventor of HMC discussed using it in this way (Neal 2011).

NIMBLE’s Laplace approximation is provided in
`buildLaplace`

. An example is given below.

We’ll use a Poisson GLMM as a simple example model. There will be 10
groups (`i`

) of 5 observations (`j`

) each. Each
observation has a covariate, `X`

, and each group has a random
effect `ran_eff`

. Here is the model code:

```
model_code <- nimbleCode({
# priors
intercept ~ dnorm(0, sd = 100)
beta ~ dnorm(0, sd = 100)
sigma ~ dunif(0, 10)
# random effects and data
for(i in 1:10) {
# random effects
ran_eff[i] ~ dnorm(0, sd = sigma)
for(j in 1:5) {
# data
y[i,j] ~ dpois(exp(intercept + beta*X[i,j] + ran_eff[i]))
}
}
})
```

We’ll simulate some values for `X`

.

```
set.seed(123)
X <- matrix(rnorm(50), nrow = 10)
```

In order to use one of the algorithms for a specific model, that
model must be created with `buildDerivs=TRUE`

.

```
model <- nimbleModel(model_code, constants = list(X = X), calculate = FALSE,
buildDerivs = TRUE) # Here is the argument needed for AD.
```

`## Defining model`

`## Building model`

`## Checking model sizes and dimensions`

```
## [Note] This model is not fully initialized. This is not an error.
## To see which variables are not initialized, use model$initializeInfo().
## For more information on model initialization, see help(modelInitialization).
```

To finish setting up the model, we can set parameter values, simulate data values, and then set those as the data to use.

```
model$intercept <- 0
model$beta <- 0.2
model$sigma <- 0.5
model$calculate() # This will return NA because the model is not fully initialized.
```

`## [1] NA`

```
model$simulate(model$getDependencies('ran_eff'))
model$calculate() # Now the model is fully initialized: all nodes have valid values.
```

`## [1] -80.74344`

`model$setData('y') # Now the model has y marked as data, with values from simulation.`

Now we can configure an MCMC to use HMC. In this case we’ll use HMC for the whole model, i.e. all parameters and latent states.

```
library(nimbleHMC)
HMC <- buildHMC(model)
```

```
## ===== Monitors =====
## thin = 1: beta, intercept, sigma
## ===== Samplers =====
## HMC sampler (1)
## - intercept, beta, sigma, ran_eff[1], ran_eff[2], ran_eff[3], ran_eff[4], ran_eff[5], ran_eff[6], ran_eff[7], ran_eff[8], ran_eff[9], ran_eff[10]
```

`Cmodel <- compileNimble(model)`

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

`CHMC <- compileNimble(HMC, project = model)`

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

`samples <- runMCMC(CHMC, niter = 1000, nburnin = 500) # short run for illustration`

`## running chain 1 ...`

```
## [Note] HMC sampler (nodes: intercept, beta, sigma, ran_eff[1], ran_eff[2], ran_eff[3], ran_eff[4], ran_eff[5], ran_eff[6], r...) is using 500 warmup iterations
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
## [Note] HMC sampler (nodes: intercept, beta, sigma, ran_eff[1], ran_eff[2], ran_eff[3], ran_eff[4], ran_eff[5], ran_eff[6], r...) encountered 7 divergent paths
```

We will not investigate results in detail, but here is a summary to see that reasonable results were generated.

`summary(coda::as.mcmc(samples))`

```
##
## Iterations = 1:500
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 500
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## beta 0.1826 0.1396 0.006241 0.006576
## intercept -0.2162 0.3088 0.013810 0.023374
## sigma 0.7697 0.2867 0.012821 0.022572
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## beta -0.09827 0.09303 0.1788 0.280194 0.4464
## intercept -0.88502 -0.39564 -0.1723 -0.003709 0.3818
## sigma 0.34567 0.55507 0.7178 0.930289 1.4386
```

`nimbleFunctions`

.Now that we have seen a derivative-enabled algorithm work, let’s see
how to obtain derivatives in methods you write. We’ll start by showing
how derivatives work in nimbleFunctions *without* using a model.
The AD system allows you to obtain derivatives of one function or method
from another.

Let’s get derivatives of the function `Y = exp(-d * X)`

where `X`

is a vector, `d`

is a scalar, and
`Y`

is a vector.

```
derivs_demo <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
return(exp(-d*x))
returnType(double(1))
},
methods = list(
derivsRun = function(d = double(), x = double(1)) {
wrt <- 1:(1 + length(x)) # total length of d and x
return(derivs(run(d, x), wrt = wrt, order = 0:2))
returnType(ADNimbleList())
}
),
buildDerivs = 'run'
)
```

Here are some things to notice:

- Having
`setup`

code allows the nimbleFunction to have multiple methods (i.e. to behave like a class definition in object-oriented programming). If, as here, you don’t need the`setup`

to do anything, you can instead use`setup=TRUE`

. - Any arguments to
`run`

that are real numbers (i.e. regular double precision numbers, but not integers or logicals) will have derivatives tracked when called through`derivs`

. - The “with-respect-to” (
`wrt`

) argument gives indices of the arguments for which you want derivatives. In this case, we’re including all elements of`d`

and`X`

. The indices form one sequence over all arguments. `order`

can contain any of`0`

(value),`1`

(1st order), or`2`

(2nd order) derivatives requested. Higher-order derivatives can be obtained by double-taping, described below.- Basics of
`nimbleFunction`

programming are covered in Chapter (ref). These include type declarations (`double()`

for scalar,`double(1)`

for vector), that`run`

is just a method with a favored name, and the distinction between setup code and run code. `derivs`

can alternatively be called`nimDerivs`

. In fact the former will be converted to the latter internally.

Let’s see this work.

```
my_derivs_demo <- derivs_demo()
C_my_derivs_demo <- compileNimble(my_derivs_demo)
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

```
d <- 1.2
x <- c(2.1, 2.2)
C_my_derivs_demo$derivsRun(d, x)
```

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] 0.08045961 0.07136127
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -0.1689652 -0.09655153 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] 0.3548269 0.1222986 0
## [2,] 0.1222986 0.1158618 0
## [3,] 0.0000000 0.0000000 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 0.3453885 0 0.1170325
## [2,] 0.0000000 0 0.0000000
## [3,] 0.1170325 0 0.1027602
```

We can see that using `order = 0:2`

results in the the
value (“0th” order result, i.e. the value returned by
`run(d, x)`

), the Jacobian (matrix of first order
derivatives), and the Hessian (array of second order derivatives). The
`run`

function here has taken three inputs (in the order
`d`

, `x[1]`

, `x[2]`

) and returned two
outputs (the first and second elements of `exp(-d*x)`

).

The i-th Jacobian row contains the first derivatives of the i-th
output with respect to `d`

, `x[1]`

, and
`x[2]`

, in that order. The first and second indices of the
Hessian array follow the same ordering as the columns of the Jacobian.
The third index of the Hessian array is for the output values. For
example, `hessian[3,1,2]`

is the second derivative of the
second output value (third index = 2) with respect to `x[2]`

(first index = 3) and `d`

(second index = 1).

(Although it may seem inconsistent to have the output index be first for Jacobian and last for Hessian, it is consistent with some standards for how these objects are created in other packages and used mathematically.)

When a function being called for derivatives (`run`

in
this case) has non-scalar arguments (`x`

in this case), the
indexing of inputs goes in order of arguments, and in column-major order
within arguments. For example, if we had arguments
`x = double(1)`

and `z = double(2)`

(a matrix),
the inputs would be ordered as `x[1]`

, `x[2]`

, …,
`z[1, 1]`

, `z[2, 1]`

, …, `z[1, 2]`

,
`z[2, 2]`

, etc.

Derivatives can also be calculated in uncompiled execution, but they
will be much slower and less accurate: slower because they are run in R,
and less accurate because they use finite element methods (from packages
`pracma`

and/or `numDeriv`

). Uncompiled execution
is mostly useful for checking that compiled derivatives are working
correctly, because although they are slower and less accurate, they are
also much simpler internally and thus provide good checks on compiled
results. For example:

`my_derivs_demo$derivsRun(d, x)`

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] 0.08045961 0.07136127
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -0.1689652 -0.09655153 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] 0.3548269 0.1222986 0
## [2,] 0.1222986 0.1158618 0
## [3,] 0.0000000 0.0000000 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 0.3453885 0 0.1170325
## [2,] 0.0000000 0 0.0000000
## [3,] 0.1170325 0 0.1027602
```

We can see that the results are very close, but typically not identical, to those from the compiled version.

Sometimes one wants to omit tracking of certain variables in
derivative calculations. Here is an example. Say that we write
`exp(-d*x)`

using a for-loop instead of a vectorized
operation as above. It wouldn’t make sense to track derivatives for the
for-loop index (`i`

), and indeed it would cause
`compileNimble`

to fail. The code to use a for-loop while
telling `derivs`

to ignore the for-loop index is:

```
derivs_demo2 <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
ans <- numeric(length = length(x))
for(i in 1:length(x))
ans[i] <- exp(-d*x[i])
return(ans)
returnType(double(1))
},
methods = list(
derivsRun = function(d = double(), x = double(1)) {
wrt <- 1:(1 + length(x)) # total length of d and x
return(derivs(run(d, x), wrt = wrt, order = 0:2))
returnType(ADNimbleList())
}
),
buildDerivs = list(run = list(ignore = 'i'))
)
```

We can see that it gives identical results as above, looking at only the Jacobian to keep the output short.

```
my_derivs_demo2 <- derivs_demo2()
C_my_derivs_demo2 <- compileNimble(my_derivs_demo2)
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

```
d <- 1.2
x <- c(2.1, 2.2)
C_my_derivs_demo2$derivsRun(d, x)$jacobian
```

```
## [,1] [,2] [,3]
## [1,] -0.1689652 -0.09655153 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352
```

One might think it should be obvious that `i`

should not
be involved in derivatives, but sometimes math is actually done with a
for-loop index. Another way to ensure derivatives won’t be tracked for
`i`

would be to write `i <- 1L`

, before the
loop. By assigning a definite integer to `i`

, it will be
established as definitely of integer type throughout the function and
thus not have derivatives tracked. (If you are not familiar with the
`L`

in R, try `storage.mode(1L)`

vs
`storage.mode(1)`

.)

Note the more elaborate value for `buildDerivs`

. Above,
this was just a character vector. Here it is a named list, with each
element itself being a list of control values. The control value
`ignore`

is a character vector of variable names to ignore in
derivative tracking. `buildDerivs = 'run'`

is equivalent to
`buildDerivs = list(run = list())`

.

Derivatives will be tracked for whatever series of calculations occur
in a method, possibly including calls to other methods or
nimbleFunctions that have `buildDerivs`

set. Let’s look at an
example where we have a separate function to return the element-wise
square root of an input vector. The net calculation for derivatives will
be `sqrt(exp(-d*x)))`

.

```
nf_sqrt <- nimbleFunction(
run = function(x = double(1)) {
return(sqrt(x))
returnType(double(1))
},
buildDerivs = TRUE
)
derivs_demo3 <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
ans <- exp(-d*x)
ans <- nf_sqrt(ans)
return(ans)
returnType(double(1))
},
methods = list(
derivsRun = function(d = double(), x = double(1)) {
wrt <- 1:(1 + length(x)) # total length of d and x
return(derivs(run(d, x), wrt = wrt, order = 0:2))
returnType(ADNimbleList())
}
),
buildDerivs = 'run'
)
```

And then let’s see it work:

```
my_derivs_demo3 <- derivs_demo3()
C_my_derivs_demo3 <- compileNimble(my_derivs_demo3)
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

```
d <- 1.2
x <- c(2.1, 2.2)
C_my_derivs_demo3$derivsRun(d, x)$jacobian
```

```
## [,1] [,2] [,3]
## [1,] -0.2978367 -0.1701924 0.0000000
## [2,] -0.2938488 0.0000000 -0.1602812
```

Note that for a nimbleFunction without setup code, one can say
`buildDerivs=TRUE`

, `buildDerivs = 'run'`

, or
`buildDerivs = list(run = list())`

.

At this point, it will be helpful to grasp more of how AD works and its implementation in nimble via the CppAD library (Bell 2022). AD methods work by following a set of calculations multiple times, sometimes in reverse order.

For example, the derivative of \(y = \exp(x^2)\) can be calculated (by the chain rule of calculus) as the derivative of \(y = \exp(z)\) (evaluated at \(z = x^2\)) times the derivative of \(z = x^2\) (evaluated at \(x\)). This can be calculated by first going through the sequence of steps for the value, i.e. (i) \(z = x^2\), (ii) \(y = \exp(z)\), and then again for the derivatives, i.e. (i) \(dz = 2x (dx)\), (ii) \(dy = \exp(z) dz\). These steps determine the instantaneous change \(dy\) that results from an instantaneous change \(dx\). (It may be helpful to think of both values as a function of a third variable such as \(t\). Then we are determining \(dy/dt\) as a function of \(dx/dt\).)

The derivative calculations typically need the values. For example, the derivative of \(\exp(z)\) is (also) \(\exp(z)\), for which the value of \(z\) is needed, which in this case is \(x^2\), one of the steps in calculating the value of \(\exp(x^2)\). (In fact, in this case the derivative of \(\exp(z)\) can more simply use the value of \(y\) itself.) Hence, values are calculated before derivatives.

What was just described is called *forward* mode AD. The
derivatives can also be calculated by working through \(\exp(x^2)\) in *reverse* mode. This
is often less familiar to graduates of basic calculus courses. In
reverse mode, one determines what instantaneous change \(dx\) would give an instantaneous change
\(dy\). For example, given a value of
\(dy\), we can calculate \(1/dz = \exp(z) / dy\) and then \((1/dx) = (x^2) / dz\). In general, choice
of forward mode versus reverse mode has to do with the lengths of inputs
and outputs and possibly specific operations involved. In nimble, some
reasonable defaults are used.

In CppAD, the metaphor of pre-digital technology – magnetic tapes –
is used for the set of operations in a calculation. When a line with
`nimDerivs`

is first run, the operations in the function it
calls are *taped* and then re-used – *played* – in forward
and/or reverse orders to obtain derivatives. Hence we will refer to the
AD *tape* in explaining some features below. It is also possible
to reset (i.e. re-record) the AD tape used for a particular
`nimDerivs`

call.

When taped operations are played, the resulting operations can
themselves be taped. We call this *meta-taping* or
*double-taping*. It is useful because it can sometimes boost
efficiency.

`nimDerivs`

callThe first time a call to `nimDerivs`

runs, it records a
tape of the operations in the its first argument (`run(d, x)`

in this case). On subsequent calls, it re-uses that tape without
re-recording it. That means subsequent calls are usually much faster.
*It also means that subsequent calls must use the
same sized arguments as the recorded call.* For example, we can use

`C_my_derivs_demo$derivsRun`

with new arguments of the same
size(s):`C_my_derivs_demo$derivsRun(-0.4, c(3.2, 5.1))`

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] 3.596640 7.690609
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -11.50925 1.438656 0.000000
## [2,] -39.22211 0.000000 3.076244
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] 36.829591 -8.2003386 0
## [2,] -8.200339 0.5754624 0
## [3,] 0.000000 0.0000000 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 200.03275 0 -23.379452
## [2,] 0.00000 0 0.000000
## [3,] -23.37945 0 1.230497
```

However, if we call `C_my_derivs_demo$derivsRun`

with
`length(x)`

different from 2, the result will be garbage. If
we need to change the size of the arguments, we need to re-record the
tape. This is done with the `reset`

argument.

Here is a slightly more general version of the
`derivs_demo`

allowing a user to reset the tape. It also
takes `order`

and `wrt`

as arguments instead of
hard-coding them.

```
derivs_demo4 <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
return(exp(-d*x))
returnType(double(1))
},
methods = list(
derivsRun = function(d = double(), x = double(1),
wrt = integer(1), order = integer(1),
reset = logical(0, default=FALSE)) {
return(derivs(run(d, x), wrt = wrt, order = order, reset = reset))
returnType(ADNimbleList())
}
),
buildDerivs = 'run'
)
```

Now we will illustrate the use of reset. To make shorter output, we’ll request only the Jacobian.

```
my_derivs_demo4 <- derivs_demo4()
C_my_derivs_demo4 <- compileNimble(my_derivs_demo4)
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

```
d <- 1.2
x <- c(2.1, 2.2)
# On the first call, reset is ignored because the tape must be recorded.
C_my_derivs_demo4$derivsRun(d, x, 1:3, 1)
```

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## numeric(0)
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -0.1689652 -0.09655153 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352
## Field "hessian":
## <0 x 0 x 0 array of double>
##
```

```
# On the second call, reset=FALSE, so the tape is re-used.
C_my_derivs_demo4$derivsRun(-0.4, c(3.2, 5.1), 1:3, 1)
```

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## numeric(0)
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -11.50925 1.438656 0.000000
## [2,] -39.22211 0.000000 3.076244
## Field "hessian":
## <0 x 0 x 0 array of double>
##
```

```
# If we need a longer X, we need to say reset=TRUE
C_my_derivs_demo4$derivsRun(1.2, c(2.1, 2.2, 2.3), 1:4, 1, reset=TRUE)
```

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## numeric(0)
## Field "jacobian":
## [,1] [,2] [,3] [,4]
## [1,] -0.1689652 -0.09655153 0.00000000 0.00000000
## [2,] -0.1569948 0.00000000 -0.08563352 0.00000000
## [3,] -0.1455711 0.00000000 0.00000000 -0.07595012
## Field "hessian":
## <0 x 0 x 0 array of double>
##
```

```
# Now we can use with reset=FALSE with X of length 3
C_my_derivs_demo4$derivsRun(-0.4, c(3.2, 5.1, 4.5), 1:4, 1)
```

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## numeric(0)
## Field "jacobian":
## [,1] [,2] [,3] [,4]
## [1,] -11.50925 1.438656 0.000000 0.000000
## [2,] -39.22211 0.000000 3.076244 0.000000
## [3,] -27.22341 0.000000 0.000000 2.419859
## Field "hessian":
## <0 x 0 x 0 array of double>
##
```

Remember, you can make multiple instances of
`derivs_demo4`

, so you could keep track of the argument sizes
currently recorded for each one.

Other important situations where you need to reset a tape are when values that are otherwise “baked in” to an AD tape need to be changed. By “baked in”, we mean that some values are permanently part of a tape until it is reset.

`reset=TRUE`

Specifically, you need to reset a tape when:

- the extent of any for-loops change. For-loops are not natively
recorded, but rather each operation done in execution of the for-loop is
recorded. (This is sometimes called “loop unrolling”.) Therefore, in
`for(i in start:end)`

, the values of`start`

and`end`

are in effect baked into a tape. - the outcomes of any if-then-else conditions in a nimbleFunction are changed.
- any other member data used in calculations changes. It is possible
to create a variable in the
`setup`

function and use it in methods such as`run`

. Such variables become member data in the terminology of object-oriented programming. Such variables will not have derivatives tracked if used in a function while it is being taped, so the values will be baked into the tape. This can be useful if you understand what is happening or confusing if not.

Uncompiled execution of `derivs`

ignores the reset
argument.

If you are interested in measuring the performance of AD in nimble,
please remember that the first call to `nimDerivs`

and any
subsequent call with `reset=TRUE`

will be slower, often much
slower, that subsequent calls.

Suppose you are interested only in the Jacobian, not in the value, and/or want only some elements of the Jacobian. You might still need the value, but obtaining the value alone from an AD tape will be slower than obtaining it by simply calling the function. On the other hand, AD methods need to calculate the value before calculating first order derivatives, so the value will be calculated anyway. However, in some cases, some of the steps of value calculations aren’t really needed if one only wants, say, first-order derivatives. In addition, if not all elements of the Jacobian are wanted, then some unnecessary calculations will be done internally that one might want to avoid.

A way to cut out unnecessary calculations is to record a tape of a tape, which we call double taping. Let’s see an example before explaining further.

```
derivs_demo5 <- nimbleFunction(
setup = function() {},
run = function(d = double(), x = double(1)) {
return(exp(-d*x))
returnType(double(1))
},
methods = list(
jacobian_run_wrt_d = function(d = double(), x = double(1),
wrt = integer(1),
reset = logical(0, default=FALSE)) {
ans <- derivs(run(d, x), wrt = wrt, order = 1, reset = reset)
jac <- ans$jacobian[,1] # derivatives wrt 'd' only
return(jac)
returnType(double(1))
},
derivsJacobian = function(d = double(), x = double(1),
wrt = integer(1),
order = integer(1),
reset = logical(0, default = FALSE)) {
innerWrt <- nimInteger(value = 1, length = 1)
ans <- nimDerivs(jacobian_run_wrt_d(d, x, wrt = innerWrt, reset = reset),
wrt = wrt, order = order, reset = reset)
return(ans)
returnType(ADNimbleList())
}
),
buildDerivs = c('run', 'jacobian_run_wrt_d')
)
```

What is happening in this code? `jacobian_run_wrt_d`

is a
method that returns part of the Jacobian of `run(d, x)`

,
specifically the first column, which contains derivatives with respect
to `d`

. It happens to use AD to do it, but otherwise it is
just some function with inputs `d`

and `x`

and a
vector output. (Arguments that are integer or logical do not have
derivatives tracked.) `derivsJacobian`

calculates derivatives
of `jacobian_run_wrt_d`

. This means that the
`value`

returned by `derivsJacobian`

will be the
value of `jacobian_run_wrt_d`

, which comprises the first
derivatives of `run`

with respect to `d`

. The
`jacobian`

returned by `derivsJacobian`

will
contain second derivatives of `run`

with respect to
`d`

and each of `d`

, `x[1]`

, and
`x[2]`

. And the `hessian`

returned by
`derivsJacobian`

will contain some *third* derivatives
of `run`

. Notice that `buildDerivs`

now includes
`jacobian_run_wrt_d`

.

Let’s see this in use.

```
my_derivs_demo5 <- derivs_demo5()
C_my_derivs_demo5 <- compileNimble(my_derivs_demo5)
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

```
d <- 1.2
x <- c(2.1, 2.2)
C_my_derivs_demo5$derivsJacobian(d, x, wrt = 1:3, order = 0:2)
```

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] -0.1689652 -0.1569948
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] 0.3548269 0.1222986 0.0000000
## [2,] 0.3453885 0.0000000 0.1170325
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] -0.74513642 -0.08786189 0
## [2,] -0.08786189 -0.05020679 0
## [3,] 0.00000000 0.00000000 0
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] -0.7598548 0 -0.10047667
## [2,] 0.0000000 0 0.00000000
## [3,] -0.1004767 0 -0.05480546
```

We’ll compare these results to those shown above from the same values
of `d`

and `x`

. The `value`

here is the
same as `jacobian[1:2, 1]`

above. The `jacobian`

here is the same as `t(hessian[1, 1:3, 1:2])`

or (by symmetry
of second derivatives) `t(hessian[1:3, 1, 1:2])`

above. And
the `hessian[i,j,k]`

here contains the third derivative of
the k-th output of `run(d, x)`

with respect to
`d`

, the i-th input, and the j-th input.

It is also possible to call `derivs`

on (i.e. tape) a
function containing a `derivs`

call from which some 0-th
and/or 2nd order derivatives are extracted. It is even possible to
triple tape by calling `derivs`

on a function calling
`derivs`

on a function calling `derivs`

, and so
on.

Obtaining derivatives involving model calculations takes some special considerations, and there are two ways to do it.

We will use the Poisson GLMM above as an example model for which we want derivatives. We can obtain derivatives with respect to any variables for all or any subset of model calculations.

`nimDerivs`

of
`model$calculate`

Recall that `model$calculate(nodes)`

returns the sum of
the log probabilities of all stochastic nodes in `nodes`

.
Deterministic calculations are also executed; they contribute 0 to the
sum of probabilities but may be needed for inputs to subsequent
calculations. Calculations are done in the order of `nodes`

,
which should be a valid order for the model, often obtained from
`model$getDependencies`

.

The simplest way to get derivatives for model calculations is to use
`model$calculate`

as the function taped by
`derivs`

(here shown by the alias `nimDerivs`

for
illustration).

```
derivs_nf <- nimbleFunction(
setup = function(model, with_respect_to_nodes, calc_nodes) {},
run = function(order = integer(1),
reset = logical(0, default=FALSE)) {
ans <- nimDerivs(model$calculate(calc_nodes), wrt = with_respect_to_nodes,
order = order, reset = reset)
return(ans)
returnType(ADNimbleList())
}
)
```

In `derivs_nf`

:

- The mere presence of
`model`

,`with_respect_to_nodes`

, and`calc_nodes`

as setup arguments makes them available as member data for`run`

or other methods. `model`

will be a model object returned from`nimbleModel`

.`with_respect_to_nodes`

will be the names of nodes we want derivatives with respect to.`calc_nodes`

will be the nodes to be calculated, in the order given.`order`

can contain any of`0`

(value),`1`

(1st order), or`2`

(2nd order) derivatives requested, as above.`reset`

should be TRUE if the AD tape should be reset (re-recorded), as above. There are additional situations when a tape for`model$calculate`

should be reset, discussed below.

Thus,
`nimDerivs(model$calculate(calc_nodes), wrt = with_respect_to_nodes, <other args>)`

takes derivatives of a function whose inputs are the values of
`with_respect_to_nodes`

(using their values in the model
object) and whose output is the summed log probability of return by
`model$calculate(calc_nodes)`

. The internal handling of this
case is distinct from other calls to `nimDerivs`

.

Now we can make an instance of `derivs_nf`

, compile the
model and nimbleFunction instance, and look at the results. We will
assume the `model`

was built as above in the HMC example

```
wrt_nodes <- c('intercept','beta', 'sigma')
calc_nodes <- model$getDependencies(wrt_nodes)
derivs_all <- derivs_nf(model, wrt_nodes, calc_nodes)
cModel <- compileNimble(model)
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

`cDerivs_all <- compileNimble(derivs_all, project = model)`

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

```
derivs_result <- cDerivs_all$run(order = 0:2)
derivs_result
```

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] -80.74344
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -9.358104 -0.3637619 -5.81414
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] -62.35820 -16.21705 0.00000
## [2,] -16.21705 -69.47074 0.00000
## [3,] 0.00000 0.00000 -45.11516
```

As above, using `order = 0:2`

results in the the value
(0th order), Jacobian (1st order), and Hessian (2nd order). The function
`model$calculate`

is organized here to have inputs that are
the current values of `intercept`

, `beta`

, and
`sigma`

in the model. It has output that is the summed log
probability of the `calc_nodes`

. The Jacobian columns are
first derivatives with respect to `intercept`

,
`beta`

and `sigma`

, respectively. The first and
second indices of the Hessian array follow the same ordering. For
example `derivs_result$hessian[2,1,1]`

is the second
derivative with respect to `beta`

(first index = 2) and
`intercept`

(second index = 1).

In the case of `model$calculate`

, the first index of the
Jacobian and the last index of the Hessian are always 1 because
derivatives are of the first (and only) output value.

The ordering of inputs is similar to that used above, such as for the
arguments `d`

and `x`

, but in this case the inputs
are model nodes. When non-scalar nodes such as matrix or array nodes are
used as `with_respect_to_nodes`

, the resulting elements of
Jacobian columns and Hessian first and second indices (corresponding to
the inputs) will follow column-major order. For example, for a 2x2
matrix `m`

, the element order would be `m[1, 1]`

,
`m[2, 1]`

, `m[1, 2]`

, `m[2, 2]`

.

The actual values used as inputs are the current values in the compiled model object.

As above, derivatives can also be calculated in uncompiled execution, but they will be much slower and less accurate. Uncompiled execution is mostly useful for checking that compiled derivatives are working correctly. This will use values in the uncompiled model object.

`derivs_all$run(order = 0:2)`

```
## nimbleList object of type NIMBLE_ADCLASS
## Field "value":
## [1] -80.74344
## Field "jacobian":
## [,1] [,2] [,3]
## [1,] -9.358104 -0.3637619 -5.81414
## Field "hessian":
## , , 1
##
## [,1] [,2] [,3]
## [1,] -62.35821 -16.21705 0.00000
## [2,] -16.21705 -69.47074 0.00000
## [3,] 0.00000 0.00000 -45.11517
```

We can see that all the results clearly match the compiled results to a reasonable numerical precision.

`nimDerivs`

of a method that calls
`model$calculate`

Sometimes one needs derivatives of calculations done in a nimbleFunction as well as in a model. And sometimes one needs to change values in a model before and/or after doing model calculations and have those changes recorded in the derivative tape.

For these reasons, it is possible in one method to take derivatives
of another method that includes a call to `model$calculate`

.
Continuing the above example, say we want to take derivatives with
respect to the log of `sigma`

, so the input vector will be
treated as `intercept`

, `beta`

, and
`log(sigma)`

, in that order. (This would be useful so we
could run an optimization algorithm on an unconstrained parameter space,
for example. Here, since `sigma`

must be positive, any
algorithm must respect that constraint, so it can be easier to use
algorithms involving `log(sigma)`

. See below for NIMBLE’s
automated parameter transformation system if you need to do this
systematically.) We will convert `log(sigma)`

to
`sigma`

before using it in the model, and we want the
derivative of that transformation included in the tape (i.e. using the
chain rule).

Here is a nimbleFunction to do that.

```
derivs_nf2 <- nimbleFunction(
setup = function(model, wrt_nodes, calc_nodes) {
derivsInfo <- makeDerivsInfo(model, wrt_nodes, calc_nodes)
updateNodes <- derivsInfo$updateNodes
constantNodes <- derivsInfo$constantNodes
n_wrt <- length(wrt_nodes)
# If wrt_nodes might contain non-scalar nodes, the more general way
# to determine the length of all scalar elements is:
# length(model$expandNodeNames(wrt_nodes, returnScalarComponents = TRUE))
},
run = function(x = double(1)) {
x_trans <- x # x[1:2] don't need transformation
x_trans[3] <- exp(x[3]) # transformation of x[3]
values(model, wrt_nodes) <<- x_trans # put inputs into model
ans <- model$calculate(calc_nodes) # calculate model
return(ans)
returnType(double(0))
},
methods = list(
derivsRun = function(x = double(1),
order = integer(1),
reset = logical(0, default=FALSE)) {
wrt <- 1:n_wrt
ans <- nimDerivs(run(x), wrt = wrt, order = order, reset = reset,
model = model, updateNodes = updateNodes,
constantNodes = constantNodes)
return(ans)
returnType(ADNimbleList())
}
),
buildDerivs = list(run = list()) # or simply 'run' would work in this case
)
```

Let’s see how this can be used.

```
derivs_all2 <- derivs_nf2(model, wrt_nodes, calc_nodes)
cDerivs_all2 <- compileNimble(derivs_all2, project = model)
params <- values(model, wrt_nodes)
params[3] <- log(params[3])
cDerivs_all2$derivsRun(params, order = 0:2)
```

Notice that these results are the same as we saw above except for
third elements, which represent derivatives with respect to
`sigma`

above and to `log(sigma)`

here.

There are several important new points here:

The only valid way to get values into the model that will be recorded on the AD tape is with

`values(model, nodes) <<- some_values`

as shown. Other ways such as`nimCopy`

or`model[[node]] <<- some_value`

are not currently supported to work with AD.The call to

`nimDerivs`

of`run`

must be told some information about the model calculations that will be done inside of`run`

.`model`

is of course the model that will be used.`constantNodes`

is a vector of node names whose values are needed for the calculations but are not expected to change until you use`reset = TRUE`

. Values of these nodes will be baked into the AD tape until`reset = TRUE`

.`updateNodes`

is a vector of node names whose values are needed for the calculations, might change between calls even with`reset = FALSE`

, and are neither part of`wrt_nodes`

nor a deterministic part of`calc_nodes`

. The function`makeDerivsInfo`

, as shown in the`setup`

code, determines what is usually needed for`constantNodes`

and`updateNodes`

.In Method 1 above, the NIMBLE compiler automatically determines

`constantNodes`

and`updateNodes`

using`makeDerivsInfo`

based on the inputs to`nimDerivs(model$calculate(...),...)`

.One can use double-taping, but if so

*both*calls to`nimDerivs`

need the`model`

,`updateNodes`

, and`constantNodes`

arguments, which should normally be identical.

It is possible to build derivative support in a user-defined function
or distribution used in a model. Simply set
`buildDerivs=TRUE`

, as shown in the `my_sqrt`

example above.

`constantNodes`

and
`updateNodes`

.In most cases, you can obtain `constantNodes`

and
`updateNodes`

from `makeDerivsInfo`

without
knowing exactly what they mean. But for advanced uses and possible
debugging needs, let’s explore these arguments in more detail. The
purpose of `constantNodes`

and `updateNodes`

is to
tell the AD system about all nodes that will be needed for taped model
calculations. Specifically:

`updateNodes`

and`constantNodes`

are vectors of nodes names that together comprise any nodes that are necessary for`model$calculate(calc_nodes)`

but are:- not in the
`wrt`

argument to`nimDerivs`

(only relevant for Method 1), and - not assigned into the model using
`values(model, nodes) <<-`

prior to`model$calculate`

(only relevant for Method 2), and - not a deterministic node in
`calc_nodes`

.

- not in the
`updateNodes`

includes node names satisfying those conditions whose values might change between uses of the tape regardless of the`reset`

argument.`constantNodes`

includes node names satisfying those conditions whose values will be baked into the tape (will not change) until the next call with`reset=TRUE`

.

To fix ideas, say that we want derivatives with respect to
`ran_eff[1]`

for calculations of it and the data that depend
on it, including any deterministic nodes. In other words,
`wrt`

will be `ran_eff[1]`

, and
`calc_nodes`

will be:

`model$getDependencies('ran_eff[1]')`

```
## [1] "ran_eff[1]"
## [2] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 1]"
## [3] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 2]"
## [4] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 3]"
## [5] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 4]"
## [6] "lifted_exp_oPintercept_plus_beta_times_X_oBi_comma_j_cB_plus_ran_eff_oBi_cB_cP_L7[1, 5]"
## [7] "y[1, 1]"
## [8] "y[1, 2]"
## [9] "y[1, 3]"
## [10] "y[1, 4]"
## [11] "y[1, 5]"
```

In this case, the log probability of `ran_eff[1]`

itself
requires `sigma`

. Since `sigma`

is not in
`wrt_nodes`

, it is not assigned into the model by the step
`values(model, wrt_nodes) <<- x_trans`

. It is also not
a deterministic node (or any node) in `calc_nodes`

, so it
must be included in `updateNodes`

or
`constantNodes`

. Since it might change between calls, it
should be included in `updateNodes`

.

Next, notice that the stochastic node `y[1, 1]`

in
`calc_nodes`

means that the log probability of
`y[1, 1]`

will be calculated, and this requires the actual
value of `y[1, 1]`

. This node is part of
`calc_nodes`

but is not a deterministic part of it, so it
must be provided in either `updateNodes`

or
`constantNodes`

. When data values will not be changed often,
it is better to put those nodes in `constantNodes`

, because
that will be more efficient than putting them in
`updateNodes`

. (If the data values are changed, use
`reset=TRUE`

, which will reset values of all
`constantNodes`

in the tape.)

The function `makeDerivsInfo`

inspects the model and
determines the usual needs for `updateNodes`

and
`constantNodes`

. By default, data nodes are put in
`constantNodes`

. Use `dataAsConstantNodes = FALSE`

in `makeDerivsInfo`

if you want them put in
`updateNodes`

.

Note that a deterministic node in `calc_nodes`

will have
its value calculated as part of the operations recorded in the AD tape,
so it does not need to be included in `updateNodes`

or
`constantNodes`

.

As usual in model-generic programming in NIMBLE, be aware of lifted nodes and their implications. Suppose in the Poisson GLMM we had used a precision parameterization for the random effects, with the changes in this code snippet:

```
precision ~ dgamma(0.01, 0.01)
# <other lines>
ran_eff[i] ~ dnorm(0, tau = precision)
```

This would result in a lifted node for the standard deviation,
calculated as `1/sqrt(precision)`

. That lifted node is what
would actually be used in `dnorm`

for each
`ran_eff[i]`

. Now if `ran_eff[1]`

is in
`wrt_nodes`

, the *lifted node* (but *not*
`precision`

) will be in `updateNodes`

(as
determined by `makeDerivsInfo`

). If you then change the value
of `precision`

, you must be sure to calculate the lifted node
before obtaining derivatives. Otherwise the value of the lifted node
will correspond to the old value of precision.

An example of model-generic programming to update any lifted nodes
that depend on `precision`

would be:

`model$calculate(model$getDependencies('precision', determOnly=TRUE))`

In Method 1 above, NIMBLE automatically uses
`makeDerivsInfo`

based on the code
`nimDerivs(model$calculate(<args>), <args>)`

.
However, in Method 2, when `model$calculate(<args>)`

is
used in a method such as `run`

, then a call to
`nimDerivs(run(<args>), <args>)`

requires the
`model`

, `updateNodes`

and
`constantNodes`

to be provided. In this sense, the two
functions (`run`

and `derivsRun`

) must be written
in coordination.

The additional rule for when tapes need to be reset based on model calculations is:

- when values of any
`constantNodes`

are changed. Usually these will only be data nodes.

Much of the math supported by NIMBLE will work for AD. Here are some details on what is and isn’t supported, as well as on some optionsn to control how linear algebra operations are handled.

Some of what is not supported for AD:

- stochastic indexing in models.
- cumulative (“p”) and quantile (“q”) functions for distributions.
- truncated distributions in models (because they use cumulative distribution functions).
- random number generation.
- specific distributions including
`dcat`

,`dcar_normal`

,`dcar_proper`

, and`dinterval`

. - some specific functions including
`step`

,`%%`

(mod),`eigen`

,`svd`

, and`bessel_k`

. - specific model operations including
`model$calculateDiff`

,`model$simulate`

,`model$getBound`

, and `model$getParam

Some details on what is supported include:

`round`

,`floor`

,`ceil`

, and`trunc`

are all supported with derivatives defined to be zero everywhere.`pow(a, b)`

requires positive`a`

and`b`

.- A new function
`pow_int(a, b)`

returns`pow(a, round(b))`

and thus sets all derivatives with respect to`b`

to zero. This allows valid derivatives with respect to`a`

even if it takes a negative value.

For the linear algebra functions `%*%`

, `chol`

,
`forwardsolve`

, `backsolve`

, and
`inverse`

, there are special extensions provided by NIMBLE
for CppAD called (in CppAD terms) “atomics”, short for derived “atomic”
classes by which one can extend CppAD. By default, these atomics will be
used and often improve efficiency. There may be cases where they
decrease efficiency, which might include when the matrix operands are
small or contain many zeros. To compare results with and without use of
the atomics, they can be turned off with a set of options:

```
nimbleOptions(useADmatMultAtomic = FALSE) # for %*%
nimbleOptions(useADcholAtomic = FALSE) # for chol
nimbleOptions(useADmatInverseAtomic = FALSE) # for inverse
nimbleOptions(useADsolveAtomic = FALSE) # for forwardsolve and backsolve
```

When a linear algebra atomic is turned off, the AD tape simply records each scalar operation that is a step in the linear algebra operation.

Many algorithms that in some way explore a parameter space are best
used in an unconstrained parameter space. For example, there are a bunch
of optimization methods provided in R’s `optim`

, but only one
(`L-BFGS-B`

) allows constraints on the parameter space.
Similarly, HMC is implemented to work in an unconstrained parameter
space.

NIMBLE provides a nimbleFunction to automatically create a
transformation from original parameters to unconstrained parameters and
inverse transformation back to the original parameters. Denoting
`x`

as an original parameter (scalar or vector) and
`g(x)`

as the transformed parameter, the cases handled
include:

- If \(x \in (0, \infty)\), \(g(x) = \log(x)\). For example,
`x ~ dweibull`

in a model means \(x \in (0, \infty)\). - If \(x \in (0, 1)\), \(g(x) = \mbox{logit}(x)\). For example,
`x ~ dbeta`

in a model means \(x \in (0, 1)\) - If \(x \in (a, \infty)\), \(g(x) = \log(x - a)\).
- If \(x \in (-\infty, b)\), \(g(x) = -\log(b - x)\).
- If \(x \in (a, b)\), \(g(x) = \mbox{logit}((x-a)/(b-a))\). For
example
`x ~ T(<some distribution>, a, b)`

in a model means \(x \in (a, b)\). - If
`x[1:n, 1:n] ~ dwishart`

or`x[1:n, 1:n] ~ dinvwishart`

in a model, \(g(x) = \mbox{chol}(x)\) for non-diagonal elements of \(\mbox{chol}(x)\) and \(g(x) = \log(\mbox{chol}(x))\) for diagonal elements of \(\mbox{chol}(x)\), where \(\mbox{chol}(x)\) is the Cholesky decomposition of \(x\). Note that \(x\) in these cases follows a Wishart or inverse Wishart distribution, respectively, and thus is a random covariance matrix. That means it must be positive definite and thus have positive diagonal elements of \(\mbox{chol}(x)\). - If
`x[1:n] ~ ddirch`

in a model, \(g(x[1]) = \mbox{logit}(x[1])\) and \(g(x[i]) = \mbox{logit}(x[i] / (1-\sum_{k=1}^{i-1}x[i]))\) for \(i \gt 1\). In this case,`x`

follows a Dirichlet distribution and thus has the simplex constraint that \(\sum_{i=1}^n x[i] = 1\). - If
`x[1:n] ~ dlkj_corr_cholesky`

, \(g(x)\) is the LKJ Cholesky transformation. This provides the LKJ prior for a covariance matrix.

Note that the scalar transformations are all monotonic increasing
functions of the original parameter. Some of these are of limited
relevance for AD because truncation of distributions using
`T()`

is not currently supported for AD.

**TBD: Better description of LKJ.**

The worked example next will illustrate use of the parameter transformation system.

`optim`

with
gradients from `nimDerivs`

.In this example, we will obtain maximum likelihood estimates of a Poisson GLM. It will be similar to the GLMM above, but without the random effects. Maximization of a likelihood function can be done with methods that don’t require gradients, but it is much faster to use methods that do use gradients.

To illustrate the parameter transformation system, we will give
`p`

a uniform prior, constrained to values between 0 and 1.
The maximum likelihood estimation below will not use the prior
probability, but nevertheless the prior will be interpreted for its
constraints on valid values in this case. This need for parameter
transformation is somewhat artifical in the interest of simplicity.

```
model_code <- nimbleCode({
# priors (will be ignored for MLE)
p ~ dunif(0, 1)
log_p <- log(p)
beta ~ dnorm(0, sd = 100)
# random effects and data
for(i in 1:50) {
# data
y[i] ~ dpois(exp(log_p + beta*X[i]))
}
})
set.seed(123)
X <- rnorm(50) # Simulate values for X
model <- nimbleModel(model_code, constants = list(X = X), calculate = FALSE,
buildDerivs = TRUE) # Create nimble model object
```

`## Defining model`

`## Building model`

`## Checking model sizes and dimensions`

```
## [Note] This model is not fully initialized. This is not an error.
## To see which variables are not initialized, use model$initializeInfo().
## For more information on model initialization, see help(modelInitialization).
```

```
model$p <- 0.5 # assign parameter values and simulate data (y)
model$beta <- 0.2
model$calculate()
```

`## [1] NA`

```
model$simulate(model$getDependencies(c('beta', 'p'), self = FALSE))
model$calculate()
```

`## [1] -53.33015`

```
model$setData('y') # Now the model has y marked as data, with values from simulation.
Cmodel <- compileNimble(model) # Make compiled version
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

Now we’ll write a nimbleFunction to provide the negative log
likelihood and its gradient using an unconstrained parameter space. The
default behavior of `optim`

below is to do a minimization
problem, so returning the negative log likelihood means that
`optim`

will find the maximum likelihood parameters.

```
logLikelihood_nf <- nimbleFunction(
setup = function(model, paramNodes) {
# Determine nodes for calculating the log likelihood for parameters given by paramNodes
calcNodes <- model$getDependencies(paramNodes, self = FALSE)
# Set up the additional arguments for nimDerivs when the function called includes model$calculate
derivsInfo <- makeDerivsInfo(model, paramNodes, calcNodes)
updateNodes <- derivsInfo$updateNodes
constantNodes <- derivsInfo$constantNodes
# Create a parameter transformation between original and unconstrained parameter spaces.
transformer <- parameterTransform(model, paramNodes)
},
methods = list(
neg_logLikelihood_p = function(p = double(1)) { # Put values in model and calcualte neg log likelihood.
values(model, paramNodes) <<- p
return(-model$calculate(calcNodes))
returnType(double())
},
neg_logLikelihood = function(ptrans = double(1)) { # Objective function for optim,
# using transformed parameter space
p <- transformer$inverseTransform(ptrans)
return(neg_logLikelihood_p(p))
returnType(double())
},
gr_neg_logLikelihood = function(ptrans = double(1)) { # Gradient of neg log likelihood
p <- transformer$inverseTransform(ptrans)
d <- derivs(neg_logLikelihood_p(p), wrt = 1:length(p), order = 1,
model = model, updateNodes = updateNodes, constantNodes = constantNodes)
return(d$jacobian[1,])
returnType(double(1))
},
transform = function(p = double(1)) { # Give easy access to transformation ...
return(transformer$transform(p))
returnType(double(1))
},
inverse = function(ptrans = double(1)) { # ... and its inverse.
return(transformer$inverseTransform(ptrans))
returnType(double(1))
}
),
buildDerivs = 'neg_logLikelihood_p'
)
```

Sometimes you might want a call to `transformer$transform`

or `transformer$inverseTransform`

to be included in a
derivative calculation. In this case, we don’t.

Next we build and compile an instance of
`logLikelihood_nf`

specialized (by the `setup`` function) to
our model and its parameters.

```
ll_glm <- logLikelihood_nf(model, c("p", "beta"))
Cll_glm <- compileNimble(ll_glm, project = model)
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

Now let’s see how the parameter transformation works for this model.
For `beta`

, it already has an unconstrained range of valid
values, so it has no transformation (i.e. it has an indentity
transformation). For `p`

, it is constrained beween 0 and 1,
so it will be logit-transformed.

`Cll_glm$transform(c(0.2, 1))`

`## [1] -1.386294 1.000000`

`Cll_glm$inverse(c(-1.386294, 1)) # The inverse transformation goes back to the original parameters, within rounding.`

`## [1] 0.2000001 1.0000000`

Now we are ready to use the methods provided in `Cll_glm`

in a call to R’s `optim`

to find the MLE.

```
MLE <- optim(c(0,0), # initial values
fn = Cll_glm$neg_logLikelihood, # function to be minimized
gr = Cll_glm$gr_neg_logLikelihood, # gradient of function to be minimized
method = "BFGS") # optimization method to use
MLE$par
```

`## [1] 0.08257249 0.25577134`

`-MLE$value`

`## [1] -47.73243`

Those outputs show the MLE in the parameters `logit(p)`

,
and `beta`

, following by the maximum log likelihood
value.

Now let’s confirm that it’s working correctly by comparing to results
from `glm`

.

```
glm_fit <- glm(I(model$y) ~ X, family = poisson)
coef(glm_fit)
```

```
## (Intercept) X
## -0.6527680 0.2557751
```

`logLik(glm_fit)`

`## 'log Lik.' -47.73243 (df=2)`

We see that the coefficient for `X`

matches
`beta`

(within numerical tolerance), and the maximum log
likelihoods match. To understand the intercept, we need to be careful
about the parameter transformation used in nimble. The inverse
transformation will give `p`

, and the log of that should
match the intercept estimated by `glm.`

`Cll_glm$inverse(MLE$par)`

`## [1] 0.5206314 0.2557713`

`log(Cll_glm$inverse(MLE$par)[1]) # This should match glm's intercept`

`## [1] -0.652713`

It looks like it worked.

In this example, we will use nimble’s Laplace approximation, which uses derivatives internally, to get maximum (approximate) likelihood estimates for the GLMM model above. What Laplace approximation approximates in this context is the integral over continuous random effects needed to calculate the likelihood. Hence, it gives an approximate likelihood – often quite accurate – that can be used for maximum likelihood estimation. Note that the Laplace approximation uses second derivatives, and the gradient of the Laplace approximation uses third derivatves. These are described in detail by Skaug and Fournier (2006) and Fournier et al. (2012).

We will create the above GLMM model anew here.

```
model_code <- nimbleCode({
# priors
intercept ~ dnorm(0, sd = 100)
beta ~ dnorm(0, sd = 100)
sigma ~ dunif(0, 10)
# random effects and data
for(i in 1:10) {
# random effects
ran_eff[i] ~ dnorm(0, sd = sigma)
for(j in 1:5) {
# data
y[i,j] ~ dpois(exp(intercept + beta*X[i,j] + ran_eff[i]))
}
}
})
set.seed(123)
X <- matrix(rnorm(50), nrow = 10)
model <- nimbleModel(model_code, constants = list(X = X), calculate = FALSE,
buildDerivs = TRUE) # Here is the argument needed for AD.
```

`## Defining model`

`## Building model`

`## Checking model sizes and dimensions`

```
## [Note] This model is not fully initialized. This is not an error.
## To see which variables are not initialized, use model$initializeInfo().
## For more information on model initialization, see help(modelInitialization).
```

```
model$intercept <- 0
model$beta <- 0.2
model$sigma <- 0.5
model$calculate()
```

`## [1] NA`

```
model$simulate(model$getDependencies('ran_eff'))
model$calculate()
```

`## [1] -80.74344`

```
model$setData('y') # Now the model has y marked as data, with values from simulation.
Cmodel <- compileNimble(model)
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

Next we create a Laplace approximation specialized to the parameters
of interest for this model. The `buildLaplace`

setup code
will automatically determine the random effects to be integrated over
and the associated nodes to calculate. In fact, if you omit the
parameter nodes, it will assume that all top-level nodes in the model
should be treated as parameters. If fine-grained control is needed,
these various sets of nodes can be input directly into
`buildLaplace`

.

```
glmm_laplace <- buildLaplace(model, c('intercept','beta','sigma'))
Cglmm_laplace <- compileNimble(glmm_laplace, project = model)
```

```
## Compiling
## [Note] This may take a minute.
## [Note] Use 'showCompilerOutput = TRUE' to see C++ compilation details.
```

Now we are ready to use some of the methods provided by
`buildLaplace`

. These include calculating the Laplace
approximation for some input parameter values, calculating its gradient,
and maximizing the Laplace-approximated likelihood.

`Cglmm_laplace$Laplace(c(0, 0, 1))`

`## [1] -65.57246`

`Cglmm_laplace$gr_Laplace(c(0, 0, 1))`

`## [1] -1.866842 8.001648 -4.059556`

```
MLE <- Cglmm_laplace$LaplaceMLE(c(0, 0, 1))
MLE$par
```

`## [1] -0.1492312 0.1934101 0.5703647`

`MLE$value`

`## [1] -63.44875`

The final outputs show the MLE for `intercept`

,
`beta`

, and `sigma`

, followed by the maximum
(approximate) likelihood.

`buildLaplace`

actually offers several choices in how
computations are done, differing in how they use double taping or not.
In some cases one or another choice might be more efficient. See
`help(buildLaplace)`

if you want to explore it further.

Finally, let’s confirm that it worked by comparing to results from package lme4. In this case, lme4 will be faster because it is highly specialized to GLMMs. Here our interest is in checking that nimble’s Laplace approximation worked correctly in a case where we have an established tool such as lme4.

```
library(lme4)
y <- as.numeric(model$y) # Rearrage inputs for call to glmer
X <- as.numeric(X)
group <- rep(1:10, 5)
lme4_fit <- glmer(y ~ X + (1 | group), family = poisson) # There is a warning that seems to be innocuous.
```

```
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.0178573 (tol = 0.002, component 1)
```

`summary(lme4_fit)`

```
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: y ~ X + (1 | group)
##
## AIC BIC logLik deviance df.resid
## 132.9 138.6 -63.4 126.9 47
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.02386 -0.74101 -0.03475 0.33823 2.46453
##
## Random effects:
## Groups Name Variance Std.Dev.
## group (Intercept) 0.3285 0.5731
## Number of obs: 50, groups: group, 10
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.153720 0.003913 -39.289 <2e-16 ***
## X 0.193464 0.143528 1.348 0.178
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## X -0.002
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.0178573 (tol = 0.002, component 1)
```

`logLik(lme4_fit)`

`## 'log Lik.' -63.44896 (df=3)`

The results match within numerical tolerance typical of optimization
problems. Specifically, the coefficients for `(Intercept)`

and `X`

match nimble’s `Intercept`

and
`beta`

, the random effects standard deviation for
`group`

matches nimble’s `sigma`

, and the maximum
likelihood values match.`

Bell, B. 2022. “CppAD: A
Package for Differentiation of
C++ Algorithms.” www.coin-or.org/CppAD.

Borchers, Hans W. 2022. *Pracma: Practical Numerical Math
Functions*. https://CRAN.R-project.org/package=pracma.

Fournier, David A., Hans J. Skaug, Johnoel Ancheta, James Ianelli, Arni
Magnusson, Mark N. Maunder, Anders Nielsen, and John Sibert. 2012.
“AD Model Builder: Using
Automatic Differentiation for Statistical Inference of Highly
Parameterized Complex Nonlinear Models.” *Optimization Methods
and Software* 27 (2): 233–49. https://doi.org/10.1080/10556788.2011.597854.

Gilbert, Paul, and Ravi Varadhan. 2019. *numDeriv: Accurate Numerical
Derivatives*. https://CRAN.R-project.org/package=numDeriv.

Griewank, Andreas, and Andrea Walther. 2008. *Evaluating
Derivatives: Principles and
Techniques of Algorithmic
Differentiation, Second
Edition*. Second edition. Philadelphia, PA: Society for
Industrial; Applied Mathematic.

Kristensen, Kasper, Anders Nielsen, Casper W. Berg, Hans Skaug, and
Bradley M. Bell. 2016. “TMB: Automatic
Differentiation and Laplace Approximation.”
*Journal of Statistical Software* 70 (5): 1–21. https://doi.org/10.18637/jss.v070.i05.

Neal, Radford M. 2011. “MCMC Using
Hamiltonian Dynamics.” In *Handbook
of Markov Chain Monte
Carlo*. Chapman &
Hall/CRC Handbooks of
Modern Statistical Methods.
Chapman; Hall/CRC. http://www.crcnetbase.com/doi/abs/10.1201/b10905-6.

Skaug, Hans J., and David A. Fournier. 2006. “Automatic
Approximation of the Marginal Likelihood in Non-Gaussian
Hierarchical Models.” *Computational Statistics & Data
Analysis* 51 (2): 699–709. https://doi.org/10.1016/j.csda.2006.03.005.