Variable selection in NIMBLE using reversible jump MCMC

Prepared by Sally Paganin.

Reversible Jump MCMC Overview

Reversible Jump MCMC (RJMCMC) is a general framework for MCMC simulation in which the dimension of the parameter space (i.e., the number of parameters) can vary between iterations of the Markov chain. It can be viewed as an extension of the Metropolis-Hastings algorithm onto more general state spaces. A common use case for RJMCMC is for variable selection in regression-style problems, where the dimension of the parameter space varies as variables are included or excluded from the regression specification.

Recently we added an RJMCMC sampler for variable selection to NIMBLE, using a univariate normal distribution as proposal distribution. There are two ways to use RJMCMC variable selection in your model. If you know the prior probability for inclusion of a variable in the model, you can use that directly in the RJMCMC without modifying your model. If you need the prior probability for inclusion in the model to be a model node itself, such as if it will have a prior and be estimated, you will need to write the model with indicator variables. In this post we will illustrate the basic usage of NIMBLE RJMCMC in both situations.

More information can be found in the NIMBLE User Manual and via help(configureRJ).

Linear regression example

In the following we consider a linear regression example in which we have 15 explanatory variables, and five of those are real effects while the others have no effect. First we simulate some data.

N <- 100
p <- 15
set.seed(1)
X <- matrix(rnorm(N*p), nrow = N, ncol = p)
true_betas <- c(c(0.1, 0.2, 0.3, 0.4, 0.5),
                rep(0, 10))

y <- rnorm(N, X%*%true_betas, sd = 1)

summary(lm(y ~ X))
## 
## Call:
## lm(formula = y ~ X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64569 -0.50496  0.01613  0.59480  1.97766 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.04190    0.11511  -0.364   0.7168    
## X1           0.06574    0.12113   0.543   0.5888    
## X2           0.21162    0.11469   1.845   0.0686 .  
## X3           0.16606    0.10823   1.534   0.1287    
## X4           0.66582    0.11376   5.853 9.06e-08 ***
## X5           0.51343    0.09351   5.490 4.18e-07 ***
## X6           0.01506    0.11399   0.132   0.8952    
## X7          -0.12203    0.10127  -1.205   0.2316    
## X8           0.18177    0.10168   1.788   0.0774 .  
## X9          -0.09645    0.10617  -0.908   0.3663    
## X10          0.15986    0.11294   1.416   0.1606    
## X11          0.03806    0.10530   0.361   0.7186    
## X12          0.05354    0.10834   0.494   0.6225    
## X13         -0.02510    0.10516  -0.239   0.8119    
## X14         -0.07184    0.12842  -0.559   0.5774    
## X15         -0.04327    0.11236  -0.385   0.7011    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.035 on 84 degrees of freedom
## Multiple R-squared:  0.5337,	Adjusted R-squared:  0.4505 
## F-statistic:  6.41 on 15 and 84 DF,  p-value: 7.726e-09

Reversible jump with indicator variables

Next we set up the model. In this case we explicitly include indicator variables that include or exclude the corresponding predictor variable. For this example we assume the indicator variables are exchangeable and we include the inclusion probability in the inference.

lmIndicatorCode <- nimbleCode({
  sigma ~ dunif(0, 20)  ## uniform prior per Gelman (2006)
  psi ~ dunif(0,1)    ## prior on inclusion probability

  for(i in 1:numVars) {
    z[i] ~ dbern(psi) ## indicator variable for each coefficient
    beta[i] ~ dnorm(0, sd = 100)
    zbeta[i] <- z[i] * beta[i]  ## indicator * beta
  }
  for(i in 1:N) {
    pred.y[i] <- inprod(X[i, 1:numVars], zbeta[1:numVars])
    y[i] ~ dnorm(pred.y[i], sd = sigma)
  }
})

## Set up the model.
lmIndicatorConstants <- list(N = 100, numVars = 15)
lmIndicatorInits <- list(sigma = 1, psi = 0.5,
                         beta = rnorm(lmIndicatorConstants$numVars),
                         z = sample(0:1, lmIndicatorConstants$numVars, 0.5))

lmIndicatorData  <- list(y = y, X = X)
lmIndicatorModel <- nimbleModel(code = lmIndicatorCode, constants = lmIndicatorConstants,
                                inits = lmIndicatorInits, data = lmIndicatorData)
## defining model...
## building model...
## setting data and initial values...
## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions...
## model building finished.

The above model code can potentially be used to set up variable selection in NIMBLE without using RJMCMC, since the indicator variables can turn the regression parameters off and on. However, in that case the MCMC sampling can be inefficient because a given regression parameter can wander widely in the parameter space when the corresponding variable is not in the model. This can make it difficult for the variable to be put back into the model, unless the prior for that variable is (perhaps artificially) made somewhat informative. Configuring RJMCMC sampling via our NIMBLE function configureRJ results in the MCMC not sampling the regression coefficients for variables for iterations where the variables are not in the model.

Configuring RJMCMC

The RJMCMC sampler can be added to the MCMC configuration by calling the function configureRJ(). In the example considered we introduced z as indicator variables associated with the regression coefficients beta. We can pass these, respectively, to configureRJ using the arguments indicatorNodes and targetNodes. The control arguments allow one to specify the mean and the scale of the normal proposal distribution used when proposing to put a coefficient back into the model.

lmIndicatorConf <- configureMCMC(lmIndicatorModel)
lmIndicatorConf$addMonitors('z')
## thin = 1: sigma, psi, beta, z
configureRJ(lmIndicatorConf,
            targetNodes = 'beta',
            indicatorNodes = 'z',
            control = list(mean = 0, scale = .2))

Checking the assigned samplers we see that the indicator variables are each assigned an RJ_indicator sampler whose targetNode is the corresponding coefficient, while the beta parameters have a RJ_toggled sampler. The latter sampler is a modified version of the original sampler to the targetNode that is invoked only when the variable is currently in the model.

## Check the assigned samplers
lmIndicatorConf$printSamplers(c("z[1]", "beta[1]"))
## [3] RJ_indicator sampler: z[1],  mean: 0,  scale: 0.2,  targetNode: beta[1]
## [4] RJ_toggled sampler: beta[1],  samplerType: conjugate_dnorm_dnorm

Build and run the RJMCMC

mcmcIndicatorRJ <- buildMCMC(lmIndicatorConf)

cIndicatorModel <- compileNimble(lmIndicatorModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.
## compilation finished.
CMCMCIndicatorRJ <- compileNimble(mcmcIndicatorRJ, project = lmIndicatorModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.
## compilation finished.
set.seed(1)
system.time(samplesIndicator <- runMCMC(CMCMCIndicatorRJ, niter = 10000, nburnin = 1000))
## running chain 1...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
##    user  system elapsed 
##   2.749   0.004   2.753

Looking at the results

We can look at the sampled values of the indicator and corresponding coefficient for some of the variables.

par(mfrow = c(2, 2))
plot(samplesIndicator[,'beta[3]'], pch = 16, cex = 0.4, main = "beta[3] traceplot")
plot(samplesIndicator[,'z[3]'], pch = 16, cex = 0.4, main = "z[3] traceplot")
plot(samplesIndicator[,'beta[5]'], pch = 16, cex = 0.4, main = "beta[5] traceplot")
plot(samplesIndicator[,'z[5]'], pch = 16, cex = 0.4, main = "z[5] traceplot")

Individual inclusion probabilities

Now let’s look at the inference on the variable selection problem. We see that the fourth and fifth predictors are almost always included (these are the ones with the largest true coefficient values), while the others, including some variables that are truly associated with the outcome but have smaller true coefficient values, are almost never included.

par(mfrow = c(1, 1))
zCols <- grep("z\\[", colnames(samplesIndicator))
posterior_inclusion_prob <- colMeans(samplesIndicator[, zCols])
plot(1:length(true_betas), posterior_inclusion_prob,
        xlab = "beta", ylab = "inclusion probability",
        main = "Inclusion probabilities for each beta")

Reversible jump without indicator variables

If we assume that the inclusion probabilities for the coefficients are known, we can use the RJMCMC with model code written without indicator variables.

lmNoIndicatorCode <- nimbleCode({
  sigma ~ dunif(0, 20)

  for(i in 1:numVars) {
    beta[i] ~ dnorm(0, sd = 100)
  }
  for(i in 1:N) {
    pred.y[i] <- inprod(X[i, 1:numVars], beta[1:numVars])
    y[i] ~ dnorm(pred.y[i], sd = sigma)
  }
})

## Define model constants, inits and data
lmNoIndicatorConstants <- list(N = 100, numVars = 15)
lmNoIndicatorInits <- list(sigma = 1,
                           beta = rnorm(lmNoIndicatorConstants$numVars))
lmNoIndicatorData  <- list(y = y, X = X)

lmNoIndicatorModel <- nimbleModel(code = lmNoIndicatorCode,
                                  constants = lmNoIndicatorConstants,
                                  inits = lmNoIndicatorInits,
                                  data = lmNoIndicatorData)
## defining model...
## building model...
## setting data and initial values...
## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions...
## model building finished.

Configuring RJMCMC with no indicator variables

Again, the RJMCMC sampler can be added to the MCMC configuration by calling the function configureRJ() for nodes specified in targetNodes, but since there are no indicator variables we need to provide the prior inclusion probabilities. We use the priorProb argument, and we can provide either a vector of values or a common value.

lmNoIndicatorConf <- configureMCMC(lmNoIndicatorModel)
configureRJ(lmNoIndicatorConf,
            targetNodes = 'beta',
            priorProb = 0.5,
            control = list(mean = 0, scale = .2))

Since there are no indicator variables in this case, a RJ_fixed_prior sampler is assigned directly to each of coefficents along with the RJ_toggled sampler, which still uses the default sampler for the node, but only if the corresponding variable is in the model at a given iteration. In addition in this case one can set the coefficient to a value different from zero via the fixedValue argument in the control list.

## Check the assigned samplers
lmNoIndicatorConf$printSamplers(c("beta[1]"))
## [2] RJ_fixed_prior sampler: beta[1],  priorProb: 0.5,  mean: 0,  scale: 0.2,  fixedValue: 0
## [3] RJ_toggled sampler: beta[1],  samplerType: conjugate_dnorm_dnorm,  fixedValue: 0

Build and run the RJMCMC

mcmcNoIndicatorRJ <- buildMCMC(lmNoIndicatorConf)

cNoIndicatorModel <- compileNimble(lmNoIndicatorModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.
## compilation finished.
CMCMCNoIndicatorRJ <- compileNimble(mcmcNoIndicatorRJ, project = lmNoIndicatorModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compilation details.
## compilation finished.
set.seed(100)
system.time(samplesNoIndicator <- runMCMC(CMCMCNoIndicatorRJ, niter = 10000, nburnin = 1000))
## running chain 1...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
##    user  system elapsed 
##   3.014   0.004   3.050

Looking at the results

In this case we just look at one of the model coefficients.

plot(samplesNoIndicator[,'beta[1]'], pch = 16, cex = 0.4, main = "beta[1] traceplot")

Individual inclusion proportion

We can calculate the proportion of times each coefficient is included in the model.

betaCols <- grep("beta\\[", colnames(samplesNoIndicator))
posterior_inclusion_proportions <- colMeans(apply(samplesNoIndicator[, betaCols],
                                                  2, function(x) x != 0))
posterior_inclusion_proportions
##      beta[1]      beta[2]      beta[3]      beta[4]      beta[5] 
## 0.0017777778 0.0097777778 0.0017777778 1.0000000000 0.9996666667 
##      beta[6]      beta[7]      beta[8]      beta[9]     beta[10] 
## 0.0015555556 0.0015555556 0.0031111111 0.0018888889 0.0028888889 
##     beta[11]     beta[12]     beta[13]     beta[14]     beta[15] 
## 0.0006666667 0.0015555556 0.0007777778 0.0024444444 0.0007777778

New NIMBLE code examples

We’ve posted a variety of new code examples for NIMBLE at https://r-nimble.org/examples.

These include some introductory code for getting started with NIMBLE, as well as examples of setting up specific kinds of models such as GLMMs, spatial models, item-response theory models, and a variety of models widely used in ecology. We also have examples of doing variable selection via reversible jump MCMC, using MCEM, and writing new distributions for use with NIMBLE.

And if you have examples that you think might be useful to make available to others, please get in touch with us.

 

Bayesian Nonparametric Models in NIMBLE, Part 2: Nonparametric Random Effects

Bayesian nonparametrics in NIMBLE: Nonparametric random effects

Overview

NIMBLE is a hierarchical modeling package that uses nearly the same language for model specification as the popular MCMC packages WinBUGS, OpenBUGS and JAGS, while making the modeling language extensible — you can add distributions and functions — and also allowing customization of the algorithms used to estimate the parameters of the model.

Recently, we added support for Markov chain Monte Carlo (MCMC) inference for Bayesian nonparametric (BNP) mixture models to NIMBLE. In particular, starting with version 0.6-11, NIMBLE provides functionality for fitting models involving Dirichlet process priors using either the Chinese Restaurant Process (CRP) or a truncated stick-breaking (SB) representation of the Dirichlet process prior.

We will illustrate NIMBLE’s BNP capabilities using two examples. In a previous post, we showed how to use nonparametric mixture models with different kernels for density estimation. In this post, we will take a parametric generalized linear mixed model and show how to switch to a nonparametric representation of the random effects that avoids the assumption of normally-distributed random effects.

For more detailed information on NIMBLE and Bayesian nonparametrics in NIMBLE, see the NIMBLE User Manual.

Parametric meta analysis of Avandia myocardial infarctions (MIs)

We will illustrate the use of nonparametric mixture models for modeling random effects distributions in the context of a meta-analysis of the side effects of a formerly very popular drug for diabetes called Avandia. The data we analyze played a role in raising serious questions about the safety of this drug. The question is whether Avandia use increases the risk of myocardial infarction (heart attack). There are 48 studies (the 49th study in the data file is different in some ways and excluded here), each with treatment and control arms.

dat <- read.csv('https://rawgit.com/nimble-dev/nimble-demos/master/intro_bnp/avandia.csv')
head(dat)
##   trial nAvandia avandiaMI nControl controlMI
## 1     1      357         2      176         0
## 2     2      391         2      207         1
## 3     3      774         1      185         1
## 4     4      213         0      109         1
## 5     5      232         1      116         0
## 6     6       43         0       47         1
dat <- dat[-49, ]

Model formulation

We begin with a standard generalized linear mixed model (GLMM)-based meta analysis. The vectors n and x contain the total number of patients in the control and the number of patients suffering from myocardial infarctions in the control group of each study, respectively. Similarly, the vectors m and y contain similar information for patients receiving the drug Avandia. The model takes the form

  x_{i} \mid \theta, \gamma_i \sim \mbox{Bin} \left(n_i, \frac{\exp\left\{ \gamma_i \right\}}{1 + \exp\left\{ \gamma_i \right\}} \right) , \quad\quad y_{i} \mid \theta, \gamma_i \sim \mbox{Bin} \left(m_i, \frac{\exp\left\{ \theta + \gamma_i \right\}}{1 + \exp\left\{ \theta + \gamma_i \right\}} \right)

where the random effects, \gamma_i, follow a common normal distribution, \gamma_i \sim \mbox{N}(0, \tau^2), and the \theta and \tau^2 are given reasonably non-informative priors. The parameter \theta quantifies the difference in risk between the control and treatment arms, while the \gamma_i quantify study-specific variation.

This model can be specified in NIMBLE using the following code:

x <- dat$controlMI
n <- dat$nControl
y <- dat$avandiaMI
m <- dat$nAvandia

nStudies <- nrow(dat)
data <- list(x = x, y = y)
constants = list(n = n, m = m, nStudies = nStudies)

codeParam <- nimbleCode({
    for(i in 1:nStudies) {
        y[i] ~ dbin(size = m[i], prob = q[i]) # avandia MIs
        x[i] ~ dbin(size = n[i], prob = p[i]) # control MIs
        q[i] <- expit(theta + gamma[i])       # Avandia log-odds
        p[i] <- expit(gamma[i])               # control log-odds
        gamma[i] ~ dnorm(mu, var = tau2)      # study effects
    }
    theta ~ dflat()        # effect of Avandia
    # random effects hyperparameters
    mu ~ dnorm(0, 10)
    tau2 ~ dinvgamma(2, 1)
})

Running the MCMC

Let’s run a basic MCMC.

set.seed(9)
inits = list(theta = 0, mu = 0, tau2 = 1, gamma = rnorm(nStudies))

samples <- nimbleMCMC(code = codeParam, data = data, inits = inits,
                      constants = constants, monitors = c("mu", "tau2", "theta", "gamma"),
                      thin = 10, niter = 22000, nburnin = 2000, nchains = 1, setSeed = TRUE)
## defining model...
## building model...
## setting data and initial values...
## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions...
## checking model calculations...
## model building finished.
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
## runMCMC's handling of nburnin changed in nimble version 0.6-11. Previously, nburnin samples were discarded *post-thinning*.  Now nburnin samples are discarded *pre-thinning*.  The number of samples returned will be floor((niter-nburnin)/thin).
## running chain 1...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
par(mfrow = c(1, 4), cex = 1.1, mgp = c(1.8,.7,0))
ts.plot(samples[ , 'theta'], xlab = 'iteration', ylab = expression(theta))
hist(samples[ , 'theta'], xlab = expression(theta), main = 'effect of Avandia')

gammaCols <- grep('gamma', colnames(samples))
gammaMn <- colMeans(samples[ , gammaCols])
hist(gammaMn, xlab = 'posterior means', main = 'random effects distribution')
hist(samples[1000, gammaCols], xlab = 'single draw',
                   main = 'random effects distribution')

The results suggests there is an overall difference in risk between the control and treatment arms. But what about the normality assumption? Are our conclusions robust to that assumption? Perhaps the random effects distribution are skewed. (And recall that the estimates above of the random effects are generated under the normality assumption, which pushes the estimated effects to look more normal…)

DP-based random effects modeling for meta analysis

Model formulation

Now, we use a nonparametric distribution for the \gamma_is. More specifically, we assume that each \gamma_i is generated from a location-scale mixture of normal distributions:

  \gamma_i \mid \mu_i, \tau_i^2 \sim \mbox{N}(\mu_i, \tau_i^2), \quad\quad (\mu_i, \tau_i^2) \mid G \sim G, \quad\quad G \sim \mbox{DP}(\alpha, H),

where H is a normal-inverse-gamma distribution.

This specification induces clustering among the random effects. As in the case of density estimation problems, the DP prior allows the data to determine the number of components, from as few as one component (i.e., simplifying to the parametric model), to as many as n components, i.e., one component for each observation. This allows the distribution of the random effects to be multimodal if the data supports such behavior, greatly increasing its flexibility. This model can be specified in NIMBLE using the following code:

codeBNP <- nimbleCode({
    for(i in 1:nStudies) {
        y[i] ~ dbin(size = m[i], prob = q[i])   # avandia MIs
        x[i] ~ dbin(size = n[i], prob = p[i])   # control MIs
        q[i] <- expit(theta + gamma[i])         # Avandia log-odds
        p[i] <- expit(gamma[i])                 # control log-odds
        gamma[i] ~ dnorm(mu[i], var = tau2[i])  # random effects from mixture dist.
        mu[i] <- muTilde[xi[i]]                 # mean for random effect from cluster xi[i]
        tau2[i] <- tau2Tilde[xi[i]]             # var for random effect from cluster xi[i]
    }
    # mixture component parameters drawn from base measures
    for(i in 1:nStudies) {
        muTilde[i] ~ dnorm(mu0, var = var0)
        tau2Tilde[i] ~ dinvgamma(a0, b0)
    }
    # CRP for clustering studies to mixture components
    xi[1:nStudies] ~ dCRP(alpha, size = nStudies)
    # hyperparameters
    alpha ~ dgamma(1, 1)
    mu0 ~ dnorm(0, 10)
    var0 ~ dinvgamma(2, 1)
    a0 ~ dinvgamma(2, 1)
    b0 ~ dinvgamma(2, 1)
    theta ~ dflat()          # effect of Avandia
})

Running the MCMC

The following code compiles the model and runs a collapsed Gibbs sampler for the model

inits <- list(gamma = rnorm(nStudies), xi = sample(1:2, nStudies, replace = TRUE),
              alpha = 1, mu0 = 0, var0 = 1, a0 = 1, b0 = 1, theta = 0,
              muTilde = rnorm(nStudies), tau2Tilde = rep(1, nStudies))

samplesBNP <- nimbleMCMC(code = codeBNP, data = data, inits = inits,
               constants = constants,
               monitors = c("theta", "gamma", "alpha", "xi", "mu0", "var0", "a0", "b0"),
               thin = 10, niter = 22000, nburnin = 2000, nchains = 1, setSeed = TRUE)
## defining model...
## building model...
## setting data and initial values...
## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions...
## checking model calculations...
## model building finished.
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
## runMCMC's handling of nburnin changed in nimble version 0.6-11. Previously, nburnin samples were discarded *post-thinning*.  Now nburnin samples are discarded *pre-thinning*.  The number of samples returned will be floor((niter-nburnin)/thin).
## running chain 1...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
gammaCols <- grep('gamma', colnames(samplesBNP))
gammaMn <- colMeans(samplesBNP[ , gammaCols])
xiCols <- grep('xi', colnames(samplesBNP))

par(mfrow = c(1,5), cex = 1.1, mgp = c(1.8,.7,0))
ts.plot(samplesBNP[ , 'theta'], xlab = 'iteration', ylab = expression(theta),
   main = expression(paste('traceplot for ', theta)))
hist(samplesBNP[ , 'theta'], xlab = expression(theta), main = 'effect of Avandia')
hist(gammaMn, xlab = 'posterior means',
              main = "random effects distrib'n")
hist(samplesBNP[1000, gammaCols], xlab = 'single draw',
                   main = "random effects distrib'n")

# How many mixture components are inferred?
xiRes <- samplesBNP[ , xiCols]
nGrps <- apply(xiRes, 1, function(x) length(unique(x)))
ts.plot(nGrps, xlab = 'iteration', ylab = 'number of components',
   main = 'number of components')

The primary inference seems robust to the original parametric assumption. This is probably driven by the fact that there is not much evidence of lack of normality in the random effects distribution (as evidenced by the fact that the posterior distribution of the number of mixture components places a large amount of probability on exactly one component).

More information and future development

Please see our User Manual for more details.

We’re in the midst of improvements to the existing BNP functionality as well as adding additional Bayesian nonparametric models, such as hierarchical Dirichlet processes and Pitman-Yor processes, so please add yourself to our announcement or user support/discussion Google groups.

Bayesian Nonparametric Models in NIMBLE, Part 1: Density Estimation


Bayesian Nonparametric Models in NIMBLE, Part 1: Density Estimation

Bayesian nonparametrics in NIMBLE: Density estimation

Overview

NIMBLE is a hierarchical modeling package that uses nearly the same language for model specification as the popular MCMC packages WinBUGS, OpenBUGS and JAGS, while making the modeling language extensible — you can add distributions and functions — and also allowing customization of the algorithms used to estimate the parameters of the model.

Recently, we added support for Markov chain Monte Carlo (MCMC) inference for Bayesian nonparametric (BNP) mixture models to NIMBLE. In particular, starting with version 0.6-11, NIMBLE provides functionality for fitting models involving Dirichlet process priors using either the Chinese Restaurant Process (CRP) or a truncated stick-breaking (SB) representation of the Dirichlet process prior.

In this post we illustrate NIMBLE’s BNP capabilities by showing how to use nonparametric mixture models with different kernels for density estimation. In a later post, we will take a parametric generalized linear mixed model and show how to switch to a nonparametric representation of the random effects that avoids the assumption of normally-distributed random effects.

For more detailed information on NIMBLE and Bayesian nonparametrics in NIMBLE, see the NIMBLE User Manual.

Basic density estimation using Dirichlet Process Mixture models

NIMBLE provides the machinery for nonparametric density estimation by means of Dirichlet process mixture (DPM) models (Ferguson, 1974; Lo, 1984; Escobar, 1994; Escobar and West, 1995). For an independent and identically distributed sample y_1, \ldots, y_n, the model takes the form

  y_i \mid \theta_i \sim p(y_i \mid \theta_i), \quad\quad \theta_i \mid G \sim G, \quad\quad G \mid  \alpha, H \sim \mbox{DP}(\alpha, H), \quad\quad i=1,\ldots, n .

The NIMBLE implementation of this model is flexible and allows for mixtures of arbitrary kernels, p(y_i \mid \theta), which can be either conjugate or non-conjugate to the (also arbitrary) base measure H. In the case of conjugate kernel / base measure pairs, NIMBLE is able to detect the presence of the conjugacy and use it to improve the performance of the sampler.

To illustrate these capabilities, we consider the estimation of the probability density function of the waiting time between eruptions of the Old Faithful volcano data set available in R.

data(faithful)

The observations y_1, \ldots, y_n correspond to the second column of the dataframe, and n = 272.

Fitting a location-scale mixture of Gaussian distributions using the CRP representation

Model specification

We first consider a location-scale Dirichlet process mixture of normal distributionss fitted to the transformed data y_i^{*} = \log (y_i):

  y^{*}_i \mid \mu_i, \sigma^2_i \sim \mbox{N}(\mu_i, \sigma^2_i), \quad (\mu_i, \sigma^2_i) \mid G \sim G, \quad G \mid \alpha, H \sim \mbox{DP}(\alpha, H), \quad i=1,\ldots, n,

where H corresponds to a normal-inverse-gamma distribution. This model can be interpreted as providing a Bayesian version of kernel density estimation for y^{*}_i using Gaussian kernels and adaptive bandwidths. On the original scale of the data, this translates into an adaptive log-Gaussian kernel density estimate.

Introducing auxiliary variables \xi_1, \ldots, \xi_n that indicate which component of the mixture generates each observation, and integrating over the random measure G, we obtain the CRP representation of the model (Blackwell and MacQueen, 1973):

  y_i^{*} \mid \{ \tilde{\mu}_k \}, \{ \tilde{\sigma}_k^{2} \} \sim \mbox{N}\left( \tilde{\mu}_{\xi_i}, \tilde{\sigma}^2_{\xi_i} \right), \quad\quad \xi \mid \alpha \sim \mbox{CRP}(\alpha), \quad\quad (\tilde{\mu}_k, \tilde{\sigma}_k^2) \mid H \sim H, \quad\quad i=1,\ldots, n ,

where

  p(\xi \mid \alpha) = \frac{\Gamma(\alpha)}{\Gamma(\alpha + n)} \alpha^{K(\xi)} \prod_k  \Gamma\left(m_k(\xi)\right),

K(\xi) \le n is the number of unique values in the vector \xi, and m_k(\xi) is the number of times the k-th unique value appears in \xi. This specification makes it clear that each observation belongs to any of at most n normally distributed clusters, and that the CRP distribution corresponds to the prior distribution on the partition structure.

NIMBLE’s specification of this model is given by

code <- nimbleCode({
  for(i in 1:n) {
    y[i] ~ dnorm(mu[i], var = s2[i])
    mu[i] <- muTilde[xi[i]]
    s2[i] <- s2Tilde[xi[i]]
  }
  xi[1:n] ~ dCRP(alpha, size = n)
  for(i in 1:n) {
    muTilde[i] ~ dnorm(0, var = s2Tilde[i])
    s2Tilde[i] ~ dinvgamma(2, 1)
  }
  alpha ~ dgamma(1, 1)
})

Note that in the model code the length of the parameter vectors muTilde and s2Tilde has been set to n. We do this because the current implementation of NIMBLE requires that the length of vector of parameters be set in advance and does not allow for their number to change between iterations. Hence, if we are to ensure that the algorithm always performs as intended we need to work with the worst case scenario, i.e., the case where there are as many components as observations. While this ensures that the algorithm always works as intended, it is also somewhat inefficient, both in terms of memory requirements (when n is large a large number of unoccupied components need to be maintained) and in terms of computational burden (a large number of parameters that are not required for posterior inference need to be updated at every iteration). When we use a mixture of gamma distributions below, we will show a computational shortcut that improves the efficiency.

Note also that the value of \alpha controls the number of components we expect a priori, with larger values of \alpha corresponding to a larger number of components occupied by the data. Hence, by assigning a prior to \alpha we add flexibility to the model specification. The particular choice of a Gamma prior allows NIMBLE to use a data-augmentation scheme to efficiently sample from the corresponding full conditional distribution. Alternative prior specifications for \alpha are possible, in which case the default sampler for this parameter is an adaptive random-walk Metropolis-Hastings algorithm.

Running the MCMC algorithm

The following code sets up the data and constants, initializes the parameters, defines the model object, and builds and runs the MCMC algorithm. Because the specification is in terms of a Chinese restaurant process, the default sampler selected by NIMBLE is a collapsed Gibbs sampler (Neal, 2000).

set.seed(1)
# Model Data
lFaithful <- log(faithful$waiting)
standlFaithful <- (lFaithful - mean(lFaithful)) / sd(lFaithful)
data <- list(y = standlFaithful)
# Model Constants
consts <- list(n = length(standlFaithful))
# Parameter initialization
inits <- list(xi = sample(1:10, size=consts$n, replace=TRUE),
              muTilde = rnorm(consts$n, 0, sd = sqrt(10)),
              s2Tilde = rinvgamma(consts$n, 2, 1),
              alpha = 1)
# Model creation and compilation
rModel <- nimbleModel(code, data = data, inits = inits, constants = consts)
## defining model...
## building model...
## setting data and initial values...
## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions...
## model building finished.
cModel <- compileNimble(rModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
# MCMC configuration, creation, and compilation
conf <- configureMCMC(rModel, monitors = c("xi", "muTilde", "s2Tilde", "alpha"))
mcmc <- buildMCMC(conf)
cmcmc <- compileNimble(mcmc, project = rModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
samples <- runMCMC(cmcmc, niter = 7000, nburnin = 2000, setSeed = TRUE)
## running chain 1...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|

We can extract the samples from the posterior distributions of the parameters and create trace plots, histograms, and any other summary of interest. For example, for the concentration parameter \alpha we have:

# Trace plot for the concentration parameter
ts.plot(samples[ , "alpha"], xlab = "iteration", ylab = expression(alpha))
# Posterior histogram
hist(samples[ , "alpha"], xlab = expression(alpha), main = "", ylab = "Frequency")
quantile(samples[ , "alpha"], c(0.5, 0.025, 0.975))
##        50%       2.5%      97.5% 
## 0.42365754 0.06021512 1.52299639

Under this model, the posterior predictive distribution for a new observation \tilde{y}, p(\tilde{y} \mid y_1, \ldots, y_n), is the optimal density estimator (under squared error loss). Samples for this estimator can be easily computed from the samples generated by our MCMC:

# posterior samples of the concentration parameter
alphaSamples <- samples[ , "alpha"]
# posterior samples of the cluster means
muTildeSamples <- samples[ , grep('muTilde', colnames(samples))]
# posterior samples of the cluster variances
s2TildeSamples <- samples[ , grep('s2Tilde', colnames(samples))]
# posterior samples of the cluster memberships
xiSamples <- samples [ , grep('xi', colnames(samples))]

standlGrid <- seq(-2.5, 2.5, len = 200) # standardized grid on log scale

densitySamplesStandl <- matrix(0, ncol = length(standlGrid), nrow = nrow(samples))
for(i in 1:nrow(samples)){
  k <- unique(xiSamples[i, ])
  kNew <- max(k) + 1
  mk <- c()
  li <- 1
  for(l in 1:length(k)) {
    mk[li] <- sum(xiSamples[i, ] == k[li])
    li <- li + 1
  }
  alpha <- alphaSamples[i]

  muK <-  muTildeSamples[i, k]
  s2K <-  s2TildeSamples[i, k]
  muKnew <-  muTildeSamples[i, kNew]
  s2Knew <-  s2TildeSamples[i, kNew]

  densitySamplesStandl[i, ] <- sapply(standlGrid,
                function(x)(sum(mk * dnorm(x, muK, sqrt(s2K))) +
                alpha * dnorm(x, muKnew, sqrt(s2Knew)) )/(alpha+consts$n))
}

hist(data$y, freq = FALSE, xlim = c(-2.5, 2.5), ylim = c(0,0.75), main = "",
     xlab = "Waiting times on standardized log scale")
## pointwise estimate of the density for standardized log grid
lines(standlGrid, apply(densitySamplesStandl, 2, mean), lwd = 2, col = 'black')
lines(standlGrid, apply(densitySamplesStandl, 2, quantile, 0.025), lty = 2, col = 'black')
lines(standlGrid, apply(densitySamplesStandl, 2, quantile, 0.975), lty = 2, col = 'black')

Recall, however, that this is the density estimate for the logarithm of the waiting time. To obtain the density on the original scale we need to apply the appropriate transformation to the kernel.

lgrid <- standlGrid*sd(lFaithful) + mean(lFaithful) # grid on log scale
densitySamplesl <- densitySamplesStandl / sd(lFaithful) # density samples for grid on log scale

hist(faithful$waiting, freq = FALSE, xlim = c(40, 100), ylim=c(0, 0.05),
     main = "", xlab = "Waiting times")
lines(exp(lgrid), apply(densitySamplesl, 2, mean)/exp(lgrid), lwd = 2, col = 'black')
lines(exp(lgrid), apply(densitySamplesl, 2, quantile, 0.025)/exp(lgrid), lty = 2,
      col = 'black')
lines(exp(lgrid), apply(densitySamplesl, 2, quantile, 0.975)/exp(lgrid), lty = 2,
      col = 'black')

In either case, there is clear evidence that the data has two components for the waiting times.

Generating samples from the mixing distribution

While samples from the posterior distribution of linear functionals of the mixing distribution G (such as the predictive distribution above) can be computed directly from the realizations of the collapsed sampler, inference for non-linear functionals of G requires that we first generate samples from the mixing distribution. In NIMBLE we can get posterior samples from the random measure G, using the getSamplesDPmeasure function. Note that, in order to get posterior samples from G, we need to monitor all the random variables involved in its computations, i.e., the membership variable, xi, the cluster parameters, muTilde and s2Tilde, and the concentration parameter, alpha.

The following code generates posterior samples from the random measure G. The cMCMC object includes the model and posterior samples from the parameters. The getSamplesDPmeasure function estimates a truncation level of G, namely truncG. The posterior samples are in a matrix with (\mbox{truncG} \cdot (p+1)) columns, where p is the dimension of the vector of parameters with distribution G (in this example p=2).

outputG <- getSamplesDPmeasure(cmcmc)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
## sampleDPmeasure: Approximating the random measure by a finite stick-breaking representation with and error smaller than 1e-10, leads to a truncation level of 33.
if(packageVersion('nimble') <= '0.6-12')
  samplesG <- outputG else samplesG <- outputG$samples

The following code computes posterior samples of P(\tilde{y} > 70) using the posterior samples from the random measure G. Note that these samples are computed based on the transformed model and a value larger than 70 corresponds to a value larger than 0.03557236 on the above defined grid.

if(packageVersion('nimble') >= '0.6.13')
     truncG <- outputG$trunc # truncation level for G

weightIndex <- grep('weight', colnames(samplesG))
muTildeIndex <- grep('muTilde', colnames(samplesG))
s2TildeIndex <- grep('s2Tilde', colnames(samplesG))

probY70 <- rep(0, nrow(samples))  # posterior samples of P(y.tilde > 70)
for(i in seq_len(nrow(samples))) {
  probY70[i] <- sum(samplesG[i, weightIndex] *
                    pnorm(0.03557236, mean = samplesG[i, muTildeIndex],
                      sd = sqrt(samplesG[i, s2TildeIndex]), lower.tail = FALSE))
}

hist(probY70,  xlab = "Probability", ylab = "P(yTilde > 70 | data)" , main = "" )

Fitting a mixture of gamma distributions using the CRP representation

NIMBLE is not restricted to using Gaussian kernels in DPM models. In the case of the Old Faithful data, an alternative to the mixture of Gaussian kernels on the logarithmic scale that we presented in the previous section is a (scale-and-shape) mixture of Gamma distributions on the original scale of the data.

Model specification

In this case, the model takes the form

  y_i \mid \{ \tilde{\beta}_k \}, \{ \tilde{\lambda}_k \} \sim \mbox{Gamma}\left( \tilde{\beta}_{\xi_i}, \tilde{\lambda}_{\xi_i} \right), \quad\quad \xi \mid \alpha \sim \mbox{CRP}(\alpha), \quad\quad (\tilde{\beta}_k, \tilde{\lambda}_k) \mid H \sim H ,

where H corresponds to the product of two independent Gamma distributions. The following code provides the NIMBLE specification for the model:

code <- nimbleCode({
  for(i in 1:n) {
    y[i] ~ dgamma(shape = beta[i], scale = lambda[i])
    beta[i] <- betaTilde[xi[i]]
    lambda[i] <- lambdaTilde[xi[i]]
  }
  xi[1:n] ~ dCRP(alpha, size = n)
  for(i in 1:50) { # only 50 cluster parameters
    betaTilde[i] ~ dgamma(shape = 71, scale = 2)
    lambdaTilde[i] ~ dgamma(shape = 2, scale = 2)
  }
  alpha ~ dgamma(1, 1)
})

Note that in this case the vectors betaTilde and lambdaTilde have length 50 \ll n = 272. This is done to reduce the computational and storage burdens associated with the sampling algorithm. You could think about this approach as truncating the process, except that it can be thought of as an *exact* truncation. Indeed, under the CRP representation, using parameter vector(s) with a length that is shorter than the number of observations in the sample will lead to a proper algorithm as long as the number of components instatiated by the sampler is strictly lower than the length of the parameter vector(s) for every iteration of the sampler.

Running the MCMC algorithm

The following code sets up the model data and constants, initializes the parameters, defines the model object, and builds and runs the MCMC algorithm for the mixture of Gamma distributions. Note that, when building the MCMC, a warning message about the number of cluster parameters is generated. This is because the lengths of betaTilde and lambdaTilde are smaller than n. Also, note that no error message is generated during execution, which indicates that the number of clusters required never exceeded the maximum of 50.

data <- list(y = faithful$waiting)
set.seed(1)
inits <- list(xi = sample(1:10, size=consts$n, replace=TRUE),
              betaTilde = rgamma(50, shape = 71, scale = 2),
              lambdaTilde = rgamma(50, shape = 2, scale = 2),
              alpha = 1)
rModel <- nimbleModel(code, data = data, inits = inits, constants = consts)
## defining model...
## building model...
## setting data and initial values...
## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions...
## model building finished.
cModel <- compileNimble(rModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
conf <- configureMCMC(rModel, monitors = c("xi", "betaTilde", "lambdaTilde", "alpha"))
mcmc <- buildMCMC(conf)
## Warning in samplerFunction(model = model, mvSaved = mvSaved, target = target, : sampler_CRP: The number of cluster parameters is less than the number of potential clusters. The MCMC is not strictly valid if ever it proposes more components than cluster parameters exist; NIMBLE will warn you if this occurs.
cmcmc <- compileNimble(mcmc, project = rModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
samples <- runMCMC(cmcmc, niter = 7000, nburnin = 2000, setSeed = TRUE)
## running chain 1...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|

In this case we use the posterior samples of the parameters to construct a trace plot and estimate the posterior distribution of \alpha:

# Trace plot of the posterior samples of the concentration parameter
ts.plot(samples[ , 'alpha'], xlab = "iteration", ylab = expression(alpha))
# Histogram of the posterior samples for the concentration parameter 
hist(samples[ , 'alpha'], xlab = expression(alpha), ylab = "Frequency", main = "")

Generating samples from the mixing distribution

As before, we obtain samples from the posterior distribution of G using the getSamplesDPmeasure function.

outputG <- getSamplesDPmeasure(cmcmc)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
## sampleDPmeasure: Approximating the random measure by a finite stick-breaking representation with and error smaller than 1e-10, leads to a truncation level of 28.
if(packageVersion('nimble') <= '0.6-12')
  samplesG <- outputG else samplesG <- outputG$samples

We use these samples to create an estimate of the density of the data along with a pointwise 95% credible band:

if(packageVersion('nimble') >= '0.6.13')
     truncG <- outputG$trunc # truncation level for G
grid <- seq(40, 100, len = 200)

weightSamples <- samplesG[ , grep('weight', colnames(samplesG))]
betaTildeSamples <- samplesG[ , grep('betaTilde', colnames(samplesG))]
lambdaTildeSamples <- samplesG[ , grep('lambdaTilde', colnames(samplesG))]

densitySamples <- matrix(0, ncol = length(grid), nrow = nrow(samples))
for(iter in seq_len(nrow(samples))) {
  densitySamples[iter, ] <- sapply(grid, function(x)
    sum( weightSamples[iter, ] * dgamma(x, shape = betaTildeSamples[iter, ],
                scale = lambdaTildeSamples[iter, ])))
}


hist(faithful$waiting, freq = FALSE, xlim = c(40,100), ylim = c(0, .05), main = "",
   ylab = "", xlab = "Waiting times")
lines(grid, apply(densitySamples, 2, mean), lwd = 2, col = 'black')
lines(grid, apply(densitySamples, 2, quantile, 0.025), lwd = 2, lty = 2, col = 'black')
lines(grid, apply(densitySamples, 2, quantile, 0.975), lwd = 2, lty = 2, col = 'black')

Again, we see that the density of the data is bimodal, and looks very similar to the one we obtained before.

Fitting a DP mixture of Gammas using a stick-breaking representation

Model specification

An alternative representation of the Dirichlet process mixture uses the stick-breaking representation of the random distribution G (Sethuraman, 1994). NIMBLE allows us to specify an approximation that involves a truncation of the Dirichlet process to a finite number of atoms, L. The resulting model therefore reduces to a finite mixture with L components and a very particular prior on the weights of the mixture components.

Introducing auxiliary variables, z_1, \ldots, z_n, that indicate which component generated each observation, the corresponding model for the mixture of Gamma densities discussed in the previous section takes the form

  y_i \mid \{ {\beta}_k^{\star} \}, \{ {\lambda}_k^{\star} \}, z_i \sim \mbox{Gamma}\left( {\beta}_{z_i}^{\star}, {\lambda}_{z_i}^{\star} \right), \quad\quad \boldsymbol{z} \mid \boldsymbol{w} \sim \mbox{Discrete}(\boldsymbol{w}), \quad\quad ({\beta}_k^{\star}, {\lambda}_k^{\star}) \mid H \sim H ,

where H is again the product of two independent Gamma distributions,

  w_1=v_1, \quad\quad w_l=v_l\prod_{m=1}^{l-1}(1-v_m), \quad l=2, \ldots, L-1,\quad\quad w_L=\prod_{m=1}^{L-1}(1-v_m)

with v_l \mid \alpha\sim \mbox{Beta}(1, \alpha), l=1, \ldots, L-1. The following code provides the NIMBLE specification for the model:

code <- nimbleCode(
  {
    for(i in 1:n) {
      y[i] ~ dgamma(shape = beta[i], scale = lambda[i])
      beta[i] <- betaStar[z[i]]
      lambda[i] <- lambdaStar[z[i]]
      z[i] ~ dcat(w[1:Trunc])
    }
    for(i in 1:(Trunc-1)) { # stick-breaking variables
      v[i] ~ dbeta(1, alpha)
    }
    w[1:Trunc] <- stick_breaking(v[1:(Trunc-1)]) # stick-breaking weights
    for(i in 1:Trunc) {
      betaStar[i] ~ dgamma(shape = 71, scale = 2)
      lambdaStar[i] ~ dgamma(shape = 2, scale = 2)
    }
    alpha ~ dgamma(1, 1)
  }
)

Note that the truncation level L of G has been set to a value Trunc, which is to be defined in the constants argument of the nimbleModel function.

Running the MCMC algorithm

The following code sets up the model data and constants, initializes the parameters, defines the model object, and builds and runs the MCMC algorithm for the mixture of Gamma distributions. When a stick-breaking representation is used, a blocked Gibbs sampler is assigned (Ishwaran, 2001; Ishwaran and James, 2002).

data <- list(y = faithful$waiting)
set.seed(1)
consts <- list(n = length(faithful$waiting), Trunc = 50)
inits <- list(betaStar = rgamma(consts$Trunc, shape = 71, scale = 2),
              lambdaStar = rgamma(consts$Trunc, shape = 2, scale = 2),
              v = rbeta(consts$Trunc-1, 1, 1),
              z = sample(1:10, size = consts$n, replace = TRUE),
              alpha = 1)

rModel <- nimbleModel(code, data = data, inits = inits, constants = consts)
## defining model...
## building model...
## setting data and initial values...
## running calculate on model (any error reports that follow may simply reflect missing values in model variables) ... 
## checking model sizes and dimensions...
## model building finished.
cModel <- compileNimble(rModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
conf <- configureMCMC(rModel, monitors = c("w", "betaStar", "lambdaStar", 'z', 'alpha'))
mcmc <- buildMCMC(conf)
cmcmc <- compileNimble(mcmc, project = rModel)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
samples <- runMCMC(cmcmc, niter = 24000, nburnin = 4000, setSeed = TRUE)
## running chain 1...
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|

Using the stick-breaking approximation automatically provides an approximation, G_L, of the random distribution G. The following code computes posterior samples of G_L using posterior samples from the samples object, and from them, a density estimate for the data.

betaStarSamples <- samples[ , grep('betaStar', colnames(samples))]
lambdaStarSamples <- samples[ , grep('lambdaStar', colnames(samples))]
weightSamples <- samples[ , grep('w', colnames(samples))]

grid <- seq(40, 100, len = 200)

densitySamples <- matrix(0, ncol = length(grid), nrow = nrow(samples))
for(i in 1:nrow(samples)) {
  densitySamples[i, ] <- sapply(grid, function(x)
    sum(weightSamples[i, ] * dgamma(x, shape = betaStarSamples[i, ],
                                    scale = lambdaStarSamples[i, ])))
}

hist(faithful$waiting, freq = FALSE,  xlab = "Waiting times", ylim=c(0,0.05),
     main = '')
lines(grid, apply(densitySamples, 2, mean), lwd = 2, col = 'black')
lines(grid, apply(densitySamples, 2, quantile, 0.025), lwd = 2, lty = 2, col = 'black')
lines(grid, apply(densitySamples, 2, quantile, 0.975), lwd = 2, lty = 2, col = 'black')

As expected, this estimate looks identical to the one we obtained through the CRP representation of the process.

More information and future development

Please see our User Manual for more details.

We’re in the midst of improvements to the existing BNP functionality as well as adding additional Bayesian nonparametric models, such as hierarchical Dirichlet processes and Pitman-Yor processes, so please add yourself to our announcement or user support/discussion Google groups.

References

Blackwell, D. and MacQueen, J. 1973. Ferguson distributions via Polya urn schemes. The Annals of Statistics 1:353-355.

Ferguson, T.S. 1974. Prior distribution on the spaces of probability measures. Annals of Statistics 2:615-629.

Lo, A.Y. 1984. On a class of Bayesian nonparametric estimates I: Density estimates. The Annals of Statistics 12:351-357.

Escobar, M.D. 1994. Estimating normal means with a Dirichlet process prior. Journal of the American Statistical Association 89:268-277.

Escobar, M.D. and West, M. 1995. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90:577-588.

Ishwaran, H. and James, L.F. 2001. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association 96: 161-173.

Ishwaran, H. and James, L.F. 2002. Approximate Dirichlet process computing in finite normal mixtures: smoothing and prior information. Journal of Computational and Graphical Statistics 11:508-532.

Neal, R. 2000. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics 9:249-265.

Sethuraman, J. 1994. A constructive definition of Dirichlet prior. Statistica Sinica 2: 639-650.


Writing reversible jump MCMC in NIMBLE


Writing reversible jump MCMC samplers in NIMBLE

Writing reversible jump MCMC samplers in NIMBLE

Introduction

Reversible jump Markov chain Monte Carlo (RJMCMC) is a powerful method for drawing posterior samples over multiple models by jumping between models as part of the sampling. For a simple example that I’ll use below, think about a regression model where we don’t know which explanatory variables to include, so we want to do variable selection. There may be a huge number of possible combinations of variables, so it would be nice to explore the combinations as part of one MCMC run rather than running many different MCMCs on some chosen combinations of variables. To do it in one MCMC, one sets up a model that includes all possible variables and coefficients. Then “removing” a variable from the model is equivalent to setting its coefficient to zero, and “adding” it back into the model requires a valid move to a non-zero coefficient. Reversible jump MCMC methods provide a way to do that.

Reversible jump is different enough from other MCMC situations that packages like WinBUGS, OpenBUGS, JAGS, and Stan don’t do it. An alternative way to set up the problem, which does not involve the technicality of changing model dimension, is to use indicator variables. An indicator variable is either zero or one and is multiplied by another parameter. Thus when the indicator is 0, the parameter that is multipled by 0 is effectively removed from the model. Darren Wilkinson has a nice old blog post on using indicator variables for Bayesian variable selection in BUGS code. The problem with using indicator variables is that they can create a lot of extra MCMC work and the samplers operating on them may not be well designed for their situation.

NIMBLE lets one program model-generic algorithms to use with models written in the BUGS language. The MCMC system works by first making a configuration in R, which can be modified by a user or a program, and then building and compiling the MCMC. The nimbleFunction programming system makes it easy to write new kinds of samplers.

The aim of this blog post is to illustrate how one can write reversible jump MCMC in NIMBLE. A variant of this may be incorporated into a later version of NIMBLE.

Example model

For illustration, I’ll use an extremely simple model: linear regression with two candidate explanatory variables. I’ll assume the first, x1, should definitely be included. But the analyst is not sure about the second, x2, and wants to use reversible jump to include it or exclude it from the model. I won’t deal with the issue of choosing the prior probability that it should be in the model. Instead I’ll just pick a simple choice and stay focused on the reversible jump aspect of the example. The methods below could be applied en masse to large models.

Here I’ll simulate data to use:

N <- 20
x1 <- runif(N, -1, 1)
x2 <- runif(N, -1, 1)
Y <- rnorm(N, 1.5 + 0.5 * x1, sd = 1)

I’ll take two approaches to implementing RJ sampling. In the first, I’ll use a traditional indicator variable and write the RJMCMC sampler to use it. In the second, I’ll write the RJMCMC sampler to incorporate the prior probability of inclusion for the coefficient it is sampling, so the indicator variable won’t be needed in the model.

First we’ll need nimble:

library(nimble)

RJMCMC implementation 1, with indicator variable included

Here is BUGS code for the first method, with an indicator variable written into the model, and the creation of a NIMBLE model object from it. Note that although RJMCMC technically jumps between models of different dimensions, we still start by creating the largest model so that changes of dimension can occur by setting some parameters to zero (or, in the second method, possibly another fixed value).

simpleCode1 <- nimbleCode({
    beta0 ~ dnorm(0, sd = 100)
    beta1 ~ dnorm(0, sd = 100)
    beta2 ~ dnorm(0, sd = 100)
    sigma ~ dunif(0, 100)
    z2 ~ dbern(0.8)  ## indicator variable for including beta2
    beta2z2 <- beta2 * z2
    for(i in 1:N) {
        Ypred[i] <- beta0 + beta1 * x1[i] + beta2z2 * x2[i]
        Y[i] ~ dnorm(Ypred[i], sd = sigma)
    }
})

simpleModel1 <- nimbleModel(simpleCode1,
                            data = list(Y = Y, x1 = x1, x2 = x2),
                            constants = list(N = N),
                            inits = list(beta0 = 0, beta1 = 0, beta2 = 0, sigma = sd(Y), z2 = 1))

Now here are two custom samplers. The first one will sample beta2 only if the indicator variable z2 is 1 (meaning that beta2 is included in the model). It does this by containing a regular random walk sampler but only calling it when the indicator is 1 (we could perhaps set it up to contain any sampler to be used when z2 is 1, but for now it’s a random walk sampler). The second sampler makes reversible jump proposals to move beta2 in and out of the model. When it is out of the model, both beta2 and z2 are set to zero. Since beta2 will be zero every time z2 is zero, we don’t really need beta2z2, but it ensures correct behavior in other cases, like if someone runs default samplers on the model and expects the indicator variable to do its job correctly. For use in reversible jump, z2’s role is really to trigger the prior probability (set to 0.8 in this example) of being in the model.

Don’t worry about the warning message emitted by NIMBLE. They are there because when a nimbleFunction is defined it tries to make sure the user knows anything else that needs to be defined.

RW_sampler_nonzero_indicator <- nimbleFunction(
    contains = sampler_BASE,
    setup = function(model, mvSaved, target, control) {
        regular_RW_sampler <- sampler_RW(model, mvSaved, target = target, control = control$RWcontrol)
        indicatorNode <- control$indicator
    },
    run = function() {
        if(model[[indicatorNode]] == 1) regular_RW_sampler$run()
    },
    methods = list(
        reset = function() {regular_RW_sampler$reset()}
    ))
## Warning in nf_checkDSLcode(code): For this nimbleFunction to compile, these
## functions must be defined as nimbleFunctions or nimbleFunction methods:
## reset.
RJindicatorSampler <- nimbleFunction(
    contains = sampler_BASE,
    setup = function( model, mvSaved, target, control ) {
        ## target should be the name of the indicator node, 'z2' above
        ## control should have an element called coef for the name of the corresponding coefficient, 'beta2' above.  
        coefNode <- control$coef
        scale <- control$scale
        calcNodes <- model$getDependencies(c(coefNode, target))
    },
    run = function( ) { ## The reversible-jump updates happen here.
        currentIndicator <- model[[target]]
        currentLogProb <- model$getLogProb(calcNodes)
        if(currentIndicator == 1) {
            ## propose removing it
            currentCoef <- model[[coefNode]]
            logProbReverseProposal <- dnorm(0, currentCoef, sd = scale, log = TRUE)
            model[[target]] <<- 0
            model[[coefNode]] <<- 0
            proposalLogProb <- model$calculate(calcNodes)
            log_accept_prob <- proposalLogProb - currentLogProb + logProbReverseProposal
        } else {
            ## propose adding it
            proposalCoef <- rnorm(1, 0, sd = scale)
            model[[target]] <<- 1
            model[[coefNode]] <<- proposalCoef
            logProbForwardProposal <- dnorm(0, proposalCoef, sd = scale, log = TRUE)
            proposalLogProb <- model$calculate(calcNodes)
            log_accept_prob <- proposalLogProb - currentLogProb - logProbForwardProposal
        }
        accept <- decide(log_accept_prob)
        if(accept) {
            copy(from = model, to = mvSaved, row = 1, nodes = calcNodes, logProb = TRUE)
        } else {
            copy(from = mvSaved, to = model, row = 1, nodes = calcNodes, logProb = TRUE)
        }
    },
    methods = list(reset = function() {
    })
    )

Now we’ll set up and run the samplers:

mcmcConf1 <- configureMCMC(simpleModel1)
mcmcConf1$removeSamplers('z2')
mcmcConf1$addSampler(target = 'z2',
                     type = RJindicatorSampler,
                     control = list(scale = 1, coef = 'beta2'))
mcmcConf1$removeSamplers('beta2')
mcmcConf1$addSampler(target = 'beta2',
                     type = 'RW_sampler_nonzero_indicator',
                     control = list(indicator = 'z2',
                                    RWcontrol = list(adaptive = TRUE,
                                                   adaptInterval = 100,
                                                   scale = 1,
                                                   log = FALSE,
                                                   reflective = FALSE)))

mcmc1 <- buildMCMC(mcmcConf1)
compiled1 <- compileNimble(simpleModel1, mcmc1)
compiled1$mcmc1$run(10000)
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
## NULL
samples1 <- as.matrix(compiled1$mcmc1$mvSamples)

Here is a trace plot of the beta2 (slope) samples. The thick line at zero corresponds to having beta2 removed from the model.

plot(samples1[,'beta2'])

And here is a trace plot of the z2 (indicator variable) samples.

plot(samples1[,'z2'])

The chains look reasonable.

As a quick check of reasonableness, let’s compare the beta2 samples to what we’d get if it was always included in the model. I’ll do that by setting up default samplers and then removing the sampler for z2 (and z2 should be 1).

mcmcConf1b <- configureMCMC(simpleModel1)
mcmcConf1b$removeSamplers('z2')
mcmc1b <- buildMCMC(mcmcConf1b)
compiled1b <- compileNimble(simpleModel1, mcmc1b)
compiled1b$mcmc1b$run(10000)
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
## NULL
samples1b <- as.matrix(compiled1b$mcmc1b$mvSamples)
plot(samples1b[,'beta2'])

qqplot(samples1[ samples1[,'z2'] == 1, 'beta2'], samples1b[,'beta2'])
abline(0,1)

That looks correct, in the sense that the distribution of beta2 given that it’s in the model (using reversible jump) should match the distribution of beta2 when it is
always in the model.

RJ implementation 2, without indicator variables

Now I’ll set up the second version of the model and samplers. I won’t include the indicator variable in the model but will instead include the prior probability for inclusion in the sampler. One added bit of generality is that being “out of the model” will be defined as taking some fixedValue, to be provided, which will typically but not necessarily be zero. These functions are very similar to the ones above.

Here is the code to define and build a model without the indicator variable:

simpleCode2 <- nimbleCode({
    beta0 ~ dnorm(0, sd = 100)
    beta1 ~ dnorm(0, sd = 100)
    beta2 ~ dnorm(0, sd = 100)
    sigma ~ dunif(0, 100)
    for(i in 1:N) {
        Ypred[i] <- beta0 + beta1 * x1[i] + beta2 * x2[i]
        Y[i] ~ dnorm(Ypred[i], sd = sigma)
    }
})

simpleModel2 <- nimbleModel(simpleCode2,
                            data = list(Y = Y, x1 = x1, x2 = x2),
                            constants = list(N = N),
                            inits = list(beta0 = 0, beta1 = 0, beta2 = 0, sigma = sd(Y)))

And here are the samplers (again, ignore the warning):

RW_sampler_nonzero <- nimbleFunction(
    ## "nonzero" is a misnomer because it can check whether it sits at any fixedValue, not just 0
    contains = sampler_BASE,
    setup = function(model, mvSaved, target, control) {
        regular_RW_sampler <- sampler_RW(model, mvSaved, target = target, control = control$RWcontrol)
        fixedValue <- control$fixedValue
    },
    run = function() { ## Now there is no indicator variable, so check if the target node is exactly
                       ## equal to the fixedValue representing "not in the model".
        if(model[[target]] != fixedValue) regular_RW_sampler$run()
    },
    methods = list(
        reset = function() {regular_RW_sampler$reset()}
    ))
## Warning in nf_checkDSLcode(code): For this nimbleFunction to compile, these
## functions must be defined as nimbleFunctions or nimbleFunction methods:
## reset.
RJsampler <- nimbleFunction(
    contains = sampler_BASE,
    setup = function( model, mvSaved, target, control ) {
        ## target should be a coefficient to be set to a fixed value (usually zero) or not
        ## control should have an element called fixedValue (usually 0),
        ##    a scale for jumps to and from the fixedValue,
        ##    and a prior prob of taking its fixedValue
        fixedValue <- control$fixedValue
        scale <- control$scale
        ## The control list contains the prior probability of inclusion, and we can pre-calculate
        ## this log ratio because it's what we'll need later.
        logRatioProbFixedOverProbNotFixed <- log(control$prior) - log(1-control$prior)
        calcNodes <- model$getDependencies(target)
    },
    run = function( ) { ## The reversible-jump moves happen here
        currentValue <- model[[target]]
        currentLogProb <- model$getLogProb(calcNodes)
        if(currentValue != fixedValue) { ## There is no indicator variable, so check if current value matches fixedValue
            ## propose removing it (setting it to fixedValue)
            logProbReverseProposal <- dnorm(fixedValue, currentValue, sd = scale, log = TRUE)
            model[[target]] <<- fixedValue
            proposalLogProb <- model$calculate(calcNodes)
            log_accept_prob <- proposalLogProb - currentLogProb - logRatioProbFixedOverProbNotFixed + logProbReverseProposal
        } else {
            ## propose adding it
            proposalValue <- rnorm(1, fixedValue, sd = scale)
            model[[target]] <<- proposalValue
            logProbForwardProposal <- dnorm(fixedValue, proposalValue, sd = scale, log = TRUE)
            proposalLogProb <- model$calculate(calcNodes)
            log_accept_prob <- proposalLogProb - currentLogProb + logRatioProbFixedOverProbNotFixed - logProbForwardProposal
        }
        accept <- decide(log_accept_prob)
        if(accept) {
            copy(from = model, to = mvSaved, row = 1, nodes = calcNodes, logProb = TRUE)
        } else {
            copy(from = mvSaved, to = model, row = 1, nodes = calcNodes, logProb = TRUE)
        }
    },
    methods = list(reset = function() {
    })
    )

Now let’s set up and use the samplers

mcmcConf2 <- configureMCMC(simpleModel2)
mcmcConf2$removeSamplers('beta2')

mcmcConf2$addSampler(target = 'beta2',
                     type = 'RJsampler',
                     control = list(fixedValue = 0, prior = 0.8, scale = 1))
mcmcConf2$addSampler(target = 'beta2',
                     type = 'RW_sampler_nonzero',
                     control = list(fixedValue = 0,
                                     RWcontrol = list(adaptive = TRUE,
                                     adaptInterval = 100,
                                     scale = 1,
                                     log = FALSE,
                                     reflective = FALSE)))

mcmc2 <- buildMCMC(mcmcConf2)
compiled2 <- compileNimble(simpleModel2, mcmc2)

compiled2$mcmc2$run(10000)
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
## NULL
samples2 <- as.matrix(compiled2$mcmc2$mvSamples)

And again let’s look at the samples. As above, the horizontal line at 0 represents having beta2 removed from the model.

plot(samples2[,'beta2'])

Now let’s compare those results to results from the first method, above. They should match.

mean(samples1[,'beta2']==0)
## [1] 0.12
mean(samples2[,'beta2']==0)
## [1] 0.1173
qqplot(samples1[ samples1[,'beta2'] != 0,'beta2'], samples2[samples2[,'beta2'] != 0,'beta2'])
abline(0,1)

They match well.

How to apply this for larger models.

The samplers above could be assigned to arbitrary nodes in a model. The only additional code would arise from adding more samplers to an MCMC configuration. It would also be possible to refine the reversible-jump step to adapt the scale of its jumps in order to achieve better mixing. For example, one could try this method by Ehlers and Brooks. We’re interested in hearing from you if you plan to try using RJMCMC on your own models.


Building Particle Filters and Particle MCMC in NIMBLE



An Example of Using <code>nimble</code>‘s Particle Filtering Algorithms

This example shows how to construct and conduct inference on a state space model using particle filtering algorithms. nimble currently has versions of the bootstrap filter, the auxiliary particle filter, the ensemble Kalman filter, and the Liu and West filter implemented. Additionally, particle MCMC samplers are available and can be specified for both univariate and multivariate parameters.

Model Creation

Assume x_{i} is the latent state and y_{i} is the observation at time i for i=1,\ldots,t. We define our state space model as

x_{i} \sim N(a \cdot x_{i-1} + b, \sigma_{PN})
 y_{i} \sim t_{5}(x_{i}, \sigma_{OE})

with initial states

  x_{1} \sim N(\frac{b}{1-a}, \frac{\sigma_{PN}}{\sqrt{1-a^2}})
 y_{1} \sim t_{5}(x_{1}, \sigma_{OE})

and prior distributions

 a \sim Unif(-0.999, 0.999)
 b \sim N(0, 1000)

where N(\mu, \sigma) denotes a normal distribution with mean \mu and standard deviation \sigma, and t_{\nu}(\mu, \sigma) is a shifted, scaled t-distribution with center parameter \mu, scale parameter \sigma, and \nu degrees of freedom.

We specify and build our state space model below, using t=10 time points:

## load the nimble library and set seed
library('nimble')
set.seed(1)

## define the model
stateSpaceCode <- nimbleCode({
    a ~ dunif(-0.9999, 0.9999)
    b ~ dnorm(0, sd = 1000)
    sigPN ~ dunif(1e-04, 1)
    sigOE ~ dunif(1e-04, 1)
    x[1] ~ dnorm(b/(1 - a), sd = sigPN/sqrt((1-a*a)))
    y[1] ~ dt(mu = x[1], sigma = sigOE, df = 5)
    for (i in 2:t) {
        x[i] ~ dnorm(a * x[i - 1] + b, sd = sigPN)
        y[i] ~ dt(mu = x[i], sigma = sigOE, df = 5)
    }
})

## define data, constants, and initial values  
data <- list(
    y = c(0.213, 1.025, 0.314, 0.521, 0.895, 1.74, 0.078, 0.474, 0.656, 0.802)
)
constants <- list(
    t = 10
)
inits <- list(
    a = 0,
    b = .5,
    sigPN = .1,
    sigOE = .05
)

## build the model
stateSpaceModel <- nimbleModel(stateSpaceCode,
                              data = data,
                              constants = constants,
                              inits = inits,
                              check = FALSE)
## defining model...
## building model...
## setting data and initial values...
## running calculate on model (any error reports that follow may simply
## reflect missing values in model variables) ...
## 
## checking model sizes and dimensions...
## note that missing values (NAs) or non-finite values were found in model
## variables: x, lifted_a_times_x_oBi_minus_1_cB_plus_b. This is not an error,
## but some or all variables may need to be initialized for certain algorithms
## to operate properly.
## 
## model building finished.

Construct and run a bootstrap filter

We next construct a bootstrap filter to conduct inference on the latent states of our state space model. Note that the bootstrap filter, along with the auxiliary particle filter and the ensemble Kalman filter, treat the top-level parameters a, b, sigPN, and sigOE as fixed. Therefore, the bootstrap filter below will proceed as though a = 0, b = .5, sigPN = .1, and sigOE = .05, which are the initial values that were assigned to the top-level parameters.

The bootstrap filter takes as arguments the name of the model and the name of the latent state variable within the model. The filter can also take a control list that can be used to fine-tune the algorithm’s configuration.

## build bootstrap filter and compile model and filter
bootstrapFilter <- buildBootstrapFilter(stateSpaceModel, nodes = 'x')
compiledList <- compileNimble(stateSpaceModel, bootstrapFilter)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
## run compiled filter with 10,000 particles.  
## note that the bootstrap filter returns an estimate of the log-likelihood of the model.
compiledList$bootstrapFilter$run(10000)
## [1] -28.13009

Particle filtering algorithms in nimble store weighted samples of the filtering distribution of the latent states in the mvSamples modelValues object. Equally weighted samples are stored in the mvEWSamples object. By default, nimble only stores samples from the final time point.

## extract equally weighted posterior samples of x[10]  and create a histogram
posteriorSamples <- as.matrix(compiledList$bootstrapFilter$mvEWSamples)
hist(posteriorSamples)
plot of chunk plot-bootstrap

The auxiliary particle filter and ensemble Kalman filter can be constructed and run in the same manner as the bootstrap filter.

Conduct inference on top-level parameters using particle MCMC

Particle MCMC can be used to conduct inference on the posterior distribution of both the latent states and any top-level parameters of interest in a state space model. The particle marginal Metropolis-Hastings sampler can be specified to jointly sample the a, b, sigPN, and sigOE top level parameters within nimble‘s MCMC framework as follows:

## create MCMC specification for the state space model
stateSpaceMCMCconf <- configureMCMC(stateSpaceModel, nodes = NULL)

## add a block pMCMC sampler for a, b, sigPN, and sigOE 
stateSpaceMCMCconf$addSampler(target = c('a', 'b', 'sigPN', 'sigOE'),
                              type = 'RW_PF_block', control = list(latents = 'x'))

## build and compile pMCMC sampler
stateSpaceMCMC <- buildMCMC(stateSpaceMCMCconf)
compiledList <- compileNimble(stateSpaceModel, stateSpaceMCMC, resetFunctions = TRUE)
## compiling... this may take a minute. Use 'showCompilerOutput = TRUE' to see C++ compiler details.
## compilation finished.
## run compiled sampler for 5000 iterations
compiledList$stateSpaceMCMC$run(5000)
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
## NULL
## create trace plots for each parameter
library('coda')
par(mfrow = c(2,2))
posteriorSamps <- as.mcmc(as.matrix(compiledList$stateSpaceMCMC$mvSamples))
traceplot(posteriorSamps[,'a'], ylab = 'a')
traceplot(posteriorSamps[,'b'], ylab = 'b')
traceplot(posteriorSamps[,'sigPN'], ylab = 'sigPN')
traceplot(posteriorSamps[,'sigOE'], ylab = 'sigOE')
plot of chunk run-PMCMC

The above RW_PF_block sampler uses a multivariate normal proposal distribution to sample vectors of top-level parameters. To sample a scalar top-level parameter, use the RW_PF sampler instead.