Chapter 10 Bayesian nonparametric models

As of version 0.6-11, NIMBLE provides initial support for Bayesian nonparametric (BNP) mixture modeling. These features are currently considered EXPERIMENTAL – please let us know (via the NIMBLE user mailing list or by emailing us) if you have any problems or have suggestions about functionality you would like NIMBLE to support or the NIMBLE interface for using BNP.

10.1 Bayesian nonparametric mixture models

NIMBLE provides support for Bayesian nonparametric (BNP) mixture modeling. The current implementation provides support for hierarchical specifications involving Dirichlet process (DP) mixtures (Ferguson 1973; Ferguson 1974; Lo 1984; Escobar 1994; Escobar and West 1995). More specifically, a DP mixture model takes the form

\[y_i \mid G \overset{iid}{\sim} \int h(y_i \mid \theta) G(d\theta),\] \[G \mid \alpha, G_0 \sim DP(\alpha, G_0),\]

where \(h(\cdot \mid \theta)\) is a suitable kernel with parameter \(\theta\), and \(\alpha\) and \(G_0\) are the concentration and baseline distribution parameters of the DP, respectively. DP mixture models can be written with different levels of hierarchy, all being equivalent to the model above.

When the random measure \(G\) is integrated out from the model, the DP mixture model can be written using latent or membership variables, \(z_i\), following a Chinese Restaurant Process (CRP) distribution (Blackwell and MacQueen 1973), discussed in Section 10.2. The model takes the form

\[y_i \mid \tilde{\boldsymbol{\theta}}, z_i \overset{ind}{\sim} h(\cdot \mid \tilde{\theta}_{z_i}),\] \[\boldsymbol{z}\mid \alpha \sim \mbox{CRP}(\alpha),\hspace{0.5cm} \tilde{\theta}_j \overset{iid}{\sim}G_0,\] where \(\mbox{CRP}(\alpha)\) denotes the CRP distribution with concentration parameter \(\alpha\).

If a stick-breaking representation (Sethuraman 1994), discussed in section 10.3, is assumed for the random measure \(G\), then the model takes the form

\[y_i \mid {\boldsymbol{\theta}}^{\star}, \boldsymbol{v} \overset{ind}{\sim} \sum_{l=1}^{\infty}\left\{ v_l\prod_{m<l}(1-v_m)\right\} h(\cdot \mid {\theta}_l^{\star}),\] \[v_l \mid \alpha \overset{iid}{\sim} Beta(1, \alpha),\hspace{0.5cm} {\theta}_l^{\star} \overset{iid}{\sim}G_0.\]

More general representations of the random measure can be specify by considering \(v_l \mid \nu_l, \alpha_l \overset{ind}{\sim} Beta(\nu_l, \alpha_l)\). Finite dimensional approximations can be obtained by truncating the infinite sum to have \(L\) components.

Different representations of DP mixtures lead to different computational algorithms. NIMBLE supports sampling algorithms based on the CRP representation, as well as on the stick-breaking representation. NIMBLE includes definitions of structures required to implement the CRP and stick-breaking distributions, and the associated MCMC algorithms.

10.2 Chinese Restaurant Process model

The CRP is a distribution over the space of partitions of positive integers and is implemented in NIMBLE as the dCRP distribution. More details for using this distribution are available using help(CRP).

The CRP can be described as a stochastic process in which customers arrive at a restaurant, potentially with an infinite number of tables. Each customer sits at an empty or occupied table according to probabilities that depend on the number of customers in the occupied tables. Thus, the CRP partitions the set of customers, through their assignment to tables in the restaurant.

10.2.1 Specification and density

NIMBLE parametrizes the dCRP distribution by a concentration parameter and a size parameter.

10.2.1.1 Specification

The dCRP distribution is specified in NIMBLE for a membership vector z as

z[1:N] ~ dCRP(conc, size)

The conc parameter is the concentration parameter of the CRP, controlling the probability of a customer sitting on a new table, i.e., creating a new cluster. The size parameter defines the size of the set of integers to be partitioned.

The conc parameter is a positive real value that can be treated as known or unknown. When a gamma prior is assumed for the conc parameter, a specialized sampler is assigned. See more on this in section 10.4.1.

The size parameter is a positive integer that has to be fixed and equal to the length of vector z. It defines the set of consecutive integers from 1 to N to be partitioned. Each element in z can be an integer from 1 to N, and repetitions are allowed.

10.2.1.2 Density

The CRP distribution partitions the set of positive integers \({1, \ldots, N}\), into \(N^{\star} \leq N\) disjoint subsets, indicating to which subset each element belongs. For instance, if \(N=6\), the set \(\{1, 2, 3, 4, 5, 6 \}\) can be partitioned into the subsets \(S_1=\{1, 2, 6\}\), \(S_2=\{4, 5\}\), and \(S_3=\{3\}\). Note that \(N^{\star} =3\), and this is one partition from out of 203 possibilities. The CRP-distributed vector \(\boldsymbol{z}\) encodes this partition and its observed values would be \((1, 1, 3, 2, 2, 1)\), for this example. In mixture modeling, this indicates that observations 1, 2, and 6 belong to cluster 1, observations 4 and 5 to cluster 2, and observation 3 to cluster 3. Note that this representation is not unique, vector \((2, 2, 1, 3, 3, 2)\) encodes the same partition.

The joint probability function of \(z=(z_1, \ldots, z_N)\), with concentration parameter \(\alpha\), is given by

\[p(\boldsymbol{z} \mid \alpha) \propto \frac{\Gamma{(\alpha)}}{\Gamma{(\alpha + n)}} \alpha^{N^{\star}(\boldsymbol{z})}\prod_{k=1}^{N^{\star}(\boldsymbol{z})}\Gamma(m_{k}(\boldsymbol{z})),\]

where \(m_k(\boldsymbol{z})\) denotes the number of elements in \(\boldsymbol{z}\) that are equal to \(k\). The full conditional distribution for \(z_i\) given \(z_{-i}\) is

\[p(z_i = m \mid z_{-i}, \alpha) = \frac{1}{n-1+\alpha} \sum_{j \neq i}1_{\{z_j\}} (m)+ \frac{\alpha}{n-1+\alpha}1_{\{z^{new}\}}(m),\]

where \(z_{-i}\) denotes vector \(\boldsymbol{z}\) after removing its \(i-\)th component, \(z^{new}\) is a value not in \(z_{-i}\), and \(1_{A}\) denotes the indicator function at set \(A\).

Note that the probability of creating a new cluster is proportional to \(\alpha\): the larger the concentration parameter, the more clusters are created.

10.2.2 Example

The following example illustrates how to use NIMBLE to perform single density estimation for real-valued data, under a BNP approach, using the dCRP distribution. (Note that the BNP approach is also often used to perform density estimation on random effects.) The model is given by

\[ y_i \mid \tilde{\boldsymbol{\theta}}, \tilde{\boldsymbol{\sigma}}^2, z_i \overset{ind}{\sim} N(\tilde{\theta}_{z_i}, \tilde{\sigma}^2_{z_i},)\hspace{0.5cm} i = 1, \ldots, N, \] \[ \boldsymbol{z} \sim \mbox{CRP}(\alpha), \hspace{0.5cm} \alpha \sim \mbox{Gamma}(1, 1), \] \[ \tilde{\theta}_j \overset{iid}{\sim} N(0, 100), \hspace{0.5cm}\tilde{\sigma}^2_j \overset{iid}{\sim}\mbox{InvGamma}(1, 1), \hspace{0.2cm} j=1, \ldots, M. \]

code <- nimbleCode({
  z[1:N] ~ dCRP(alpha, size = N)
  alpha ~ dgamma(1, 1)
  for(i in 1:M) {
    thetatilde[i] ~ dnorm(0, 100)
    s2tilde[i] ~ dinvgamma(1, 1)
  }
  for(i in 1:N) 
    y[i] ~ dnorm(thetatilde[z[i]], var = s2tilde[z[i]])  
})

set.seed(1)
constants <- list(N = 100, M = 50)
data <- list(y = c(rnorm(50, -5, sqrt(3)), rnorm(50, 5, sqrt(4))))
inits <- list(thetatilde = rnorm(constants$N, 0, 10), 
              s2tilde = rinvgamma(constants$N, 1, 1), 
              z = sample(1:10, size = constants$N, replace = TRUE),
              alpha  = 1)
Rmodel <- nimbleModel(code, constants, data, inits)

The resulting model may be fitted through MCMC sampling. NIMBLE will assign a specialized sampler to update z and alpha. See Chapter 7 for information about NIMBLE’s MCMC engine, and Section 10.4.1 for details on MCMC sampling of the CRP.

One of the advantages of BNP mixture models is that the number of clusters is treated as random. Therefore, in MCMC sampling, the number of cluster parameters varies with the iteration. Since NIMBLE does not currently allow dynamic length allocation, the number of unique cluster parameters, \(N^{\star}\), has to be fixed. One safe option is to set this number to \(N\), but this is inefficient, both in terms of computation and in terms of storage, because in practice it is often that \(N^{\star} < N\). In addition, configuring and building the MCMC can be slow (and use a lot of memory) for large \(M\). In an effort to mitigate these inefficiencies, we allow the user to set \(N^{\star} = M\), with \(M<N\), as seen in the example above. However, if this number is too small and is exceeded in any iteration a warning is issued.

10.3 Stick-breaking model

In NIMBLE, weights defined by sequentially breaking a stick, as in the stick-breaking process, are implemented as the stick_breaking link function. More details for using this function are available using help(stick_breaking).

10.3.1 Specification and function

NIMBLE parametrizes the stick_breaking function by vector of values in \((0,1)\).

10.3.1.1 Function

The weights \((w_1, \ldots, w_L)\) follow a finite stick-breaking construction if

\[ \hspace{-2cm} w_1 = v_1, \] \[ \hspace{2.2cm} w_l = v_l\prod_{m<l}(1-v_m),l=2,\ldots,L-1 \] \[ \hspace{-0.4cm} w_L = \prod_{m<L}(1-v_m).\]

for \(v_l \in [0,1], l=1,\ldots,L-1\).

10.3.1.2 Specification

The stick_breaking function is specified in NIMBLE for a vector w of probabilities as

w[1:L] <- stick_breaking(v[1:(L-1)])

The argument v is a vector of values between 0 and 1 defining the sequential breaking points of the stick after removing the previous portions already broken off. It is of length \(L-1\), implicitly assuming that its last component is equal to 1.

In order to complete the definition of the weights in the stick-breaking representation of \(G\), a prior distribution on \((0,1)\) should to be assumed for \(v_l\), \(l =1, \ldots, L-1\), for instance a beta prior.

10.3.2 Example

Here we illustrate how to use NIMBLE for the example described in section 10.2.2, but considering a stick-breaking representation for \(G\). The model is given by

\[ y_i \mid \boldsymbol{\theta}^{\star}, {\boldsymbol{\sigma}^{\star}}^2, z_i \overset{ind}{\sim} N({{\theta}^{\star}}_{z_i}, {{\sigma}^2}^{\star}_{z_i}),\hspace{0.5cm} i = 1, \ldots, N, \] \[ \boldsymbol{z} \sim Discrete(\boldsymbol{w}), \hspace{0.5cm} v_l\overset{iid}{\sim} Beta(1, \alpha), \hspace{0.2cm}l=1, \ldots, L-1, \] \[ \alpha \sim \mbox{Gamma}(1, 1),\] \[{\theta}^{\star}_l \overset{iid}{\sim} N(0, 100), \hspace{0.5cm}{{\sigma}^2}^{\star}_l \overset{iid}{\sim}\mbox{InvGamma}(1, 1), \hspace{0.2cm} l=1, \ldots, L.\]

where \(w_1=v_1\), \(w_l=v_l \prod_{m<l}(1-v_m)\), for \(l=1, \ldots, L-1\), and \(w_L=\prod_{m<L}(1-v_m).\)

code <- nimbleCode({
  for(i in 1:(L-1)){
    v[i] ~ dbeta(1, alpha)
  }
  alpha ~ dgamma(1, 1)
  w[1:L] <- stick_breaking(v[1:(L-1)])
  for(i in 1:L) {
    thetastar[i] ~ dnorm(0, 100)
    s2star[i] ~ dinvgamma(1, 1)
  }
  for(i in 1:N) {
    z[i] ~ dcat(w[1:L])
    y[i] ~ dnorm(thetastar[z[i]], var = s2star[z[i]])  
  }
})

set.seed(1)
constants <- list(N = 100, L=50)
data <- list(y = c(rnorm(50, -5, sqrt(3)), rnorm(50, 5, sqrt(4))))
inits <- list(thetastar = rnorm(constants$L, 0, 100), 
              s2star = rinvgamma(constants$L, 1, 1), 
              z = sample(1:10, size = constants$N, replace = TRUE),
              v  = rbeta(constants$L, 1, 1),
              alpha = 1)
RmodelSB <- nimbleModel(code, constants, data, inits)

The resulting model may be carried through to MCMC sampling. NIMBLE will assign a specialized sampler to update v. See Chapter 7 for information about NIMBLE’s MCMC engine, and Section 10.4.2 for details on MCMC sampling of the stick-breaking weights.

10.4 MCMC sampling of BNP models

BNP models can be specified in different, yet equivalent, manners. Examples 10.2.2 and 10.3.2 are examples of density estimation for real-valued data, and are specified through the CRP and the stick-breaking process, respectively. Different specifications lead NIMBLE to assign different sampling algorithms for the model. When the model is specified through a CRP, a collapsed sampler (R. Neal 2000) is assigned. Under this specification, the random measure \(G\) is integrated out from the model. When a stick-breaking representation is used, a blocked Gibbs sampler is assigned, see Hemant Ishwaran and James (2001) and H Ishwaran and James (2002).

10.4.1 Sampling CRP models

NIMBLE’s MCMC engine provides specialized samplers for the dCRP distribution, updating each component of the membership vector sequentially. Internally, the sampler is assigned based on inspection of the model structure, evaluating conjugacy between the mixture kernel and the baseline distribution, as follows:

  1. A conjugate sampler in the case of the baseline distribution being conjugate for the mixture kernel.
  2. A non-conjugate sampler in the case of the baseline distribution not being conjugate for the mixture kernel.

Note that both samplers are specialized versions that operate on a vector having a CRP distribution. Details of these assignments are strictly internal to the CRP samplers. Additionally, a specialized sampler is assigned to the conc hyper parameter when a gamma hyper prior is assigned, see section 6 in Escobar and West (1995) for more details. Otherwise, a random walk Metropolis-Hastings sampler is assigned.

10.4.1.1 Initial values

Valid initial values should be provided for all elements of the process specified by a CRP structure before running the MCMC. A simple and safe choice for z is to provide a sample of size N, the same as its length, of values between 1 and some reasonable number of clusters (less than or equal to the length of z), with replacement, as done in the preceding CRP example. For the concentration parameter, a safe initial value is 1.

10.4.1.2 Sampling the random measure

In BNP models, it is oftenly of interest to make inference about the unknown measure \(G\). NIMBLE provides a sampler, getSamplesDPmeasure, for this random measure when a CRP structure is involved in the model.

The argument of the getsamplesDPmeasure function is a compiled or uncompiled MCMC object. The MCMC object should monitor the membership (clustering) variable, the cluster parameters, all stochastic nodes of the cluster parameters, and the concentration parameter, if it is random. Use the monitors argument when configuring the MCMC to ensure these variables are monitored in the MCMC.

The sampler is used only after the MCMC for the model has been run; more details are available from help(getsamplesDPmeasure).

The following code exemplifies how to generate samples from \(G\) after defining the model as in Section 10.2.2.

cRmodel <- compileNimble(Rmodel)

monitors <- c('z', 'thetatilde', 's2tilde' , 'alpha')
RmodelConf <- configureMCMC(Rmodel, monitors = monitors)
RmodelMCMC <- buildMCMC(RmodelConf)

CmodelMCMC <- compileNimble(RmodelMCMC, project = Rmodel)
CmodelMCMC$run(1000)

samplesG <- getSamplesDPmeasure(CmodelMCMC)

10.4.2 Sampling stick-breaking models

NIMBLE’s MCMC engine provides specialized samplers for the beta-distributed random variables that are the arguments to the stick-breaking function, updating each component of the weight vector sequentially. The sampler is assigned based on inspection of the model structure. Specifically, the specialized sampler is assigned when the membership vector has a categorical distribution, its weights are defined by a stick-breaking function, and the vector defining the weights follows a beta distribution.

10.4.2.1 Initial values

Valid initial values should be provided for all elements of the stick-breaking function and membership variable before running the MCMC. A simple and safe choice for \(z\) is to provide a sample of size \(N\), of values between 1 and some value less than \(L\), with replacement, as done in the preceding stick-breaking example. For the stick variables, safe initial values can be simulated from a beta distribution.

References

Ferguson, T S. 1973. “A Bayesian Analysis of Some Nonparametric Problems.” Annals of Statistics 1: 209–30.

Ferguson, T S. 1974. “Prior Distribution on the Spaces of Probability Measures.” Annals of Statistics 2: 615–29.

Lo, A Y. 1984. “On a Class of Bayesian Nonparametric Estimates I: Density Estimates.” The Annals of Statistics 12: 351–57.

Escobar, M D. 1994. “Estimating normal means with a Dirichlet process prior.” Journal of the American Statistical Association 89: 268–77.

Escobar, M D, and M West. 1995. “Bayesian density estimation and inference using mixtures.” Journal of the American Statistical Association 90: 577–88.

Blackwell, D, and J MacQueen. 1973. “Ferguson Distributions via Pólya Urn Schemes.” The Annals of Statistics 1: 353–55.

Sethuraman, J. 1994. “A Constructive Definition of Dirichlet Prior.” Statistica Sinica 2: 639–50.

Neal, R. 2000. “Markov chain sampling methods for Dirichlet process mixture models.” Journal of Computational and Graphical Statistics 9: 249–65.

Ishwaran, Hemant, and Lancelot F James. 2001. “Gibbs Sampling Methods for Stick-Breaking Priors.” Journal of the American Statistical Association 96 (453). Taylor & Francis: 161–73.

Ishwaran, H, and L F James. 2002. “Approximate Dirichlet Process Computing in Finite Normal Mixtures: Smoothing and Prior Information.” Journal of Computational and Graphical Statistics 11: 508–32.