Be careful with Multinomials in collapsed Gibbs sampling!

This post relates to “Gibbs sampling for the uninitiated” (Resnik and Hardisty, 2010), a very helpful material for those who just began to learn about Bayesian inference (like me). The model demonstrated in this paper is Naive Bayes (NB).

Another useful resource is “Parameter estimation for text analysis” (Heinrich 2008), which will walk you step by step through the derivation of a collapsed Gibbs sampler for latent Dirichlet allocation (LDA).

Please skim through the models in the two papers so that you will understand I will be talking about next.

This post is written because I’m a little confused by section 2.6 of the first paper. To summarize, the section says that we shouldn’t integrate out [latex] \theta [/latex] in the NB model because doing so would introduce an extra set of Gamma functions that will blow up the complexity of the sampler. At the time I was reading that, I thought: hmm, this NB is almost identical to LDA, so why are we still be able to integrate out [latex] \phi [/latex] of LDA (which is equivalent to [latex] \theta [/latex] of NB)?

I think the point the authors are trying to make is that representing documents by word count vectors instead of word-type vectors makes significant difference.

Why? Because generating a document under word count representation is drawing from a multinomial distribution with MULTIPLE trials. All words are drawn at the same time. On the other hand, under word type representation, we draw a word type [latex] w_i [/latex] for each position i in the document from a multinomial distribution with ONE trial. The word types are drawn independently of one another.

Mathematically, let [latex] \mathbf{w} [/latex] be the word type vector of a document, [latex] \mathbf{c} [/latex] be the word count vector, and [latex] \theta [/latex] is the parameter of the multinomial distribution.

If we draw the document from a multinomial with multiple trials, by definition, the probability of a document will look like this:

[latex] P(\mathbf{c} | \theta) = \frac{\Gamma(\sum_i c_i + 1) }{ \prod_i \Gamma(c_i + 1)} \prod_i \theta_i^{c_i} [/latex]

Hence, we will have to carry an extra set of Gamma functions everywhere. I suspect that this is the “extra set of Gamma functions” mentioned in section 2.6.

On the other hand, if we go through each position and draw a word type, we will get something very similar, but without the Gamma functions:

[latex] P(\mathbf{w} | \theta) = \prod_j \theta_{w_j} = \prod_i \theta_i^{c_i} [/latex]

But something confuses me is before equation (15) of the first paper, the authors say that “we pick a word [latex] w_i [/latex] independently by sampling randomly according to a probability distribution over words”. If that is also true, shouldn’t we be fine with integrating out [latex] \theta [/latex]? Where is the “extra set of Gamma function that won’t cancel out nicely” from?