*This research is a joint effort between Professor Brendan O’Connor and me.*

Natural language processing (NLP) models are amazing but, they are still far from being perfect. Thus every user of any NLP software is undertaking certain risks. Fortunately, probabilistic models are aware of those risks because they define posterior distributions over the output space, thus knowing the uncertainties of candidate predictions. By informing those uncertainties to the users, probabilistic models warn them against the risks they are undertaking. Neglecting posterior distributions and only paying attention to single-best predictions are problematic. Doing so gives the false impression that the models are perfectly confident about their predictions: the probability of the most likely candidate output is rounded up to 1 while the probabilities of other, less likely, candidate outputs are rounded down to 0. For instance, in a binary prediction problem, a 40%-uncertain prediction and an absolute 0%-uncertain prediction would be treated as two positive predictions, indistinguishable from each other, even though the former apparently carries more risk than the latter does.

In many applications, posterior probabilities are even explicitly required as important elements of the solutions. Some examples are:

**Mitigating cascaded errors:**cascaded errors in NLP pipelines is caused when mistakes from upstream models propagate and degrade the quality of downstream models. A popular solution is to model the pipeline as a Bayesian network: $P(y_1, \cdots , y_n \mid x) = \prod_i p(y_i \mid y_{i – 1}) p(y_1 \mid x)$. This is equivalent to passing along the pipeline multiples samples drawn from posterior distributions of the models instead of solely passing single-best predictions.**Data exploratory analysis**: in many computational social science studies, we may want to calculate a summary statistic from predictions such as the marginal distribution $P(y)$ of a predicted variable $y$. Using single-best predictions results in biased estimations. A better option is to apply: $P(y) = \mathbb{E}_{x}[P(y \mid x)] \approx \frac{1}{N} \sum_{i = 1}^n P(y \mid x_i)$.**Human-computer interaction**: it is preferred to ask a user to repeat a request rather than responding to it unwisely. This requires the model to be aware of the uncertainty of each prediction.**Risk calculation**: when adopting an NLP system, industrial companies need to know the underlying risk of each prediction to avoid making costly decisions and to calculate expected revenue.

Nevertheless, posterior probabilities are only helpful if they are “true” in some sense. This motivates us to evaluate how good a model is at estimating its posterior probabilities. Specifically, besides emphasizing the need for reporting posterior probabilities of predictions, we argue that, being able to produce those probabilities accurately should be a desired property of a probabilistic model and should be an important factor to consider when choosing between various models. Ideally, if two models attain the same level of accuracy, the one offering more reliable probability estimation should be preferred. If they are also equally good at probability estimation, the one which is more certain about its decisions should be prioritized. These two criteria are the fundamental idea of **calibration**–**refinement **analysis, which have been extensively used in other fields such as meteorology but has not received enough attention in NLP. Calibration-refinement analysis directly evaluates the quality of a model’s probabilistic predictions by empirically validating them with the ground truths.

In this research, we focus on calibration analysis and develop methods to better compute and present its results to users. The output of our methods are a visualizable calibration plot like this:

in which the more “diagonal” the curve is, the more calibrated the model is, and a single real number that summarizes the overall degree of uncalibration:

$$ CalibErr = \sqrt{\mathbb{E}_q \left[ q – \mathbb{P}(y = 1 \mid q) \right]^2}$$

Our method is **model-agnostic**. It only requires posterior probabilities from the analyzed model as input. In cases when enumerating the output space is intractable, as in structured prediction, those probabilities can still be approximated via sampling, as long as the model provides a mechanism for sampling from its posterior distributions. Popular families of models such as naive Bayes, logistic regression, hidden Markov models, conditional random fields, easily satisfy this requirement. More details in our paper!