The field of machine learning is rife with uncertainty quantification. In fact, the standard approach to prediction is based on quantifying the confidence over the space of possible outcomes, e.g. the confidence over the whole space of vocabulary for the next token prediction in text generation. Despite the wider scope of quantifying uncertainty in machine learning pipelines, the consensus over what defines a good notion of uncertainty remains an unsettled topic. Calibration, in my limited viewpoint, takes the cake of a super simple, highly minimal, and yet very powerful notion of uncertainty quantification. And maybe this is the reason that it has remained an object of study among machine learning practitioners for years. Despite the promise and the popularity of calibration, I’ve always felt a tinge of discomfort every time I’ve heard or read about it. In this note, I’ll try to articulate why it could have been like that, and why this should not be the case. In this first section, I’ll talk about the promise of calibration as well as why that has concerned me. In the second section, I’ll argue why that concern may not be that much of a concern after all.

No sure loss and calibration

So what exactly is calibration? Put simply, it is an alignment certification of the estimated confidence and the occurrence of outcomes in the real-world over which the forecast was made. So if the forecast was that the risk of someone having a certain disease is 70%, of all the individuals who had the same forecast, 70% actually had a positive diagnosis in real-world. Quite nicely then, it lends interpretability to the forecast. However, real-world requires more than just interpretability. In the example we have, once the forecast of 70% is made for some individual, a medical professional needs to plan a course of action—-whether to administer treatment to this person or not? Or to administer treatment A vs treatment B? The crux of the situation is that once the prediction is made, some decision-maker has to act on it by taking certain actions. And this is where things get interesting, as there could be a range of decision-makers who will consume the same prediction, with their specific utility or cost functions, and different risk attitudes—-risk neutral or risk averse. I’ll give one more example, however it is bit unrealistic to simplify things. Consider the risk forecasting system estimating the chance of early-stage osteoporosis in individuals over the next five years. In conclusion, to think about the goals of uncertainty quantification, it may be wise to consider who are the consumers of that uncertainty, and what they want?

Notation: To informally formalise, we consider the standard machine learning setup with space $\mathcal{X} \times \mathcal{Y}$, a distribution $P$ on it and canonical random variables on this space denoted as $X$ and $Y$, and a forecaster $h : \mathcal{X} \rightarrow [0,1]$, estimating the covariate dependent probability for some event, for example: the chance early-stage osteoporosis in individuals over the next five years based on the patient’s history. Now obviously, based on the forecast, a certain individual may have the choice to go for the advanced bone density scan or ignore the diagnostic test, both options with certain costs attached to them. How should a patient decision-maker act then?

To think about this question, it has helped me to think about it in terms of the behavioural interpretation of probability as pioneered by De Finetti. In the risk prediction setting, if the person will develop osteoporosis eventually and they decide to go for an advanced test, let’s say they get the reward for +20, and in case they go for the advanced test when they won’t develop the osteoporosis, they get the reward of -10, for financial costs. When the forecaster announces that a certain individual has 70% chance of having an osteoporosis, an individual who does not know a-priori whether they would have the osteoporosis or not, in their choice with going for the advanced test, they are largely dealing with a gamble $G$ where with $70\%$ chance, they will get $+20$, and with $30\%$ chance they get $-10$. Or so the forecaster wants the decision-maker to think, as their true rate of osteoporosis might be different from what is forecasted by the forecaster. How can the decision-maker then assert the quality of the forecast?

No sure loss criterion: When the forecaster has announced the gamble $G$, they have also announced the fair price $\mu$ to sell or buy that gamble, i.e. $\mu$ = 0.720 - 0.310 = 11. Now if the decision-maker is risk-neutral, i.e. they value the gamble $G$ as equally as its fair price, they may decide to engage in the transaction of buying $G$ at the price of $\mu$, thus having bought the gamble $G - \mu$. And the forecaster has the gamble $\mu - G$. One desiderata each agent has in this situation is the no sure loss criterion, i.e. each agent does not want to be at the risk being a sure loser when they make this transaction. Consider the true rate of osteoporosis of this individual is $\eta$, then the expected gain when the decision-maker decides to engage in this transaction is $\mathbb{E}{\eta}[G - \mu]$, which due to the no sure loss criterion for each agent should be $0$, i.e. $\mathbb{E}{\eta}[G - \mu] = 0$, or $\mathbb{E}{\eta}[G] = \mathbb{E}{h}[G]$ where $h$ is the forecast by the forecaster. This gives us the desirable requirement on the forecast: if both agents subscribe to the no sure loss criteria, then the forecast should enable the decision-maker to faithfully evaluate the value of the gamble. It’s worth reflecting why no sure loss criterion makes sense: as noted above, decision-makers consume probabilities to make decisions, an estimate is minimally good enough if it enables them to evaluate the consequences of those decisions. [And yet when I’m writing this, I’m bit uncomfortable, and I’ll get to that later.]

So far in my example, we’ve considered a single gamble $G$, however if one wants to put a stronger requirement of no sure loss for every gamble that depends on the forecast, i.e. $\mathbb{E}{\eta}[G] = \mathbb{E}{h}[G] \ \ \forall G$. For instance, these gambles could be of the form of choosing among multiple advanced test. Then, this results in the forecast $h$ matching the true rate $\eta$ almost surely. That is, if the forecast is to enable no-sure loss for every gamble dependent on the forecast, it can only happen if the forecast matches the true rate. Obviously, this is very strong and impractical. That’s where the promise of calibration kicks in.

Calibration: It’s worth highlighting that, so far, we consider the individualised decision-making, and it leads to the stronger requirement of the forecast matching the true base rate for each individual, in order to result in no sure loss. Could one relax this? Turns out, it is possible if we consider the population level no-sure loss, or accurate loss estimation. Following the same setup from above, a decision-maker engaging in the transaction with the forecast has the population aggregate utility as $\mathbb{E}{X \times Y \sim P}[G - \mathbb{E}{Y \sim h(X)}G] = \mathbb{E}{X \times Y \sim P}[G] - \mathbb{E}{X}\mathbb{E}_{Y \sim h(X)}G$ . It now remains to check that if the forecast is calibrated, then this evaluates to $0$, as for the second expectation here, consider events of the form $A = \{X'\ \text{s.t.} \ h\left(X'\right) = \mu\}$, then if the forecast is calibrated, then $Y \sim h\left(X \right)$ even under the true distribution over these events. Hence for each event $A$, the second expectation behaves as per the true distribution, which when averaged across all the events $A_s$ matches the first expectation.

Thus, calibration enables no sure loss, or accurate risk estimation. The cost of replacing exactly matching the forecast with the base rate by weaker property of calibration, is that now no guarantees can be derived for individual forecasts. This is just a part of the reason why I find calibration bit unsatisfying. Another reason is the notion of risk preferences, as in this note I only talked about risk-neutral decision maker. A risk neutral agent is happy with having the gamble $G - \mu$, where $\mathbb{E}[G] = \mu$, i.e. no sure loss (or gain for that matter). However, in situations of osteoporosis diagnosis and other critical applications like tumour relapse prediction, I find it unsettling to evaluate the gamble by expectations. However, I’ll get to that in the next section.

In the following section, I’ll argue why my concerns about population level no sure loss guarantee as well as evaluating the value of the gamble in terms of expectations could have been unfounded after all.

Risk aversion, insurance, and calibration

As noted above, when the decision-maker is offered the gamble $G - \mu$, the decision maker evaluates its value as $\mathbb{E}[G - \mu]$. However, in practice a decision-maker won’t observe the value $\mathbb{E}[G]$, instead they will observe either $+20$ (if they’ll indeed develop osteoporosis and opt for the advanced test) or $-10$ (they won’t have the osteoporosis, and go for the advanced test)—-i.e. a realistic decision-maker would only make an individualised decision, the consequences of which are one-time gamble value. How can a decision-maker, then, guarantee no-sure loss?