FOCS/IDEAL 2024: Workshop on Calibration

Logistics

Date: Sunday, October 27, 2024

Location: Workshop Session B, The voco Chicago Downtown, an IHG Hotel 350 W. Wolf Point Plaza, Building 1, Chicago, IL 60654

FOCS website here

Description

This workshop includes theoretical topics related to calibration. A predictor is calibrated if its predictions are empirically correct. For example, in the weather forecasting application, a predictor is calibrated if, among the times where the predictor predicts a p probability of rain, the empirical frequency of rains is also p. Calibration allows algorithmic outputs to be reliably interpreted as probabilities, which guarantees the trustworthiness and interpretability of predictions.

There has been recent literature on the algorithmic foundations of calibration. This workshop will introduce the following topics, with their connections with calibration:

Complexity theory. Calibration and multi-calibration has implications in the complexity theory, including indistinguishability [1] and the Regularity Lemma [2].
Uncertainty quantification. Calibrated predictions can be used to generate prediction sets that cover their target labels with a target confidence level, not just marginally, but conditional on various overlapping properties of the covariates. More generally everything that “multi-calibration” can do for means can be done for exactly those distributional statistics that can be expressed as M-estimators — a class that includes e.g. quantiles, which yield applications in conformal prediction [3, 4].
Decision making, regret, and omni-prediction. Calibration allows decision making to be separated into two components: a prediction task and a decision making task. Calibrated predictions from the prediction task are trustworthy for every downstream decision maker [5, 8, 11]. Particularly, best responding to calibrated predictions guarantees no swap regret for every decision maker [6, 7].
Fairness. Multi-calibration originated from the algorithmic fairness literature, and continues to have important applications in guaranteeing that models capture appropriate amounts of signal about populations not just marginally, but broken down by intersecting demographic attributes like race and sex [9, 10, 12].

The workshop will be coordinated with the Fall 2024 IDEAL Special Program on Interpretability, Privacy, and Fairness. IDEAL is an NSF-funded foundations of data science institute with members from five institutions in the greater Chicago area. The program also includes a quarter-long Ph.D. level seminar on calibration. Participants of the special program and course will be encouraged to attend the workshop.

Speakers:

Schedule

9:00 – 9:15 Aaron Roth: Tutorial on Calibration
9:15 – 9:40 Parikshit Gopalan
- Title: Robust and Efficient Calibration
- Abstract: It is commonly accepted that calibration is a desirable property of probabilistic predictors. But what does it mean for a predictor to not be calibrated (in a robust sense)? The answer should guide how we measure calibration error, a topic that turns out to be unexpectedly subtle. In this talk, we will survey recent work that proposes desirable properties of a calibration measure, and measures that satisfy these properties.
9:40 – 10:05 Natalie Collina
- Title: Tractable Agreement Protocols via Calibration
- Abstract:
  
  We give an efficient reduction through which any machine learning algorithm can be converted into an interactive protocol that can be used interact with another party (such as a human) to reach agreement on predictions and improve accuracy. The requirements on each party are calibration conditions which are computationally and statistically tractable relaxations of Bayesian rationality — that are sensible even in prior free settings — and hence are a substantial generalization of Aumann’s classic “agreement theorem”. In the interactive protocol, the machine learning model first produces a prediction. Then, the human responds to the model’s prediction by either conveying agreement, or else directional disagreement. The model then updates its state and provides a new prediction, and the human in turn may update their beliefs. The process continues until the model and the human reach agreement. We study several feedback mechanisms. The first generalizes past work on Aumann’s Agreement Theorem [Aumann 1976], in which the parties aim to agree on a one-dimensional expectation. In the simplest setting, at each round, each party simply communicates an estimate of their current prediction for the expectation. In this setting we recover the quantitative convergence theorem of Aaronson 2005 (but under our much weaker assumptions). This result holds even if the human only needs to express that they believe the machine’s estimate is “too large” or “too small”, rather than provide a quantitative prediction. We then move on to the case in which the parties maintain beliefs about a distribution over d outcomes. If the human needs to take some downstream action from a finite set, and has an arbitrary utility function of their action and the outcome, then we show that the parties can communicate and reach agreement about the correct downstream action to take by simply communicating at each round the action that they believe to be utility maximizing. We can also generalize our protocols to more than 2 parties, with rates that degrade only polynomially with the number of parties. Our protocols are based on a simple, efficiently maintainable condition and result in predictions that are more accurate than any single party’s alone.
  
  Joint work with Surbhi Goel, Varun Gupta, and Aaron Roth.
10:05 – 10:30 Michael P. Kim
- Title: Near-Optimal Algorithms for Omniprediction
- Abstract: Omnipredictors are simple prediction functions that encode loss-minimizing predictions, simultaneously for every loss function within a class of losses L. We give an oracle-efficient online learning algorithm that achieves omniprediction in O~(T^{1/2}) regret for any class of bounded variation loss functions. Quite surprisingly, this regret bound matches the optimal regret for minimization of a single loss function (up to log factors). We leverage our result from the online setting to devise an offline learning algorithm for efficient omnipredictors, whose sample complexity scales at a near-optimal rate. Our algorithms shed light on the computational complexity of achieving omniprediction, separating the (easier) task of learning omnipredictors for convex losses from that of learning omnipredictors for proper losses. Joint work with Princewill Okoroafor and Bobby Kleinberg.

Break

11:00 – 11:25 Adam Kalai
- Title: Demystifying language model hallucinations
- Abstract: We try to demystify language model hallucinations by leveraging calibration and intuition from supervised learning. We argue that the same types of facts that are hard to learn in a supervised setting are also those that language models are encouraged to hallucinate during “pre-training,” even when training data are noiseless. We do not believe that hallucinations are inevitable–language model “post-training” should remove these hallucinations (though it will also make them miscalibrated). Joint work with Santosh Vempala.
11:25 – 11:50 Princewill Okoroafor
- Title: Breaking the T^{2/3} Barrier for Sequential Calibration
- Abstract: A set of probabilistic forecasts is calibrated if each prediction of the forecaster closely approximates the empirical distribution of outcomes on the subset of timesteps where that prediction was made. We study the fundamental problem of online calibrated forecasting of binary sequences, which was initially studied by Foster & Vohra (1998). They derived an algorithm with O(T^(2/3)) calibration error after T time steps, and showed a lower bound of Ω(T^(1/2)). These bounds remained stagnant for two decades, until Qiao & Valiant (2021) improved the lower bound to Ω(T^0.528) by introducing a combinatorial game called sign preservation and showing that lower bounds for this game imply lower bounds for calibration. In this paper, we give the first improvement to the O(T^(2/3)) upper bound on calibration error of Foster & Vohra. We do this by introducing a variant of Qiao & Valiant’s game that we call sign preservation with reuse (SPR). We prove that the relationship between SPR and calibrated forecasting is bidirectional: not only do lower bounds for SPR translate into lower bounds for calibration, but algorithms for SPR also translate into new algorithms for calibrated forecasting. We then give an improved \emph{upper bound} for the SPR game, which implies, via our equivalence, a forecasting algorithm with calibration error O(T^(2/3−ε)) for some ε>0, improving Foster & Vohra’s upper bound for the first time. Using similar ideas, we then prove a slightly stronger lower bound than that of Qiao & Valiant, namely Ω(T^0.54389). Our lower bound is obtained by an oblivious adversary, marking the first ω(T^(1/2)) calibration lower bound for oblivious adversaries.
  Based on joint work with Yuval Dagan, Constantinos Daskalakis, Maxwell Fishelson, Noah Golowich and Robert Kleinberg.

Organizers:

Jason Hartline (hartline@northwestern.edu)

Jamie Morgenstern (jamiemmt@cs.washington.edu)

Aaron Roth (aaroth@cis.upenn.edu)

Yifan Wu (yifan.wu@u.northwestern.edu)

References:

[1] Dwork, C., Kim, M. P., Reingold, O., Rothblum, G. N., & Yona, G. (2021, June). Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (pp. 1095-1108).]

[2] Casacuberta, S., Dwork, C., & Vadhan, S. (2024, June). Complexity-Theoretic Implications of Multicalibration. In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (pp. 1071-1082).

[3] Jung, C., Noarov, G., Ramalingam, R., & Roth, A. (2022). Batch multivalid conformal prediction. arXiv preprint arXiv:2209.15145.

[4] Noarov, G., & Roth, A. (2023, July). The statistical scope of multicalibration. In International Conference on Machine Learning (pp. 26283-26310). PMLR.

[5] Kleinberg, B., Leme, R. P., Schneider, J., & Teng, Y. (2023, July). U-calibration: Forecasting for an unknown agent. In The Thirty Sixth Annual Conference on Learning Theory (pp. 5143-5145). PMLR.

[6] Noarov, G., Ramalingam, R., Roth, A., Xie, S. 2024. High-Dimensional Prediction for Sequential Decision Making.

[7] Roth, A., & Shi, M. (2024). Forecasting for Swap Regret for All Downstream Agents. In Proceedings of the 25rd ACM Conference on Economics and Computation (EC 2024).

[8] Gopalan, P., Okoroafor, P., Raghavendra, P., Sherry, A., & Singhal, M. (2024, June). Omnipredictors for regression and the approximate rank of convex functions. In The Thirty Seventh Annual Conference on Learning Theory (pp. 2027-2070). PMLR.

[9] Globus-Harris, I., Kearns, M., Roth, A. An Algorithmic Framework for Bias Bounties. FAccT 2022

[10] Roth, A. Tolbert, A. Weinstein, S. Reconciling Individual Probability Forecasts. FAccT 2023

[11] Gopalan, P., Kalai, A. T., Reingold, O., Sharan, V., & Wieder, U. (2022). Omnipredictors. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Schloss-Dagstuhl-Leibniz Zentrum für Informatik.

[12] Hébert-Johnson, U., Kim, M., Reingold, O., & Rothblum, G. (2018, July). Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning (pp. 1939-1948). PMLR.

FOCS/IDEAL 2024: Workshop on Calibration

Join Our Newsletter

Success!

Special Program Announcement

Winter/Spring 2025 IDEAL Special Program on Deep Learning and Optimization

Click here to view the exciting series of workshops, courses, seminars and other activities!