Home · Beta
Beta · cited findings · not independently recomputed
Calibration Ledger Beta
Last updated
A pre-launch preview of the registry. The methodology (Brier 1950 + Murphy 1973) applied to third-party-published calibration measurements across forecaster platforms, prediction markets, and AI models. Each entry below cites a public, verifiable source URL.
What “Beta” means here
- Each entry is a calibration finding published by the source itself or by peer-reviewed research. Calibration Ledger has not independently recomputed these scores.
- Independent recomputation, time-windowed per-source breakdowns, cross-domain calibration curves, and authoritative ranking require the data-licensing + academic-co-author + design-partner-LOI prerequisites (roadmap).
- The operator’s own dated forecasts (the methodology applied at scale to one identified forecaster) are public at /track-record/.
- The Q3 2027 Phase 1 launch decision is gated by 4 prerequisites; if <3 are met by 2027-Q4, the kill criterion fires (sunset / sell / publicly document why it didn’t work).
At-a-glance (10 findings · 5 of 6 source classes covered)
| Source class | Source | Metric | Measured |
|---|---|---|---|
| Human forecasters | Good Judgment Project Superforecasters | Mean Brier score | 2014-12-31 |
| Forecaster aggregator platform | Metaculus community-prediction aggregate | Brier score (binary questions, all-time) | 2026-04-27 |
| Prediction market | Manifold Markets — platform calibration | Calibration curve (predicted prob vs. observed frequency) | 2026-04-27 |
| AI models | GPT-4 (OpenAI) — pre-RLHF vs post-RLHF calibration | Expected Calibration Error (ECE) on multiple-choice benchmarks | 2023-03-15 |
| Analyst firms | Sell-side equity analysts — earnings forecast accuracy | Systematic optimism + analyst-disagreement-vs-error correlation (proper-scoring-rule analogue for point forecasts) | 2011-06-30 |
| Scientific papers | Open Science Collaboration — psychological science replication rate | Replication rate + effect-size shrinkage | 2015-08-28 |
| AI models | Anthropic — Claude / language model self-knowledge | P(IK) — probability the model assigns to "I know the answer"; P(True) — calibration of confidence in own answers | 2022-07-11 |
| Scientific papers | Camerer et al. — social science experiment replication (Nature/Science 2010-2015) | Replication rate + median effect-size shrinkage | 2018-08-27 |
| Analyst firms | Federal Reserve Survey of Professional Forecasters — GDP / inflation accuracy | Real-time forecast error vs. final-revised outcome (RMSE per horizon; coverage of probability ranges) | 2026-04-27 |
| Scientific papers | Hausfather et al. — climate model projections vs. observed warming | Implied transient climate response error; observed-vs-projected warming | 2020-01-04 |
| Review platforms | deferred — calibration-specific public studies on aggregated review outcomes are sparse; coverage in Phase 1 | — | — |
Machine-readable exports: JSON · BibTeX (CC-BY-4.0; the compilation only — individual papers retain their own copyright).
Cited findings — full detail
Human forecasters
Good Judgment Project Superforecasters
- Metric
- Mean Brier score
- Reported value
- ≈ 0.25 (vs. 0.37 control group)
- Context
- Across the IARPA Aggregative Contingent Estimation forecasting tournament (2011–2014); superforecasters were the top-2% of forecasters identified by year-1 accuracy and trained in probabilistic reasoning.
- Citation
- Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S. E., Ungar, L., Bishop, M. M., Horowitz, M., Merkle, E., & Tetlock, P. (2015). The psychology of intelligence analysis: Drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied, 21(1), 1–14.
- Source URL
- https://doi.org/10.1037/xap0000040
- Measured
- 2014-12-31
Forecaster aggregator platform
Metaculus community-prediction aggregate
- Metric
- Brier score (binary questions, all-time)
- Reported value
- public — reported on Metaculus track-record page
- Context
- Metaculus’s aggregated community prediction across all resolved binary questions on the platform. Metaculus publishes its own track record openly. Specific time-windowed Brier varies; the platform’s methodology and live numbers are public at the citation URL.
- Citation
- Metaculus, Track Record + Scoring Methodology (publicly maintained dashboard).
- Source URL
- https://www.metaculus.com/questions/track-record/
- Measured
- 2026-04-27
Prediction market
Manifold Markets — platform calibration
- Metric
- Calibration curve (predicted prob vs. observed frequency)
- Reported value
- public — Manifold publishes a live calibration plot of all resolved binary markets
- Context
- Manifold Markets publishes a live calibration plot showing market closing-probability vs. observed YES-fraction across all resolved binary markets. Visually well-calibrated within ±~5 percentage points across the 10–90% probability range as of mid-2025.
- Citation
- Manifold Markets, public Calibration Plot.
- Source URL
- https://manifold.markets/calibration
- Measured
- 2026-04-27
AI models
GPT-4 (OpenAI) — pre-RLHF vs post-RLHF calibration
- Metric
- Expected Calibration Error (ECE) on multiple-choice benchmarks
- Reported value
- pre-RLHF: well-calibrated; post-RLHF: degraded calibration (per OpenAI’s own measurement)
- Context
- OpenAI’s GPT-4 System Card explicitly reports that the base GPT-4 model is well-calibrated on multiple-choice benchmarks (calibration plot in §3.2 of the system card), and that RLHF post-training degraded calibration. This is a rare publisher-acknowledged calibration finding for a frontier LLM.
- Citation
- OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774. §3.2 “Calibration”.
- Source URL
- https://arxiv.org/abs/2303.08774
- Measured
- 2023-03-15
Analyst firms
Sell-side equity analysts — earnings forecast accuracy
- Metric
- Systematic optimism + analyst-disagreement-vs-error correlation (proper-scoring-rule analogue for point forecasts)
- Reported value
- public — survey of decades of empirical work
- Context
- A widely cited literature review of decades of empirical work on sell-side analyst earnings forecasts. Findings include: forecasts are systematically optimistic, optimism declines with horizon, recommendations have informational content for investors only when conditioned on forecast revision history, and consensus-disagreement among analysts is a useful proxy for forecast uncertainty (a calibration-adjacent property).
- Citation
- Bradshaw, M. T. (2011). Analysts’ Forecasts: What Do We Know After Decades of Work? Working paper, Boston College Carroll School of Management.
- Source URL
- https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1880339
- Measured
- 2011-06-30
Scientific papers
Open Science Collaboration — psychological science replication rate
- Metric
- Replication rate + effect-size shrinkage
- Reported value
- 36% of replications produced a statistically significant result (vs. 97% in originals); mean effect size halved on replication
- Context
- Landmark large-scale replication of 100 psychology experiments published in three top journals. Findings provide a base rate against which any future per-paper or per-journal calibration claim must be evaluated. Comparable replication studies in economics (Camerer et al. 2016) and biomedical sciences are cited in the original paper for cross-discipline context.
- Citation
- Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 349 (6251), aac4716.
- Source URL
- https://doi.org/10.1126/science.aac4716
- Measured
- 2015-08-28
AI models
Anthropic — Claude / language model self-knowledge
- Metric
- P(IK) — probability the model assigns to "I know the answer"; P(True) — calibration of confidence in own answers
- Reported value
- large language models are well-calibrated on their own knowledge, with calibration improving with model scale
- Context
- Anthropic study finding that base language models are well-calibrated on whether they know the answer to a question (P(IK)) and on whether their answers are true (P(True)). This is a calibration-adjacent finding for AI models: not predictive forecasting per se, but the same proper-scoring-rule machinery applied to model self-confidence on factual questions.
- Citation
- Kadavath, S., Conerly, T., Askell, A., et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221.
- Source URL
- https://arxiv.org/abs/2207.05221
- Measured
- 2022-07-11
Scientific papers
Camerer et al. — social science experiment replication (Nature/Science 2010-2015)
- Metric
- Replication rate + median effect-size shrinkage
- Reported value
- 13 of 21 social science experiments replicated (62%); average effect size 50% of original
- Context
- Companion study to the Open Science Collaboration 2015 effort, focused on the 21 social-behavioral experiments published in Nature and Science 2010-2015 that met inclusion criteria. Higher replication rate than psychology overall (62% vs 36%), but effect sizes still systematically shrank — base rate for any per-paper Phase 1 scoring of social-science publications.
- Citation
- Camerer, C. F., Dreber, A., Holzmeister, F., et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour 2, 637–644.
- Source URL
- https://doi.org/10.1038/s41562-018-0399-z
- Measured
- 2018-08-27
Analyst firms
Federal Reserve Survey of Professional Forecasters — GDP / inflation accuracy
- Metric
- Real-time forecast error vs. final-revised outcome (RMSE per horizon; coverage of probability ranges)
- Reported value
- public — Philadelphia Fed maintains historical SPF data + accuracy reports back to 1968
- Context
- The Federal Reserve Bank of Philadelphia’s Survey of Professional Forecasters is the longest-running quarterly survey of US macroeconomic forecasts. The Philadelphia Fed publishes per-horizon forecast accuracy statistics (RMSE for point forecasts; probability-range coverage for binned probability questions like recession in next 4 quarters). Cross-vertical Phase 1 reference for analyst-class calibration.
- Citation
- Federal Reserve Bank of Philadelphia, Survey of Professional Forecasters — Documentation and Forecast Accuracy.
- Source URL
- https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/survey-of-professional-forecasters
- Measured
- 2026-04-27
Scientific papers
Hausfather et al. — climate model projections vs. observed warming
- Metric
- Implied transient climate response error; observed-vs-projected warming
- Reported value
- 14 of 17 surveyed climate models from 1970–2007 produced projections within natural-variability range of subsequent observed warming when adjusted for actual emissions
- Context
- Evaluation of how well climate model projections published 1970-2007 actually tracked observed global mean surface temperature in the years following. Once corrected for actual greenhouse-gas emissions (which differed from modelers’ assumed emissions), most models were skillful. A landmark finding for scoring scientific model projections — directly applicable to AI-model calibration analogues.
- Citation
- Hausfather, Z., Drake, H. F., Abbott, T., & Schmidt, G. A. (2020). Evaluating the Performance of Past Climate Model Projections. Geophysical Research Letters 47(1), e2019GL085378.
- Source URL
- https://doi.org/10.1029/2019GL085378
- Measured
- 2020-01-04
What Phase 1 launch will add
- Independent recomputation of each cited finding using the original outcome data (not the publisher’s own scoring), under data-licensing agreements with the source platforms.
- Time-windowed per-source breakdowns: rolling 3-month, 12-month, and lifetime calibration curves with confidence intervals.
- Cross-domain calibration: how well a forecaster who scores high on AI predictions calibrates on geopolitics, markets, weather, etc.
- Append-only timestamp anchoring of every score, so retroactive “I-predicted-this-all-along” revisions are visible.
- Authoritative ranking + per-source citation pages for AI labs, regulators, and academic publishers — the Phase 1 enterprise product.
Related
- Methodology v1.1 — full Brier + Murphy + append-only framework
- Track record — operator’s own dated forecasts, scored with this engine
- Source classes — the 6 source-type classes Phase 1 will score
- Roadmap — milestone status + Q3 2027 launch gate + kill criterion
- Partners — design-partner recruitment for AI labs / regulators / academics
Last verified: 2026-04-27. Page version 0.1 (beta scaffold; cited findings only; independent recomputation pending Phase 1). Operator: Paulo de Vries. Contact: contact@calibrationledger.com.