Home · Beta

Beta · cited findings · not independently recomputed

Calibration Ledger Beta

Last updated

A pre-launch preview of the registry. The methodology (Brier 1950 + Murphy 1973) applied to third-party-published calibration measurements across forecaster platforms, prediction markets, and AI models. Each entry below cites a public, verifiable source URL.

What “Beta” means here

  • Each entry is a calibration finding published by the source itself or by peer-reviewed research. Calibration Ledger has not independently recomputed these scores.
  • Independent recomputation, time-windowed per-source breakdowns, cross-domain calibration curves, and authoritative ranking require the data-licensing + academic-co-author + design-partner-LOI prerequisites (roadmap).
  • The operator’s own dated forecasts (the methodology applied at scale to one identified forecaster) are public at /track-record/.
  • The Q3 2027 Phase 1 launch decision is gated by 4 prerequisites; if <3 are met by 2027-Q4, the kill criterion fires (sunset / sell / publicly document why it didn’t work).

At-a-glance (10 findings · 5 of 6 source classes covered)

Source classSourceMetricMeasured
Human forecastersGood Judgment Project SuperforecastersMean Brier score2014-12-31
Forecaster aggregator platformMetaculus community-prediction aggregateBrier score (binary questions, all-time)2026-04-27
Prediction marketManifold Markets — platform calibrationCalibration curve (predicted prob vs. observed frequency)2026-04-27
AI modelsGPT-4 (OpenAI) — pre-RLHF vs post-RLHF calibrationExpected Calibration Error (ECE) on multiple-choice benchmarks2023-03-15
Analyst firmsSell-side equity analysts — earnings forecast accuracySystematic optimism + analyst-disagreement-vs-error correlation (proper-scoring-rule analogue for point forecasts)2011-06-30
Scientific papersOpen Science Collaboration — psychological science replication rateReplication rate + effect-size shrinkage2015-08-28
AI modelsAnthropic — Claude / language model self-knowledgeP(IK) — probability the model assigns to "I know the answer"; P(True) — calibration of confidence in own answers2022-07-11
Scientific papersCamerer et al. — social science experiment replication (Nature/Science 2010-2015)Replication rate + median effect-size shrinkage2018-08-27
Analyst firmsFederal Reserve Survey of Professional Forecasters — GDP / inflation accuracyReal-time forecast error vs. final-revised outcome (RMSE per horizon; coverage of probability ranges)2026-04-27
Scientific papersHausfather et al. — climate model projections vs. observed warmingImplied transient climate response error; observed-vs-projected warming2020-01-04
Review platformsdeferred — calibration-specific public studies on aggregated review outcomes are sparse; coverage in Phase 1

Machine-readable exports: JSON · BibTeX (CC-BY-4.0; the compilation only — individual papers retain their own copyright).

Cited findings — full detail

Human forecasters

Good Judgment Project Superforecasters

Metric
Mean Brier score
Reported value
≈ 0.25 (vs. 0.37 control group)
Context
Across the IARPA Aggregative Contingent Estimation forecasting tournament (2011–2014); superforecasters were the top-2% of forecasters identified by year-1 accuracy and trained in probabilistic reasoning.
Citation
Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S. E., Ungar, L., Bishop, M. M., Horowitz, M., Merkle, E., & Tetlock, P. (2015). The psychology of intelligence analysis: Drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied, 21(1), 1–14.
Source URL
https://doi.org/10.1037/xap0000040
Measured
2014-12-31

Forecaster aggregator platform

Metaculus community-prediction aggregate

Metric
Brier score (binary questions, all-time)
Reported value
public — reported on Metaculus track-record page
Context
Metaculus’s aggregated community prediction across all resolved binary questions on the platform. Metaculus publishes its own track record openly. Specific time-windowed Brier varies; the platform’s methodology and live numbers are public at the citation URL.
Citation
Metaculus, Track Record + Scoring Methodology (publicly maintained dashboard).
Source URL
https://www.metaculus.com/questions/track-record/
Measured
2026-04-27

Prediction market

Manifold Markets — platform calibration

Metric
Calibration curve (predicted prob vs. observed frequency)
Reported value
public — Manifold publishes a live calibration plot of all resolved binary markets
Context
Manifold Markets publishes a live calibration plot showing market closing-probability vs. observed YES-fraction across all resolved binary markets. Visually well-calibrated within ±~5 percentage points across the 10–90% probability range as of mid-2025.
Citation
Manifold Markets, public Calibration Plot.
Source URL
https://manifold.markets/calibration
Measured
2026-04-27

AI models

GPT-4 (OpenAI) — pre-RLHF vs post-RLHF calibration

Metric
Expected Calibration Error (ECE) on multiple-choice benchmarks
Reported value
pre-RLHF: well-calibrated; post-RLHF: degraded calibration (per OpenAI’s own measurement)
Context
OpenAI’s GPT-4 System Card explicitly reports that the base GPT-4 model is well-calibrated on multiple-choice benchmarks (calibration plot in §3.2 of the system card), and that RLHF post-training degraded calibration. This is a rare publisher-acknowledged calibration finding for a frontier LLM.
Citation
OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774. §3.2 “Calibration”.
Source URL
https://arxiv.org/abs/2303.08774
Measured
2023-03-15

Analyst firms

Sell-side equity analysts — earnings forecast accuracy

Metric
Systematic optimism + analyst-disagreement-vs-error correlation (proper-scoring-rule analogue for point forecasts)
Reported value
public — survey of decades of empirical work
Context
A widely cited literature review of decades of empirical work on sell-side analyst earnings forecasts. Findings include: forecasts are systematically optimistic, optimism declines with horizon, recommendations have informational content for investors only when conditioned on forecast revision history, and consensus-disagreement among analysts is a useful proxy for forecast uncertainty (a calibration-adjacent property).
Citation
Bradshaw, M. T. (2011). Analysts’ Forecasts: What Do We Know After Decades of Work? Working paper, Boston College Carroll School of Management.
Source URL
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1880339
Measured
2011-06-30

Scientific papers

Open Science Collaboration — psychological science replication rate

Metric
Replication rate + effect-size shrinkage
Reported value
36% of replications produced a statistically significant result (vs. 97% in originals); mean effect size halved on replication
Context
Landmark large-scale replication of 100 psychology experiments published in three top journals. Findings provide a base rate against which any future per-paper or per-journal calibration claim must be evaluated. Comparable replication studies in economics (Camerer et al. 2016) and biomedical sciences are cited in the original paper for cross-discipline context.
Citation
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 349 (6251), aac4716.
Source URL
https://doi.org/10.1126/science.aac4716
Measured
2015-08-28

AI models

Anthropic — Claude / language model self-knowledge

Metric
P(IK) — probability the model assigns to "I know the answer"; P(True) — calibration of confidence in own answers
Reported value
large language models are well-calibrated on their own knowledge, with calibration improving with model scale
Context
Anthropic study finding that base language models are well-calibrated on whether they know the answer to a question (P(IK)) and on whether their answers are true (P(True)). This is a calibration-adjacent finding for AI models: not predictive forecasting per se, but the same proper-scoring-rule machinery applied to model self-confidence on factual questions.
Citation
Kadavath, S., Conerly, T., Askell, A., et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221.
Source URL
https://arxiv.org/abs/2207.05221
Measured
2022-07-11

Scientific papers

Camerer et al. — social science experiment replication (Nature/Science 2010-2015)

Metric
Replication rate + median effect-size shrinkage
Reported value
13 of 21 social science experiments replicated (62%); average effect size 50% of original
Context
Companion study to the Open Science Collaboration 2015 effort, focused on the 21 social-behavioral experiments published in Nature and Science 2010-2015 that met inclusion criteria. Higher replication rate than psychology overall (62% vs 36%), but effect sizes still systematically shrank — base rate for any per-paper Phase 1 scoring of social-science publications.
Citation
Camerer, C. F., Dreber, A., Holzmeister, F., et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour 2, 637–644.
Source URL
https://doi.org/10.1038/s41562-018-0399-z
Measured
2018-08-27

Analyst firms

Federal Reserve Survey of Professional Forecasters — GDP / inflation accuracy

Metric
Real-time forecast error vs. final-revised outcome (RMSE per horizon; coverage of probability ranges)
Reported value
public — Philadelphia Fed maintains historical SPF data + accuracy reports back to 1968
Context
The Federal Reserve Bank of Philadelphia’s Survey of Professional Forecasters is the longest-running quarterly survey of US macroeconomic forecasts. The Philadelphia Fed publishes per-horizon forecast accuracy statistics (RMSE for point forecasts; probability-range coverage for binned probability questions like recession in next 4 quarters). Cross-vertical Phase 1 reference for analyst-class calibration.
Citation
Federal Reserve Bank of Philadelphia, Survey of Professional Forecasters — Documentation and Forecast Accuracy.
Source URL
https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/survey-of-professional-forecasters
Measured
2026-04-27

Scientific papers

Hausfather et al. — climate model projections vs. observed warming

Metric
Implied transient climate response error; observed-vs-projected warming
Reported value
14 of 17 surveyed climate models from 1970–2007 produced projections within natural-variability range of subsequent observed warming when adjusted for actual emissions
Context
Evaluation of how well climate model projections published 1970-2007 actually tracked observed global mean surface temperature in the years following. Once corrected for actual greenhouse-gas emissions (which differed from modelers’ assumed emissions), most models were skillful. A landmark finding for scoring scientific model projections — directly applicable to AI-model calibration analogues.
Citation
Hausfather, Z., Drake, H. F., Abbott, T., & Schmidt, G. A. (2020). Evaluating the Performance of Past Climate Model Projections. Geophysical Research Letters 47(1), e2019GL085378.
Source URL
https://doi.org/10.1029/2019GL085378
Measured
2020-01-04

What Phase 1 launch will add

  • Independent recomputation of each cited finding using the original outcome data (not the publisher’s own scoring), under data-licensing agreements with the source platforms.
  • Time-windowed per-source breakdowns: rolling 3-month, 12-month, and lifetime calibration curves with confidence intervals.
  • Cross-domain calibration: how well a forecaster who scores high on AI predictions calibrates on geopolitics, markets, weather, etc.
  • Append-only timestamp anchoring of every score, so retroactive “I-predicted-this-all-along” revisions are visible.
  • Authoritative ranking + per-source citation pages for AI labs, regulators, and academic publishers — the Phase 1 enterprise product.

Related

  • Methodology v1.1 — full Brier + Murphy + append-only framework
  • Track record — operator’s own dated forecasts, scored with this engine
  • Source classes — the 6 source-type classes Phase 1 will score
  • Roadmap — milestone status + Q3 2027 launch gate + kill criterion
  • Partners — design-partner recruitment for AI labs / regulators / academics

Last verified: 2026-04-27. Page version 0.1 (beta scaffold; cited findings only; independent recomputation pending Phase 1). Operator: Paulo de Vries. Contact: contact@calibrationledger.com.