Home · Beta

Beta · cited findings · not independently recomputed

Calibration Ledger Beta

Last updated 2026-04-27

A pre-launch preview of the registry. The methodology (Brier 1950 + Murphy 1973) applied to third-party-published calibration measurements across forecaster platforms, prediction markets, and AI models. Each entry below cites a public, verifiable source URL.

What “Beta” means here

Each entry is a calibration finding published by the source itself or by peer-reviewed research. Calibration Ledger has not independently recomputed these scores.
Independent recomputation, time-windowed per-source breakdowns, cross-domain calibration curves, and authoritative ranking require the data-licensing + academic-co-author + design-partner-LOI prerequisites (roadmap).
The operator’s own dated forecasts (the methodology applied at scale to one identified forecaster) are public at /track-record/.
The Q3 2027 Phase 1 launch decision is gated by 4 prerequisites; if <3 are met by 2027-Q4, the kill criterion fires (sunset / sell / publicly document why it didn’t work).

At-a-glance (10 findings · 5 of 6 source classes covered)

Source class	Source	Metric	Measured
Human forecasters	Good Judgment Project Superforecasters	Mean Brier score	2014-12-31
Forecaster aggregator platform	Metaculus community-prediction aggregate	Brier score (binary questions, all-time)	2026-04-27
Prediction market	Manifold Markets — platform calibration	Calibration curve (predicted prob vs. observed frequency)	2026-04-27
AI models	GPT-4 (OpenAI) — pre-RLHF vs post-RLHF calibration	Expected Calibration Error (ECE) on multiple-choice benchmarks	2023-03-15
Analyst firms	Sell-side equity analysts — earnings forecast accuracy	Systematic optimism + analyst-disagreement-vs-error correlation (proper-scoring-rule analogue for point forecasts)	2011-06-30
Scientific papers	Open Science Collaboration — psychological science replication rate	Replication rate + effect-size shrinkage	2015-08-28
AI models	Anthropic — Claude / language model self-knowledge	P(IK) — probability the model assigns to "I know the answer"; P(True) — calibration of confidence in own answers	2022-07-11
Scientific papers	Camerer et al. — social science experiment replication (Nature/Science 2010-2015)	Replication rate + median effect-size shrinkage	2018-08-27
Analyst firms	Federal Reserve Survey of Professional Forecasters — GDP / inflation accuracy	Real-time forecast error vs. final-revised outcome (RMSE per horizon; coverage of probability ranges)	2026-04-27
Scientific papers	Hausfather et al. — climate model projections vs. observed warming	Implied transient climate response error; observed-vs-projected warming	2020-01-04
Review platforms	deferred — calibration-specific public studies on aggregated review outcomes are sparse; coverage in Phase 1	—	—

Machine-readable exports: JSON · BibTeX (CC-BY-4.0; the compilation only — individual papers retain their own copyright).

Cited findings — full detail

Human forecasters

Good Judgment Project Superforecasters

Metric: Mean Brier score
Reported value: ≈ 0.25 (vs. 0.37 control group)
Context: Across the IARPA Aggregative Contingent Estimation forecasting tournament (2011–2014); superforecasters were the top-2% of forecasters identified by year-1 accuracy and trained in probabilistic reasoning.
Citation: Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S. E., Ungar, L., Bishop, M. M., Horowitz, M., Merkle, E., & Tetlock, P. (2015). The psychology of intelligence analysis: Drivers of prediction accuracy in world politics. Journal of Experimental Psychology: Applied, 21(1), 1–14.
Source URL: https://doi.org/10.1037/xap0000040
Measured: 2014-12-31

Forecaster aggregator platform

Metaculus community-prediction aggregate

Metric: Brier score (binary questions, all-time)
Reported value: public — reported on Metaculus track-record page
Context: Metaculus’s aggregated community prediction across all resolved binary questions on the platform. Metaculus publishes its own track record openly. Specific time-windowed Brier varies; the platform’s methodology and live numbers are public at the citation URL.
Citation: Metaculus, Track Record + Scoring Methodology (publicly maintained dashboard).
Source URL: https://www.metaculus.com/questions/track-record/
Measured: 2026-04-27

Prediction market

Manifold Markets — platform calibration

Metric: Calibration curve (predicted prob vs. observed frequency)
Reported value: public — Manifold publishes a live calibration plot of all resolved binary markets
Context: Manifold Markets publishes a live calibration plot showing market closing-probability vs. observed YES-fraction across all resolved binary markets. Visually well-calibrated within ±~5 percentage points across the 10–90% probability range as of mid-2025.
Citation: Manifold Markets, public Calibration Plot.
Source URL: https://manifold.markets/calibration
Measured: 2026-04-27

AI models

GPT-4 (OpenAI) — pre-RLHF vs post-RLHF calibration

Metric: Expected Calibration Error (ECE) on multiple-choice benchmarks
Reported value: pre-RLHF: well-calibrated; post-RLHF: degraded calibration (per OpenAI’s own measurement)
Context: OpenAI’s GPT-4 System Card explicitly reports that the base GPT-4 model is well-calibrated on multiple-choice benchmarks (calibration plot in §3.2 of the system card), and that RLHF post-training degraded calibration. This is a rare publisher-acknowledged calibration finding for a frontier LLM.
Citation: OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774. §3.2 “Calibration”.
Source URL: https://arxiv.org/abs/2303.08774
Measured: 2023-03-15

Analyst firms

Sell-side equity analysts — earnings forecast accuracy

Metric: Systematic optimism + analyst-disagreement-vs-error correlation (proper-scoring-rule analogue for point forecasts)
Reported value: public — survey of decades of empirical work
Context: A widely cited literature review of decades of empirical work on sell-side analyst earnings forecasts. Findings include: forecasts are systematically optimistic, optimism declines with horizon, recommendations have informational content for investors only when conditioned on forecast revision history, and consensus-disagreement among analysts is a useful proxy for forecast uncertainty (a calibration-adjacent property).
Citation: Bradshaw, M. T. (2011). Analysts’ Forecasts: What Do We Know After Decades of Work? Working paper, Boston College Carroll School of Management.
Source URL: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1880339
Measured: 2011-06-30

Scientific papers

Open Science Collaboration — psychological science replication rate

Metric: Replication rate + effect-size shrinkage
Reported value: 36% of replications produced a statistically significant result (vs. 97% in originals); mean effect size halved on replication
Context: Landmark large-scale replication of 100 psychology experiments published in three top journals. Findings provide a base rate against which any future per-paper or per-journal calibration claim must be evaluated. Comparable replication studies in economics (Camerer et al. 2016) and biomedical sciences are cited in the original paper for cross-discipline context.
Citation: Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 349 (6251), aac4716.
Source URL: https://doi.org/10.1126/science.aac4716
Measured: 2015-08-28

AI models

Anthropic — Claude / language model self-knowledge

Metric: P(IK) — probability the model assigns to "I know the answer"; P(True) — calibration of confidence in own answers
Reported value: large language models are well-calibrated on their own knowledge, with calibration improving with model scale
Context: Anthropic study finding that base language models are well-calibrated on whether they know the answer to a question (P(IK)) and on whether their answers are true (P(True)). This is a calibration-adjacent finding for AI models: not predictive forecasting per se, but the same proper-scoring-rule machinery applied to model self-confidence on factual questions.
Citation: Kadavath, S., Conerly, T., Askell, A., et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221.
Source URL: https://arxiv.org/abs/2207.05221
Measured: 2022-07-11

Scientific papers

Camerer et al. — social science experiment replication (Nature/Science 2010-2015)

Metric: Replication rate + median effect-size shrinkage
Reported value: 13 of 21 social science experiments replicated (62%); average effect size 50% of original
Context: Companion study to the Open Science Collaboration 2015 effort, focused on the 21 social-behavioral experiments published in Nature and Science 2010-2015 that met inclusion criteria. Higher replication rate than psychology overall (62% vs 36%), but effect sizes still systematically shrank — base rate for any per-paper Phase 1 scoring of social-science publications.
Citation: Camerer, C. F., Dreber, A., Holzmeister, F., et al. (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour 2, 637–644.
Source URL: https://doi.org/10.1038/s41562-018-0399-z
Measured: 2018-08-27

Analyst firms

Federal Reserve Survey of Professional Forecasters — GDP / inflation accuracy

Metric: Real-time forecast error vs. final-revised outcome (RMSE per horizon; coverage of probability ranges)
Reported value: public — Philadelphia Fed maintains historical SPF data + accuracy reports back to 1968
Context: The Federal Reserve Bank of Philadelphia’s Survey of Professional Forecasters is the longest-running quarterly survey of US macroeconomic forecasts. The Philadelphia Fed publishes per-horizon forecast accuracy statistics (RMSE for point forecasts; probability-range coverage for binned probability questions like recession in next 4 quarters). Cross-vertical Phase 1 reference for analyst-class calibration.
Citation: Federal Reserve Bank of Philadelphia, Survey of Professional Forecasters — Documentation and Forecast Accuracy.
Source URL: https://www.philadelphiafed.org/surveys-and-data/real-time-data-research/survey-of-professional-forecasters
Measured: 2026-04-27

Scientific papers

Hausfather et al. — climate model projections vs. observed warming

Metric: Implied transient climate response error; observed-vs-projected warming
Reported value: 14 of 17 surveyed climate models from 1970–2007 produced projections within natural-variability range of subsequent observed warming when adjusted for actual emissions
Context: Evaluation of how well climate model projections published 1970-2007 actually tracked observed global mean surface temperature in the years following. Once corrected for actual greenhouse-gas emissions (which differed from modelers’ assumed emissions), most models were skillful. A landmark finding for scoring scientific model projections — directly applicable to AI-model calibration analogues.
Citation: Hausfather, Z., Drake, H. F., Abbott, T., & Schmidt, G. A. (2020). Evaluating the Performance of Past Climate Model Projections. Geophysical Research Letters 47(1), e2019GL085378.
Source URL: https://doi.org/10.1029/2019GL085378
Measured: 2020-01-04

What Phase 1 launch will add

Independent recomputation of each cited finding using the original outcome data (not the publisher’s own scoring), under data-licensing agreements with the source platforms.
Time-windowed per-source breakdowns: rolling 3-month, 12-month, and lifetime calibration curves with confidence intervals.
Cross-domain calibration: how well a forecaster who scores high on AI predictions calibrates on geopolitics, markets, weather, etc.
Append-only timestamp anchoring of every score, so retroactive “I-predicted-this-all-along” revisions are visible.
Authoritative ranking + per-source citation pages for AI labs, regulators, and academic publishers — the Phase 1 enterprise product.

Methodology v1.1 — full Brier + Murphy + append-only framework
Track record — operator’s own dated forecasts, scored with this engine
Source classes — the 6 source-type classes Phase 1 will score
Roadmap — milestone status + Q3 2027 launch gate + kill criterion
Partners — design-partner recruitment for AI labs / regulators / academics

Last verified: 2026-04-27. Page version 0.1 (beta scaffold; cited findings only; independent recomputation pending Phase 1). Operator: Paulo de Vries. Contact: contact@calibrationledger.com.

Calibration Ledger Beta

At-a-glance (10 findings · 5 of 6 source classes covered)

Cited findings — full detail

Good Judgment Project Superforecasters

Metaculus community-prediction aggregate

Manifold Markets — platform calibration

GPT-4 (OpenAI) — pre-RLHF vs post-RLHF calibration

Sell-side equity analysts — earnings forecast accuracy

Open Science Collaboration — psychological science replication rate

Anthropic — Claude / language model self-knowledge

Camerer et al. — social science experiment replication (Nature/Science 2010-2015)

Federal Reserve Survey of Professional Forecasters — GDP / inflation accuracy

Hausfather et al. — climate model projections vs. observed warming

What Phase 1 launch will add

Related