Research

Last updated

Calibration Ledger is grounded in 75 years of forecasting science — meteorology, political prediction, AI evaluation, and replication-crisis literature. This page is the citation map: the foundational works the methodology depends on, the active research adjacent to it, and the data sources we draw from.

#Foundational works

Three primary references. Every Calibration Ledger score traces to one or more of these. The full BibTeX is at /api/methodology.bib.

  • Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather Review 78(1), 1-3. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
    The original definition of the Brier score. Two pages of mathematics that quietly shaped how meteorologists, then political forecasters, then AI evaluators, would think about probabilistic accuracy for the next 75 years.
  • Murphy, A. H. (1973). A New Vector Partition of the Probability Score. Journal of Applied Meteorology 12, 595-600. doi:10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2
    Decomposition of Brier into Reliability − Resolution + Uncertainty. The reason “why is this source good or bad?” becomes answerable per source rather than collapsing into a single opaque number.
  • Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown Publishers. ISBN 978-0804136693.
    The empirical case that calibration is a real, measurable, trainable skill — not an inherent property of expertise. Operationalised by the IARPA Good Judgment Project and the academic-industrial Good Judgment Open platform.

#Active research adjacent to Calibration Ledger

Calibration Ledger is not a research project. But every published score will have active-literature citations. Areas we read continuously:

AI model calibration + factuality

  • • Kadavath et al. (2022). Language models (mostly) know what they know. arXiv:2207.05221
  • • Lin, Hilton, Evans (2022). TruthfulQA: Measuring how models mimic human falsehoods. ACL.
  • • OpenAI (2023). GPT-4 Technical Report — calibration discussion in Appendix.
  • • Anthropic (2024-25). Claude system card series — published hallucination + calibration metrics.
  • • HELM benchmark (Stanford CRFM). crfm.stanford.edu/helm

Human + market forecasting

  • • Tetlock & Mellers, IARPA Good Judgment Project (2011-2015) — superforecaster identification.
  • • Metaculus (2015-present). metaculus.com/questions/ — open forecast platform with public scoring data.
  • • Good Judgment Open. gjopen.com
  • • Manifold Markets. manifold.markets — play-money prediction market with extensive open data.
  • • Hanson, Robin (2003). Combinatorial information market design. Information Systems Frontiers 5(1).

Scientific replication + analyst calibration

  • • Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science 349.
  • • Camerer et al. (2018). Evaluating the replicability of social science experiments in Nature and Science. Nature Human Behaviour.
  • • Center for Open Science. cos.io
  • • Ellsberg, Daniel (1961). Risk, ambiguity, and the Savage axioms — foundation for understanding analyst confidence.

#Data sources we draw from

Public-domain or open-API data sources we use today (operator-led, no licensing required) — and the data partnerships we’re seeking for 2027 launch (see /partners/):

In use today (public domain / open API)

  • SEC EDGAR filings (13F, 13D, 13G, 10-K, 10-Q) — analyst-firm position data, public domain. Already underlying HoldLens.
  • Metaculus public API — community-forecast data + community-Brier scores per question.
  • Manifold Markets public API — play-money market data with full history.
  • arXiv preprints — for replication-status tracking via citation downstream.
  • Federal Reserve / ECB economic forecasts — central-bank Survey of Professional Forecasters data, public.

Sought for 2027 launch (data licensing — see Roadmap)

  • • Good Judgment Inc. — IARPA-derived superforecaster track-record data
  • • AI-eval consortia — third-party benchmark results across labs
  • • Equity-research analyst databases (FactSet / Refinitiv / I/B/E/S) — analyst price-target accuracy series
  • • EU AI Act Article 50 transparency disclosures — once operators begin public reporting (mid-2026 onward)

#Contribute or advise

If you publish on calibration, scoring rules, AI evaluation, or forecasting science + would consider an advisory or co-author role on Calibration Ledger’s methodology v2.0 (peer-review draft 2026-Q4): email contact@calibrationledger.com with subject Methodology advisor inquiry and a link to your publication record.