Pre-launch · illustrative · not authoritative scoring

Source Types

Last updated

Calibration Ledger will score six classes of predictive sources at launch (Q3 2027, gated on prerequisites). This page describes each class + the public-domain data we currently use to demonstrate methodology. Nothing here is an authoritative score — these are pre-launch illustrations.

Important. Until launch in Q3 2027, no individual source is being publicly scored under their name. The methodology is being demonstrated on public data only. Authoritative per-source scoring requires the data-licensing prerequisite (see roadmap) and the kill criterion retains operator’s right to sunset before scoring anyone.

#1. AI Models

What we measure. Hallucination rate, factuality on time-stamped questions with knowable answers, calibration on probabilistic queries (“rate from 0 to 100 the likelihood X”).

Public data available now. HELM benchmark scores, model cards published by labs (OpenAI / Anthropic / Google), TruthfulQA + SimpleQA replication runs, OpenReview-style benchmark replications.

Time window. Per-model versioned (each major release scored separately + carries its calibration history). Re-scored quarterly as new eval-benchmark data is published.

Why it matters. EU AI Act Article 50 transparency obligations require general-purpose AI providers to publish certain capability + limitation data. Calibration scores transform self-reported model cards into third-party-comparable signal.

#2. Human Forecasters

What we measure. Per-forecaster Brier score across domains and time windows; reliability per probability bucket; resolution; superforecaster classification (top 2% sustained over 4+ quarters per Tetlock 2015).

Public data available now. Metaculus public-question community + top forecasters; Good Judgment Open public data; Manifold Markets community forecasters.

Time window. Rolling 12-month + lifetime; domain-stratified (geopolitical / economics / science / personal-life / etc.).

#3. Analyst Firms

What we measure. Price-target hit rate within stated time window; rating-change accuracy (upgrades / downgrades vs. realised return); confidence calibration on conditional forecasts (“if X, then Y is likely Z%”).

Public data available now. SEC EDGAR 13F filings (institutional positions, public domain), 13D / 13G activist filings, 10-K + 10-Q analyst-mention sections. The operator’s HoldLens is a working pre-launch demonstration of structured-filings analysis.

Time window. Per-firm rolling 4 quarters + per-analyst longitudinal; per-issued-target time-to-resolution.

#4. Scientific Papers

What we measure. Replication status (replicates / fails-to-replicate / mixed / not-yet-attempted); effect-size shrinkage in replication; citation-context analysis (is the paper being cited as “established” or as “contested”).

Public data available now. Open Science Collaboration replication databases, Center for Open Science records, Many Labs project outputs, retracted-paper databases (Retraction Watch).

Time window. Per-paper lifetime + 5-year replication probability; per-author rolling 5-year cohort.

#5. Review Platforms

What we measure. Outcome-alignment of aggregated reviews (when reviews-say-X, X happens N% of the time); rating distribution vs. realised quality on durable outcomes (longevity-of-product, return-rates, etc.).

Public data available now. Aggregated outcome datasets (Consumer Reports historical comparison data, restaurant longevity vs. Yelp at-the-time scores, movie review vs. box-office-time-decay analysis).

Time window. Per-platform rolling 3 years; per-category stratified (electronics / restaurants / hospitality / financial / etc.).

#6. Prediction Markets

What we measure. Closing-price calibration (when market closes at X%, outcome occurs X% of the time); time-to-resolution efficiency; manipulation signals (anomalous late-window price moves).

Public data available now. Manifold Markets full historical data, PredictIt archive (until 2024 shutdown), Kalshi public-question outcomes, Polymarket on-chain settled markets.

Time window. Per-market lifetime + per-platform rolling 12 months; domain-stratified.

  • Methodology v1.1 — full Brier + Murphy + append-only framework
  • Research — foundational works + active literature + data-source map
  • Roadmap — when each source-type goes live
  • Partners — design-partner recruitment for launch