Survivorship bias in prediction-market backtests: a 30-line reproduction

Every prediction-market scraper that hits a venue's /markets endpoint at backtest time silently drops every market the venue has archived. The result: you backtest on a curated population that excludes the very markets your strategy would have lost on. Your Sharpe is overstated. Here's the proof, the magnitude, and the schema-level fix.

The bias mechanism

Pull /markets from any of the three major venues today. You'll get a list of currently-visible markets — typically active or recently-resolved. Pull again in 90 days. A subset of the markets you got the first time will be missing. They've been archived: pruned from the default listing, sometimes with API access disabled, sometimes with the data physically removed.

Why venues archive markets:

  • Resolved long ago and no longer interesting (sports, daily-event)
  • Cancelled / void-resolved (e.g., the underlying event didn't occur as defined)
  • Disputed outcomes that the venue eventually invalidated
  • Cleanup of test markets, anomalous markets, or markets with regulatory issues

None of those are random with respect to outcome. Markets that resolve unusually, get disputed, or get cancelled are systematically more likely to disappear. If your backtest only sees markets that didn't get archived, your sample is censored in exactly the direction that flatters your strategy.

A 30-line reproduction

Take any scraper that pulls a venue's /markets list and runs a backtest. Show that running the same backtest against a snapshot of the venue's universe from 90 days ago returns a different, worse, more-realistic Sharpe.

# pseudo-code; replace fetch_markets with venue API call
from datetime import date, timedelta

today_universe   = fetch_markets(as_of=date.today())
historical_dump  = fetch_markets(as_of=date.today() - timedelta(days=90))

# the markets that were visible 90 days ago and are now gone
disappeared = set(historical_dump) - set(today_universe)
survived    = set(historical_dump) & set(today_universe)

print(f"disappeared: {len(disappeared)} / {len(historical_dump)}")

# run your strategy on each subset
sharpe_today   = backtest(strategy, universe=survived).sharpe()
sharpe_full    = backtest(strategy, universe=historical_dump).sharpe()

print(f"sharpe survivor-only: {sharpe_today:.2f}")
print(f"sharpe full universe: {sharpe_full:.2f}")
print(f"survivorship-bias inflation: {(sharpe_today - sharpe_full):.2f}")

The catch is line two: fetch_markets(as_of=...) requires you to have stored the venue's universe historically. If your scraper started running last week, you can't. The data is gone unless someone archived it for you.

Magnitude on a real sample

We don't have 90 days of universe snapshots from before this dataset existed (we wish we did). What we can show is the shape of disappearance using closed Polymarket markets ordered by volume. Their gamma-api lets you walk closed markets descending by volumeNum; in practice the ratio of "still listed" to "archived" steps sharply at certain volume thresholds.

A back-of-envelope: of ~50,000 closed Polymarket markets we walked at offset 0–50,000, every one had volume above $140k. Below that threshold the markets are still in the database but absent from the descending listing past a certain depth. Our gamma-api pull capped at the threshold; below it, we'd need explicit per-market condition_id lookups to retrieve them — exactly the kind of data that disappears for a backtester who didn't think to record IDs eagerly.

Manifold archives faster than Polymarket — the in-app feed shows recent markets; anything more than ~6 months old is hard to surface without targeted IDs. Kalshi keeps old markets queryable but with reduced indexing and doc presence.

Empirically, every prediction-market backtest paper we've read either (a) explicitly addresses survivorship bias and excludes pre-2023 data, or (b) reports inflated Sharpe ratios that don't replicate on out-of-sample. The consistent gap between published and replicated results is a tell.

The schema-level fix: deletion ledger

Our canonical schema includes a deletion_ledger table:

CREATE TABLE deletion_ledger (
    market_id      TEXT PRIMARY KEY,
    venue_id       TEXT NOT NULL,
    last_seen_at   TIMESTAMPTZ NOT NULL,
    deleted_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_snapshot  JSONB NOT NULL      -- frozen markets row
);

Whenever the daily reconciliation observes that a market that was present yesterday is now missing from the venue listing, we don't drop it from markets. We mark deleted_at and freeze the last-known snapshot in the ledger. The market remains in the canonical schema with all its trades and outcomes, just flagged.

A survivorship-aware backtest joins on markets + deletion_ledger and includes the deleted markets in the universe. Your Sharpe drops, your strategy gets tested against the world as it actually was, and you find out before live money does.

Resolution data: the other half of the bias

Survivorship bias has a sibling: outcome bias. If you backtest on currently-visible markets, the resolution status is "we know it now" — even if the market traded for months without a known outcome. Strategies that look great because they "knew" the resolution by virtue of the data being post-resolution are a classic look-ahead trap.

Our outcomes table fixes this with three explicit states:

  • final_payout = 1 — winner, on a market with resolved_at ≤ backtest as_of
  • final_payout = 0 — loser, same condition
  • final_payout = NULL — unresolved at the relevant time

Combined with resolved_at IS NULL OR resolved_at > as_of filters in your backtest, this lets you simulate the world as it was at any prior date without leaking future resolution.

Validation discipline

We borrowed the validation rigor from a parallel equity-trading project: bootstrap CI, permutation test, BH-FDR (q=0.10), out-of-sample ≥1 year, walk-forward stack, SPY-only counterfactual, top-bucket toxicity check. The same harness flags survivorship-biased strategies because they fail on out-of-sample windows that include now-archived markets.

A working hedge that passed our internal gates (TLT TOM bond-flow) reconciled Sharpe 0.65 / OOS 0.97 / perm_p 0.0002. We don't promote anything to live without the OOS validation including the deleted/archived market subset.

Takeaways

  1. Survivorship bias in prediction markets is real and not random. Archived markets correlate with cancelled / disputed / cleaned-up outcomes — exactly the kind of trades your strategy would have lost on.
  2. If your scraper hits /markets at backtest time, you have it. Backtests run on "currently visible" data are biased upward.
  3. The fix is structural: keep deleted markets in the schema. A deletion ledger plus per-day snapshots lets you ask "what was visible on 2024-09-01?" and get an honest answer.
  4. Outcome bias is the silent twin. Use resolved_at explicitly in backtest filters to prevent post-hoc resolution leak.

The pred-markets dataset retains every resolved and pruned market we've seen since ingestion started, with last-known snapshots in the deletion ledger. Schema; email us if you're running a backtest where this matters and want the historical universe reconstituted.