Dataset

One canonical schema.
Six tables. Every venue.

Every market on every supported venue maps into the same six tables. Binary or categorical, on-chain or off-chain, active or resolved — same shape, same query.

Tables

The canonical schema.

Source of truth: schema/001_canonical.sql in the repo. Every venue ingester writes into these — nothing escapes the normalizer.

markets

One row per market

Globally addressable as <venue>:<native_id>. Carries volume_native + volume_unit (different units per venue — Kalshi=contracts, Polymarket=usd, Manifold=mana). Plus created_at, closes_at, resolved_at, resolution_value, raw venue payload for audit.

outcomes

One row per outcome

Two for binary, N for categorical. Carries final_payout: 1 for winners, 0 for losers on closed markets, NULL if unresolved. Survivorship-bias-free.

trades

Every fill, normalized

price ∈ [0,1] for binary markets (probability-equivalent). size_native + size_unit in the venue's unit. tx_hash when on-chain. Composite primary key on (market_id, trade_id) so re-runs are idempotent.

orderbook_snapshots

Top-N L2 at fixed intervals

Where the venue exposes a CLOB, we snapshot top-N bids/asks at a configurable cadence. Phase-0 sample is metadata-only; Phase-1 enables 1-minute snapshots for Quant tier and tick-level for Pro.

resolutions

Final state, oracle, dispute flag

Resolution metadata: who called it (UMA optimistic, venue admin, Chainlink), when it cleared, source URL, dispute flag. Joins to markets.

recon_log

Every coverage claim has a row

Daily reconciliation output. volume_drift_pct, gap_count_5min, subgraph_vs_gamma_drift_pct, etc. The audit trail behind any "we have X% coverage" statement.

Coverage matrix

Where the data comes from.

No single source covers everything. Each venue has a primary path and one or more fallbacks. We document them publicly so you know what depends on what.

Venue Markets Trades (primary) Trades (fallback) Resolution Recon status
Kalshi REST /markets REST /markets/trades REST settle_time 100% / 0.0%
Manifold REST /markets REST /bets + cursor walk REST resolutionTime 75% / ~10%
Polymarket (std CTF) gamma-api Goldsky orderbook subgraph data-api (capped 3,500) gamma closedTime + on-chain ~21% drift
Polymarket (NegRisk) gamma-api Polygon RPC eth_getLogs gamma + on-chain on-chain audit
Recon convention: drift is computed as |venue_reported - sum(captured)| / venue_reported, sampled daily, logged in recon_log. The threshold per venue is published; we don't hide a market if it fails.
Quickstart

Three readers, one schema.

Daily Parquet drops are partitioned by venue and date. Same files work in pandas, polars, and duckdb out of the box.

pandas

import pandas as pd
m  = pd.read_parquet("markets.parquet")
tr = pd.read_parquet("trades.parquet")
print(m.merge(tr, on="market_id").head())

polars

import polars as pl
pl.read_parquet("trades.parquet") \
  .group_by("market_id") \
  .agg([
    pl.len().alias("n"),
    pl.col("size_native").sum()
  ])

duckdb

SELECT venue_id, count(*) AS n
FROM read_parquet('markets.parquet')
GROUP BY 1
ORDER BY n DESC;
Sample

5 markets. 24 KB. Three readers verified.

Real Phase-0 spike data from Kalshi + Manifold + Polymarket. Email us your use case; we ship the link.