One canonical schema.
Six tables. Every venue.
Every market on every supported venue maps into the same six tables. Binary or categorical, on-chain or off-chain, active or resolved — same shape, same query.
The canonical schema.
Source of truth: schema/001_canonical.sql in the repo. Every venue ingester
writes into these — nothing escapes the normalizer.
One row per market
Globally addressable as <venue>:<native_id>. Carries
volume_native + volume_unit (different units per venue — Kalshi=contracts,
Polymarket=usd, Manifold=mana). Plus created_at, closes_at,
resolved_at, resolution_value, raw venue payload for audit.
One row per outcome
Two for binary, N for categorical. Carries final_payout:
1 for winners, 0 for losers on closed markets,
NULL if unresolved. Survivorship-bias-free.
Every fill, normalized
price ∈ [0,1] for binary markets (probability-equivalent).
size_native + size_unit in the venue's unit. tx_hash
when on-chain. Composite primary key on (market_id, trade_id)
so re-runs are idempotent.
Top-N L2 at fixed intervals
Where the venue exposes a CLOB, we snapshot top-N bids/asks at a configurable cadence. Phase-0 sample is metadata-only; Phase-1 enables 1-minute snapshots for Quant tier and tick-level for Pro.
Final state, oracle, dispute flag
Resolution metadata: who called it (UMA optimistic, venue admin, Chainlink),
when it cleared, source URL, dispute flag. Joins to markets.
Every coverage claim has a row
Daily reconciliation output. volume_drift_pct,
gap_count_5min, subgraph_vs_gamma_drift_pct, etc.
The audit trail behind any "we have X% coverage" statement.
Where the data comes from.
No single source covers everything. Each venue has a primary path and one or more fallbacks. We document them publicly so you know what depends on what.
| Venue | Markets | Trades (primary) | Trades (fallback) | Resolution | Recon status |
|---|---|---|---|---|---|
| Kalshi | REST /markets |
REST /markets/trades |
— | REST settle_time |
100% / 0.0% |
| Manifold | REST /markets |
REST /bets + cursor walk |
— | REST resolutionTime |
75% / ~10% |
| Polymarket (std CTF) | gamma-api | Goldsky orderbook subgraph | data-api (capped 3,500) | gamma closedTime + on-chain |
~21% drift |
| Polymarket (NegRisk) | gamma-api | Polygon RPC eth_getLogs |
— | gamma + on-chain | on-chain audit |
|venue_reported - sum(captured)| / venue_reported, sampled daily,
logged in recon_log. The threshold per venue is published; we don't
hide a market if it fails.
Three readers, one schema.
Daily Parquet drops are partitioned by venue and date. Same files work in pandas, polars, and duckdb out of the box.
pandas
import pandas as pd
m = pd.read_parquet("markets.parquet")
tr = pd.read_parquet("trades.parquet")
print(m.merge(tr, on="market_id").head())
polars
import polars as pl
pl.read_parquet("trades.parquet") \
.group_by("market_id") \
.agg([
pl.len().alias("n"),
pl.col("size_native").sum()
])
duckdb
SELECT venue_id, count(*) AS n
FROM read_parquet('markets.parquet')
GROUP BY 1
ORDER BY n DESC;
5 markets. 24 KB. Three readers verified.
Real Phase-0 spike data from Kalshi + Manifold + Polymarket. Email us your use case; we ship the link.