Dataset

One canonical schema.
Six tables. Every venue.

Every market on every supported venue maps into the same six tables. Binary or categorical, on-chain or off-chain, active or resolved — same shape, same query.

Tables

The canonical schema.

Source of truth: schema/001_canonical.sql in the repo. Every venue ingester writes into these — nothing escapes the normalizer.

markets

One row per market

Globally addressable as <venue>:<native_id>. Carries volume_native + volume_unit (different units per venue — Kalshi=contracts, Polymarket=usd, Manifold=mana). Plus created_at, closes_at, resolved_at, resolution_value, raw venue payload for audit.

outcomes

One row per outcome

Two for binary, N for categorical. Carries final_payout: 1 for winners, 0 for losers on closed markets, NULL if unresolved. Survivorship-bias-free.

trades

Every fill, normalized

price ∈ [0,1] for binary markets (probability-equivalent). size_native + size_unit in the venue's unit. tx_hash when on-chain. Composite primary key on (market_id, trade_id) so re-runs are idempotent.

orderbook_snapshots

Top-N L2 at fixed intervals

Where the venue exposes a CLOB, we snapshot top-N bids/asks at a configurable cadence. Phase-0 sample is metadata-only; Phase-1 enables 1-minute snapshots for Quant tier and tick-level for Pro.

resolutions

Final state, oracle, dispute flag

Resolution metadata: who called it (UMA optimistic, venue admin, Chainlink), when it cleared, source URL, dispute flag. Joins to markets.

recon_log

Every coverage claim has a row

Daily reconciliation output. volume_drift_pct, gap_count_5min, subgraph_vs_gamma_drift_pct, etc. The audit trail behind any "we have X% coverage" statement.

Coverage matrix

Where the data comes from.

No single source covers everything. Each venue has a primary path and one or more fallbacks. We document them publicly so you know what depends on what.

Venue	Markets	Trades (primary)	Trades (fallback)	Resolution	Recon status
Kalshi	REST `/markets`	REST `/markets/trades`	—	REST `settle_time`	100% / 0.0%
Manifold	REST `/markets`	REST `/bets` + cursor walk	—	REST `resolutionTime`	75% / ~10%
Polymarket (std CTF)	gamma-api	Goldsky orderbook subgraph	data-api (capped 3,500)	gamma `closedTime` + on-chain	~21% drift
Polymarket (NegRisk)	gamma-api	Polygon RPC `eth_getLogs`	—	gamma + on-chain	on-chain audit

Recon convention: drift is computed as |venue_reported - sum(captured)| / venue_reported, sampled daily, logged in recon_log. The threshold per venue is published; we don't hide a market if it fails.

Quickstart

Three readers, one schema.

Daily Parquet drops are partitioned by venue and date. Same files work in pandas, polars, and duckdb out of the box.

pandas

import pandas as pd
m  = pd.read_parquet("markets.parquet")
tr = pd.read_parquet("trades.parquet")
print(m.merge(tr, on="market_id").head())

polars

import polars as pl
pl.read_parquet("trades.parquet") \
  .group_by("market_id") \
  .agg([
    pl.len().alias("n"),
    pl.col("size_native").sum()
  ])

duckdb

SELECT venue_id, count(*) AS n
FROM read_parquet('markets.parquet')
GROUP BY 1
ORDER BY n DESC;

Sample

5 markets. 24 KB. Three readers verified.

Real Phase-0 spike data from Kalshi + Manifold + Polymarket. Email us your use case; we ship the link.

Request the sample Pricing →

One canonical schema.Six tables. Every venue.

The canonical schema.

One row per market

One row per outcome

Every fill, normalized

Top-N L2 at fixed intervals

Final state, oracle, dispute flag

Every coverage claim has a row

Where the data comes from.

Three readers, one schema.

pandas

polars

duckdb

5 markets. 24 KB. Three readers verified.

One canonical schema.
Six tables. Every venue.