The Polymarket data-api 3,500-trade offset cap, and how we busted it on-chain
The Polymarket data-api /trades endpoint silently returns 400
above offset 3,500. For markets like Trump 2024 ($1.5B notional, 5M+ trades) that's
roughly 0.07% of the real history. If you've been backtesting on data-api alone you've
been backtesting on a tiny window. Here's the way out.
The trap
Polymarket has three public data surfaces:
gamma-api— market metadata, volume aggregates, resolution. Solid; we use it for themarketstable.data-api— on-chain trades for one market at a time, paginated by offset.- CLOB
/markets— orderbook + token IDs; doesn't expose volume on resolved markets.
The natural backfill loop is: pull markets from gamma, then for each market walk
data-api/trades?market=<cid>&limit=500&offset=N until empty. So we did.
async def iter_all_trades(condition_id, page_size=500):
out, offset = [], 0
while True:
page = await list_trades(condition_id, limit=page_size, offset=offset)
if not page: break
out.extend(page)
if len(page) < page_size: break
offset += page_size
return out
Looked fine. Returned ~3,500 trades per market. We figured Polymarket markets just have ~3,500 trades each on average. They don't.
The first signal: every market hits exactly 3,500
The reconciliation harness flagged it: every high-volume market reported a captured trade count of exactly 3,500. Not 3,499, not 3,501. 3,500.
| Market | Reported volume | Captured | n_trades | Drift |
|---|---|---|---|---|
| "Will Jesus Christ return before 2027?" | $60.85M | $1.17M | 3499 | 98.1% |
| "Will Oprah Winfrey win the 2028 Democratic..." | $49.72M | $0.84M | 3500 | 98.3% |
| "Will LeBron James win the 2028 US Pres Election?" | $49.00M | $0.76M | 3500 | 98.4% |
| "Will Bernie Sanders win the 2028 Dem Pres..." | $45.35M | $0.72M | 3500 | 98.4% |
That's not a probabilistic distribution. That's a hard cap somewhere. Sure enough, the
next pagination call returned a clean HTTP 400. Try any timestamp filter param you can
think of (before, startTs, endTime, etc.) — all
silently ignored, all return the same first 5 trades. The endpoint is
only filterable by market and limit, with a hard
offset ceiling at 3,500.
Why this exists
We can guess: data-api is built for the Polymarket frontend (recent-trades table, live-feed strip), not for historical-data consumers. 3,500 is "plenty" for showing the latest fills. Anyone wanting deeper history is supposed to go on-chain. ICE (Polymarket's institutional data partner since the October deal) gets the deeper feed; everyone else gets the sample.
For a research dataset that's an unacceptable ceiling. The way out runs through the Polygon blockchain.
The way out, layer 1: Goldsky orderbook subgraph
Polymarket's CLOB exchange contract (0x4bFb41d5...8982E) emits an
OrderFilled event for every fill. Goldsky hosts a public subgraph indexing
those events:
https://api.goldsky.com/api/public/project_cl6mb8i9h0003e201j6li0diw
/subgraphs/orderbook-subgraph/prod/gn
The schema is exactly what you'd want: orderFilledEvent, orderbook
(per-token aggregate counter with tradesQuantity and
scaledCollateralVolume), and a few other entities. Auth-free, GraphQL, paginated.
Verifying on Trump 2024 YES token 21742633...836455:
| Source | Trade count | Volume (USD) |
|---|---|---|
| data-api (capped) | 3,500 | $1.17M |
| Goldsky subgraph | 3,450,272 | $1.22B |
| gamma-api reported | — | $1.53B |
~3.45M trades vs 3,500. 986× more data. The remaining drift between subgraph ($1.22B) and gamma ($1.53B) is roughly 20% — partially due to pre-CLOB FPMM v1 activity that the orderbook-subgraph doesn't index, partially due to NegRisk-wrapped trades on a separate exchange contract (more on that below).
Subgraph pagination has its own quirk: the GraphQL where filter doesn't
support or: in this version. Trades have one side as USDC (asset_id "0")
and the other as the outcome token. To get all of a market's trades you must run two
queries — one filtered by makerAssetId_in, one by takerAssetId_in —
and dedupe by event id. Use timestamp_gte for cursor pagination, never
timestamp_gt: events sharing a unix-second at the page boundary will be
silently skipped otherwise.
The way out, layer 2: Polygon RPC for NegRisk
About 75% of Polymarket's high-volume markets are NegRisk-wrapped (multi-outcome events
where outcomes share a basket adapter). Their trades flow through a different exchange
contract — NegRiskCtfExchange at
0xC5d563A...20f80a — and the orderbook-subgraph doesn't index it.
No one we found does, publicly.
The path: read OrderFilled events directly off Polygon. Same event signature
as the standard exchange:
topic0 = keccak256("OrderFilled(bytes32,address,address,uint256,uint256,uint256,uint256,uint256)")
= 0xd0a08e8c493f9c94f29311604c9de1b4e8c8d4c06bd0c789af57f2d65bfec0f6
Indexed args: orderHash, maker, taker. Unindexed:
makerAssetId, takerAssetId, makerAmountFilled,
takerAmountFilled, fee — all uint256, ABI-encoded in the data
blob.
Two free public archive RPCs work:
polygon.drpc.orgpolygon.lava.build
The widely-used polygon-bor-rpc.publicnode.com prunes history (you'll get
error: -32701 History has been pruned), so don't bother for backfills.
A 20-block window at block 65,000,000 (Aug 2024) decoded 125 OrderFilled events from
NegRiskCtfExchange — 11 of which matched our 50-market token universe. Idempotent UPSERT
via composite trade_id (tx_hash, log_index) hash. Re-runs add nothing.
USDC scale and price math
USDC has 6 decimals on Polygon. Polymarket CTF outcome tokens also conventionally use
6-decimal scaling. Both makerAmountFilled and takerAmountFilled
arrive as raw uint256s — divide by 10^6 to get USD or shares.
Determining the side: one of the two assets in any fill is USDC (id "0").
If makerAssetId == "0", the maker is offering USDC and the taker is delivering
outcome shares — the taker is selling. If takerAssetId == "0", taker is
buying. (Not the other way around — we got this backwards on the first ship and an audit
caught it.)
if maker_asset == "0":
usdc = maker_amount_filled / 1e6
shares = taker_amount_filled / 1e6
side = "sell" # taker delivered shares for USDC
else:
usdc = taker_amount_filled / 1e6
shares = maker_amount_filled / 1e6
side = "buy" # taker took shares with USDC
price = usdc / shares
Cost / latency tradeoff
Subgraph: ~5–20 seconds for a 1k-trade page. RPC: ~50–200 ms per eth_getLogs
with a 1k-block chunk; block-timestamp lookup is the dominant cost (one RPC call per
unique block in the result set, batched via asyncio.gather). For a market
with 100k trades over a year of trading, expect:
- Subgraph: ~3–5 minutes wall-clock to walk both maker/taker sides.
- RPC: ~10–30 minutes wall-clock for the full event log + timestamps.
Not interactive, but acceptable for nightly backfills. For Pro-tier customers we cache both into Postgres and serve via REST.
Takeaways
- Don't rely on Polymarket's
data-apifor historical work — the 3,500 ceiling is silent and absolute. - Goldsky orderbook-subgraph covers the standard CTFExchange. Use
timestamp_gte+ dedupe across maker/taker side queries. - NegRisk-wrapped markets need direct RPC.
polygon.drpc.organdpolygon.lava.buildwork without auth and have archive history. - USDC scale is 1e6, both sides. Don't trust whichever scale your favorite library decided was "obvious".
In the pred-markets dataset, Polymarket coverage runs all three layers: gamma for
metadata + resolution, Goldsky subgraph for standard CTF trades, Polygon RPC for NegRisk.
Daily reconciliation logs both subgraph_vs_gamma_drift_pct and a separate
subgraph_negrisk_uncovered_pct so the standard-CTF dashboard stays clean.
Coverage matrix.