The Polymarket data-api 3,500-trade offset cap, and how we busted it on-chain

The Polymarket data-api /trades endpoint silently returns 400 above offset 3,500. For markets like Trump 2024 ($1.5B notional, 5M+ trades) that's roughly 0.07% of the real history. If you've been backtesting on data-api alone you've been backtesting on a tiny window. Here's the way out.

The trap

Polymarket has three public data surfaces:

  • gamma-api — market metadata, volume aggregates, resolution. Solid; we use it for the markets table.
  • data-api — on-chain trades for one market at a time, paginated by offset.
  • CLOB /markets — orderbook + token IDs; doesn't expose volume on resolved markets.

The natural backfill loop is: pull markets from gamma, then for each market walk data-api/trades?market=<cid>&limit=500&offset=N until empty. So we did.

async def iter_all_trades(condition_id, page_size=500):
    out, offset = [], 0
    while True:
        page = await list_trades(condition_id, limit=page_size, offset=offset)
        if not page: break
        out.extend(page)
        if len(page) < page_size: break
        offset += page_size
    return out

Looked fine. Returned ~3,500 trades per market. We figured Polymarket markets just have ~3,500 trades each on average. They don't.

The first signal: every market hits exactly 3,500

The reconciliation harness flagged it: every high-volume market reported a captured trade count of exactly 3,500. Not 3,499, not 3,501. 3,500.

MarketReported volumeCapturedn_tradesDrift
"Will Jesus Christ return before 2027?"$60.85M$1.17M349998.1%
"Will Oprah Winfrey win the 2028 Democratic..."$49.72M$0.84M350098.3%
"Will LeBron James win the 2028 US Pres Election?"$49.00M$0.76M350098.4%
"Will Bernie Sanders win the 2028 Dem Pres..."$45.35M$0.72M350098.4%

That's not a probabilistic distribution. That's a hard cap somewhere. Sure enough, the next pagination call returned a clean HTTP 400. Try any timestamp filter param you can think of (before, startTs, endTime, etc.) — all silently ignored, all return the same first 5 trades. The endpoint is only filterable by market and limit, with a hard offset ceiling at 3,500.

Why this exists

We can guess: data-api is built for the Polymarket frontend (recent-trades table, live-feed strip), not for historical-data consumers. 3,500 is "plenty" for showing the latest fills. Anyone wanting deeper history is supposed to go on-chain. ICE (Polymarket's institutional data partner since the October deal) gets the deeper feed; everyone else gets the sample.

For a research dataset that's an unacceptable ceiling. The way out runs through the Polygon blockchain.

The way out, layer 1: Goldsky orderbook subgraph

Polymarket's CLOB exchange contract (0x4bFb41d5...8982E) emits an OrderFilled event for every fill. Goldsky hosts a public subgraph indexing those events:

https://api.goldsky.com/api/public/project_cl6mb8i9h0003e201j6li0diw
    /subgraphs/orderbook-subgraph/prod/gn

The schema is exactly what you'd want: orderFilledEvent, orderbook (per-token aggregate counter with tradesQuantity and scaledCollateralVolume), and a few other entities. Auth-free, GraphQL, paginated.

Verifying on Trump 2024 YES token 21742633...836455:

SourceTrade countVolume (USD)
data-api (capped)3,500$1.17M
Goldsky subgraph3,450,272$1.22B
gamma-api reported$1.53B

~3.45M trades vs 3,500. 986× more data. The remaining drift between subgraph ($1.22B) and gamma ($1.53B) is roughly 20% — partially due to pre-CLOB FPMM v1 activity that the orderbook-subgraph doesn't index, partially due to NegRisk-wrapped trades on a separate exchange contract (more on that below).

Subgraph pagination has its own quirk: the GraphQL where filter doesn't support or: in this version. Trades have one side as USDC (asset_id "0") and the other as the outcome token. To get all of a market's trades you must run two queries — one filtered by makerAssetId_in, one by takerAssetId_in — and dedupe by event id. Use timestamp_gte for cursor pagination, never timestamp_gt: events sharing a unix-second at the page boundary will be silently skipped otherwise.

The way out, layer 2: Polygon RPC for NegRisk

About 75% of Polymarket's high-volume markets are NegRisk-wrapped (multi-outcome events where outcomes share a basket adapter). Their trades flow through a different exchange contract — NegRiskCtfExchange at 0xC5d563A...20f80a — and the orderbook-subgraph doesn't index it. No one we found does, publicly.

The path: read OrderFilled events directly off Polygon. Same event signature as the standard exchange:

topic0 = keccak256("OrderFilled(bytes32,address,address,uint256,uint256,uint256,uint256,uint256)")
       = 0xd0a08e8c493f9c94f29311604c9de1b4e8c8d4c06bd0c789af57f2d65bfec0f6

Indexed args: orderHash, maker, taker. Unindexed: makerAssetId, takerAssetId, makerAmountFilled, takerAmountFilled, fee — all uint256, ABI-encoded in the data blob.

Two free public archive RPCs work:

  • polygon.drpc.org
  • polygon.lava.build

The widely-used polygon-bor-rpc.publicnode.com prunes history (you'll get error: -32701 History has been pruned), so don't bother for backfills.

A 20-block window at block 65,000,000 (Aug 2024) decoded 125 OrderFilled events from NegRiskCtfExchange — 11 of which matched our 50-market token universe. Idempotent UPSERT via composite trade_id (tx_hash, log_index) hash. Re-runs add nothing.

USDC scale and price math

USDC has 6 decimals on Polygon. Polymarket CTF outcome tokens also conventionally use 6-decimal scaling. Both makerAmountFilled and takerAmountFilled arrive as raw uint256s — divide by 10^6 to get USD or shares.

Determining the side: one of the two assets in any fill is USDC (id "0"). If makerAssetId == "0", the maker is offering USDC and the taker is delivering outcome shares — the taker is selling. If takerAssetId == "0", taker is buying. (Not the other way around — we got this backwards on the first ship and an audit caught it.)

if maker_asset == "0":
    usdc       = maker_amount_filled / 1e6
    shares     = taker_amount_filled / 1e6
    side       = "sell"     # taker delivered shares for USDC
else:
    usdc       = taker_amount_filled / 1e6
    shares     = maker_amount_filled / 1e6
    side       = "buy"      # taker took shares with USDC

price = usdc / shares

Cost / latency tradeoff

Subgraph: ~5–20 seconds for a 1k-trade page. RPC: ~50–200 ms per eth_getLogs with a 1k-block chunk; block-timestamp lookup is the dominant cost (one RPC call per unique block in the result set, batched via asyncio.gather). For a market with 100k trades over a year of trading, expect:

  • Subgraph: ~3–5 minutes wall-clock to walk both maker/taker sides.
  • RPC: ~10–30 minutes wall-clock for the full event log + timestamps.

Not interactive, but acceptable for nightly backfills. For Pro-tier customers we cache both into Postgres and serve via REST.

Takeaways

  1. Don't rely on Polymarket's data-api for historical work — the 3,500 ceiling is silent and absolute.
  2. Goldsky orderbook-subgraph covers the standard CTFExchange. Use timestamp_gte + dedupe across maker/taker side queries.
  3. NegRisk-wrapped markets need direct RPC. polygon.drpc.org and polygon.lava.build work without auth and have archive history.
  4. USDC scale is 1e6, both sides. Don't trust whichever scale your favorite library decided was "obvious".

In the pred-markets dataset, Polymarket coverage runs all three layers: gamma for metadata + resolution, Goldsky subgraph for standard CTF trades, Polygon RPC for NegRisk. Daily reconciliation logs both subgraph_vs_gamma_drift_pct and a separate subgraph_negrisk_uncovered_pct so the standard-CTF dashboard stays clean. Coverage matrix.