Blog · Apr 17, 2026 · benchmarks · evals · proof

How we publish reproducible commerce benchmarks

Turaxia publishes every benchmark as a replayable card, not a marketing number. Here is how the proof pack works.

In the AI-commerce category the default is to put a large number on the homepage and call it a day: "99.9% accuracy", "sub-second latency", "processes thousands of products per minute". None of those numbers is replayable. None of them survives a serious technical evaluation.

We made a different call. Every number on turaxia.com is backed by a card in our proof bundle: a reproducible artifact, generated by a validation script, against an approved real-source scenario, with the exact command and audit events recorded.

What is in the proof bundle

The live bundle lives at /proof and contains, for every primitive we publish a number on:

  • A card with the latency, output summary, and confidence signals from a real run.
  • A demo recording you can watch end-to-end.
  • An evidence row that names the asset, the owner, and the maturity of the module it came from.
  • A manifest that ties it all together with the scenario name and the source URL.

If the underlying behaviour changes, the card changes. If the card is missing, the number does not appear on the site.

The scenarios are real

Our first public scenario takes a real GU Japan product page, parses it, translates it to Kazakhstan Russian, prices the landed cost in KZT, and picks an optimal carrier with a package plan. We keep the scenario small on purpose — a wide benchmark nobody can actually reproduce is worse than a small one that anyone can.

When we expand the public cohort we add scenarios one at a time, as design partners opt in.

Why this matters for the agent era

Agents do not tolerate marketing numbers. If an agent reads our docs and calls a Turaxia tool on a page that looks nothing like what we claimed to support, the agent will fail publicly. Publishing reproducible cards is not just a trust play; it is the only way the primitives stay usable by downstream agent authors.

How to reproduce

  • Read the quickstart for the end-to-end command.
  • Download /proof/latest/manifest.json and confirm the scenario and source URL.
  • Run the primitives on the same URL and compare your output to the manifest card.

If a number ever appears on this site that is not backed by a proof card, tell us and we will remove it.


Written by Yerzhan Karatayev. Last updated Apr 17, 2026.

← All posts