Docs · modules/parse

Parse

Turaxia Parse — confidence-aware DOM extraction with variant-matrix decomposition, encoding detection, and per-field typed output.

Parse is the top of the Turaxia pipeline. It takes a supplier URL (or already-captured HTML) and returns a typed product record.

Contract

type ParseInput =
  | { sourceUrl: string }
  | { sourceUrl: string; rawHtml: string };

type ParseOutput = {
  productId: string;
  title: string | null;
  description: string | null;
  images: Array<{ url: string; role?: 'hero' | 'gallery' }>;
  variants: Array<{ color?: string; size?: string; material?: string; sku?: string; price?: { currency: string; amount: number } }>;
  pricing: { currency: string; amount: number } | null;
  hsCodeHint?: string;
  weightGramsEstimate?: number;
  confidenceSignals: Array<{ field: string; confidence: number; source: string }>;
};

What it handles

  • Encoding detection: Shift_JIS, EUC-JP, UTF-8, Windows-1252.
  • JS-rendered pages where the server returns a near-empty shell.
  • Variant matrices (size × color × material grids) with sparse availability.
  • Per-retailer rules layered on top of a generic extractor.

Benchmarks

  • Fixture baseline: 9.22 ms on approved GU HTML (card in /proof).
  • Live supplier HTML: 79.97 ms P50 (card in /proof).

Both cards are regenerated by real validation scripts. Reproduce via turaxia benchmark run.

When Parse returns null

When a field cannot be derived with confidence, Parse returns null and emits a confidenceSignals entry. This is intentional. Downstream primitives (Localize, Price, Route) are designed around typed nulls, not hallucinations.

Availability

Parse is generally available. See the developer hub for the status of every module.


← All docs