Docs · modules/parse
Parse
Turaxia Parse — confidence-aware DOM extraction with variant-matrix decomposition, encoding detection, and per-field typed output.
Parse is the top of the Turaxia pipeline. It takes a supplier URL (or already-captured HTML) and returns a typed product record.
Contract
type ParseInput =
| { sourceUrl: string }
| { sourceUrl: string; rawHtml: string };
type ParseOutput = {
productId: string;
title: string | null;
description: string | null;
images: Array<{ url: string; role?: 'hero' | 'gallery' }>;
variants: Array<{ color?: string; size?: string; material?: string; sku?: string; price?: { currency: string; amount: number } }>;
pricing: { currency: string; amount: number } | null;
hsCodeHint?: string;
weightGramsEstimate?: number;
confidenceSignals: Array<{ field: string; confidence: number; source: string }>;
};
What it handles
- Encoding detection: Shift_JIS, EUC-JP, UTF-8, Windows-1252.
- JS-rendered pages where the server returns a near-empty shell.
- Variant matrices (size × color × material grids) with sparse availability.
- Per-retailer rules layered on top of a generic extractor.
Benchmarks
- Fixture baseline: 9.22 ms on approved GU HTML (card in
/proof). - Live supplier HTML: 79.97 ms P50 (card in
/proof).
Both cards are regenerated by real validation scripts. Reproduce via turaxia benchmark run.
When Parse returns null
When a field cannot be derived with confidence, Parse returns null and emits a confidenceSignals entry. This is intentional. Downstream primitives (Localize, Price, Route) are designed around typed nulls, not hallucinations.
Availability
Parse is generally available. See the developer hub for the status of every module.
← All docs