Benchmark methodology
read as.md benchmarks.ggui.ai is the public dashboard for ggui’s generation quality. It runs nightly and publishes per-cell quality, latency, and cost across a three-tier model matrix. This page is the methodology behind those numbers — what is measured, how it is scored, and how to reproduce a run yourself.
What it measures
Section titled “What it measures”Every night the harness generates UI for a fixed corpus of prompts across a three-tier model matrix — fast, balanced, and premium capability tiers, each instantiated on the three providers ggui supports (claude, openai, google). Every matrix cell records three things:
- Quality — the aesthetic score (below), 0–100.
- Latency — wall-clock time to a compiled, contract-typed component.
- Cost — provider spend for the generation, in USD.
The dashboard publishes per-cell results. It is not a provider leaderboard — see Judge panel.
Quality scoring
Section titled “Quality scoring”Quality is the mean of five aesthetic dimensions, each weighted equally at 20% and scored 0–100:
| Dimension | What it captures |
|---|---|
| Layout | Spacing, alignment, structure, responsive behavior |
| Design tokens | Correct use of @ggui-ai/design tokens over ad-hoc CSS |
| Hierarchy | Visual weighting — what reads first, second, third |
| Polish | States, affordances, finish; the absence of rough edges |
| Data presentation | How clearly the contract’s data is rendered |
The pass threshold is 70. A cell at or above 70 is a pass; below is a fail. The five-dimension breakdown is published alongside the composite so a regression can be traced to the dimension that moved.
Judge panel
Section titled “Judge panel”Quality is not scored by a single model. Each generation is judged by a three-provider panel — claude, openai, and google — all run at temperature 0 for determinism. The published score is the panel mean, and the per-cell spread across the three judges is shown alongside it.
The panel exists to neutralize single-model bias: no model judges only its own output, and a generous-to-self or harsh-to-rivals bias from any one judge is diluted by the other two. The visible spread is the honesty check — a wide spread on a cell is a signal that the judges disagree, not a number to trust blindly.
This is why the dashboard publishes per-cell scores, not a provider ranking. The unit of truth is “this model, this tier, on this prompt” — rolling that up into a single “best provider” headline would discard exactly the per-cell, per-dimension detail the panel is designed to preserve.
Corpus
Section titled “Corpus”The harness runs against a fixed set of generation prompts — representative UI shapes that exercise the contract surface: weather-card, survey-form, kanban-board, and others, plus gadget commits (renderer-side capability flows). The corpus is fixed so that night-over-night movement reflects model and triad changes, not a shifting set of prompts.
Reproducibility
Section titled “Reproducibility”The benchmark is source-available — the entire harness ships in the public repo. To run it yourself:
git clone https://github.com/ggui-ai/gguicd gguipnpm installpnpm --filter @ggui-ai/benchmark bench …You need a provider API key (set the relevant provider environment variable; the harness reads it the same way the live dashboard does). The benchmark dataset is licensed CC-BY-4.0 — reuse it, cite it, build on it.
See also
Section titled “See also”- UI Generator — the harness the benchmark exercises.
- benchmarks.ggui.ai — the live dashboard.