Benchmark methodology

read as .md

benchmarks.ggui.ai is the public dashboard for ggui’s generation quality. It runs nightly and publishes per-cell quality, latency, and cost across a three-tier model matrix. This page is the methodology behind those numbers — what is measured, how it is scored, and how to reproduce a run yourself.

What it measures

Every night the harness generates UI for a fixed corpus of prompts across a three-tier model matrix — fast, balanced, and premium capability tiers, each instantiated on the three providers ggui supports (claude, openai, google). Every matrix cell records three things:

Quality — the aesthetic score (below), 0–100.
Latency — wall-clock time to a compiled, contract-typed component.
Cost — provider spend for the generation, in USD.

The dashboard publishes per-cell results. It is not a provider leaderboard — see Judge panel.

Quality scoring

Quality is the mean of five aesthetic dimensions, each weighted equally at 20% and scored 0–100:

Dimension	What it captures
Layout	Spacing, alignment, structure, responsive behavior
Design tokens	Correct use of `@ggui-ai/design` tokens over ad-hoc CSS
Hierarchy	Visual weighting — what reads first, second, third
Polish	States, affordances, finish; the absence of rough edges
Data presentation	How clearly the contract’s data is rendered

The pass threshold is 70. A cell at or above 70 is a pass; below is a fail. The five-dimension breakdown is published alongside the composite so a regression can be traced to the dimension that moved.

Judge panel

Quality is not scored by a single model. Each generation is judged by a three-provider panel — claude, openai, and google — all run at temperature 0 for determinism. The published score is the panel mean, and the per-cell spread across the three judges is shown alongside it.

The panel exists to neutralize single-model bias: no model judges only its own output, and a generous-to-self or harsh-to-rivals bias from any one judge is diluted by the other two. The visible spread is the honesty check — a wide spread on a cell is a signal that the judges disagree, not a number to trust blindly.

This is why the dashboard publishes per-cell scores, not a provider ranking. The unit of truth is “this model, this tier, on this prompt” — rolling that up into a single “best provider” headline would discard exactly the per-cell, per-dimension detail the panel is designed to preserve.

Corpus

The harness runs against a fixed set of generation prompts — representative UI shapes that exercise the contract surface: weather-card, survey-form, kanban-board, and others, plus gadget commits (renderer-side capability flows). The corpus is fixed so that night-over-night movement reflects model and triad changes, not a shifting set of prompts.

Reproducibility

The benchmark is source-available — the entire harness ships in the public repo. To run it yourself:

git clone https://github.com/ggui-ai/ggui
cd ggui
pnpm install
pnpm --filter @ggui-ai/benchmark bench …

You need a provider API key (set the relevant provider environment variable; the harness reads it the same way the live dashboard does). The benchmark dataset is licensed CC-BY-4.0 — reuse it, cite it, build on it.