Skip to content

Benchmark methodology

read as .md

benchmarks.ggui.ai is the public dashboard for ggui’s generation quality. It runs nightly and publishes per-cell quality, latency, and cost across a three-tier model matrix. This page is the methodology behind those numbers — what is measured, how it is scored, and how to reproduce a run yourself.

Every night the harness generates UI for a fixed corpus of prompts across a three-tier model matrixfast, balanced, and premium capability tiers, each instantiated on the three providers ggui supports (claude, openai, google). Every matrix cell records three things:

  • Quality — the aesthetic score (below), 0–100.
  • Latency — wall-clock time to a compiled, contract-typed component.
  • Cost — provider spend for the generation, in USD.

The dashboard publishes per-cell results. It is not a provider leaderboard — see Judge panel.

Quality is the mean of five aesthetic dimensions, each weighted equally at 20% and scored 0–100:

DimensionWhat it captures
LayoutSpacing, alignment, structure, responsive behavior
Design tokensCorrect use of @ggui-ai/design tokens over ad-hoc CSS
HierarchyVisual weighting — what reads first, second, third
PolishStates, affordances, finish; the absence of rough edges
Data presentationHow clearly the contract’s data is rendered

The pass threshold is 70. A cell at or above 70 is a pass; below is a fail. The five-dimension breakdown is published alongside the composite so a regression can be traced to the dimension that moved.

Quality is not scored by a single model. Each generation is judged by a three-provider panelclaude, openai, and google — all run at temperature 0 for determinism. The published score is the panel mean, and the per-cell spread across the three judges is shown alongside it.

The panel exists to neutralize single-model bias: no model judges only its own output, and a generous-to-self or harsh-to-rivals bias from any one judge is diluted by the other two. The visible spread is the honesty check — a wide spread on a cell is a signal that the judges disagree, not a number to trust blindly.

This is why the dashboard publishes per-cell scores, not a provider ranking. The unit of truth is “this model, this tier, on this prompt” — rolling that up into a single “best provider” headline would discard exactly the per-cell, per-dimension detail the panel is designed to preserve.

The harness runs against a fixed set of generation prompts — representative UI shapes that exercise the contract surface: weather-card, survey-form, kanban-board, and others, plus gadget commits (renderer-side capability flows). The corpus is fixed so that night-over-night movement reflects model and triad changes, not a shifting set of prompts.

The benchmark is source-available — the entire harness ships in the public repo. To run it yourself:

Terminal window
git clone https://github.com/ggui-ai/ggui
cd ggui
pnpm install
pnpm --filter @ggui-ai/benchmark bench

You need a provider API key (set the relevant provider environment variable; the harness reads it the same way the live dashboard does). The benchmark dataset is licensed CC-BY-4.0 — reuse it, cite it, build on it.