---
title: Benchmark methodology
description: How benchmarks.ggui.ai measures UI-generation quality, latency, and cost — the model matrix, the five aesthetic dimensions, the three-provider judge panel, the corpus, and how to reproduce a run locally.
---

[benchmarks.ggui.ai](https://benchmarks.ggui.ai) is the public dashboard for ggui's generation quality. It runs nightly and publishes per-cell **quality**, **latency**, and **cost** across a three-tier model matrix. This page is the methodology behind those numbers — what is measured, how it is scored, and how to reproduce a run yourself.

## What it measures

Every night the harness generates UI for a fixed corpus of prompts across a **three-tier model matrix** — `fast`, `balanced`, and `premium` capability tiers, each instantiated on the three providers ggui supports (`claude`, `openai`, `google`). Every matrix cell records three things:

- **Quality** — the aesthetic score (below), 0–100.
- **Latency** — wall-clock time to a compiled, contract-typed component.
- **Cost** — provider spend for the generation, in USD.

The dashboard publishes **per-cell** results. It is not a provider leaderboard — see [Judge panel](#judge-panel).

## Quality scoring

Quality is the mean of **five aesthetic dimensions**, each weighted equally at 20% and scored 0–100:

| Dimension         | What it captures                                        |
| ----------------- | ------------------------------------------------------- |
| Layout            | Spacing, alignment, structure, responsive behavior      |
| Design tokens     | Correct use of `@ggui-ai/design` tokens over ad-hoc CSS |
| Hierarchy         | Visual weighting — what reads first, second, third      |
| Polish            | States, affordances, finish; the absence of rough edges |
| Data presentation | How clearly the contract's data is rendered             |

The **pass threshold is 70**. A cell at or above 70 is a pass; below is a fail. The five-dimension breakdown is published alongside the composite so a regression can be traced to the dimension that moved.

## Judge panel

Quality is **not** scored by a single model. Each generation is judged by a **three-provider panel** — `claude`, `openai`, and `google` — all run at **temperature 0** for determinism. The published score is the **panel mean**, and the **per-cell spread** across the three judges is shown alongside it.

The panel exists to neutralize single-model bias: **no model judges only its own output**, and a generous-to-self or harsh-to-rivals bias from any one judge is diluted by the other two. The visible spread is the honesty check — a wide spread on a cell is a signal that the judges disagree, not a number to trust blindly.

This is why the dashboard publishes **per-cell scores, not a provider ranking**. The unit of truth is "this model, this tier, on this prompt" — rolling that up into a single "best provider" headline would discard exactly the per-cell, per-dimension detail the panel is designed to preserve.

## Corpus

The harness runs against a **fixed set of generation prompts** — representative UI shapes that exercise the contract surface: `weather-card`, `survey-form`, `kanban-board`, and others, **plus gadget commits** (renderer-side capability flows). The corpus is fixed so that night-over-night movement reflects model and triad changes, not a shifting set of prompts.

## Reproducibility

The benchmark is **source-available** — the entire harness ships in the public repo. To run it yourself:

```bash
git clone https://github.com/ggui-ai/ggui
cd ggui
pnpm install
pnpm --filter @ggui-ai/benchmark bench …
```

You need a provider API key (set the relevant provider environment variable; the harness reads it the same way the live dashboard does). The benchmark **dataset is licensed CC-BY-4.0** — reuse it, cite it, build on it.

## See also

- [UI Generator](/architecture/ui-generator/) — the harness the benchmark exercises.
- [benchmarks.ggui.ai](https://benchmarks.ggui.ai) — the live dashboard.