The agent backend

read as .md

A ggui-aware agent needs an HTTP backend the user’s chat client can talk to. You do not hand-roll that backend. The OSS reference implementation is @ggui-ai/agent-server — a brand-agnostic Hono server that owns every ggui-coupled host concern — plus a thin AgentAdapter that maps your LLM SDK’s event stream to a normalized message envelope.

The split is the point: the agent-server has zero LLM-SDK knowledge and the adapter has zero ggui awareness. The protocol lives entirely between the two.

Three parties

        ┌─────────────────────────┐
        │   host / chat client    │   MCP-Apps host: claude.ai, ChatGPT,
        │  (owns the chat UI,     │   or the sample chat app. Forwards
        │   forwards ui/message)  │   ui/message text to the model.
        └───────────┬─────────────┘
              chat   │  (HTTP POST /agent → SSE)
                     ▼
        ┌─────────────────────────┐
        │     agent backend       │   @ggui-ai/agent-server (Hono)
        │  agent-server + a thin  │   + your AgentAdapter (per-SDK glue).
        │     AgentAdapter        │
        └───────────┬─────────────┘
              MCP    │  (Streamable HTTP)
                     ▼
        ┌─────────────────────────┐
        │     ggui MCP server     │   ggui_handshake / ggui_render /
        │  (GguiSessions + state  │   ggui_update / ggui_consume / ggui_emit,
        │   + the iframe runtime) │   plus the iframe runtime served per session.
        └─────────────────────────┘

The iframe (running @ggui-ai/iframe-runtime) is the rendered surface that lives inside the host. It is not a fourth party — it is the ggui MCP server’s UI, mounted in the host.

Three channels

Channel	Between	Transport	Status	Carries
chat	host ↔ agent backend	HTTP `POST /agent` → SSE	Mandatory	`{kind: 'chat', prompt, chatId?}` in; a stream of `NormalizedMessage`s out
MCP	agent backend ↔ ggui MCP server	MCP Streamable HTTP	Mandatory	`ggui_handshake` / `ggui_render` / `ggui_update` / `ggui_consume` / `ggui_emit` calls
live	ggui MCP server ↔ iframe	WebSocket (+ fallback)	Optional	declared `streamSpec` deliveries (`ggui_emit` fan-out) and `props_update` frames

chat is how a turn starts. POST /agent is one kind-discriminated endpoint: {kind: 'chat', prompt, chatId?} opens the SSE stream — the first event is always chat-allocated, carrying the server-allocated chatId; subsequent frames are message events. The same endpoint also accepts {kind: 'tool-call', name, arguments} — the iframe-issued tools/call relay, answered as plain JSON rather than SSE. GET /agent?chatId=X replays the server-authoritative snapshot through the same handler for rehydration (each recorded tool result is re-inlined fresh so the snapshot reflects current server state), and GET / serves a small public manifest the frontend reads sandboxProxyUrl from.

MCP is the agent loop — the adapter’s LLM calls ggui_render / ggui_consume / etc. against the ggui MCP server using the URL + bearer the library threads on every call.

live is the first-party fast path for streaming UI updates into the iframe over WebSocket (gated by a wsToken). The spec-compliant cross-host fallback is tool-result inlining: agent-server’s interceptor reads _meta.ui.resourceUri on each tool result, issues a resources/read, and inlines the iframe HTML under _meta.ui.resource — so the iframe mounts on the first SSE frame with no extra round-trip.

What agent-server owns

The library owns every ggui-coupled host concern, and nothing else:

HTTP routing (Hono) + SSE streaming
MCP discovery / routing + bearer threading (bearer defaults to GGUI_MCP_BEARER, then dev — pairing with ggui serve --dev-allow-all)
Tool-result resource inlining (interceptToolResult) — mounting iframes from _meta.ui.resourceUri
Server-allocated chat ids (mintChatId) — the frontend never mints ids client-side
Guest + bearer auth with chat-ownership gating
The second-origin sandbox proxy boot (per the MCP Apps spec; defaults to port + 1000)
The snapshot / rehydration path
Cross-framework tool identity: with crossFramework on (the startAgentServer default), the library declares each tool’s canonical serverInfo to ggui once per process via ggui_runtime_declare_tool_catalog, so blueprint reuse stays identity-stable across agent frameworks

Crucially, it is a pure prompt forwarder: the prompt is fed to the adapter verbatim, and the server synthesizes no directive and special-cases no key. The user-gesture directive that tells the model to call ggui_consume is authored in the iframe’s ui/message text and passes straight through (see the user-action flow below).

What the AgentAdapter implements

A thin per-SDK adapter implements one async-iterable method:

import { startAgentServer, type AgentAdapter } from "@ggui-ai/agent-server";

const adapter: AgentAdapter = {
  name: "my-sdk",
  async *run(input) {
    // input.prompt        — the string the LLM should see (verbatim)
    // input.chatId        — server-allocated stable id for this conversation
    // input.mcpServers    — { name → { url, bearer } } map (e.g. { ggui: {…} })
    // input.systemPrompt  — three-way: undefined = adapter default,
    //                       null = explicitly none, string = override
    // input.abortSignal   — fires on client disconnect; stop the LLM call
    // input.agentCapabilities — canonical tool catalog (from live MCP
    //                       initialize + tools/list) to stamp into the
    //                       handshake's blueprintDraft contract
    //
    // Drive your SDK's tool loop and yield NormalizedMessage values:
    //   assistant text · tool_use · tool_result (carrying the full MCP
    //   CallToolResult as `tool_use_result`) · result
  },
};

await startAgentServer({
  port: 6790,
  mcpServers: { ggui: { url: "http://localhost:6781/mcp" } }, // a `ggui` entry is required
  adapter,
  // optional: auth (default createGuestTokenAuth()), sandboxProxyPort
  // (default port + 1000), systemPrompt, bearer, chatStore, crossFramework
});

Adapters must stay brand-agnostic: no imports of @ggui-ai/protocol/integrations/mcp-apps, no awareness of sessionId / host-session / _meta.ui keys. The adapter maps its native SDK event stream onto the NormalizedMessage envelope; ggui mechanics stay in agent-server. Reference adapters ship for the Claude Agent SDK (claude-agent-sdk), the OpenAI Agents SDK (openai-agents-sdk), and Google ADK (google-adk).

Frontend pairing — `@ggui-ai/react/chat-helpers`

On the browser side, agent-server pairs with the useMcpAppsChat hook from @ggui-ai/react/chat-helpers. It:

opens the SSE stream to POST /agent and parses chat-allocated then message frames into one wire;
walks each tool result’s tool_use_result for _meta.ui.resourceUri (and any inlined _meta.ui.resource) and surfaces the result as sessions entries your app mounts with <AppRenderer> (imported directly from @mcp-ui/client — ggui doesn’t wrap or re-export it);
replays GET /agent?chatId through the same pipeline for rehydration;
runs the guest-token client flow (POST /auth/guest → store → Bearer on every request → retry once on 401);
forwards an iframe ui/message’s text as the next prompt via handleAppMessage and carries its _meta opaquely as data.meta — it never reads a key inside.

The user-action flow

When a user interacts with a rendered UI, the gesture travels back to the agent through the GguiSession’s pending-event pipe, which is the single source of truth:

Gesture in the iframe → the iframe runtime calls ggui_runtime_submit_action via the host’s tools/call relay (postMessage per the MCP Apps spec; in the sample stack, the host relays it as POST /agent {kind: 'tool-call'}).
The ggui server appends the gesture to the GguiSession’s pending-event pipe and returns {ok, consumerPresent}.
consumerPresent is computed by an active-consumer registry: ggui_consume registers itself while long-polling, so submit_action knows whether a consume loop is currently listening.
If a ggui_consume long-poll is listening, it unblocks in-turn and returns the event {intent, actionData, uiContext, actionId, firedAt} to the agent.
If nobody is listening (consumerPresent: false — e.g. the user reloaded the page after the agent’s turn ended), the iframe emits a userAction doorbell on a ui/message. The host forwards the message text to the model, which wakes a fresh turn and calls ggui_consume({sessionId}) to drain the already-enqueued gesture.

The agent retrieves the gesture exclusively via ggui_consume, so it fires exactly once. The doorbell is a pure pointer — _meta["ai.ggui/userAction"] (GguiUserActionMeta) carries only {kind: 'user-action', description, sessionId, actionId, submittedAt, intent, nextStep: {tool: 'ggui_consume', args: {sessionId}}}, never the action payload. Carrying the payload in the doorbell would risk a double-trigger.

“Zero Agent Code”, redefined

Zero Agent Code now means an agent builder writes only:

MCP server config naming the ggui MCP endpoint,
a system prompt (start from GGUI_AGENT_SYSTEM_PROMPT, exported by @ggui-ai/protocol), and
a few lines wiring a thin AgentAdapter into startAgentServer().

No polling loops, no event handlers, no protocol parsing, no sessionId / host-session awareness. All of that lives inside @ggui-ai/agent-server. The adapter is intentionally brand-agnostic SDK-mapping glue — not ggui logic.

Auth posture (Preview)

agent-server ships two AuthAdapter implementations:

createGuestTokenAuth (default) — stateless signed bearer tokens (signing secret from GUEST_TOKEN_SECRET, ephemeral with a warning if omitted) that work across browser / React Native / CLI. Mounts POST /auth/guest, GET /auth/me, POST /auth/logout.
createBearerTokenAuth — static operator-configured tokens for sample apps, CI, and small self-hosts. Mounts GET /auth/me only.

Every chat row is stamped with an ownerId; reads and appends are ownership-gated (200 owner / 403 other / 404 unknown), overridable via authorizeChat for team / org semantics. Richer JWT / JWKS / OAuth + PKCE flows are deferred to a future @ggui-ai/agent-server-auth-extras (same AuthAdapter contract, no handler rewrites). For the Preview, the bundled guest-token + static-bearer paths are the supported surface.