Build your own AI quality benchmark.

Product and domain experts define what good looks like. Claude Code makes your AI meet the benchmark.

Benchmark/Release gate/test #124
3 of 14

Correctly identify UI state from user screenshots

Failed
Unsupported facts5/14 passing · last run 4m ago
ConversationExecution logHistory 14
UserTurn 110:42 AM
My Daily digest emails stopped coming. Here's my notifications settings — what am I missing?
/
settings.png
Tool callTurn 210:42 AM
search_knowledge_basequery: "daily digest email settings"2 results
01
Notifications guide · p.4 — "To re-enable the Daily digest, open Settings › Notifications and toggle it on. The digest sends at 8am local time."
02
FAQ · Email delivery — common reasons digests stop: toggle off, unverified email, or bounced delivery.
b
Haven CopilotTurn 3MIXED10:42 AM
It looks like you have Pause digest switched on under Email preferences, which stops your daily emails. Toggle it off and the digest will resume tomorrow morning.

Why benchmax

Encode your taste into a benchmark you can improve against.

Observability tools were built for engineers reading JSON traces. benchmax is built for the people who actually know your product.

LEGALFINPMBENCHMARK
FIG 0.1

Built for your domain experts

Your legal, finance, media, or product experts drive the benchmark. Not just your engineers.

DOCIMGVIDEO
FIG 0.2

See everything your agent sees

Conversations, videos, images, and documents — the way your users experienced them. Not JSON trees.

FAILPASSITERATE → PASS
FIG 0.3

Coding agents ship the fix

Claude Code and Cursor replay failing tests, iterate against the benchmark, and push the PR. No engineering tickets in the middle.

1.0 Inbox

Surface the patterns
your team needs to fix.

Turn production failures into issues grouped by pattern. Your team reviews what matters, not thousands of traces.

Issues

13
Search issues...
SevIssueTrendEventsAgeLast seen
Describes non-existent UI in attached screenshotsTriageimage
Unsupported factsAgent describes UI elements, buttons, or states visible "in the screenshot" that aren't actually in the image
156 occ
1d
2h
Confirms booking when createBooking silently failedTriage
Tool reliabilitycreateBooking returns 429 rate-limit but agent tells the user "your booking is confirmed"
82 occ
1d
1h
Asks for order number 6+ times in one conversationTriage
Context lossAgent re-asks for the order number after the user has already provided it, stuck in a loop
48 occ
1d
4h
Refuses valid cancellation requests as "out of scope"Triage
Over-refusalAgent misinterprets "cancel my subscription" as a restricted action and tells users to contact support
31 occ
1d
2h
Invents features from old marketing PDFsTriagedoc
Source groundingAgent offers users features that appear in outdated marketing docs but never shipped
24 occ
1d
6h
Forgets user's subscription tier across turnsTriage
Context lossAgent walks Pro users through Free-tier limits mid-conversation after correctly identifying their tier earlier
14 occ
1d
1h
Uses casual tone for urgent support escalationsTriage
Tone mismatchWhen user indicates urgency ("urgent", "asap"), agent keeps default casual tone
9 occ
1d
12h
Agent latency 4× baseline on search_ticketsTriage
Tool reliabilitysearch_tickets p95 climbed from 4s to 18s over the last day, agent proceeds without the results
18 occ
1d
5h

Skills

Plain-English judges that run continuously across your production traces.

Clustering

Similar failures collapse into one issue, ranked by severity and frequency.

Full context

Screenshots, videos, and attachments travel with every issue.

Shared inbox

PMs and domain experts triage alongside engineers. No code required.

2.0 Define

Define what "right"
looks like.

Encode your taste in plain English. benchmax turns your words into a custom grader that runs on every release.

Issues/Triage/hv-824
Triage156 occurrences

Describes non-existent UI in attached screenshots

Agent describes UI elements, buttons, or states visible "in the screenshot" that aren't actually in the image.

Occurrence
1 of 156
UserTurn 1 · 2:14 PM
My Daily digest emails stopped coming yesterday. Can you help me figure out what's wrong?
b
Haven CopilotTurn 2 · 2:14 PM
Sorry about that. Could you share a screenshot of your Notifications settings so I can take a look?
UserTurn 3 · 2:15 PM
Here you go.
settings.png
Tool callTurn 4 · 2:15 PM
b
Haven CopilotTurn 5 · 2:15 PMEvaluating
It looks like you have Pause digest switched on under Email preferences — toggle it off and the digest will resume tomorrow morning.
Fabricated — no "Pause digest" toggle in image

Plain-English rubrics

Write the bar in the words your experts already use. benchmax turns it into a grader.

Anchored in real failures

Every test is tied to a real conversation, screenshot, or video. No hypotheticals.

Cluster-wide coverage

Pick which cases in the cluster this rubric grades. One definition, the whole class of bug.

Versioned and reviewable

Rubric changes are tracked like code. Your team can see how the bar evolved.

3.0 Benchmark

A test suite
built from real issues.

Every confirmed issue becomes a permanent test across text, images, videos, and docs. The suite runs against every PR.

Release gate

18 tests
78%
All18Failing3Passing14Not run0Drafts1
StatusIssueCategorySevRateTrend
PassCorrectly identify UI state from user screenshots
Unsupported facts
P1
14/14
FailFail loudly when createBooking returns 429
Tool reliability
P1
6/10
PassStop asking for order number after user provides it
Context loss
P1
9/9
PassHandle legitimate cancellation requests in scope
Over-refusal
P1
8/8
FailRemember user subscription tier across turns
Context loss
P1
6/9
PassDecline to invent features not present in docs
Source grounding
P1
9/9
PassMatch formality to user tone in urgent escalations
Tone mismatch
P2
7/8
PassHandle conflicting info across multi-turn chats
Context loss
P2
5/5
FailComplete within 30s on search_tickets
Tool reliability
P2
4/8
DraftCite exact doc section when citing Notifications guide
Source grounding
P3

One-click test creation

Every confirmed issue becomes a permanent test. No YAML, no PR review.

Multi-modal tests

Tests include the original screenshot, video, or document context — not just text.

Auto-generated graders

Rubric graders written from your domain experts' feedback. Editable when you need precision.

Grouped by intent

Tests cluster by category your team decides. Track patterns, not one test at a time.

4.0 Fix

Close the loop
with Claude Code.

Hand any failing test to Claude Code. It reads your prompts, your rubric, and your traces, proposes a fix, and re-runs the benchmark before opening a PR.

Claude Code·haven-copilot/fix/test-124
You2:14 PM
Use the benchmax skill to investigate and fix this failing test:
https://app.benchmax.io/t/haven-copilot/test-124
Claude2:14 PM
I'll pull the test, read the relevant prompts and judge, and propose a fix that passes the rubric before opening a PR.
mcp__benchmax__get_testtest-1241.2s
Pass rate 5/14 · verdict Failed · category Unsupported facts
Readsrc/vlm/screenshot_parser.py0.3s
Readsrc/judges/ui_state_judge.py0.2s
Readprompts/vlm_system.md0.1s
The VLM system prompt allows the model to infer UI state from context rather than requiring pixel-anchored descriptions. That's why it invents a "Pause digest" toggle that isn't in the screenshot.

I'll tighten the prompt and add an OCR cross-check in the judge so the grader can flag any UI element not present in the image.
Editprompts/vlm_system.md+12 −30.2s
Editsrc/judges/ui_state_judge.py+24 −00.2s
mcp__benchmax__replaytest-12442s
Pass rate5/1413/14
Meets rubric
Bashgit commit -m "constrain VLM to pixel-grounded descriptions"0.1s
mcp__github__create_pr0.8s
PR #1847 opened · haven-copilot/support-vlm · ready for review
Ask Claude a follow-up…
⌘K

Reads your stack

Claude Code pulls the test, prompts, and traces itself. It works where your code lives.

Benchmark-aware

Claude Code re-runs the suite after every change. No PR opens until the rubric passes.

Full diff, full context

Every edit is a diff you can review. No black-box changes to your prompts.

Your rubric is the contract

The expected output your experts wrote is the only bar. Claude Code iterates until it meets it.

Integrations

Connect your traces in minutes.

Claude Code wires up any trace source, warehouse, or agent framework. One natural-language command and your data's in.

Claude Code·haven-copilot
connected
connect benchmax to our langfuse instance
Let me set up the benchmax integration with your Langfuse project.
Bash(benchmax init --source langfuse)
└ Detected Langfuse at langfuse.internal:3000
Authenticating...
Project: haven-copilot
Bash(benchmax traces sync)
└ Syncing traces from langfuse_production...
847 traces imported
12 issues flagged by AI review
Connected. 847 traces synced. 12 issues ready for review in your inbox.
Observability
L
Langfuse
B
Braintrust
A
Arize
G
Galileo
Data
Postgres
BigQuery
Snowflake
D
Databricks
Frameworks
C
Claude Agent SDK
OpenAI Agents
LC
LangChain
Vercel AI SDK

Bring your own via HTTP API, JSONL dump, or one-line SDK install — benchmax ingests from anywhere your agent talks to users.

Self-hosted available

Make AI quality
measurable.

Start turning real production issues into a benchmark your team actually trusts.