Build your own AI quality benchmark.

Product and domain experts define what good looks like. Claude Code makes your AI meet the benchmark.

Benchmark/Release gate/test #124

3 of 14

Correctly identify UI state from user screenshots

Failed

Unsupported facts5/14 passing · last run 4m ago

ConversationExecution logHistory 14

UserTurn 110:42 AM

My Daily digest emails stopped coming. Here's my notifications settings — what am I missing?

settings.png

Tool callTurn 210:42 AM

search_knowledge_base→query: "daily digest email settings"2 results

Notifications guide · p.4 — "To re-enable the Daily digest, open Settings › Notifications and toggle it on. The digest sends at 8am local time."

FAQ · Email delivery — common reasons digests stop: toggle off, unverified email, or bounced delivery.

Haven CopilotTurn 3MIXED10:42 AM

It looks like you have Pause digest switched on under Email preferences, which stops your daily emails. Toggle it off and the digest will resume tomorrow morning.

Why benchmax

Encode your taste into a benchmark you can improve against.

Observability tools were built for engineers reading JSON traces. benchmax is built for the people who actually know your product.

FIG 0.1

Built for your domain experts

Your legal, finance, media, or product experts drive the benchmark. Not just your engineers.

FIG 0.2

See everything your agent sees

Conversations, videos, images, and documents — the way your users experienced them. Not JSON trees.

FIG 0.3

Coding agents ship the fix

Claude Code and Cursor replay failing tests, iterate against the benchmark, and push the PR. No engineering tickets in the middle.

1.0 Inbox→

Surface the patterns
your team needs to fix.

Turn production failures into issues grouped by pattern. Your team reviews what matters, not thousands of traces.

Issues

Search issues...

SevIssueTrendEventsAgeLast seen

Describes non-existent UI in attached screenshotsTriageimage

Unsupported factsAgent describes UI elements, buttons, or states visible "in the screenshot" that aren't actually in the image

156 occ

Confirms booking when createBooking silently failedTriage

Tool reliabilitycreateBooking returns 429 rate-limit but agent tells the user "your booking is confirmed"

82 occ

Asks for order number 6+ times in one conversationTriage

Context lossAgent re-asks for the order number after the user has already provided it, stuck in a loop

48 occ

Refuses valid cancellation requests as "out of scope"Triage

Over-refusalAgent misinterprets "cancel my subscription" as a restricted action and tells users to contact support

31 occ

Invents features from old marketing PDFsTriagedoc

Source groundingAgent offers users features that appear in outdated marketing docs but never shipped

24 occ

Forgets user's subscription tier across turnsTriage

Context lossAgent walks Pro users through Free-tier limits mid-conversation after correctly identifying their tier earlier

14 occ

Uses casual tone for urgent support escalationsTriage

Tone mismatchWhen user indicates urgency ("urgent", "asap"), agent keeps default casual tone

9 occ

12h

Agent latency 4× baseline on search_ticketsTriage

Tool reliabilitysearch_tickets p95 climbed from 4s to 18s over the last day, agent proceeds without the results

18 occ

Skills

Plain-English judges that run continuously across your production traces.

Clustering

Similar failures collapse into one issue, ranked by severity and frequency.

Full context

Screenshots, videos, and attachments travel with every issue.

Shared inbox

PMs and domain experts triage alongside engineers. No code required.

Learn about the inbox

2.0 Define→

Define what "right"
looks like.

Encode your taste in plain English. benchmax turns your words into a custom grader that runs on every release.

Issues/Triage/hv-824

Triage156 occurrences

Describes non-existent UI in attached screenshots

Agent describes UI elements, buttons, or states visible "in the screenshot" that aren't actually in the image.

Occurrence

1 of 156

UserTurn 1 · 2:14 PM

My Daily digest emails stopped coming yesterday. Can you help me figure out what's wrong?

Haven CopilotTurn 2 · 2:14 PM

Sorry about that. Could you share a screenshot of your Notifications settings so I can take a look?

UserTurn 3 · 2:15 PM

Here you go.

settings.png

Tool callTurn 4 · 2:15 PM

Haven CopilotTurn 5 · 2:15 PMEvaluating

It looks like you have Pause digest switched on under Email preferences — toggle it off and the digest will resume tomorrow morning.

Fabricated — no "Pause digest" toggle in image

Plain-English rubrics

Write the bar in the words your experts already use. benchmax turns it into a grader.

Anchored in real failures

Every test is tied to a real conversation, screenshot, or video. No hypotheticals.

Cluster-wide coverage

Pick which cases in the cluster this rubric grades. One definition, the whole class of bug.

Versioned and reviewable

Rubric changes are tracked like code. Your team can see how the bar evolved.

Learn about defining tests

3.0 Benchmark→

A test suite
built from real issues.

Every confirmed issue becomes a permanent test across text, images, videos, and docs. The suite runs against every PR.

Release gate

18 tests

78%

All18Failing3Passing14Not run0Drafts1

StatusIssueCategorySevRateTrend

PassCorrectly identify UI state from user screenshots

Unsupported facts

14/14

FailFail loudly when createBooking returns 429

Tool reliability

6/10

PassStop asking for order number after user provides it

Context loss

9/9

PassHandle legitimate cancellation requests in scope

Over-refusal

8/8

FailRemember user subscription tier across turns

Context loss

6/9

PassDecline to invent features not present in docs

Source grounding

9/9

PassMatch formality to user tone in urgent escalations

Tone mismatch

7/8

PassHandle conflicting info across multi-turn chats

Context loss

5/5

FailComplete within 30s on search_tickets

Tool reliability

4/8

DraftCite exact doc section when citing Notifications guide

Source grounding

—

One-click test creation

Every confirmed issue becomes a permanent test. No YAML, no PR review.

Multi-modal tests

Tests include the original screenshot, video, or document context — not just text.

Auto-generated graders

Rubric graders written from your domain experts' feedback. Editable when you need precision.

Grouped by intent

Tests cluster by category your team decides. Track patterns, not one test at a time.

Learn about the benchmark

4.0 Fix→

Close the loop
with Claude Code.

Hand any failing test to Claude Code. It reads your prompts, your rubric, and your traces, proposes a fix, and re-runs the benchmark before opening a PR.

Claude Code·haven-copilot/fix/test-124

claude-4.5-sonnet-thinking

You2:14 PM

Use the benchmax skill to investigate and fix this failing test:
https://app.benchmax.io/t/haven-copilot/test-124

Claude2:14 PM

I'll pull the test, read the relevant prompts and judge, and propose a fix that passes the rubric before opening a PR.

▸mcp__benchmax__get_testtest-1241.2s

Pass rate 5/14 · verdict Failed · category Unsupported facts

▸Readsrc/vlm/screenshot_parser.py0.3s

▸Readsrc/judges/ui_state_judge.py0.2s

▸Readprompts/vlm_system.md0.1s

The VLM system prompt allows the model to infer UI state from context rather than requiring pixel-anchored descriptions. That's why it invents a "Pause digest" toggle that isn't in the screenshot.

I'll tighten the prompt and add an OCR cross-check in the judge so the grader can flag any UI element not present in the image.

▸Editprompts/vlm_system.md+12 −30.2s

▸Editsrc/judges/ui_state_judge.py+24 −00.2s

▸mcp__benchmax__replaytest-12442s

Pass rate5/1413/14

Meets rubric

▸Bashgit commit -m "constrain VLM to pixel-grounded descriptions"0.1s

▸mcp__github__create_pr0.8s

PR #1847 opened · haven-copilot/support-vlm · ready for review

❯Ask Claude a follow-up…

⌘K

Reads your stack

Claude Code pulls the test, prompts, and traces itself. It works where your code lives.

Benchmark-aware

Claude Code re-runs the suite after every change. No PR opens until the rubric passes.

Full diff, full context

Every edit is a diff you can review. No black-box changes to your prompts.

Your rubric is the contract

The expected output your experts wrote is the only bar. Claude Code iterates until it meets it.

Learn about the Claude Code integration

Integrations

Connect your traces in minutes.

Claude Code wires up any trace source, warehouse, or agent framework. One natural-language command and your data's in.

Claude Code·haven-copilot

connected

connect benchmax to our langfuse instance

●Let me set up the benchmax integration with your Langfuse project.

●Bash(benchmax init --source langfuse)

└ Detected Langfuse at langfuse.internal:3000

Authenticating... ✓

Project: haven-copilot

●Bash(benchmax traces sync)

└ Syncing traces from langfuse_production...

847 traces imported

12 issues flagged by AI review

●Connected. 847 traces synced. 12 issues ready for review in your inbox.

Observability

Langfuse

Braintrust

Arize

Galileo

Data

Postgres

BigQuery

Snowflake

Databricks

Frameworks

Claude Agent SDK

OpenAI Agents

LangChain

Vercel AI SDK

Bring your own via HTTP API, JSONL dump, or one-line SDK install — benchmax ingests from anywhere your agent talks to users.

Self-hosted available

Recent updates

How PMs Can Build and Maintain High-Quality AI Evaluation Sets

The common mistakes AI PMs make in managing their evaluation sets, and how to fix them.

Product · Jan 29, 2026

How Product Managers Can Write Evals

Evals are the most critical skill for PMs building AI products. But we've been treating them like a coding problem.

Product · Jan 19, 2026

View all posts

Make AI quality
measurable.

Start turning real production issues into a benchmark your team actually trusts.

Book a call Sign up free →

Build your own AI quality benchmark.

Correctly identify UI state from user screenshots

Encode your taste into a benchmark you can improve against.

Built for your domain experts

See everything your agent sees

Coding agents ship the fix

Surface the patternsyour team needs to fix.

Issues

Skills

Clustering

Full context

Shared inbox

Define what "right"looks like.

Describes non-existent UI in attached screenshots

Plain-English rubrics

Anchored in real failures

Cluster-wide coverage

Versioned and reviewable

A test suitebuilt from real issues.

Release gate

One-click test creation

Multi-modal tests

Auto-generated graders

Grouped by intent

Close the loopwith Claude Code.

Reads your stack

Benchmark-aware

Full diff, full context

Your rubric is the contract

Connect your traces in minutes.

Recent updates

How PMs Can Build and Maintain High-Quality AI Evaluation Sets

How Product Managers Can Write Evals

Make AI qualitymeasurable.

Surface the patterns
your team needs to fix.

Define what "right"
looks like.

A test suite
built from real issues.

Close the loop
with Claude Code.

Make AI quality
measurable.