Build your own AI quality benchmark.
Product and domain experts define what good looks like. Claude Code makes your AI meet the benchmark.
Correctly identify UI state from user screenshots
Why benchmax
Encode your taste into a benchmark you can improve against.
Observability tools were built for engineers reading JSON traces. benchmax is built for the people who actually know your product.
Built for your domain experts
Your legal, finance, media, or product experts drive the benchmark. Not just your engineers.
See everything your agent sees
Conversations, videos, images, and documents — the way your users experienced them. Not JSON trees.
Coding agents ship the fix
Claude Code and Cursor replay failing tests, iterate against the benchmark, and push the PR. No engineering tickets in the middle.
Surface the patterns
your team needs to fix.
Turn production failures into issues grouped by pattern. Your team reviews what matters, not thousands of traces.
Issues
13Skills
Plain-English judges that run continuously across your production traces.
Clustering
Similar failures collapse into one issue, ranked by severity and frequency.
Full context
Screenshots, videos, and attachments travel with every issue.
Shared inbox
PMs and domain experts triage alongside engineers. No code required.
Define what "right"
looks like.
Encode your taste in plain English. benchmax turns your words into a custom grader that runs on every release.
Describes non-existent UI in attached screenshots
Agent describes UI elements, buttons, or states visible "in the screenshot" that aren't actually in the image.
Plain-English rubrics
Write the bar in the words your experts already use. benchmax turns it into a grader.
Anchored in real failures
Every test is tied to a real conversation, screenshot, or video. No hypotheticals.
Cluster-wide coverage
Pick which cases in the cluster this rubric grades. One definition, the whole class of bug.
Versioned and reviewable
Rubric changes are tracked like code. Your team can see how the bar evolved.
A test suite
built from real issues.
Every confirmed issue becomes a permanent test across text, images, videos, and docs. The suite runs against every PR.
Release gate
18 testsOne-click test creation
Every confirmed issue becomes a permanent test. No YAML, no PR review.
Multi-modal tests
Tests include the original screenshot, video, or document context — not just text.
Auto-generated graders
Rubric graders written from your domain experts' feedback. Editable when you need precision.
Grouped by intent
Tests cluster by category your team decides. Track patterns, not one test at a time.
Close the loop
with Claude Code.
Hand any failing test to Claude Code. It reads your prompts, your rubric, and your traces, proposes a fix, and re-runs the benchmark before opening a PR.
https://app.benchmax.io/t/haven-copilot/test-124
I'll tighten the prompt and add an OCR cross-check in the judge so the grader can flag any UI element not present in the image.
Reads your stack
Claude Code pulls the test, prompts, and traces itself. It works where your code lives.
Benchmark-aware
Claude Code re-runs the suite after every change. No PR opens until the rubric passes.
Full diff, full context
Every edit is a diff you can review. No black-box changes to your prompts.
Your rubric is the contract
The expected output your experts wrote is the only bar. Claude Code iterates until it meets it.
Connect your traces in minutes.
Claude Code wires up any trace source, warehouse, or agent framework. One natural-language command and your data's in.
Bring your own via HTTP API, JSONL dump, or one-line SDK install — benchmax ingests from anywhere your agent talks to users.
Self-hosted availableRecent updates
How PMs Can Build and Maintain High-Quality AI Evaluation Sets
The common mistakes AI PMs make in managing their evaluation sets, and how to fix them.
How Product Managers Can Write Evals
Evals are the most critical skill for PMs building AI products. But we've been treating them like a coding problem.
Make AI quality
measurable.
Start turning real production issues into a benchmark your team actually trusts.