| Level | Credibility for buyers | Good for | Key limitations | Best for |
|---|---|---|---|---|
Level 1 One tool vs its own internal benchmark | Low |
|
| Builders |
Level 2 Multiple tools vs the same benchmark | Medium |
|
| Comparisons |
Level 2a Level 2 + humans scored on the same rubric | Medium |
|
| Headlines |
Level 3 Humans using Tool A vs Tool B vs no tool | High |
|
| Adoption decisions |
Level 4 AI vs real lawyer work product | Highest — but rare |
|
| Gold standard |
TL;DR
There are four-ish levels of testing legal AI. This piece discusses their strengths and limitations, and ranks them in order of credibility.
Level 1 (one tool vs its own internal benchmark): great for monitoring and improving a product; mediocre for telling how well a tool works in the wild; not useful for comparing vendors.
Level 2 (multiple tools vs the same benchmark): can be decent for comparisons (to the extent the measurements are valid); not necessarily great for understanding how well a tool works in practice; the benchmark is critical: if it’s flawed, the results are flawed too.
Level 2a (Level 2 + humans scored on the same rubric): adds a human comparison point; makes for great headlines, but the value depends heavily on which lawyers were tested, and how they were measured.
Level 3 (humans doing work manually vs humans using Tool A vs humans using Tool B): often the best proxy for adoption today, because most AI is not accurate enough on its own for most legal work.
Level 4 (tests AI against actual lawyer work product or a gold standard): the most credible measure; rare because confidentiality issues are often insurmountable.
There’s never been more legal AI. And there’s never been more testing of it. Some tests are genuinely useful. Some are closer to marketing.1
Along with that, there’s now a steady stream of legal AI tests. Some are genuinely useful. Some are closer to marketing. Many are well-intentioned, but they measure “how did the tool do on this benchmark,” not “how will this tool work for me in practice.”
This post is for lawyers trying to decide: Is this test telling me something I can rely on? It can also be useful for legal AI builders trying to create credible proof that their tool works, for an audience that is (1) skeptical and (2) unusually skilled at asking hard questions about evidence.
Why I’m worth reading on this

I’ve been building and evaluating legal AI for 15 years, including all four levels described here.
I co-founded and led Kira Systems, likely the leading legal AI company when we sold it in 2021, and have kept working on legal AI at Zuva since then. That’s meant being involved in a lot of evaluation work across the spectrum: internal benchmarking, head-to-head comparisons against humans, comparisons against other tools, and, occasionally, tests against real work product.
Level 1: Internal benchmark
Definition: A vendor tests its system against its own benchmark.
This is how products get built. You need to know how accurate your system is in order to make it better. Without this, you’re basically doing engineering by vibes.
Buyers of legal AI almost always ask “how accurate is your system?” and you need to be able to answer, even if buyers often discount those answers.
Mini-example:
A vendor runs a regression suite every release: 1,000 commercial agreements, 25 clause/field types, scored for recall and precision. A new release improves assignment extraction but regresses on change of control. Internal testing catches it before customers do.
Strengths of Level 1
- It gives you a consistent indication of how accurate your tech is (as you define “accurate”).
- It catches regressions.
- It lets you answer “how accurate is your tech?” questions before you have more externally credible evidence.
Limitations of Level 1
- It is only informative to outsiders if they trust the vendor’s benchmark design and scoring.
- It’s susceptible to overfitting: sometimes you just get good at the benchmark you regularly test against, without improving at the underlying task
- It’s not comparable across tools. Tool A’s “90%” and Tool B’s “85%” can be different tasks, datasets, and scoring.
- It measures how the AI performs, not how a human using the AI performs. Currently, the latter is often more important because many AIs are not yet accurate enough to use without oversight.
- There’s an inherent tension between using a benchmark for internal improvement and using it for external marketing, even when vendors tell themselves the former comes first.
Bottom line: Level 1 is essential if you’re building. If you’re buying, don’t take Level 1 results too seriously. It can be a marketing claim wearing a lab coat.

Level 2: Shared benchmark (multiple tools, same test)
Definition: Multiple tools run the same benchmark and are scored the same way.
This enables comparisons. But now you have a new load-bearing assumption: The benchmark and scoring rubric reflect reality.
They may not.
Mini-example:
Multiple tools are tested at creating “contract risk summaries.” The testers grade summaries against a checklist. Tool A writes a long answer that hits every checklist item and some additional opinions. Tool B writes a shorter answer that an in-house lawyer would actually paste into an email to the business team, but it omits three rubric-required details. Tool A “wins.” In practice, Tool B’s output might be more useful.
Strengths of Level 2
- You can do apples-to-apples comparisons (at least within the world defined by the benchmark).
- It creates pressure for transparency and repeatability (assuming the methodology is actually disclosed).
- t’s a reasonable way to track broad progress over time, though overfitting is a real risk: tools can get better at the test without getting better at the underlying task.
Limitations of Level 2
- The benchmark can be wrong or misaligned with what lawyers actually value. If the rubric rewards verbosity, you learn who is verbose.
- Even if the benchmark is good, it still usually measures the AI in isolation. If the AI is not accurate enough to use without oversight, the benchmark misses how the tool performs in the workflow that most lawyers will actually use.
- Benchmarks often struggle with “commercially reasonable brevity.” A rubric can easily over-penalize outputs that a lawyer would consider perfectly fine.
- Benchmarks are often inflexible by design. That helps repeatability, but it can make them brittle and gameable.
- Benchmarks are susceptible to overfitting. If vendors repeatedly test and tune against the same benchmark, they can get very good at that specific test without improving at the underlying task.
- Broad benchmarks can implicitly favor generalist tools over narrow ones. A best-in-breed tool that does one thing very well may not even get measured in a broad test.
- Benchmarks that don’t include general-purpose baselines (GPT/Claude/Gemini) make it hard to tell whether specialized legal AI is actually differentiated, which matters given the price difference.
- Benchmarks that treat all errors equally miss an important reality: a missed governing law clause is not the same as a missed exclusivity clause.
Bottom line: Level 2 is better than Level 1 for comparisons, but the benchmark can define “good” in a way that doesn’t match practice. And if the AI isn’t accurate enough to use without human review, measuring the AI in isolation misses an important part of the story.
Level 2a: Level 2 + a human comparison point
Definition: Human lawyer(s) do the same tasks as the AIs and are scored on the same rubric.
This is popular because it answers the almost-tangible question: “Is it as good as a lawyer?” It also makes for great headlines. But measuring properly is hard: the results depend heavily on which, how, and how many lawyers were tested.
Mini-example:
A study includes a “human baseline” consisting of multiple offshore reviewers, but only one reviewer on any given task. The writeup is vague on who the reviewers actually were. The AIs score higher. The headline becomes “AI beats lawyers.” Serious lawyers read “offshore reviewers” and discount the conclusion.
Strengths of Level 2a
- It makes for good headlines.
- It prompts lawyers to honestly assess their own accuracy, which is often lower than they’d expect.
- It suggests areas where tools are “good enough” for summaries, first drafts, research scaffolds, or issue spotting.
Limitations of Level 2a
- Human baselines are highly dependent on which, how, and how many lawyers were tested.
- Human accuracy can vary a lot, across individuals and across days. (AIs tend to be more consistent, which counts in their favor.)
- Even well-designed studies can conflate “rubric compliance” with “real-world usefulness.”
- There’s an instruction asymmetry problem: AI systems are often given carefully tuned prompts, while humans can just be told to do the task. That can skew results.
Bottom line: Testing AI against human lawyers is a reasonable thing to do. Level 2a can be useful, but it inherits Level 2’s benchmark assumptions and adds the complexity of human selection and variance.
Level 3: Workflow tests (humans using the tool)
Definition: Compare outcomes when people do the work manually, with Tool A, and with Tool B, measuring things like time spent and accuracy of the final work product.
This is the level lawyers should care about most right now, because it matches reality. For many legal tasks, AI outputs are not accurate enough to use as-is. For example, according to the Vals Legal AI Report, the best performing tools tend to be 70-80% accurate on most tasks.2 So lawyers primarily use AI outputs as a starting point, reviewing, correcting, and improving along the way.
So the real question usually isn’t “How good is the raw AI output?” It’s: “How good is the human + tool system at producing the final work product?”
Level 3 measures this.
Mini-example:
Tool A has 82% raw extraction accuracy. Tool B has 76% raw extraction accuracy. But Tool B makes verification much faster (clear locators, better review workflow, fewer “where did this come from?” moments). Lawyers using Tool B finish review faster with higher final accuracy. For that workflow, Tool B wins.
This is also the right way to think about UI. Not “the UI is nice,” but: does it help users produce better work product in less time? Some evaluation frameworks include subjective UI scoring, which is better than ignoring workflow entirely, but Level 3 gets at it more directly. If workflow matters, measure it by outcomes, not vibes. (That said, it’s also good if the UI is nice!)
Strengths of Level 3
- It matches how many lawyers actually use these tools today (AI plus verification).
- It measures what actually matters: how well users produce the final work product.
- It naturally includes the “UI matters” point, without resorting to subjective UI scoring.
Limitations of Level 3
- It is harder and more expensive to run well.
- You need to control for human variance (some reviewers are just better).
- Learning effects matter: people get better at using a tool over time, which can distort results if you don’t design the study carefully.
- It still depends on a benchmark or gold standard to score final outputs, so benchmark-design problems don’t disappear.
Bottom line: For adoption decisions, Level 3 is often the best proxy for reality.
Level 4: Gold standard / real work product
Definition: Test against real work product (or a gold standard derived from it).
Lawyers generally treat lawyer work product as the gold standard. Yes, some teams are better than others, and there’s variability matter by matter. Still, lawyer work product is a fundamental output of the profession, and it is usually produced to a very high standard.
So if an AI, or an AI-supplemented process, is meant to be an addition or alternative to how things are done today, the best thing to measure it against is real work product. And, of course, the higher-quality the lawyers who produced that work product, the more credible the comparison.
The problem is practical: Biglaw work product tends to be highly confidential. So it rarely becomes a usable test set, and almost never a publishable one.
The closest public Level 4 benchmarking is likely in eDiscovery / TAR, where evaluation norms became unusually serious. That happened partly because the community is measurement-oriented, includes many academics, and has benefited from substantial research attention and funding. Outside of eDiscovery and private tests (law firms and companies testing in-house and not talking about it publicly), Level 4-style comparisons are rare, and public results are rarer still.
Mini-example:
A general counsel works with an AI vendor to re-run elements of diligence reviews originally done by reputable Biglaw firms, comparing outputs against the actual diligence memos and disclosure schedules produced for those deals.
Strengths of Level 4
- It tests against reality.
- It’s hard to game: there’s no public benchmark to optimize against.
Limitations of Level 4
- It’s hard to line up. Confidentiality and privilege are real blockers.
- It’s expensive. Creating and validating a true gold standard takes serious expert time.
- Even when you can run the test, sharing results publicly is hard: the underlying documents are confidential.
Bottom line: Level 4 is the gold standard. It’s rare because it requires access to real work product, a willing partner, and careful design. When it does happen, it’s worth paying attention to.
Common pitfalls (across all levels)
The following issues can affect tests at any level.
Recall vs precision vs “one number”
Lawyers often care more about recall than they admit. Missing the one clause that matters is usually worse than flagging a few extra items that get cleaned up in review. That doesn’t mean precision doesn’t matter. It does mean you should be skeptical of studies that report a single blended metric without showing the underlying recall/precision tradeoff.
Overfitting against the test set
If you let a tool take multiple runs at the same documents and iteratively adjust prompts or settings based on observed misses, you can end up measuring “how well did we learn this test set” rather than “how well does this generalize.”
This is especially easy to do with GenAI prompting. You run prompt version 1, score it, tweak the prompt, run version 2, score it, and so on. Eventually you have a prompt that is great on those documents. That does not mean it will be great in production.
Dataset contamination
If you test on a dataset that is widely circulated, there’s an increased chance that modern models have seen it during training or fine-tuning. CUAD is an example of a well-known contract dataset that was also likely part of the models’ training set. Testing on training data doesn’t tell you how a tool will perform on unfamiliar documents.
Thanks to Dr. Adam Roegiest for his comments on this piece. Mistakes are mine though. I used GPT and Claude as writing companions in this piece. ↩︎
The Vals report was released in February 2025, and the underlying models have improved since then. That said, our sense is that legal AI systems still generally fall short of the accuracy threshold most lawyers would require to rely on outputs without human review or improvement. ↩︎