Your LLM is not a classifier

Introducing the Trarian Patent Invalidity Score — what we learned trying to grade patents with frontier AI, and what we had to build instead.

May 17, 2026

We’re excited to share Trarian’s first product: a 1–10 score that tells you how likely a US patent is to be invalidated. It’s trained on real court and patent-office decisions, it’s built for the people who buy, insure, and finance patents, and it does something that today’s most capable AI models on their own still can’t: it gives you the same answer every time, and the answer actually predicts what happens.

This post is also a case study for anyone trying to use AI as a classifier — whether you’re scoring patents, resumes, insurance applications, loan files, or anything else where the goal isn’t to write text but to put a defensible number on something. The short version: the best language models in the world are surprisingly bad at this on their own, and the fix isn’t a smarter prompt.

What’s actually at stake

A US patent is a 20-year monopoly. The government grants it after a patent examiner reads the application and decides the invention is new and not obvious. The trouble is that examiners spend, on average, less than 20 hours reviewing each patent, and the body of past inventions they’re supposed to compare it against is enormous. As a result, a huge share of patents that get granted shouldn’t have been — the examiner missed something, or the law moved, or the patent ended up covering things that were already known.

How big a share? A recent study in the Harvard Journal of Law & Technology looked at roughly 89,000 patents that ended up in litigation over two decades and found that about 4 in 10 are thrown out when a court reaches a decision on the merits. That rate has been remarkably stable over time. (There’s also a faster process at the patent office where a specialized board can throw patents out, and the rate is even higher there.)

A patent on paper, in other words, is not the same thing as a patent that will hold up. The gap between the two is the entire problem.

Why this matters as a business

If you finance, buy, sell, license, or insure patents, that 4-in-10 number is the one that has to clear your desk before money moves:

Litigation finance funds put hundreds of millions of dollars behind patent lawsuits each year. If the patent gets thrown out, the case collapses and the investment goes to zero.
Patent insurers sell policies on patent enforcement and validity. They need a defensible probability for every patent they touch.
Patent brokers and IP-backed lenders price assets that can be worth nine figures — but only if the patents hold up.
Operating companies and universities with large portfolios have to decide which patents to keep paying maintenance fees on, which to sell, and which to drop.

In every one of these workflows, the bottleneck is the same: figuring out whether a patent will hold up is slow, inconsistent, and expensive. A single patent can take a senior associate at a top law firm a week, costs tens of thousands of dollars, and two senior associates won’t always agree. There’s a real reason underwriters don’t run this analysis at scale today — the unit economics don’t work.

That bottleneck is exactly the kind of thing modern AI was supposed to clear.

The obvious AI answer — and why it doesn’t work

The natural reaction over the past two years has been to point a frontier AI model — Claude, GPT, Gemini — at a patent and ask “how strong is this, on a scale of 1 to 10?” We spent a lot of time inside that reaction. As a classification tool, it fails in two specific ways that should be familiar to anyone trying to do this in any other domain.

The answer changes when you change the prompt — even if you don’t change the question. We wrote a careful prompt: a rubric for each thing that can get a patent thrown out, a scale that defined what a 1, a 5, and a 10 should mean, and a strict output format. Then we ran a second prompt that was word-for-word identical except for one tiny change: the rubric sections appeared in a different order. Same patent. Same model. Same temperature setting (the setting that controls randomness — we had it pinned at zero).

About a quarter of patents — 26% for Claude Sonnet, 24% for Claude Opus — got a 1–10 score that moved by at least a full point just because the sections of the rubric were listed in a different order. That isn’t reliable enough to put in a deal memo. If your underwriting score depends on the order of the questions you asked, you don’t really have a score.

This isn’t a patent-specific quirk. Any time you use a language model to put a number on something — risk, fit, eligibility, severity, quality — the prompt is a knob the model is silently turning. You only see one number out of the model, so you never notice that a small re-write of your own prompt would have given you a different one.

The answer isn’t very predictive. We held out a set of patents where we already knew the real-world outcome — patents that had been through court and either survived or were thrown out — and asked: when the AI says a patent is weak, is it actually weak?

The standard way to measure that is a yardstick called AUC (explainer here). AUC asks a simple question: pick a patent that was thrown out and a patent that survived, at random — how often did the model give the loser a worse score than the winner? An AUC of 1.0 is perfect, 0.5 is a coin flip. On the held-out set, Claude Sonnet scored 0.594 and Claude Opus 0.601 — both barely better than a coin flip, and well short of usable.

This is the second thing that should be familiar from other domains. Big language models read brilliantly. They explain themselves articulately. They also score in a way that turns out, when you finally check, to be only loosely correlated with the thing you actually care about. The text is great. The number is mostly vibes.

What we built instead

Trarian uses AI — as a reader, not as a grader. The thing that produces the final 1–10 score is a more traditional kind of model: a statistical model trained on thousands of real patents with real outcomes, the kind of model that has been quietly powering credit scores and insurance underwriting for decades. The LLM’s job is to read each patent carefully and pull out the kinds of signals an experienced attorney would notice. The statistical model’s job is to weigh those signals against what has actually happened to similar patents in the past.

We look at four layers of every patent:

The claims — what the patent actually protects.
The specification — the technical description that’s supposed to support the claims.
The prior-art landscape — what came before, what cites the patent, what the patent itself cites.
The prosecution history — the back-and-forth between the inventor and the examiner that produced the patent.

The model learns the relationship between those signals and real validity decisions — a mix of patent-office rulings, ITC determinations, and federal-court decisions. It’s blind to the outcome at scoring time; it sees only what an underwriter sees on day one.

This combination — LLMs as reader, statistical model as grader — is the broader pattern we think anyone using language models for classification will end up at. It keeps what the LLM is good at (reading hard documents carefully, in volume) and replaces what it’s bad at (assigning a consistent, calibrated number) with something that was built for that job.

Trarian invalidity score vs raw Claude on the same patents

On the same held-out patents where raw Claude scored 0.594 and 0.601, Trarian scores 0.791 on AUC — a meaningful step up from the frontier LLMs and well into territory that is useful to underwrite against. But the clearer way to read the model is to look at how it sorts patents from strongest to weakest.

Top decile vs bottom decile — the headline result

The most useful test of an underwriting score is also the most intuitive one. Rank patents from strongest to weakest, slice the ranking into ten equal buckets, and look at the actual invalidity rate in each bucket. A good score makes the bottom bucket look very different from the top.

On the cohort we tested on (1,296 patents), the strongest 10% of patents — what we call strength 10 — were thrown out 7.7% of the time. The weakest 10% — strength 1 — were thrown out 87.7% of the time. That’s an 80-percentage-point gap between the top and bottom of our score.

That gap is the product. It’s what makes the score worth running across a portfolio. And the consistency is what makes the score worth signing under. A 6 means the same thing every time. A 6 in March means the same thing as a 6 in November. A 6 on a software patent means the same expected outcome as a 6 on a hardware patent. The historical rate inside each band is published and reproducible.

A worked example — US Patent 7,068,684

Here’s how it reads on a real patent. US Patent 7,068,684 covers “quality of service in a voice over IP telephone system” — a 2001 invention on keeping voice calls smooth when they travel over the internet.

Single-patent score with attributable drivers

The model gives the patent a final strength of 4 out of 10 and places it in the 7th bucket from the top (lower buckets are stronger). Patents that land in the same band in our training set were thrown out about 68% of the time historically.

We don’t expose the full feature list, but the top drivers for this patent tell a coherent story:

The examiner added essentially no prior art of their own during prosecution. That usually means a light search by the examiner — and when a patent has been examined lightly, more invalidating prior art tends to surface later.
The references that were cited are themselves heavily cited by other patents — the patent sits in a busy, well-mapped technical field where prior art has been worked over thoroughly.
The patent family is small and the forward-citation pattern looks like a contested, fast-moving space.
On the strengthening side, the legal-doctrine signals are moderate, not severe — the major grounds for throwing the patent out are present but not damning.

A 4 doesn’t mean the patent will be invalidated. It means it sits in a band where, historically, the majority have been. That’s the bet an underwriter can size.

In this particular case, the underwriter’s bet would have paid off. The ‘684 patent was indeed invalidated — the patent office cancelled the challenged claims, and the Federal Circuit affirmed in In re Estech Systems IP, LLC, No. 24-1935 (Fed. Cir. Dec. 23, 2025). A single case never validates a model. But it’s the kind of outcome the score is built to flag in advance.

Who this is for

The bottleneck is the same everywhere: fast, consistent triage of patent assets.

Litigation finance funds — rank a docket or a target’s portfolio in hours, not weeks.
Patent brokers — assess transaction quality before pricing.
Insurers — underwrite patent infringement, validity, and enforcement policies on a consistent rubric.
Portfolio owners — prune, sell, or license with a defensible quality signal.
Lenders — collateralize patent assets without rebuilding the analysis on every deal.

How we work with clients

Two engagements:

Patent ranking and quality assessment — calibrated scores at scale across a portfolio, a docket, or a target list. Delivered with the score, the bucket, the historical rate in that bucket, and the top drivers in plain language.
Prior art search reports — litigation-grade prior-art and invalidity analysis delivered at the quality of traditional human searches and substantially faster.

If you have an underwriting workflow that is bottlenecked by slow patent assessment, we’d like to talk.

Caveats

The score is a ranking tool calibrated to historical outcomes. It is not a guarantee of future results. In any small sample you will see patents in the strongest bucket that turn out invalid and patents in the weakest bucket that turn out valid. We classify patents into more or less likely to be invalidated — not into “valid” and “invalid.” A 1 is not a verdict; an 8 is not a free pass. The point is consistency: a 6 means the same thing every time, every patent, every analyst — and that is the thing an underwriting model has to deliver.

Key takeaways

Frontier LLMs are excellent readers and unreliable graders. Ask the same model the same question twice and a quarter of the time you get a different answer.
Even when you average across runs, the AUC sits barely above a coin flip on real validity outcomes. That isn’t enough to underwrite against.
The pattern that works — for patents and, we think, for most classification problems — is to let the LLM read and let a calibrated statistical model score. The Trarian invalidity score gets to AUC 0.791 and an 80-point spread between the strongest and weakest deciles using that division of labor.
The same 1–10 score means the same thing every time it is produced. That consistency is what makes a number usable inside an underwriting, finance, or insurance workflow.

If you’re an operator in the patent space — a fund, a broker, an insurer, a lender, or a portfolio owner — and you’re interested in scoring patents quickly and consistently, we’d love to chat.

And Yet It Moves

Discussion about this post

Ready for more?