Where Generative AI Is Good in Contract Data Extraction … And Where It Isn’t
TL;DR: GPT-4 is impressive overall, but-on contract review tasks-it’s inconsistent and makes mistakes; probably not yet ready as a standalone approach if predictable accuracy matters.
Back in 2011, when we started Kira Systems, other contract review software companies were using rules- or comparison-based approaches to finding data in contracts. We used supervised machine learning to find information in contracts, and did well against our competitors. Eventually, most of the world’s leading law and audit/consulting firms (and a bunch of corporates) became Kira customers. In recent months, Generative AI solutions have become all the rage. Are they going to supplant other machine learning approaches in contract analysis, just as machine learning approaches beat out rules?
We thought it would be useful to drill into this. Our thoughts on the subject come in three parts:
- How do Large Language Models perform on contract analysis tasks. Given GPT-4’s easy accessibility, we’ll discuss this with examples from it.
- What are advantages and disadvantages of LLMs versus other tech.
- Where we think LLMs can be especially helpful in contract review, and where contracts AI tech is headed.
Throughout, we use “Large Language Models,” “LLMs,” and “Generative AI” interchangeably. We recognize that this isn’t a great equivalence. Generative AI uses Large Language Models but is not the only form of Large Language Model. Nonetheless, we think it works well enough here that we do it.
This piece will cover how LLMs do at contract analysis. Specifically, we tested GPT-4 on a number of contract analysis challenges.
This is a long piece, so here’s a table of contents in case you would like to skip parts:
Why We’re (Not?) Worth Reading On This Topic
How GPT-4 Does At Identifying Data In Contracts
Before we get to our evaluation, let’s cover why you might find our perspective on this helpful.
Why We’re (Not?) Worth Reading On This Topic
On the “worth reading” side, we have a lot of experience in contracts AI, and ⅔ of us have computer science PhDs.
- Noah was a corporate lawyer, then co-founded leading machine learning contract analysis company Kira Systems in 2011, and was its CEO until its sale to Litera in 2021. He’s now Zuva’s CEO. He has been involved in contract analysis AI longer than most.
- Adam joined Kira in 2016 as a Research Scientist, having finished his PhD in computer science at the University of Waterloo (where he worked with Professor Gordon Cormack, a longtime leader in eDiscovery). At Zuva, he leads our Research and Product Development teams.
- Sam joined Kira in 2018, after finishing his computer science postdoctoral work. He is currently a Senior Research Scientist at Zuva. His work has included a focus on differential privacy.
On the “not worth reading” side, perhaps:
- Our experience prevents us from understanding a completely different present and future. We would like to think we’re open minded, but this is possible.
- We are biased because we have a horse in the race: Zuva sells contracts AI software, and maybe we are trying to defend how we do things, as opposed to looking at the space clearly.
At Zuva, our mission is to make it dead easy to use the world’s best contracts AI. Frankly, we don’t really care about the technical approach we take to get there. Supervised ML, LLMs, even rules if appropriate - whatever. We just would like to (1) build great contracts AI and (2) make it dead easy to use. We think doing those things have the potential to get us to a good outcome. Also, even if we are biased, bias can sometimes drive crisper thinking on an issue. This is basically how the adversarial legal system works.
How GPT-4 Does At Identifying Data In Contracts
Adam, Noah, and Dr. Alexander Hudek (Kira and Zuva’s co-founder) ran a number of contract review tests on ChatGPT’s performance on contract analysis tasks beginning with ChatGPTs release, but we tarried in writing up our findings. With GPT-4’s release, we thought the time was right to share what we have learned.
When we first started playing with ChatGPT, we were pretty wowed.
GPT-3.5 was very impressive, and GPT-4 (on limited testing) seems even better. We think these can be really helpful for the right use cases. However, we now have a more nuanced view of how GPT-4 performs on contract review tasks.
Due to (1) the amount of attention on GPT-4 and (2) how easy it is to try ChatGPT, we are going to discuss LLM performance with examples from GPT-4. We know there are other Large Language Models out there, including some that have been specifically trained on legal documents. It’s very likely they perform differently, but it’s hard to say whether that means they are better or worse.
Let’s get into some examples.
Header Detection?
One really tricky thing about doing accurate post-signature contract analysis is that wordings can be non-standard. Sometimes documents come in the form of poor quality scans. At other times, they are drafted in atypical ways.
To measure how GPT-4 performed on slightly altered wording, we first gave GPT-4 some contract clauses (copied from a contract filed on Edgar) and asked it to identify what clauses were there.
While—for this use-case—I could have done without the summary, GPT-4 accurately identified the clauses. And the summary was well done (and much more impressive than the GPT-3.5 one).
But that’s a pretty easy test. To spice things up, we ran some new contract text and changed the headers to incorrect contract terms. Here is the original contract segment:
- 7.1 Indemnification. VMware shall, at its expense, defend Distributor against and pay all costs and damages including reasonable attorneys’ fees made in settlement or finally awarded against Distributor resulting from any claim, action or allegation brought against Distributor, that a Software Product infringes any copyright or trademark of a third party in the United States, Japan or European Community (“Infringement Claim”); provided that, as conditions of VMware’s obligation so to defend and pay, Distributor: (a) promptly notifies VMware in writing of any such Infringement Claim; (b) gives VMware sole control of the defense of any such Infringement Claim and any related negotiations or settlement; and (c) gives VMware the information and assistance necessary to settle or defend such Infringement Claim. If it is adjudicatively determined, or if VMware reasonably believes that any Software Product infringes any third party copyright or trademark, then VMware may, at its option and expense: (i) modify the Software Product or infringing part thereof to be reasonably equivalent and non-infringing; (ii) procure for Distributor a license to continue distributing the Software Product or infringing part thereof; (iii) replace the Software Product or infringing part thereof with other comparable products; or (iv) terminate Distributor’s rights hereunder with respect to the infringing Software Product.
- LIMITATION OF LIABILITY.VMWARE’S LIABILITY UNDER THIS AGREEMENT, REGARDLESS OF THE FORM OF ACTION, WILL NOT EXCEED ONE HUNDRED PERCENT (100%) OF THE AMOUNTS PAID UNDER THIS AGREEMENT BY DISTRIBUTOR TO VMWARE DURING THE PREVIOUS TWELVE (12) MONTH PERIOD FOR THE PRODUCTS GIVING RISE TO THE LIABILITY. NEITHER PARTY WILL BE LIABLE FOR ANY SPECIAL, INDIRECT, CONSEQUENTIAL OR INCIDENTAL DAMAGES ARISING OUT OF THIS AGREEMENT, WHETHER OR NOT SUCH PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES, AND NOTWITHSTANDING ANY FAILURE OF ESSENTIAL PURPOSE OF ANY LIMITED REMEDY.
- 9.1 Governing Law. The rights and obligations of the parties under this Agreement shall not be governed by the 1980 U N Convention on Contracts for the International Sale of Goods. This Agreement will be governed by the laws of the State of California and the United States of America, without regard to conflict of law principles. The parties hereby consent to the non-exclusive jurisdiction of the state and federal courts located in Santa Clara County, California, for resolution of any disputes arising out of this Agreement. Either party may seek injunctions to prevent and/or stop any breach of, and otherwise enforce, the provisions of Section 6, and VMware may seek injunctions to prevent and/or stop any infringement of, and otherwise enforce, its intellectual property rights of whatever nature, in the courts of any country, state or other territory which accepts jurisdiction.
- 9.2 Assignment.This Agreement and any rights or obligations of Distributor hereunder may not be assigned, sub-contracted or otherwise transferred by Distributor without VMware’s prior written consent. Subject to the preceding sentence, this Agreement shall be binding upon and inure to the benefit of the parties’ permitted successors and assigns.
- 9.3 Indemnification by Distributor. If VMware should incur any liability to a third party caused by the non-performance of Distributor of any of its obligations under this Agreement, or resulting from any act or omission of Distributor, or if VMware incurs any liability to a third party by reason of acts of Distributor in marketing or distributing the Products, Distributor agrees to indemnify and hold VMware free and harmless from any such liability, and from all loss, claims, costs, demands, debts, and causes of action in connection therewith, including reasonable attorney’s fees.
Here’s what we pasted into GPT-4, and what we got in return:
From the look of it, it seems like GPT-4 identified contract clauses with a header-detection approach. This might work sometimes, but didn’t here. We would not recommend it if you need accurate contract clause identification. Headers only sometimes describe clause contents.
While we thought the previous prompt was pretty reasonable, we decided to try one additional prompt:
GPT-4 correctly identified both (pretty-standard, apart from the changed headers) indemnity clauses. This is an improvement from GPT-3.5, which found one and missed one.
Overall, this is inconsistent performance. These tests weren’t particularly hard. GPT-4 was right sometimes, wrong other times. On the one hand, it’s very cool that GPT-4 was this good out of the box. On the other hand, I might have reservations about using it if I was in a situation where I needed more consistent and predictable accuracy.
Differently-Phrased Clauses
Contract clauses can be worded a lot of different ways. Poor quality scans, and contracts written in different jurisdictions (sometimes by non-English-first-language, non-lawyer authors) can contribute to even more disparate wordings.
Change of control clauses are among the most important to identify in connection with M&A contract review (aka, due diligence). At the time we sold Kira, 18 of the top 25 global M&A law firms were Kira customers, so we have a fair amount of experience with this use case. 30–60% of a typical M&A legal bill goes to due diligence. Not all of this is contract review, and not all the contract review segment of due diligence is about finding change of control clauses, but (accurately!) finding change of control clauses tends to be a big part of this work. Companies who are buying other companies literally spend millions of dollars with Biglaw firms on a regular basis to get this done right. Not only are change of control clauses important to get right, but there are a lot of ways to word them. As with other areas of contract review, this problem can be exacerbated by atypical drafting and poor quality scans.
We decided to test how GPT-4 would do if we ran some typical change of control clauses through. We took it relatively easy here, only giving pretty standard change of control language and not, say, introducing the distortion of a poor quality scan or non-English-first-language drafted clauses.
All of 3, 6, 7, and 9 are change of control clauses. GPT-4 showed real room for improvement, though at least it got 6.
One objection to this test is that GPT-4 (or another Large Language Model) could be trained to find variations of change of control clauses, just as other contracts AI was heavily trained to find clauses. Zuva has a not-small machine learning research team, and we are always interested in how to get our tech to perform better. When we have tested using LLMs to find information in contracts, we have found we can get comparable accuracies. The catch is that they are orders of magnitude more expensive to train and use. For example, in one of our papers we compared a (non-generative) LLM to our current tech for finding named entities. The LLM was 4% more accurate, which is not nothing, but cost 10,000 times more than our baseline to achieve that¹. We’ll delve deeper into costs in part 2.
Sensitivity To Prompts
In the course of testing GPT-4 on summarizing text and question answering (which we’ll discuss in more detail in part 3), we noticed that it was sensitive to prompts. This is fairly unsurprising, given what we’ve seen with other LLMs. Be mindful of this if using an LLM in your work.
Summary
GPT-4 has generally really impressed us. It appears able to do a lot very well. That said, we are unconvinced that it yet offers predictable accuracy on contract analysis tasks. (And we have not yet really pushed it, for example on poor quality scans and less standard wordings, though perhaps it would do well on these.) If you use Generative AI to help you find contracts with a change of control clause, it will identify change of control clauses in contracts. If, however, you need it to find all (or nearly all) change of control clauses over a group of contracts, we wouldn’t yet count on it². Still, this technology is improving, and our view may change as we test further iterations. Also, as we’ll discuss in a further installment, we think LLMs offer significant benefits today when combined with other machine learning contract analysis technologies. Exciting!
¹ In our experience, Zuva’s ML is far more effective on clause detection than finding contextually-relevant entities. Accordingly, we didn’t expect to “win” this comparison. But we think it’s important to test and see where we can improve.
² In situations where accurate contract review really matters, we recommend using contract analysis AI in conjunction with a user interface that includes a document viewer. A document viewer increases the odds that a human working with the AI might catch mistakes the AI makes.