arrow_back Back to Features

Classification and Extraction

Noah Waisberg • August 11, 2022 • 17 minute read

Contracts are generally in the form of mostly unstructured text. The core function of an Embeddable Contracts AI is converting that unstructured text into semi-structured (e.g., provisions) or structured (e.g., entities, normalized values, “answers”) data. Other Contracts AI features derive from this core. Let’s walk through some different types of data produced.

Provision Extraction

Provision extraction is the AI finding clauses in a contract, e.g., an agreement’s term, termination, indemnification, exclusivity, tenant’s right to sublet, or default for cross default clause. This is generally a classification task: it works by the artificial intelligence reading text, and deciding that a given batch of text (ranging from a few words to a few paragraphs) is (or is not) an exclusivity clause, say. Provision extraction generally involves taking unstructured contracts and turning them into structured data.

While once upon a time, vendors fought over how to best implement provision extraction, these days, most provision extraction is machine learning-based. It is also possible to implement rules-based, comparison-based, or even section-header-based provision extraction. We would not recommend anything but machine learning for provision extraction, if you need to do provision extraction on unfamiliar text (like third-party paper) or poor quality scans.

A good provision extraction model will be more than a keyword search. Ideally it will find concepts. Here are some examples where identifying concepts over keywords matters:

  • You’re trying to find “confidentiality” language, but the contract uses “not disclose” instead of “keep confidential.” A good confidentiality provision extraction model should identify both types of language correctly.
  • A contract might mention “change of control,” “transfer of all or substantially all the stock,” “merger, consolidation, or otherwise,” “direct or indirect assignment,” or something else similar. While only the first uses the words “change of control,” all should be identified as potentially relevant language by a change of control provision extraction model. On the other hand, some contracts have a “change control” concept in them, which governs what happens when changes are proposed to an agreed-upon-spec. These are completely irrelevant to someone considering “change of control,” and should not be surfaced by a change of control provision extraction model.
  • An exclusivity clause might read “Licensor hereby grants Licensee a fully paid, worldwide, exclusive license.” Or it might read “Buyer will purchase 100% of its requirements for [good] from Seller.” Even though the latter doesn’t say “exclusive,” it contains an exclusive obligation. Both need to be identified by a exclusivity provision extraction model.

As the above illustrates, it’s important that provision models perform well on non-standard text. A number of super-important contract clauses (including exclusivity, non-competition, most favored treatment, change of control, pricing) are frequently phrased in highly variable ways. Note that poor quality scans can make even standard clauses (like amendment) look very different from what they were originally trained on. Performance on poor quality scans is conceptually similar to performance on unfamiliar text. In both cases, you are counting on a provision extraction model to work on text different to that which it was trained on.

Provision extraction is important in and of itself, as well as a foundation for other tasks. There are lots of situations where users need the verbatim provision language, or to be guided to the provision in a version of the original document. Contract interpretation can be an exceptionally complex task. As a junior lawyer, I remember struggling with two partners, a senior associate, and a summer student for days over one very value-impacting change of control clause. The original contract text is the source of truth.

Provision extraction results are a useful addition to many applications. These include:

  • They are a core feature of most contract analysis software applications.
  • The original text (and a link to the specific location of the text in the original document) can enhance a CLM, CRM, ERP, HRIS, or lease management system.
  • In contract negotiation software, they can be used to help identify clauses that have been modified, and guide what playbook section to show or which fallback language to use.
  • Provision extraction results can form the basis for a clause bank.
  • They can point to whether a provision you care about is present or not.

Even if you need to transform a contract into structured data, semi-structured provision extraction results are a critical step along the way.

  • Different models that give structured data (entity extraction, answers, normalization) all work best when run on top of accurate provision extraction results. In fact, it will be pretty hard to give accurate entity, answer, or normalization results if you aren’t already starting by looking at the right spot in the document.
  • It can be harder to get entity, answer, and normalization to be highly accurate. Provision extraction results are much more likely to be accurately returned by an AI, and can enable humans working with a system a better starting place.
    • Today, many systems built using Contracts AI are meant to enhance rather than replace humans. Kira (a leading contract analysis software that we helped build), for example, enabled lawyers, accountants, consultants, and alternative legal service provider employees to review contracts as or more accurately in 20–90% less time than pure human review. A 40–50% time savings was pretty typical. While this seems like big savings, it meant that Kira users still had to spend a lot of time (50–60% of the time they previously would have) reviewing results. Provision extraction results can be a jumping off point for users (e.g., at an Alternative Legal Services Provider) to further refine results into structured form.

[insert diagram showing OCR as the bottom layer of foundation, provision extraction built on top of it, and entity extraction, answers, and normalization built on top of that, with document grouping built on top of entity extraction, and risk scoring and non-standard provision detection built on top of these?]

If evaluating provision extraction offerings from different Contracts AI vendors, you should consider:

  • Accuracy, especially on agreements like those you will need reviewed. Note that
    • Some provisions are easier to get right than others (e.g., assignment, term, notice, and governing law are pretty easy; change of control, exclusivity, non-compete can all be harder).
    • Make sure to test on documents that the vendor doesn’t have access to in advance of the evaluation, if you are at all worried about the vendor being able to cheat on the evaluation.
    • Poor quality scans can be a good test of accuracy on unfamiliar documents. You shouldn’t expect a Contracts AI to properly transcribe every word from a poor quality scan - accuracy here is a function of the OCR’s performance. Rather, if the OCR worked poorly, it can be interesting to see how the Contracts AI did at just finding the clause at all, despite the difficulty caused by the AI having very unfamiliar text to go off.
  • Comprehensiveness - does the Contracts AI find what you need it to, out of the box.
  • Trainability - how easy is it to train the Contracts AI to find new provisions should you imagine a need to do so.
  • Speed.

Provision extraction was among the first main features available in Contracts AI software for a reason. It’s super important. In our experience, it’s also reasonably hard to do right, at least on more complicated provisions in non-standard documents or poor quality scans.

Entity Extraction

Entity extraction is a specific type of text extraction that focuses on single units of information that a user seeks to pull out of a contract or document. Examples of entities in contracts include title, parties, dates, an agreement’s governing law, and leased square footage.

Depending on the entity, Contracts AIs tend to use different strategies to find the right one. In general, many entities are extracted by the AI first identifying the text that likely contains the entity, then extracting the value sought from that text. For example:

  • To find an agreement’s governing law, first, the AI finds the governing law section or sentence, then extracts the specific location from that text. E.g., “This agreement is governed by the laws of the state of Delaware.” A simpler machine learning model might just pull all locations from an agreement and then attempt to filter down to the correct one. That could be waaaaaay overinclusive, pulling company locations, addresses, and litigation venues too.
  • To find the parties to an agreement, an AI could first locate the agreement’s preamble, then pull things that look like person or company names from that sentence. Sometimes there are added clues (e.g., a lease might include words like “Lessor” and “Lessee” after specific party names; a credit agreement might have the same with “Lender” and “Borrower’’). However, this approach is not sufficient to accurately do party identification on all agreements: lots of contracts present their parties in a non-preamble-based way.
  • To find the interest rate, first find the sentence or clause discussing the interest rate, then find the percentage number in the clause. Of course, the number may not be a number at all. It might instead be something like “[LIBOR/SOFR/SONIA] + __%.” As with governing law, a model that just pulled all numbers from a document wouldn’t be particularly useful for this task. A good entity extraction model finds the right number in the document.

If evaluating entity extraction quality across various Contracts AI offerings, you can roughly look at the same factors you would use to evaluate provision extraction performance.

Often, entities are very useful data points to feed into another system. Entity extraction is also a foundation for workflows that are looking to group contracts, in that agreement title, parties, and dates are useful inputs for this task.


Answers goes a step beyond finding contract clauses - it interprets and converts them into structured format (e.g., similar to your favourite search engine generally being able to tell you how old a public figure is rather than linking you to Wikipedia). So, it could tell you:

  • Whether an agreement is exclusive or not.
  • Whether assignment (a) requires notice, (b) requires consent, (c) is unrestricted, (d) is not permitted.
  • What covenants exist (and maybe even what their triggers are) in a credit agreement or indenture.

As a further example, consider the following license grant:

Subject to the terms and conditions of this Agreement, Licensor grants to Customer a limited, non-exclusive, non-transferable, non-assignable (other than to a permitted assignee of this Agreement) and sub-licensable worldwide license to permit Users to access and use the Services.

It contains a number of data points, which an Answers feature could potentially pull out:

  • Is the license limited? Yes/No
  • Is the license exclusive? Yes/No
  • Is the license transferable? Yes/No
  • Is the license assignable? Yes/No
    • If yes, under what circumstances? Multiple choice, likely.
  • Is the license sublicensable? Yes/No
  • Is there a territory restriction on the license? Yes/No
    • If yes, what is it? Ideally a normalized location.
  • Is there a purpose restriction? Yes/No
    • If yes, what is it? This may need to be free text, not a structured data point.

Answers is a very hard feature to get the tech working correctly on.

  • Some questions can be difficult to answer properly.
    • The meaning of a clause can turn on a single word.
    • Interpretation can sometimes be hard even for senior lawyers. I remember a time in my corporate lawyer days where I (midlevel associate), two partners, a senior associate, and a summer associate spent weeks trying to figure out the correct interpretation of a critical change of control clause.
    • Sometimes the answer is not just in one place in the document. Instead of asking “Is license exclusive?,” you might ask “Is this agreement exclusive?” This requires looking at more than the license grant - it requires looking at the whole agreement, and—potentially—interpreting multiple clauses.
    • Sometimes the answer isn’t in the agreement at all. For example, answering might require knowing whether an external triggering event has happened.
  • Answers can require a lot more training data, since you need to (1) accurately find the clause(s) in question, and then (2) interpret them. Note that Answers features are unlikely to work well unless you are starting from good clause detection accuracy. E.g., if your detection accuracy for a particular clause is 75%, any Answers models that rely on finding that clause should be even less accurate (since they need to both find the clause and interpret them, meaning that mistakes are likely in both stages). Essentially, multiplying fractions yields even lower numbers.

There can be ways to get this feature working better on specific questions, and—especially— more limited document types or collections.

If evaluating answers features across various Contracts AI offerings, you should consider the same factors you would use to evaluate provision extraction performance. Overall, our experience is that this is a hard feature to get right.

Answers can be a super useful feature.

  • Since results are in structured format, they can
    • Drive actions. E.g., if a MFN clause is identified, automatically send the agreement to a more senior lawyer for review.
    • Enable more granular database creation.
    • Help point human reviewers to where they should spend their time. E.g., check out agreements that have problematic clauses.
  • Help spot actual non-standard clauses. And identify which ones are functionally identical even though phrased differently.

Since Answers results can be so useful, it’s not uncommon at this stage to see human reviewers involved in the process of, e.g., converting semi-structured clause-level results into more granular and structured Answers format.


Normalization is a feature that turns specific variable text into standardized data. So, not only does the AI find text you’re looking for, it converts it into a consistent format and potentially removes any extraneous text. For example, if you need dates in a particular format, but agreement date fields read “January 22, 2022” or “the twenty second day of January 2022”, you will need an extra layer of post-processing to convert it into a standardized date format (e.g., “1/22/2022”). This is “normalization.” With normalization, you are able to get both the raw text that was extracted as well as a result in a standardized format, e.g., “2022-01-22”.

Normalization is typically offered on certain types of information:

  • Date normalization. This is a pretty standard normalization use case. Note that US vs international date formats can add some complication here - not so much on the output, but rather on the input. I.e., it’s pretty easy to have “March 5, 2022” output in whatever format, but it can be tricky for software to tell if “5/3/22” means “May 3, 2022” (US) or “March 5, 2022” (international). This means that some human QC may be required on outputs.
  • Location normalization. There are a number of potential location normalizations possible. Two that stand out are:
    • Address normalization - put an address into a standardized format.
    • Clause location normalization - put something like governing law or dispute resolution venue into a standardized format.
  • Number normalization.
    • Currency normalization - put currencies into a standardized format, likely including different results for “number” and “currency.” E.g., “975,” “€.”
    • Term normalization - put a duration (e.g., “five years” into “5” and “years” or “30 days” into “30” and “days”) into a standardized format. Useful for agreement term, expiry dates, payment terms, and notice periods, among other things.
    • Percentage normalization - put percentages into a standardized format. E.g., “twenty-five percent” and “25%” as “25” and “%.”
  • Party normalization. Group disparate entity names into one result. E.g., “GM,” “General Motors,” “General Motors Company,” and maybe the 455(!) different GM entities into one “GM” result. This is a hard problem, and isn’t something AI can do alone. It requires an outside database that knows that “P.T. G M AutoWorld Indonesia,” “VRP Venture Capital Rheinland-Pfalz Nr. 2 GmbH & Co. KG,” and “W. Grose Northampton Limited” are all GM entities. It is also somewhat prone to error, in that companies might have confusingly similar names. At the moment, you will likely need to rely on some system beyond a Contracts AI to get entity normalization.

Normalization yields standardized data, and this can be useful in multiple situations:

  • Normalization can be very helpful when populating databases. Database columns will have data types like numbers, integers, text strings, dates and date times which can all be standardized and formatted for the input you’re expecting.
  • Normalization can enable calculation. E.g., take a contract’s start date (not always easy to determine in our experience - often enough, the start date is a defined term (e.g., “The date when the Item is Delivered”), add its duration, and you can give the contract’s end date (assuming it hasn’t been extended or modified by another agreement - this stuff can be hard! :) ).
  • Normalization can feed workflows. E.g., if you know that an agreement expires on a set date, and it requires 60-day advance notice of cancellation or it autorenews, a CLM could use this normalized information to give a notification well before the 60-day advance notification deadline.
  • Normalization can help drive data visualizations. Normalized governing law data could power a world map feature showing where a company’s contracts are governed, or normalized end dates could give companies a visualization of when their customer contracts expire.

If evaluating Contracts AI normalization features, consider:

  • Accuracy - does the system do a good job at normalizing data.
  • Comprehensiveness - can you get the things normalized that you would like normalized.

Document-Level Classification

Document Type Classification

Document type classification is a feature that identifies documents by type. So for example, if a user submits a document, a Contracts AI will tell you if the document is a contract (or not). It will also tell you what type of contract it is, such as a real estate agreement, or an IP agreement. Document type classification features can get even more granular, and may cover all sorts of document types beyond contracts (e.g., invoices, policies, minute book material). Some Contracts AIs may allow users to train their own document type classifiers.

Document type classification can help in multiple ways:

  • Some provision, entity, and answer AI fields may be most appropriate for certain types of documents. For example, if reviewing a lease, there may be a set of fields that are relevant, like base and additional rent, signage, or sublettability. Or, if reviewing credit agreements or commitment papers, an end user might care about covenants, redemption, or reporting requirements. Document type classification can help guide end users to the most appropriate AI fields for the work they’re doing.
  • It can be useful in workflows.
    • This includes helping triage which documents to review, and by whom. So, for example:
      • Leases can go to real estate lawyers, NDAs to an outsourced resource, credit agreements to finance lawyers.
      • You could hook Contracts AI up to a document crawler, get the crawler to pull all documents from specific places on a company’s network, then classify retrieved documents by document type, and only further review certain types of documents (e.g., contracts (as opposed to non-contracts) or license agreements).
    • In contract negotiation use cases, it can help determine which playbook to offer, and which replacement clauses might fit.
  • It can be useful metadata to include in other systems.
    • This is reasonably standard information to show in a contract management or contract analysis software system.
    • It’s really useful information to supplement a document management system, making the DM better able to be searched.
    • It can help organize folders in a virtual data room. E.g., suggest which documents might go into the real estate, NDA, or employee/HR data room folders.

If evaluating different document type classification options, you should consider:

  • Accuracy.
  • Breadth and granularity of document types, and whether it covers the types of documents you foresee needing to classify.
  • Whether it is possible to train the system to identify additional document types.
    • How easy this is.
  • Whether the system offers an ability to do multiple different types of document classes simultaneously, or only one. That is, can you add multiple different types of document type labels at the same time, or just one document type label per document.
    • It can be useful to have the ability to do multiple types of document classes since different groups of classes can be useful for different purposes in the same application.
  • For example, it might be useful for a contract analysis or CLM system to have a feature that:
    • Crawls databases on a company’s network and pulls all documents, then
    • Puts those through a document type classifier that distinguishes between “contract” and “non-contract,” then
    • Passes documents identified as contracts into a further review process, then
    • As part of the further review process, classify the documents at a more granular level, to
      • Send contracts to the most appropriate reviewer - e.g., leases to someone experienced with real estate, licenses to an IP lawyer.
      • Use the most appropriate AI fields for the specific document.
      • Add additional useful contract type metadata to the repository where results get stored.

Here, multiple levels of document classification power different functions.

Language Classification

Language classification tells what language(s) a document is written in. Language classification has been around for a long time, and is pretty common (but still useful) tech.

Language classification:

  • Can be helpful for companies operating in multiple regions who need to determine if specific regional and language based contracts need to be managed through certain workflows that might pertain to particular legal issues or matters.
    • E.g., contracts written in French likely need to go to someone who reads French for review, and—perhaps—someone who is familiar with French law (though this will really depend on governing law - a French-language document might be under Quebec or Cote d’Ivoire law, so language-type classification might be too simplistic).
  • Can help determine which AI fields to apply to a document.
    • E.g., if a document is in German, it would be more effective to use German-language AI fields on it.

Language classification is available from a lot of different sources, not just Embeddable Contracts AI vendors. If evaluating different language classification options, things to consider are:

  • Accuracy.
  • Comprehensiveness - does it cover the languages you need it to.
  • Does it generate multiple results for documents written in multiple languages, or only the primary language of the document.