arrow_back Back to Features


Noah Waisberg • August 11, 2022 • 4 minute read

Clustering is the grouping of related information together. In general, contract information tends to be grouped along two attributes:

  • Provision-level clustering.
  • Document clustering.

Provision-level Clustering

In provision-level clustering, similar provisions are grouped together. So, for example, clustering might enable an end user to see change of control clauses in other documents similar to one they had just reviewed.

Grouping similar provisions together can be a useful feature because it:

  • Enables end users to review pools of contracts more quickly for clauses they care about.
  • Provides end users with another tool to make sure that they don’t miss information that they might care about.
  • Can be used to cluster documents (e.g., Documents A, B, C all have Change of Control variant X, Assignment variant Y, and Exclusivity variant Z).
  • Can be a way to help build training data for teaching provision extraction models.
    • Though note that it can be dangerous to rely too heavily on this approach to generating training data because provision-level clustering can miss different ways of phrasing the same provision.

Provision-level clustering has a risky side. Provision-level clustering is a mediocre way to find provisions, and can lull users into a false sense of security. Here is what I wrote years ago (2013?) on using a comparison-based (clustering) approach to finding provisions in contracts:

Comparison-based systems run into trouble when provisions in new agreements differ from ones in the provision database. This can occur because new agreements are drafted differently, which happens, especially in commercial agreements like supply and distribution contracts (some of the most common agreements in due diligence and contract management database population projects). Or it can occur because of poor quality scans leading to inexact agreement transcriptions. Comparison-based systems can cope with dissimilar agreements or difficult-to-OCR text by relaxing their comparison threshold, but this increases the odds of finding false positives. Comparison-based provision detection could work with a provision database covering all examples of how the provision is drafted (assuming no poor quality scans are reviewed). But it takes a lot of effort to build a good provision database, and it would be hard to be sure the database was actually comprehensive.

Over the years, some Contracts AI vendors have advocated for provision-level clustering approaches to contract provision detection. While provision-level clustering can be useful as a feature, it has real limitations as an approach to accurately identifying provisions in contracts. It is better used to group related provisions together and identify outlying provisions once these provisions have already been identified using a well-built provision extraction system.

Here are some much more thorough pieces I wrote on provision-level clustering years ago:

  1. This article on contract provision extraction has a bunch more detail on using provision-level clustering to identify contract provisions.
  2. This article about adding non-standard clause detection to your system has instructions on how to use a comparison-based approach to build a non-standard contract provision identification system.
  3. This article on building non-standard clause detection goes into some of the tricky issues in building non-standard clause detection.

If you only have limited time on this topic, read 3., then 1. If still interested, read 2.

Also, this article about technology fundamentals by my teammate Dr. Adam Roegiest is worthwhile. It gives a higher level overview of supervised versus unsupervised machine learning, lots of discussion on sorting fruit, and a sense of the effort trade-off with using one approach or the other.

Document Clustering

Document clustering groups similar documents together.

Document clustering results can be useful to:

  • Help users find similar subgroups of documents in a larger set of documents.
  • Enable users to more quickly take secondary actions on documents, like:
    • Assign similar documents for review to one person.
    • Identify agreements likely drafted off the same form, enabling users to bulk redline/blackline/compare these documents.
  • Provide a starting point for training document-level classifiers.

Clustering is generally technology that has been around for a while, partially because it can be so useful.

There are a few challenges to look out for when considering provision or document clustering features:

  • Clustering is inherently sensitive to the nature and number of documents that are being clustered since it relies on mathematical notions of similarity and usually pre-set thresholds of similarity. This means that document clustering can be especially sensitive to the desired outcome (e.g., very granular clustering on a small group of documents versus a coarse clustering on a larger set) and often requires tweaking to get right for the envisioned task and available data. Put another way, there is often no “one size fits all” clustering solution but more a spectrum of possible solutions; some of which utilize the vendor seeding clusters with data available to them (e.g., specific types of documents or provisions) or pre-determining a minimum (or maximum) number of clusters.
  • Adding new documents raises UI challenges. Should you re-run the clustering (generating new clusters), or add new documents to existing clusters?
  • Need to keep documents in the system to work. The tech underlying Contracts AI classification and extraction features can ingest, then almost immediately delete documents as they are processed. This is attractive for both security and data storage reasons. On the other hand, in order to cluster provisions or documents, a system generally needs to keep them around (to measure how similar they are to each other) or provide some other mechanism to measure similarity (e.g., to a hypothetical “representative” document for a cluster).