Enhanced Document Classifier & Open Source Taxonomy

We released our first document type classifiers back in the early days of Kira. The classifiers could identify if a document was a contract or not, and put documents into one of ~35 buckets. Ever since then, we’ve been working hard to build a taxonomy and expand the scope of documents that our AI software can automatically classify. That work has been a heavy lift—it’s taken years—but now the wait is over, big time! With Zuva’s new Multi-Level Document Classifier, Zuva now automatically classifies 225 document types.

Automatic document classification is useful for multiple reasons:

Document management systems are core organizational systems in businesses and professional service firms. Document types are important metadata, but it’s hard to get users to complete this information. AI technology applying a set taxonomy can help.
Within a contracts AI, it will tell you if your document is a contract (or not), as well as what type of contract it is. It’s helpful in guiding end users to the most appropriate AI fields to use, in triaging documents for review, or determining playbooks to offer in contract negotiation. It can also provide useful metadata for other systems. We’ve written a lot more on document classification.

This automated document classifier took a bunch of work. Building the taxonomy took years, and a lot of thought and labor (including building off taxonomies done by others), by a bunch of lawyers, paralegals, research scientists, and a knowledge manager/librarian from Zuva, Kira and Litera. Each sub-type in the taxonomy needed examples to train the AI, and we sourced and sorted tens of thousands of documents for this. We then had to refine and retrain, as well as build, implement, and optimize a lot of tech behind the scenes to make it all work.

In today’s legal and document management technology market, we know that many customers use multiple (sometimes competing) systems together. For example, a single company or law firm might use three different contract analysis AIs, plus another document management repository, potentially with its own AI. In order to really get a handle on their enterprise data, this customer would need to have data from System A be comparable to data from System B/C/D. This is extra hard if each vendor has their own private taxonomy. Our guess is that our document type taxonomy is likely to be more comprehensive and robust than that of many other vendors. Keeping it to ourselves could create a competitive advantage for us. But we think our customers are a lot better off if others use our taxonomy too , or if competitor systems’ taxonomies can be translated to ours. This is what SALI (Standards Advancement for the Legal Industry) Alliance does, and this is why we’ve (in conjunction with Litera) contributed our document classification taxonomy to SALI.

SALI is the leading global non-profit dedicated to creating and promoting legal data standards. Over the past 6+ years, SALI has made significant progress in getting the legal industry to adopt the taxonomy they’ve created within the Legal Matter Standard Specification (LMSS), a standardized framework of tags that categorize and describe legal work in a way that’s consistently structured.

Toby Brown, president of the board of The SALI Alliance commented on the collaboration:
“Legal data standards are critical for optimizing efficiency and nurturing global collaborations. Zuva and Litera’s contribution is an exciting addition to the standards we’ve established, further paving the way for vast opportunities across the legal spectrum.” SALI leader Damien Riehl added “For Large Language Models (LLMs), an important method of increasing accuracy and reducing hallucinations is Retrieval Augmented Generation (RAG), and SALI’s 13,000+ tags can helpfully curate that document subset — for LLMs to summarize, analyze, and synthesize.”

We think this benefits the legal industry in multiple ways:

First, the SALI community can continue to build on this taxonomy. With its 50+ participating member organizations, and dozens more implementers, we anticipate and support the potential of new document classification type suggestions and ideas for how it might integrate into the broader set of standards that SALI is building.

Second, we’ve learned that it’s very common for larger businesses to have tens or hundreds of document and contract repositories on the go, powered by technology from multiple vendors. A standard taxonomy for document classification should improve the interoperability of enterprise systems.

A standard taxonomy can contribute to improved interoperability through:

Better data management. A standard taxonomy provides a common structure and language for contract data types, simplifies data field mapping between systems, and reduces the risk of data corruption or loss because both systems understand and handle data in a consistent way.
API integrations. Contract management software vendors can design their APIs based on standardized taxonomies which make it easier for third party developers and other software systems to integrate with multiple contract management softwares.
Easier vendor selection. Organizations looking to adopt new contract management software can more easily evaluate vendors that support standardized taxonomies (i.e., they can test how multiple systems perform at the exact same task). Standardized outputs also makes switching vendors easier. This should lower the risk of vendor selection, since a wrong vendor selection decision becomes easier to get out of.

We’re excited to see where this goes! Explore the 225 document types available in Zuva’s Multi-Level Document Classifier.

Zuva releases enhanced document classifier, open-sources its multi-level document classification taxonomy via the SALI Alliance