Optical Character Recognition (OCR)

Noah Waisberg • July 28, 2022 • 3 minute read

Many contracts come in the form of image files - i.e., they are scanned documents. In order to review them, the images need to be converted into (correctly laid out) text. This is where Optical Character Recognition (OCR) comes in. OCR converts images into text. So you can upload a contract (or other document) in the form of an image file into the Embeddable Contracts AI application you’re using, and OCR will convert that file into text, which the Embeddable Contracts AI can process further (e.g., identify document type, extract clauses).

It is possible for Embeddable Contracts AI to work without built-in OCR.

There are lots of OCR solutions available (some better than others). You could OCR files using another system or API, and then pass text results to the Embeddable Contracts AI. This adds a step, but end customers of your application may never notice, so it can be pretty fine to do it this way.
- If following this approach, make sure that Embeddable Contracts AIs that you’re considering can work with different OCR software.
- Depending on what information you pass to the Embeddable Contracts AI, you may be depriving it of useful layout information. This can hurt accuracy, though the impact may not be material.
The documents you need reviewed may be born digital, not needing OCR. For example, most contract negotiations are on Word documents.
- Note that not all PDF files convert cleanly into text, and Embeddable Contracts AI vendors can use OCR to help protect against this (by running PDFs through OCR).

OCR is a foundational technology for most Contracts AI, since systems add intelligence by reviewing text generated by OCR. OCR accuracy can vary dramatically. Inaccurate results at the OCR stage raise the odds of mistakes as the Embeddable Contracts AI attempts to add further intelligence to documents it reviews. This piece goes into a lot more detail on contract analysis on poor quality scans.

There are multiple different OCRs available, and Embeddable Contracts AIs can have one integrated into it. The main legacy OCR vendors are ABBYY and Kofax’s OmniPage. Once upon a time, these two were far more accurate than other offerings, in our experience. That may have shifted in recent years with many cloud vendors offering their own OCR services. Some Contracts AI vendors also specially enhance their third party OCR. For example, at Zuva, our Research Team has added improved table and cell detection. Overall, our experience is that different OCRs have different tradeoffs. If seriously considering implementing your own OCR, test thoroughly and budget accordingly.

Note that OCRing documents can be time consuming (e.g., 1–5 seconds/page). Documents that don’t have to go through OCR tend to process much faster.

OCR is a foundational technology, and so choosing one is a very important decision. If evaluating different OCRs, consider:

Accuracy. There can be significant differences here, including on poorer quality scans, and elements like tables.
Speed.
General attributes like ease of implementation and robustness.

Here are some additional useful OCR features of some Embeddable Contract AIs worth looking out for:

OCR’d text and layout information availability. OCR’d text and layout information can be useful elsewhere in applications (e.g., to include in a document viewer), and getting this information from an Embeddable Contracts AI can save the time and expense of OCRing documents twice.
Scan quality grade. OCR is imperfect, especially on poor quality scans. Basically, the worse the quality of the scan, the worse OCR results are likely to be. Since OCR is such a foundational technology for other Contracts AI features (e.g., classification, clustering), these other features are likely to have more errors if the OCR stage went poorly. A scan quality grade can enable end users to know documents where the Contracts AI was more likely to have made errors, meaning that they can spend more time reviewing and cleaning up these documents.