OCR

DocAI uses optical character recognition (OCR) to convert documents from diverse input formats into a single internal format (eOCR) for further processing by services such as extraction and classification. In addition, DocAI also makes the OCR’d content available in three different formats: text, images or as a Protobuf (binary) layout file specifying the location of each character on the page.

The OCR Engine used is Kofax Omnipage. Zuva’s OCR pipeline contains layers of post processing to work best with Zuva’s machine learning stack. In this way, you are guaranteed access to ongoing improvements by Zuva’s Research team to enhance the machine learning capabilities.

Using your own OCR

DocaAI also supports use of your own OCR engine if you prefer. There are two ways of using the output of your own OCR engine. The first way is to provide text input in POST /files, and the second way is to provide the byte content of an eOCR file. The difference between these two methods is that the eOCR content retains the positional (bounds) information of the characters. The File submission page for more information about uploading text and eOCR files.

For an example of how to create an eOCR (ZuvaOCR) file from hOCR, please refer to our sample Python code.

On This Page