DocAI uses optical character recognition (OCR) to convert documents from diverse input formats into a single internal format (eOCR) for further processing by services such as extraction and classification. In addition, DocAI also makes the OCR’d content available in three different formats: text, images or as a Protobuf (binary) layout file specifying the location of each character on the page.
Important notes on the use of Zuva OCR service:
- Submitting a document to DocAI just uploads the document content to DocAI - it does not get the document OCR’ed.
- For Classification or Extraction services, users do not need to run OCR task explicitly as DocAI handles it internally by design.
- For Training service, users are expected to run OCR task explicitly before sending the training data in a document.
- Please note that, after a document is OCR’ed once (by user or DocAI), it will not go through OCR again for further requests - until it gets expired.
- If you are planning to use a document in future (after it expires), you can consider downloading a copy of the EOCR version. When you need to process the same document again, you can simply submit the downloaded EOCR version (in case it was already expired), which eliminates the need for OCR on that document again.
The OCR Engine used is Kofax Omnipage. Zuva’s OCR pipeline contains layers of post processing to work best with Zuva’s machine learning stack. In this way, you are guaranteed access to ongoing improvements by Zuva’s Research team to enhance the machine learning capabilities.
Using your own OCR
DocaAI also supports use of your own OCR engine if you prefer. There are two ways of using the output of your own OCR engine. The first way is to provide text input in POST /files, and the second way is to provide the byte content of an eOCR file. The difference between these two methods is that the eOCR content retains the positional (bounds) information of the characters. The File submission page for more information about uploading text and eOCR files.
On This Page