OCR

The DocAI OCR service allows you to obtain the text of a document, as well as images of the pages and the location of every character on each page.

Like most of the DocAI API, the OCR service works asynchronously, requiring you to first make a POST to create the OCR request, then use a GET endpoint to poll the status of the request. Once the status is complete, you can use additional GET endpoints to obtain the results you are interested in.

Using this guide

This guide uses plain Python 3 and built-in libraries here for illustrative purposes, but if you plan to use Python in your own code you may want to check out our prebuilt Python wrapper.

To run the code samples, you’ll need the following imports and constants:

1
2
3
4
5
6
7
import os, json, requests

# Assumes you've exported your token as an environment variable
TOKEN = os.getenv('DOCAI_TOKEN')

# Change this if you are using another region
REGION_URL = "https://us.app.zuva.ai"

Step 1: Upload your file to DocAI

Follow the instructions in the File Management Workflow to upload your file to DocAI and obtain its file_id.

Step 2: Create an OCR request

To start processing your file, use the Create OCR requests endpoint, providing the file_id from step 1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def create_ocr_requests(file_id):
   resp = requests.request("POST",
                           REGION_URL + "/api/v2/ocr",
                           headers={"Authorization": "Bearer " + TOKEN},
                           json={"file_ids": [file_id]})
   if resp.status_code==202:
       return json.loads(resp.text)
   else:
       raise RuntimeError(f'Unexpected status code: {resp.status_code}')

ocr_requests = create_ocr_requests(file_id)["file_ids"]

The response includes a request_id for each file_id - in this case, since we included only one file ID, we get a single-element array. We’ll need the request_id in the next step:

1
request_id = ocr_requests[0]['request_id']

Step 3: Poll OCR request status

Begin polling the Get ocr request status endpoint until the status is “complete”.

Note: you should also check for a “failed” status, to avoid waiting on a request that will never complete.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import time

def poll_ocr(request_id, timeout_seconds=180, interval_seconds=2):
   t_start = time.time()
   while time.time() < t_start + timeout_seconds:
       resp=requests.request("GET",
                             REGION_URL + "/api/v2/ocr/"+request_id,
                             headers={"Authorization": "Bearer " + TOKEN})
       if resp.status_code!=200:
           raise RuntimeError(f'Unexpected status code: {resp.status_code}')
       status_results = json.loads(resp.text)
       if status_results['status'] == 'failed':
           raise RuntimeError('ocr request failed')
       if status_results['status'] == 'complete':
           return status_results
       time.sleep(interval_seconds)
   raise RuntimeError("Timed out waiting for ocr request to process")

Step 4: Get results

Text results

The following example retrieves the document content using the Get OCR text endpoint prints out the first 70 characters:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def get_ocr_text(request_id):
    resp=requests.request("GET",
                          REGION_URL + "/api/v2/ocr/"+request_id+"/text",
                          headers={"Authorization": "Bearer " + TOKEN})
    if resp.status_code!=200:
        raise RuntimeError(f'Unexpected status code: {resp.status_code}')
    return json.loads(resp.text)['text']

text = get_ocr_text(request_id)
print(text[:70] + "...")

Image results

The GET OCR images endpoint returns the a ZIP file containing a PNG image of each page:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def get_ocr_images(request_id):
    resp=requests.request("GET",
                          REGION_URL + "/api/v2/ocr/"+request_id+"/images",
                          headers={"Authorization": "Bearer " + TOKEN})
    if resp.status_code!=200:
        raise RuntimeError(f'Unexpected status code: {resp.status_code}')
    return resp.content

with open("images.zip", "wb") as out_file:
    out_file.write(get_ocr_images(request_id))

Layouts

The GET layouts endpoint returns the layout of every character on each page of the document in a binary Protobuf format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def get_ocr_layouts(request_id):
    resp=requests.request("GET",
                          REGION_URL + "/api/v2/ocr/"+request_id+"/layouts",
                          headers={"Authorization": "Bearer " + TOKEN})
    if resp.status_code!=200:
        raise RuntimeError(f'Unexpected status code: {resp.status_code}')
    return resp.content

with open("layouts.zip", "wb") as out_file:
    out_file.write(get_ocr_layouts(request_id))

See the hOCR to eOCR Python example for more information on working with the protobuf binary file.

Step 5 (optional): Delete the file from DocAI

If desired, you may now Delete the file from DocAI. Otherwise, it will automatically be removed after 48 hours.