OCR
The DocAI OCR service allows you to obtain the text of a document, as well as images of the pages and the location of
every character on each page.
Like most of the DocAI API, the OCR service works asynchronously, requiring you to first make a POST to create the OCR request, then use a GET endpoint to poll the status of the request. Once the status is complete, you can use additional GET endpoints to obtain the results you are interested in.
Using this guide
This guide uses plain Python 3 and built-in libraries here for illustrative purposes, but if you plan to use Python in your own code you may want to check out our prebuilt Python wrapper.
To run the code samples, you’ll need the following imports and constants:
1
2
3
4
5
6
7
| import os, json, requests
# Assumes you've exported your token as an environment variable
TOKEN = os.getenv('DOCAI_TOKEN')
# Change this if you are using another region
REGION_URL = "https://us.app.zuva.ai"
|
Step 1: Upload your file to DocAI
Follow the instructions in the File Management Workflow to upload your file to DocAI and obtain its file_id
.
Step 2: Create an OCR request
To start processing your file, use the Create OCR requests endpoint, providing the file_id
from step 1.
1
2
3
4
5
6
7
8
9
10
11
| def create_ocr_requests(file_id):
resp = requests.request("POST",
REGION_URL + "/api/v2/ocr",
headers={"Authorization": "Bearer " + TOKEN},
json={"file_ids": [file_id]})
if resp.status_code==202:
return json.loads(resp.text)
else:
raise RuntimeError(f'Unexpected status code: {resp.status_code}')
ocr_requests = create_ocr_requests(file_id)["file_ids"]
|
The response includes a request_id
for each file_id
- in this case, since we included only one file ID, we get a single-element array. We’ll need the request_id
in the next step:
1
| request_id = ocr_requests[0]['request_id']
|
Step 3: Poll OCR request status
Begin polling the Get ocr request status endpoint until the status
is “complete”.
Note: you should also check for a “failed” status, to avoid waiting on a request that will never complete.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| import time
def poll_ocr(request_id, timeout_seconds=180, interval_seconds=2):
t_start = time.time()
while time.time() < t_start + timeout_seconds:
resp=requests.request("GET",
REGION_URL + "/api/v2/ocr/"+request_id,
headers={"Authorization": "Bearer " + TOKEN})
if resp.status_code!=200:
raise RuntimeError(f'Unexpected status code: {resp.status_code}')
status_results = json.loads(resp.text)
if status_results['status'] == 'failed':
raise RuntimeError('ocr request failed')
if status_results['status'] == 'complete':
return status_results
time.sleep(interval_seconds)
raise RuntimeError("Timed out waiting for ocr request to process")
|
Step 4: Get results
Text results
The following example retrieves the document content using the Get OCR text endpoint prints out the first 70 characters:
1
2
3
4
5
6
7
8
9
10
| def get_ocr_text(request_id):
resp=requests.request("GET",
REGION_URL + "/api/v2/ocr/"+request_id+"/text",
headers={"Authorization": "Bearer " + TOKEN})
if resp.status_code!=200:
raise RuntimeError(f'Unexpected status code: {resp.status_code}')
return json.loads(resp.text)['text']
text = get_ocr_text(request_id)
print(text[:70] + "...")
|
Image results
The GET OCR images endpoint returns the a ZIP file containing a PNG image of each page:
1
2
3
4
5
6
7
8
9
10
| def get_ocr_images(request_id):
resp=requests.request("GET",
REGION_URL + "/api/v2/ocr/"+request_id+"/images",
headers={"Authorization": "Bearer " + TOKEN})
if resp.status_code!=200:
raise RuntimeError(f'Unexpected status code: {resp.status_code}')
return resp.content
with open("images.zip", "wb") as out_file:
out_file.write(get_ocr_images(request_id))
|
Layouts
The GET layouts endpoint returns the layout of every character on each page of the document in a binary Protobuf format:
1
2
3
4
5
6
7
8
9
10
| def get_ocr_layouts(request_id):
resp=requests.request("GET",
REGION_URL + "/api/v2/ocr/"+request_id+"/layouts",
headers={"Authorization": "Bearer " + TOKEN})
if resp.status_code!=200:
raise RuntimeError(f'Unexpected status code: {resp.status_code}')
return resp.content
with open("layouts.zip", "wb") as out_file:
out_file.write(get_ocr_layouts(request_id))
|
See the hOCR to eOCR Python example for more
information on working with the protobuf binary file.
Step 5 (optional): Delete the file from DocAI
If desired, you may now Delete the file from DocAI.
Otherwise, it will automatically be removed after 48 hours.