Classification

The DocAI classification service automatically identifies 220+ document types and classify documents based on the extensive taxonomy. This taxonomy is available here: https://zuva.ai/document-classifications/

Like most of the DocAI API, the classification service works asynchronously, requiring you to first make a POST to create the classification request, then use a GET endpoint to poll the status of the request and obtain the results once complete. When you make a classification request, DocAI automatically applies OCR to the document (if necessary - see file submission for exceptions) and caches the OCR results for reuse by any of the other services (language, field extraction and OCR).

Using this guide

This guide uses plain Python 3 and built-in libraries here for illustrative purposes, but if you plan to use Python in your own code you may want to check out our prebuilt Python wrapper.

To run the code samples, you’ll need the following imports and constants:

1
2
3
4
5
6
7
import os, json, requests

# Assumes you've exported your token as an environment variable
TOKEN = os.getenv('DOCAI_TOKEN')

# Change this if you are using another region
REGION_URL = "https://us.app.zuva.ai"

Step 1: Upload your file to DocAI

Follow the instructions in the File Management Workflow to upload your file to DocAI and obtain its file_id.

Step 2: Create a document classification request

To start processing your file, use the Create classification requests endpoint, providing the file_id from step 1.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def create_classification_requests(file_id):
   resp = requests.request("POST",
                           REGION_URL + "/api/v2/mlc",
                           headers={"Authorization": "Bearer " + TOKEN},
                           json={"file_ids": [file_id]})
   if resp.status_code==202:
       return json.loads(resp.text)
   else:
       raise RuntimeError(f'Unexpected status code: {resp.status_code}')

class_requests = create_classification_requests(file_id)["file_ids"]

The response includes a request_id for each file_id - in this case, since we only asked to classify one file, we get a single-element array. We’ll need the request_id in the next step:

1
request_id = class_requests[0]['request_id']

Step 3: Poll for the status and results

Begin polling the Get classification request status endpoint until the status is “complete”.

Note: you should also check for a “failed” status, to avoid waiting on a request that will never complete.

Once the status is complete, the response will also include the classification results within the ‘classifications’ key, e.g. “classifications”: [ “Contract”, “IP Agt”, “License Agt” ]

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import time

def poll_classification(request_id, timeout_seconds=180, interval_seconds=2):
   t_start = time.time()
   while time.time() < t_start + timeout_seconds:
       resp=requests.request("GET",
                             REGION_URL + "/api/v2/mlc/"+request_id,
                             headers={"Authorization": "Bearer " + TOKEN})
       if resp.status_code!=200:
           raise RuntimeError(f'Unexpected status code: {resp.status_code}')
       status_results = json.loads(resp.text)
       if status_results['status'] == 'failed':
           raise RuntimeError('classification request failed')
       if status_results['status'] == 'complete':
           return status_results
       time.sleep(interval_seconds)
   raise RuntimeError("Timed out waiting for classification request to process")

When the request is complete, the results will be included in the response.

1
2
3
4
results = poll_classification(request_id)
print("Document type level-1: " + results['classifications'][0])
print("Document type level-2: " + results['classifications'][1])
print("Document type level-3: " + results['classifications'][2])

Step 4 (optional): Delete the file from DocAI

If desired, you may now Delete the file from DocAI. Otherwise, it will automatically be removed after 48 hours.