Field Extraction

The DocAI field extraction service allows you to extract common legal clauses, provisions and data points from unstructured documents and contracts, including ones written in non-standard language.

Like most of the DocAI API, the field extraction service works asynchronously, requiring you to first make a POST to create the Field extraction request, then use a GET endpoint to poll the status of the request. Once the status is complete, you can use additional GET endpoints to obtain the results.

When you make a field extraction request, DocAI automatically applies OCR to the document (if necessary - see file submission for exceptions) and caches the OCR results for reuse by any of the other services (classification, language and OCR).

Using this guide

This guide uses plain Python 3 and built-in libraries here for illustrative purposes, but if you plan to use Python in your own code you may want to check out our prebuilt Python wrapper.

To run the code samples, you’ll need the following imports and constants:

1
2
3
4
5
6
7
import os, json, requests

# Assumes you've exported your token as an environment variable
TOKEN = os.getenv('DOCAI_TOKEN')

# Change this if you are using another region
REGION_URL = "https://us.app.zuva.ai"

Step 1: Select field IDs

DocAI includes over 1300 built-in fields, as well as a field training API and the ability to use fields created using AI trainer.

Find fields programmatically

To obtain a list of all available fields from the API, make a Get field list request

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def get_fields():
    resp=requests.request("GET", REGION_URL + "/api/v2/fields",
                          headers={"Authorization": "Bearer " + TOKEN})
    if resp.status_code==200:
        return json.loads(resp.text)
    else:
        raise RuntimeError('Unexpected status code: '+resp.status_code)

fields = get_fields()
print(json.dumps(fields, indent=4))

The returned list will include all fields available to your token, including both custom and built-in fields.

For the purpose of this walkthrough, we will define some field IDs as constants:

1
2
3
4
5
6
fields = {
    "668ee3b5-e15a-439f-9475-05a21755a5c1": "Title",
    "f743f363-1d8b-435b-8812-204a6d883834": "Parties",
    "4d34c0ac-a3d4-4172-92d0-5fad8b3860a7": "Date",
    "c83868ae-269a-4a1b-b2af-c53e5f91efca": "Governing Law"
}

Find fields in the field library

The IDs of built-in fields are available in the Field Library (log in required).

Find field ID of a custom field in AI trainer

In AI trainer, the ID of each field is labelled “GUID” on the Field Details page.

AI Trainer screen

Step 2: Upload your file to DocAI

Follow the instructions in the File Management Workflow to upload your file to DocAI and obtain its file_id.

Step 3: Create a field extraction request

To start processing your file, use the Create field extraction requests endpoint, providing the field_ids from step 1 and the file_id from step 2.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def create_extraction_requests(file_id, field_ids):
   resp = requests.request("POST",
                           REGION_URL + "/api/v2/extraction",
                           headers={"Authorization": "Bearer " + TOKEN},
                           json={"file_ids": [file_id],
                                 "field_ids": field_ids})
   if resp.status_code==202:
       return json.loads(resp.text)
   else:
       raise RuntimeError(f'Unexpected status code: {resp.status_code}')

field_ids = list(fields.keys())
extraction_requests = create_extraction_requests(file_id, field_ids)["file_ids"]

The response includes a request_id for each file_id - in this case, since we included only one file ID, we get a single-element array. We’ll need the request_id in the next step:

1
request_id = extraction_requests[0]['request_id']

Step 4: Poll field extraction request status

Begin polling the Get extraction status endpoint until the status is “complete”.

Note: you should also check for a “failed” status, to avoid waiting on a request that will never complete.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import time

def poll_extraction(request_id, timeout_seconds=180, interval_seconds=2):
   t_start = time.time()
   while time.time() < t_start + timeout_seconds:
       resp=requests.request("GET",
                             REGION_URL + "/api/v2/extraction/"+request_id,
                             headers={"Authorization": "Bearer " + TOKEN})
       if resp.status_code!=200:
           raise RuntimeError(f'Unexpected status code: {resp.status_code}')
       status_results = json.loads(resp.text)
       if status_results['status'] == 'failed':
           raise RuntimeError('extraction request failed')
       if status_results['status'] == 'complete':
           return status_results
       time.sleep(interval_seconds)
   raise RuntimeError("Timed out waiting for extraction request to process")

status = poll_extraction(request_id)

Step 5: Get results

The following example gets the results of the extraction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def get_extraction_results(request_id):
    url = REGION_URL+"/api/v2/extraction/"+request_id+"/results/text"
    resp=requests.request("GET",
                          url,
                          headers={"Authorization": "Bearer " + TOKEN})
    if resp.status_code!=200:
        raise RuntimeError(f'Unexpected status code: {resp.status_code}')
    return json.loads(resp.text)

extraction_data = get_extraction_results(request_id)
print(json.dumps(extraction_data, indent=4))

The actual results are included in the response under the top-level 'results' key. The results consist of an array with one entry for each field_id that was requested. For each field, there is an array of zero or more extractions (i.e. instances of the desired text). For example, you can print out the extractions by iterating over the results, and then iterating over the extractions for each result

1
2
3
for result in extraction_data['results']:
    for ex in result['extractions']:
        print(fields[result['field_id']] + ': ' + ex['text'])

Step 6 (optional): Delete the file from DocAI

If desired, you may now Delete the file from DocAI. Otherwise, it will automatically be removed after 48 hours.