Python SDK tutorial
This document contains a step-by-step tutorial on how to leverage DocAI to obtain key metadata from your documents, as well as how this metadata can be saved to a spreadsheet for ease-of-reading.
After you have gone through this tutorial, you will be left with a working Python script that leverages multiple packages to take documents from your local folder, use machine learning (via DocAI) to extract and classify key metadata and save the content in a spreadsheet. Best of all, you will be able to explain what each component of the script does!
What will you learn?
The following is what you will learn by going through this tutorial:
- How to create an instance of the Python SDK to interact with DocAI
- How to submit your documents to DocAI
- How to create Language, Classification, and Field Extraction requests
- How to select fields from DocAI’s readily-available out-of-the-box fields
- How to output the data DocAI provides into an easy-to-read spreadsheet
To go through this tutorial, you will need:
- The Python interpreter (this tutorial uses v3.10)
- A Zuva account and token - see the Getting started guide
- A copy of Zuva’s DocAI Python SDK
Import the necessary packages
The first step is to import the necessary Python packages in your script. Below are the packages needed by this tutorial:
Get the files
Before you can run DocAI’s ML on your document, you will need to submit it (or them) to DocAI first. Below is how you can provide a folder name (upload_files, in this case) pointing to a folder that contains documents.
Going forward, docs contain the file path and file name of all of the underlying files. An example of this is:
To verify the contents of docs, you can run
print([d for d in docs]). While this tutorial uses local files, the same workflow would be followed for remote-hosted files. The only difference would be using the remote solution’s functionality to obtain the file’s content over the network.
Create an instance of the SDK
At this point in the tutorial you have imported the necessary Python packages, as well as loaded the documents that will be sent to DocAI. You should also have a token that was created, as mentioned in the requirements.
DocAI offers multiple regions to choose from, which can help you decrease latency (due to being physically closer/in the region), and data residency requirements. If you created a token on another region, provide that region’s url (e.g. japan.app.zuva.ai, eu.app.zuva.ai) instead of the one provided above (us.app.zuva.ai).
Going forward, the
sdk variable is going to be used to interact with DocAI.
Get the DocAI Fields
All DocAI users can utilize the Zuva-maintained AI model catalog in their workflow. These AI models are known as Fields in DocAI: they are used to extract entities, provisions and clauses from legal documents. DocAI is able to extract text written in a non-standard way (i.e. non-templated), which results in an offering that searches based on the AI’s understanding of legal concepts, as opposed to traditional regular expressions and database searches.
fields variable contains a reference to all of the Fields available to you. When run, the above will print how many fields were found on the region that you used when creating an instance of the Python SDK.
Submit your documents to DocAI
Submitting documents to DocAI is the first step towards obtaining metadata out of the document. Note that DocAI will not use the documents submitted to it for training purposes. These submitted documents are treated as confidential and are not used by Zuva for anything.
You can submit your documents to DocAI for analysis by running the following:
The above will go through all of your documents (from your docs variable) and submit the document to DocAI. This is done by using a function that the sdk exposes:
file.create, which takes the file content (in this case,
f.read()). The DocAI response is assigned to a variable named
file, which contains properties that can be used by you to keep track of this document.
The three properties used above are:
file.name (set locally to make it easier to keep track),
file.id (the file’s unique identifier) and
file.expiration (when it will be deleted).
These files are loaded in
docai_files, which will be used in the next steps to create requests in DocAI.
Create requests in DocAI
All requests in this script will be added to a variable named
Every request contains a unique identifier, which will be used further in this tutorial to keep track of the request’s status. Once the request completes processing, we can then obtain the results. In other words: DocAI performs requests asynchronously. When you create a request, DocAI will automatically perform OCR (if needed) behind the scenes without requiring the user to explicitly call the OCR service.
This service will tell you the dominant language of the document. The following creates the requests. One request is created per document provided. The following provides a
file_id list to the Language Classification service by using the sdk function
This service will tell you the document’s type (e.g. Real Estate Agreement). It will also tell you if it is a contract or not. The following creates the requests. One request is created per document provided. The following provides a
file_id list to the Document Classification service by using the
This service will extract the fields you have chosen to be extracted from your document. It will return the text (which can be multiple words, sentences or paragraphs), as well as where it was found in the document.
Choosing the fields
By now you have a variable named
fields that contain ~1300+ field references. You can filter these using the field names that you would like to use. Below is the list of field names that this tutorial will extract out of your documents, as well as how their unique identifiers (used by the Field Extraction service) are retrieved.
field_ids variable now contains a
field_id that represents the fields defined in
Using the fields
Now that you have both a list of documents and a list of field identifiers, you can use the sdk function
extraction.create, and provide these two lists to it.
One field extraction request is created per document in
docai_files. Each request will be responsible to search the document for the fields from
Combine the requests in one list
You now have three variables that contain numerous requests each. These variables are
Combine all of these to the requests variable:
Wait until all requests complete
When a request is created, DocAI’s workers will pick them up and process them. Since this tutorial will obtain the DocAI output and save them to a spreadsheet, we will need to form a data structure that allows us to organize DocAI’s results in a manner that makes it easy for us to retrieve them when it’s time to save them.
Thus, the following snippet performs two key things:
- Every two seconds, it checks all of the requests to see if they have completed.
- When a request completes, its metadata is added to the
resultsvariable, which will be used later in this tutorial. In addition, the request is removed from the list since it is no longer needed.
This snippet leverages numerous
sdk functions to achieve this task (also known as “polling” the requests until they complete):
.update()is used to obtain the request’s latest status.
.get_results()is used by Field Extraction requests to obtain all of the extracted text (and their locations). There is a separate function to obtain the Field Extraction results because the response varies in size (few-to-many results, depending on file size and number of fields requested), compared to the other services which are always going to contain a fixed number of data points.
.is_type()is used since the requests variable contains multiple different types of requests (language, document classification, field extraction)
.is_finished()to check if the request completed processing
.is_successful()to check if the request completed successfully
This data structure was defined for this tutorial, and has no bearing on how DocAI performs its tasks. The structure exists to collate DocAI results to their respective
file_id. This data in practice will likely be saved in a database, from where the results can be obtained. However, for this tutorial, we are setting this in-memory.
Save as a Spreadsheet
Using the organized data, we can now save it in a format that is easy to share with others.
First, we need to define the spreadsheet’s columns:
Second, we’ll need to go through the organized data and set it up for intake by a third-party package (in this case, pandas) so that our metadata maps to the columns above.
Using the new data variable, create a
And then save it as a spreadsheet:
On This Page