Python SDK tutorial
Introduction
This document contains a step-by-step tutorial on how to leverage Zuva to obtain key metadata from your documents, as well as how this metadata can be saved to a spreadsheet for ease-of-reading.
After you have gone through this tutorial, you will be left with a working Python script that leverages multiple packages to take documents from your local folder, use machine learning (via API) to extract and classify key metadata and save the content in a spreadsheet. Best of all, you will be able to explain what each component of the script does!
This tutorial is also available on Github as plain Python and as an interactive Jupyter notebook.
What will you learn?
The following is what you will learn by going through this tutorial:
- How to create an instance of the Python SDK to interact with Zuva
- How to submit your documents to Zuva
- How to create Language, Classification, and Field Extraction requests
- How to select fields from Zuva’s readily-available out-of-the-box fields
- How to output the data Zuva provides into an easy-to-read spreadsheet
Requirements
To go through this tutorial, you will need:
- The Python interpreter (this tutorial uses v3.10)
- A Zuva account and token - see the Getting started guide
- A copy of Zuva’s Python SDK
Let’s Build!
Import the necessary packages
The first step is to import the necessary Python packages in your script. Below are the packages needed by this tutorial:
|
|
Get the files
Before you can run Zuva’s ML on your document, you will need to submit it (or them) to Zuva first. Below is how you can provide a folder name (upload_files, in this case) pointing to a folder that contains documents.
|
|
Going forward, docs contain the file path and file name of all of the underlying files. An example of this is: upload_files/mydocument.pdf
.
To verify the contents of docs, you can run print([d for d in docs])
. While this tutorial uses local files, the same workflow would be followed for remote-hosted files. The only difference would be using the remote solution’s functionality to obtain the file’s content over the network.
Create an instance of the SDK
At this point in the tutorial you have imported the necessary Python packages, as well as loaded the documents that will be sent to Zuva. You should also have a token that was created, as mentioned in the requirements.
|
|
Zuva’s API servers are hosted in both the US and Europe regions, which can help you decrease latency (due to being physically closer/in the region), and data residency requirements. If you created a token on another region, provide that region’s url (e.g. eu.app.zuva.ai) instead of the one provided above (us.app.zuva.ai).
Going forward, the sdk
variable is going to be used to interact with Zuva.
Get the Zuva Fields
All Zuva users can utilize the Zuva-maintained AI model catalog in their workflow. These AI models are known as Fields in Zuva: they are used to extract entities, provisions and clauses from legal documents. Zuva is able to extract text written in a non-standard way (i.e. non-templated), which results in an offering that searches based on the AI’s understanding of legal concepts, as opposed to traditional regular expressions and database searches.
|
|
The fields
variable contains a reference to all of the Fields available to you. When run, the above will print how many fields were found on the region that you used when creating an instance of the Python SDK.
Submit your documents to Zuva
Submitting documents to Zuva is the first step towards obtaining metadata out of the document. Note that Zuva will not use the documents submitted to it for training purposes. These submitted documents are treated as confidential and are not used by Zuva for anything.
You can submit your documents to Zuva for analysis by running the following:
|
|
The above will go through all of your documents (from your docs variable) and submit the document to Zuva. This is done by using a function that the sdk exposes: file.create
, which takes the file content (in this case, f.read()
). The Zuva response is assigned to a variable named file
, which contains properties that can be used by you to keep track of this document.
The three properties used above are: file.name
(set locally to make it easier to keep track), file.id
(the file’s unique identifier) and file.expiration
(when it will be deleted).
These files are loaded in zuva_files
, which will be used in the next steps to create requests in Zuva.
Create requests in Zuva
All requests in this script will be added to a variable named requests
:
|
|
Every request contains a unique identifier, which will be used further in this tutorial to keep track of the request’s status. Once the request completes processing, we can then obtain the results. In other words: Zuva performs requests asynchronously. When you create a request, Zuva will automatically perform OCR (if needed) behind the scenes without requiring the user to explicitly call the OCR service.
Language Classification
This service will tell you the dominant language of the document. The following creates the requests. One request is created per document provided. The following provides a file_id
list to the Language Classification service by using the sdk function language.create
.
|
|
Document Classification
This service will tell you the document’s type (e.g. Real Estate Agreement). It will also tell you if it is a contract or not. The following creates the requests. One request is created per document provided. The following provides a file_id
list to the Document Classification service by using the sdk
function classification.create
.
|
|
Field Extraction
This service will extract the fields you have chosen to be extracted from your document. It will return the text (which can be multiple words, sentences or paragraphs), as well as where it was found in the document.
Choosing the fields
By now you have a variable named fields
that contain ~1300+ field references. You can filter these using the field names that you would like to use. Below is the list of field names that this tutorial will extract out of your documents, as well as how their unique identifiers (used by the Field Extraction service) are retrieved.
|
|
The field_ids
variable now contains a field_id
that represents the fields defined in field_names
.
Using the fields
Now that you have both a list of documents and a list of field identifiers, you can use the sdk function extraction.create
, and provide these two lists to it.
|
|
One field extraction request is created per document in zuva_files
. Each request will be responsible to search the document for the fields from field_ids
.
Combine the requests in one list
You now have three variables that contain numerous requests each. These variables are languages
, classifications
and extractions
.
Combine all of these to the requests variable:
|
|
Wait until all requests complete
When a request is created, Zuva’s workers will pick them up and process them. Since this tutorial will obtain the Zuva output and save them to a spreadsheet, we will need to form a data structure that allows us to organize Zuva’s results in a manner that makes it easy for us to retrieve them when it’s time to save them.
Thus, the following snippet performs two key things:
- Every two seconds, it checks all of the requests to see if they have completed.
- When a request completes, its metadata is added to the
results
variable, which will be used later in this tutorial. In addition, the request is removed from the list since it is no longer needed.
|
|
This snippet leverages numerous sdk
functions to achieve this task (also known as “polling” the requests until they complete):
.update()
is used to obtain the request’s latest status..get_results()
is used by Field Extraction requests to obtain all of the extracted text (and their locations). There is a separate function to obtain the Field Extraction results because the response varies in size (few-to-many results, depending on file size and number of fields requested), compared to the other services which are always going to contain a fixed number of data points..is_type()
is used since the requests variable contains multiple different types of requests (language, document classification, field extraction).is_finished()
to check if the request completed processing.is_successful()
to check if the request completed successfully
Data Structure
This data structure was defined for this tutorial, and has no bearing on how Zuva performs its tasks. The structure exists to collate Zuva results to their respective file_id
. This data in practice will likely be saved in a database, from where the results can be obtained. However, for this tutorial, we are setting this in-memory.
|
|
Save as a Spreadsheet
Using the organized data, we can now save it in a format that is easy to share with others.
First, we need to define the spreadsheet’s columns:
|
|
Second, we’ll need to go through the organized data and set it up for intake by a third-party package (in this case, pandas) so that our metadata maps to the columns above.
|
|
Using the new data variable, create a DataFrame
:
|
|
And then save it as a spreadsheet:
|
|
The Code
The full code of this tutorial is available on Github as plain Python and as a Jupyter notebook.
On This Page