The layouts are how DocAI “sees” the document. It contains metadata about every character and page from the document.
This tutorial will cover how to request the layouts for a local PDF document, as well as parse the content in that file to see how DocAI “saw” the document.
After you have gone through this tutorial, you will be left with a working Python script that leverages multiple packages to take a layouts document. This script can be used to ingest layouts content and output the relevant data.
What will you learn?
The following is what you will learn by going through this tutorial:
- How to create an instance of the Python SDK to interact with DocAI
- How to submit a document to DocAI
- How to create an OCR request
- How to load and iterate through the contents of a layouts file in Python
To follow this tutorial, you will need:
- The Python interpreter (this tutorial uses v3.10)
- A Zuva account and token - see the Getting started guide
- A copy of Zuva’s DocAI Python SDK
- A copy of recognition_results_pb2.py
google.protobufPython package (this tutorial uses v3.0.0), usually installed with
pip install protobufor
pip3 install protobuf
For DocAI to perform classification and extractions on the documents submitted to it, it must first be able to “see” what content exists in the document. Whether it be a digital-native PDF (e.g. Word Document saved as a PDF) or a potential third-party paper in PDF format that contains a scan of a physical document, DocAI needs a method to “read” the content so that downstream processing can be performed.
Optical Character Recognition enables DocAI to achieve this. By going through every character, on every page, DocAI creates a new representation of the user-provided document. This new representation allows DocAI to have a better “understanding” of the contents provided by the user, and thus is used in DocAI’s downstream processing for all of its machine learning processes.
OCR is an optional service. However, if you would like to skip DocAI’s OCR, you will need to create your own representation using the output of your OCR engine, or provide raw text. The latter results in a partial degradation of machine learning performance due to it not containing the physical locations of where the characters exist on the page.
Layouts: The Overview
The layouts schema can be found here.
The basic overview is as follows:
You have a
Document. This is the “entry-point” into the contents of the layouts.
Documentcontains a list of
Characteris represented by its unicode value
Documentcontains a list of
Pagehas a horizontal dots-per-inch (DPI)
Pagehas a vertical dots-per-inch (DPI)
CharacterRange(e.g. “characters 500 to 1000 exist on this page”)
Folks who develop applications that contain a document viewer to visualize where extractions occurred in their contracts would leverage the layouts information to map what DocAI extracted to the document/pages that the end-user (e.g. a reviewer) is shown in their solution’s document viewer.
Import the necessary packages
The first step is to import the necessary Python packages in your script. Below are the packages needed by this tutorial:
Create an instance of the SDK
At this point in the tutorial you have imported the necessary Python packages. You should also have a token that was created, as mentioned in the requirements.
DocAI offers multiple regions to choose from, which can help you decrease latency (due to being physically closer/in the region), and data residency requirements. If you created a token on another region, provide that region’s url (e.g.
eu.app.zuva.ai) instead of the one provided above (
Going forward, the
sdk variable is going to be used to interact with DocAI.
Before we can obtain the layouts from DocAI, we will need to submit a document and run an OCR request. To upload a file, such as the demo document available here, using the following code:
The file variable will be used going forward to refer to the unique identifier, since DocAI has no concept of filenames. It is possible to obtain the file’s unique identifier by running
Create an OCR request
sdk variable exposes a function named
ocr.create which accepts a
file.id list. One request is created per document provided. We are only running this on one document, as such
ocr_request = ocrs simply assigns the first (and only) item (request) of
ocrs to a new variable.
Wait for OCR request to complete
We will need to wait for the DocAI OCR request to complete processing before we can obtain the
layouts content. The following snippet, every two seconds, will check the request’s latest status. If it completes successfully, then load the layouts variable. This is possible by using the
ocr.get_layouts, which accepts the unique ID of the request.
Load the Layouts
By now the OCR request would have completed and the
layouts variable contains the content needed for the next steps of this tutorial.
Using the package that we imported earlier, we can leverage
recognition_results_pb2.py to load the
layouts content in a way that allows us to interact with it.
Document (entry-point), we can create a new
Document object, and load it using the
layouts that DocAI provided to our script.
Going forward, the
doc variable is what we will use to dig into the layouts.
Get number of pages
doc.pages contain a list of Page objects.
Get number of characters
The doc.characters contain a list of Character objects.
Get the first 15 characters of document
We can use
doc.characters to obtain the first 15 characters:
The above, however, returns:
[69, 120, 104, 105, 98, 105, 116, 32, 49, 48, 46, 49, 48, 32, 50]
This is because the
Character values are stored as unicode numbers. These can easily be converted by running the following:
Exhibit 10.10 2
Get the page metadata
As mentioned earlier, each
Page contains its own metadata. The following can be used to go through all of the layouts’ pages, and print its metadata.
Below is a sample output for the first two pages:
Page 1: width = 2550 pixels height = 3300 pixels dpi_x = 300 dpi_y = 300 range_start = 0 range_end = 822 Page 2: width = 2550 pixels height = 3300 pixels dpi_x = 300 dpi_y = 300 range_start = 822 range_end = 5518
Get the Character metadata
Earlier, we printed out the first handful of characters of the layouts.
The following continues with this approach, however it also exposes additional metadata for each character. It also uses the pages data to locate on which page the characters were found.
Running the above for the first 15
"E": Page 1, x1=2183, y1=162, x2=2209, y2=190 "x": Page 1, x1=2211, y1=170, x2=2232, y2=190 "h": Page 1, x1=2233, y1=161, x2=2256, y2=190 "i": Page 1, x1=2256, y1=161, x2=2266, y2=190 "b": Page 1, x1=2267, y1=161, x2=2289, y2=191 "i": Page 1, x1=2290, y1=161, x2=2301, y2=190 "t": Page 1, x1=2302, y1=166, x2=2316, y2=191 " ": Page 1, x1=2316, y1=162, x2=2329, y2=190 "1": Page 1, x1=2329, y1=162, x2=2345, y2=190 "0": Page 1, x1=2348, y1=162, x2=2367, y2=191 ".": Page 1, x1=2369, y1=183, x2=2377, y2=191 "1": Page 1, x1=2382, y1=162, x2=2398, y2=190 "0": Page 1, x1=2401, y1=162, x2=2420, y2=191 " ": Page 1, x1=2420, y1=163, x2=2420, y2=191 "2": Page 1, x1=130 , y1=266, x2=149 , y2=294
The above can be interpreted as:
- The first character found on Page 1 was an
E, located in a rectangle with top left corner at the location
(2183, 162), and bottom right corner located at
On This Page