Using Layouts
Introduction
The layouts are how Zuva “sees” the document. It contains metadata about every character and page from the document.
This tutorial will cover how to request the layouts for a local PDF document, as well as parse the content in that file to see how Zuva “saw” the document.
After you have gone through this tutorial, you will be left with a working Python script that leverages multiple packages to take a layouts document. This script can be used to ingest layouts content and output the relevant data.
This tutorial is also available on Github as plain Python and as an interactive Jupyter notebook.
What will you learn?
The following is what you will learn by going through this tutorial:
- How to create an instance of the Python SDK to interact with Zuva
- How to submit a document to Zuva
- How to create an OCR request
- How to load and iterate through the contents of a layouts file in Python
Requirements
To follow this tutorial, you will need:
- The Python interpreter (this tutorial uses v3.10)
- A Zuva account and token - see the Getting started guide
- A copy of Zuva’s Python SDK
- A copy of recognition_results_pb2.py
- The
google.protobuf
Python package (this tutorial uses v3.0.0), usually installed withpip install protobuf
orpip3 install protobuf
OCR
For Zuva to perform classification and extractions on the documents submitted to it, it must first be able to “see” what content exists in the document. Whether it be a digital-native PDF (e.g. Word Document saved as a PDF) or a potential third-party paper in PDF format that contains a scan of a physical document, Zuva needs a method to “read” the content so that downstream processing can be performed.
Optical Character Recognition enables Zuva to achieve this. By going through every character, on every page, Zuva creates a new representation of the user-provided document. This new representation allows Zuva to have a better “understanding” of the contents provided by the user, and thus is used in Zuva’s downstream processing for all of its machine learning processes.
OCR is an optional service. However, if you would like to skip Zuva’s OCR, you will need to create your own representation using the output of your OCR engine, or provide raw text. The latter results in a partial degradation of machine learning performance due to it not containing the physical locations of where the characters exist on the page.
Layouts: The Overview
The layouts schema can be found here.
The basic overview is as follows:
You have a Document
. This is the “entry-point” into the contents of the layouts.
Each
Document
contains a list ofCharacters
.- Each
Character
is represented by its unicode value - Each
Character
has aBoundingBox
- Each
Each
Document
contains a list ofPages
- Each
Page
has thewidth
(in pixels) - Each
Page
has theheight
(in pixels) - Each
Page
has a horizontal dots-per-inch (DPI) - Each
Page
has a vertical dots-per-inch (DPI) - Each
Page
has aCharacterRange
(e.g. “characters 500 to 1000 exist on this page”)
- Each
Folks who develop applications that contain a document viewer to visualize where extractions occurred in their contracts would leverage the layouts information to map what Zuva extracted to the document/pages that the end-user (e.g. a reviewer) is shown in their solution’s document viewer.
Let’s Build!
Import the necessary packages
The first step is to import the necessary Python packages in your script. Below are the packages needed by this tutorial:
|
|
Create an instance of the SDK
At this point in the tutorial you have imported the necessary Python packages. You should also have a token that was created, as mentioned in the requirements.
|
|
Zuva’s API servers are hosted in both the US and Europe regions, which can help you decrease latency (due to being physically closer/in the region), and data residency requirements. If you created a token on another region, provide that region’s url (e.g. eu.app.zuva.ai
) instead of the one provided above (us.app.zuva.ai
).
Going forward, the sdk
variable is going to be used to interact with Zuva.
Submit document
Before we can obtain the layouts from Zuva, we will need to submit a document and run an OCR request. To upload a file, such as the demo document available here, using the following code:
|
|
The file variable will be used going forward to refer to the unique identifier, since Zuva has no concept of filenames. It is possible to obtain the file’s unique identifier by running print(file.id)
.
Create an OCR request
This request will run an OCR on the document that was provided earlier. This process will create the layouts.
|
|
The sdk
variable exposes a function named ocr.create
which accepts a file.id
list. One request is created per document provided. We are only running this on one document, as such ocr_request = ocrs[0]
simply assigns the first (and only) item (request) of ocrs
to a new variable.
Wait for OCR request to complete
We will need to wait for the Zuva OCR request to complete processing before we can obtain the layouts
content. The following snippet, every two seconds, will check the request’s latest status. If it completes successfully, then load the layouts variable. This is possible by using the sdk
function ocr.get_layouts
, which accepts the unique ID of the request.
|
|
Load the Layouts
By now the OCR request would have completed and the layouts
variable contains the content needed for the next steps of this tutorial.
Using the package that we imported earlier, we can leverage recognition_results_pb2.py
to load the layouts
content in a way that allows us to interact with it.
Using the Document
(entry-point), we can create a new Document
object, and load it using the layouts
that Zuva provided to our script.
|
|
Going forward, the doc
variable is what we will use to dig into the layouts.
Get number of pages
The doc.pages
contain a list of Page objects.
|
|
Get number of characters
The doc.characters contain a list of Character objects.
|
|
Get the first 15 characters of document
We can use doc.characters
to obtain the first 15 characters:
|
|
The above, however, returns:
[69, 120, 104, 105, 98, 105, 116, 32, 49, 48, 46, 49, 48, 32, 50]
This is because the Character
values are stored as unicode numbers. These can easily be converted by running the following:
|
|
Which returns:
Exhibit 10.10 2
Get the page metadata
As mentioned earlier, each Page
contains its own metadata. The following can be used to go through all of the layouts’ pages, and print its metadata.
|
|
Below is a sample output for the first two pages:
Page 1:
width = 2550 pixels
height = 3300 pixels
dpi_x = 300
dpi_y = 300
range_start = 0
range_end = 822
Page 2:
width = 2550 pixels
height = 3300 pixels
dpi_x = 300
dpi_y = 300
range_start = 822
range_end = 5518
Get the Character metadata
Earlier, we printed out the first handful of characters of the layouts.
The following continues with this approach, however it also exposes additional metadata for each character. It also uses the pages data to locate on which page the characters were found.
|
|
Running the above for the first 15 characters
returns:
"E": Page 1, x1=2183, y1=162, x2=2209, y2=190
"x": Page 1, x1=2211, y1=170, x2=2232, y2=190
"h": Page 1, x1=2233, y1=161, x2=2256, y2=190
"i": Page 1, x1=2256, y1=161, x2=2266, y2=190
"b": Page 1, x1=2267, y1=161, x2=2289, y2=191
"i": Page 1, x1=2290, y1=161, x2=2301, y2=190
"t": Page 1, x1=2302, y1=166, x2=2316, y2=191
" ": Page 1, x1=2316, y1=162, x2=2329, y2=190
"1": Page 1, x1=2329, y1=162, x2=2345, y2=190
"0": Page 1, x1=2348, y1=162, x2=2367, y2=191
".": Page 1, x1=2369, y1=183, x2=2377, y2=191
"1": Page 1, x1=2382, y1=162, x2=2398, y2=190
"0": Page 1, x1=2401, y1=162, x2=2420, y2=191
" ": Page 1, x1=2420, y1=163, x2=2420, y2=191
"2": Page 1, x1=130 , y1=266, x2=149 , y2=294
The above can be interpreted as:
- The first character found on Page 1 was an
E
, located in a rectangle with top left corner at the location(2183, 162)
, and bottom right corner located at(2209, 190)
The Code
The full code of this tutorial is available on Github as plain Python and as a Jupyter notebook.
On This Page