When files are classified using the language endpoint, the predominant language of the file will be classified. Zuva will classify files with languages in Latin and some non-Latin characters.
Important notes when classifying files:
- When Zuva classifies languages, detection relies in part on the OCR quality of the file, where lower quality ratings of the OCR may cause incorrect language classification.
- In the case where a document has multiple languages, Zuva will identify a single language, typically the predominant language in the document.
- We have seen Zuva encounter errors with files of non-Latin script languages, where the OCR had trouble extracting the non-Latin, specifically Asian-language, characters. Even though you may see some documents in these scripts languages appear properly in Zuva, they are not officially supported.
The following languages can be classified:
What if a file is classified incorrectly?
Once a document has been classified, you cannot change the classification.
In most instances, an incorrect language classification is caused by poor OCR quality of the document, as small changes in OCR can cause Zuva to identify the document language with another language that has a very similar likelihood.
Various languages used throughout a document may also impact the accuracy of the language tags. In these instances, Zuva will only identify the most likely language, and not all languages that may appear in a document.
If a document has been classified in the wrong language, you can always try to increase the quality of the original document, and re-upload the document into Zuva.