File Submission

In order to process your documents with Zuva DocAI, you must first upload them using the POST /files endpoint.

Uploaded documents expire and are removed from the system after 48 hours. A shorter expiration time may also be specified when uploading the file, or the file can be deleted at any time using the DELETE /files endpoint.

Zuva automatically recognizes over 60 common document and image file formats. Documents in these formats will automatically be processed with OCR prior to classification and field extraction.

Plain Text Content

Plain text documents are not subject to OCR. Instead, Zuva creates pages by applying word-wrapping and inserting appropriate page breaks. Only UTF-8 encoding is currently supported. Plain text will be automatically detected, but you may also choose to specify the content type text/plain.

Plain text should only be used if it is the original document format, or if conversion to plain text preserves most of the document formatting (e.g. paragraph breaks, alignment of titles etc.). If possible, it is recommended to use Zuva’s eOCR format (see below) rather than plain text, in order to retain the positional information of the characters from the original document.

eOCR content

If your documents are already in a format such as .hocr, you will need to convert them to the eOCR (Zuva OCR) format for upload. Example code to convert hOCR to eOCR is available on Github. The file should then be uploaded to the POST /files endpoint with content-type application/eocr.

Supported File Types

You can upload documents for processing in over 60 formats, and PDF and Word documents will be recognized in both portrait and landscape orientation. Here is a complete list:

  • Microsoft Word 6.0/95/97/2000/XP (.doc and .dot)
  • Microsoft Word 2003 XML (.xml)
  • Microsoft Word 2007 XML (.docx, .docm, .dotx, .dotm)
  • Rich Text Format (.rtf)
  • Text CSV (.csv and .txt)
  • Unsecured Portable Document Format (.pdf)
  • Microsoft Excel 97/2000/XP (.xls, .xlw, and .xlt)
  • Microsoft Excel 4.x–5.0/95 (.xls, .xlw, and .xlt)
  • Microsoft Excel 2003 XML (.xml)
  • Microsoft Excel 2007 XML (.xlsx, .xlsm, .xltx, .xltm, .xlsb)
  • Microsoft PowerPoint 97/2000/XP (.ppt, .pps, and .pot)
  • Microsoft PowerPoint 2007 (.pptx, .pptm, .potx, .potm)
  • Email files (.msg - HTML tables in .msg files not supported)
  • Image files (.bmp, .jpg, .jpeg, .png, .pcx, .sgv, .dxf, .met, .pgm, .ras, .svm, .xbm, .emf, .pbm, .plt, .sda, .tga, .xpm, .eps, .pcd, .sdd, .tif, .tiff, .gif, .pct, .ppm, .sgf, .vor)
  • .htm and .html files
  • OpenDocument formats (.odt, .ott, .oth, .odm, .sxw, .stw, .sxg, .ods, .ots, .sxc, .stc, .odf, .sxm)
  • Other text formats (.wpd, .wps, .sdw, sgl, .vor, .uot, .uof, .jtd, .jtt, .hwp, .602, .pdb, .psw)
  • Other spreadsheet formats (.wk1, .wks, .123, .dif, .sdc, .dbf, .slk, .uos, .uof, .pxl, .wb2)
  • Other presentation and formula formats (.sda, .sdd, .sdp, .uop, .uof, .cgm, .smf, .mml)