DocumentExtraction

class indico.queries.documents.DocumentExtraction(files, json_config=None, upload_batch_size=None, ocr_engine='OMNIPAGE')

Extract raw text from PDF or TIF files.

DocumentExtraction performs Optical Character Recognition (OCR) on PDF or TIF files to extract raw text for model training and prediction.

Parameters
  • files= (List[str]) – Pathnames of one or more files to OCR

  • json_config (dict or JSON str) – Configuration settings for OCR. See Notes below.

  • upload_batch_size (int) – size of batches for document upload if uploading many documents

  • ocr_engine (str) – Denotes which ocr engine to use. Defaults to OMNIPAGE.

Returns

Job object

Raises:

Notes

DocumentExtraction is extremely configurable. Four preset configurations are provided:

simple - Provides a simple and fast response for native PDFs (3-5x faster). Will NOT work with scanned PDFs.

legacy - Provided to mimic the behavior of Indico’s older pdf_extraction function. Use this if your model was trained with data from the older pdf_extraction.

detailed - Provides detailed bounding box information on tokens and characters. Returns data in a nested format at the document level with all metadata included.

ondocument - Provides detailed information at the page-level in an unnested format.

standard - Provides page text and block text/position in a nested format.

For more information, please reference the Indico knowledgebase article on OCR: https://docs.indicodata.ai/articles/documentation-publication/ocr

Example

Call DocumentExtraction and wait for the result:

job = client.call(DocumentExtraction(files=[src_path], json_config='{"preset_config": "legacy"}'))
job = client.call(JobStatus(id=job[0].id, wait=True))
extracted_data = client.call(RetrieveStorageObject(job.result))