DocumentExtraction
- class indico.queries.documents.DocumentExtraction(files, json_config=None, upload_batch_size=None, ocr_engine='OMNIPAGE')
Extract raw text from PDF or TIF files.
DocumentExtraction performs Optical Character Recognition (OCR) on PDF or TIF files to extract raw text for model training and prediction.
- Parameters
files= (List[str]) – Pathnames of one or more files to OCR
json_config (dict or JSON str) – Configuration settings for OCR. See Notes below.
upload_batch_size (int) – size of batches for document upload if uploading many documents
ocr_engine (str) – Denotes which ocr engine to use. Defaults to OMNIPAGE.
- Returns
Job object
Raises:
Notes
DocumentExtraction is extremely configurable. Four preset configurations are provided:
simple - Provides a simple and fast response for native PDFs (3-5x faster). Will NOT work with scanned PDFs.
legacy - Provided to mimic the behavior of Indico’s older pdf_extraction function. Use this if your model was trained with data from the older pdf_extraction.
detailed - Provides detailed bounding box information on tokens and characters. Returns data in a nested format at the document level with all metadata included.
ondocument - Provides detailed information at the page-level in an unnested format.
standard - Provides page text and block text/position in a nested format.
For more information, please reference the Indico knowledgebase article on OCR: https://docs.indicodata.ai/articles/documentation-publication/ocr
Example
Call DocumentExtraction and wait for the result:
job = client.call(DocumentExtraction(files=[src_path], json_config='{"preset_config": "legacy"}')) job = client.call(JobStatus(id=job[0].id, wait=True)) extracted_data = client.call(RetrieveStorageObject(job.result))