ExtractThinker

What is the project about?

ExtractThinker is a document intelligence tool that uses Large Language Models (LLMs) to extract and classify structured data from various document formats. It acts as an Object-Relational Mapper (ORM) for documents, simplifying document processing workflows.

What problem does it solve?

ExtractThinker solves the problem of efficiently and accurately extracting structured data from unstructured or semi-structured documents. It simplifies the complexities of Intelligent Document Processing (IDP) by providing specialized tools and leveraging LLMs.

What are the features of the project?

Flexible Document Loaders: Supports various document loaders like Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, and PyPDF.
Customizable Contracts: Uses Pydantic models to define precise data extraction contracts.
Advanced Classification: Classifies documents or sections using custom classifications.
Asynchronous Processing: Efficiently handles large documents with asynchronous processing.
Multi-format Support: Works with PDFs, images, spreadsheets, and other formats.
ORM-style Interaction: Provides an intuitive, ORM-like development experience.
Splitting Strategies: Offers lazy or eager splitting strategies for document processing.
Integration with LLMs: Integrates with LLM providers like OpenAI, Anthropic, Cohere, and local models.
Community-driven Development: Inspired by LangChain, focusing on intelligent document processing.
Batch Processing: Supports batch processing of documents.

What are the technologies used in the project?

Python: The primary programming language.
Large Language Models (LLMs): Integration with OpenAI, Anthropic, Cohere, Azure OpenAI, and local models (Ollama compatible).
OCR Tools: Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI.
Pydantic: For defining data extraction contracts.
Document Loaders: PyPDF and other custom loaders.

What are the benefits of the project?

High Accuracy: Improves data extraction and classification accuracy using LLMs.
Ease of Use: Simplifies development with intuitive APIs and ORM-style interactions.
Specialized Components: Provides tailored tools for document loading, splitting, and extraction.
Community Support: Benefits from active development and community contributions.
Flexibility: Supports multiple document formats and LLM providers.

What are the use cases of the project?

Invoice Processing: Extracting data like invoice numbers and dates from invoice documents.
Document Classification: Classifying documents into categories like invoices, driver's licenses, or receipts.
Data Extraction from Forms: Extracting information from various forms using OCR and LLMs.
Batch Processing of Documents: Handling large volumes of documents for data extraction.
Integration with Local LLMs: Using custom or local LLMs for document processing.
Splitting and Processing Large Documents: Efficiently processing large documents by splitting them into smaller parts.