GitHub

ExtractThinker

What is the project about?

ExtractThinker is a document intelligence tool that uses Large Language Models (LLMs) to extract and classify structured data from various document formats. It acts as an Object-Relational Mapper (ORM) for documents, simplifying document processing workflows.

What problem does it solve?

ExtractThinker solves the problem of efficiently and accurately extracting structured data from unstructured or semi-structured documents. It simplifies the complexities of Intelligent Document Processing (IDP) by providing specialized tools and leveraging LLMs.

What are the features of the project?

  • Flexible Document Loaders: Supports various document loaders like Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, and PyPDF.
  • Customizable Contracts: Uses Pydantic models to define precise data extraction contracts.
  • Advanced Classification: Classifies documents or sections using custom classifications.
  • Asynchronous Processing: Efficiently handles large documents with asynchronous processing.
  • Multi-format Support: Works with PDFs, images, spreadsheets, and other formats.
  • ORM-style Interaction: Provides an intuitive, ORM-like development experience.
  • Splitting Strategies: Offers lazy or eager splitting strategies for document processing.
  • Integration with LLMs: Integrates with LLM providers like OpenAI, Anthropic, Cohere, and local models.
  • Community-driven Development: Inspired by LangChain, focusing on intelligent document processing.
  • Batch Processing: Supports batch processing of documents.

What are the technologies used in the project?

  • Python: The primary programming language.
  • Large Language Models (LLMs): Integration with OpenAI, Anthropic, Cohere, Azure OpenAI, and local models (Ollama compatible).
  • OCR Tools: Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI.
  • Pydantic: For defining data extraction contracts.
  • Document Loaders: PyPDF and other custom loaders.

What are the benefits of the project?

  • High Accuracy: Improves data extraction and classification accuracy using LLMs.
  • Ease of Use: Simplifies development with intuitive APIs and ORM-style interactions.
  • Specialized Components: Provides tailored tools for document loading, splitting, and extraction.
  • Community Support: Benefits from active development and community contributions.
  • Flexibility: Supports multiple document formats and LLM providers.

What are the use cases of the project?

  • Invoice Processing: Extracting data like invoice numbers and dates from invoice documents.
  • Document Classification: Classifying documents into categories like invoices, driver's licenses, or receipts.
  • Data Extraction from Forms: Extracting information from various forms using OCR and LLMs.
  • Batch Processing of Documents: Handling large volumes of documents for data extraction.
  • Integration with Local LLMs: Using custom or local LLMs for document processing.
  • Splitting and Processing Large Documents: Efficiently processing large documents by splitting them into smaller parts.
ExtractThinker screenshot