GitHub

ColiVara

What is the project about?

ColiVara is a state-of-the-art document retrieval system designed to enhance Retrieval Augmented Generation (RAG) for Language Models (LLMs). It focuses on providing a delightful developer experience while achieving high accuracy and efficiency in retrieving information from documents. It goes beyond traditional text-based retrieval by incorporating visual cues from documents.

What problem does it solve?

Traditional RAG systems often struggle with visually rich documents because they primarily rely on text extraction, neglecting visual elements like tables, figures, and page layouts. ColiVara addresses this limitation by using vision models to generate embeddings, enabling retrieval based on both textual and visual content. This improves the quality of information fed to LLMs, especially for documents where visual context is crucial. It also outperforms traditional text based systems even when the documents are text based.

What are the features of the project?

  • State-of-the-Art Retrieval: Provides high-quality and low-latency document retrieval, outperforming existing systems.
  • Wide Format Support: Handles over 100 file formats, including PDF, DOCX, PPTX, and more.
  • Filtering: Allows filtering of search results based on document and collection metadata (e.g., author, year, tags).
  • Convention over Configuration: Offers opinionated and optimized defaults for ease of use.
  • Modern PgVector Features: Utilizes HalfVecs in Postgres with pgvector for faster search and reduced storage.
  • REST API: Provides a RESTful API with Swagger documentation for easy integration.
  • Comprehensive CRUD: Supports full Create, Read, Update, and Delete operations for documents, collections, and users.
  • SDKs: Offers Python and Typescript SDKs for convenient interaction with the API.
  • Optional Vector Database Usage: Allows users to generate embeddings and use their own vector database if desired.

What are the technologies used in the project?

  • Vision Language Models: Used to generate embeddings that capture both textual and visual information.
  • Python: Used for the backend API and Python SDK.
  • Typescript: Used for the Typescript SDK.
  • Postgres DB with pgvector: Used for storing embeddings and document/collection metadata.
  • REST API: The core communication interface.
  • Docker: Used for local development and deployment.
  • AWS S3 (or compatible): Used for file storage.
  • Serverless GPU (for ColiVarE): The embedding service is optimized for serverless GPU workloads.

What are the benefits of the project?

  • Improved RAG Quality: Provides more accurate and relevant information to LLMs, leading to better output.
  • Handles Visually Rich Documents: Effectively retrieves information from documents where visual context is important.
  • Easy to Use: Simple API and SDKs make integration straightforward.
  • Efficient: Optimized for speed and storage efficiency.
  • Flexible: Can be used with or without a separate vector database.
  • No Chunking or OCR: Avoids the limitations and potential errors of traditional text extraction methods.

What are the use cases of the project?

  • Enhancing LLM Applications: Improving the performance of any application that uses LLMs and RAG, such as chatbots, question-answering systems, and content generation tools.
  • Document Search and Retrieval: Providing a powerful search engine for large document collections, especially those with visual elements.
  • Knowledge Management: Organizing and retrieving information from internal company documents, research papers, or other proprietary knowledge bases.
  • Research: Facilitating research that involves analyzing and retrieving information from large datasets of documents.
  • Any application needing to retrieve information from a corpus of documents.
ColiVara screenshot