LOTUS: A Query Engine For Processing Data with LLMs

What is the project about?

LOTUS is a query engine designed to make processing data with Large Language Models (LLMs) fast and easy. It introduces a declarative programming model and an optimized query engine for building and executing reasoning-intensive query pipelines over both structured and unstructured data. It's essentially a way to use LLMs to perform database-like operations on data that might include text, not just numbers and categories.

What problem does it solve?

Traditional data processing tools (like SQL databases or Pandas DataFrames) are primarily designed for structured data. Analyzing unstructured data (like text) often requires complex, custom-built pipelines involving multiple tools and manual coding. LOTUS simplifies this by allowing users to express data transformations using natural language, leveraging the power of LLMs to understand and process both structured and unstructured data within a single framework. It bridges the gap between structured data processing and the reasoning capabilities of LLMs.

What are the features of the project?

Semantic Operators: LOTUS provides a set of "semantic operators" that are similar to relational database operators (like join, filter, map, aggregate) but are powered by LLMs. These operators are defined using natural language expressions ("langex"), making them intuitive to use.
Pandas-like API: The API is designed to be familiar to users of the popular Pandas library, making it easy to learn and use.
Declarative Programming Model: Users specify what they want to achieve (using natural language), and LOTUS handles how to achieve it using LLMs.
Support for Structured and Unstructured Data: LOTUS can handle tables containing both traditional structured data (numbers, categories) and unstructured data (text).
Composable Operators: Semantic operators can be chained together to create complex data processing pipelines.
Optimized Query Engine: LOTUS includes an engine that optimizes the execution of these LLM-powered queries.
Support for Multiple LLMs and Retrieval Models: LOTUS is built on top of LiteLLM and SentenceTransformers, providing broad compatibility with various LLMs (OpenAI, Ollama, vLLM, etc.) and retrieval models.
Reranker Support: Includes support for reranking models to improve the quality of results.

What are the technologies used in the project?

Python: The primary programming language.
Large Language Models (LLMs): The core technology powering the semantic operators. Supports various models through LiteLLM (e.g., OpenAI models, Ollama, vLLM).
LiteLLM: A library providing a unified interface to various LLM APIs.
SentenceTransformers: A library for generating sentence embeddings, used for retrieval and reranking.
Pandas: The API is inspired by and interoperable with Pandas DataFrames.
Faiss: (Optional, especially on Mac) A library for efficient similarity search, used for some semantic operators.
Conda: Used for environment management and dependency installation.

What are the benefits of the project?

Simplified AI-Powered Data Processing: Makes it much easier to incorporate LLM reasoning into data analysis workflows.
Increased Productivity: Reduces the need for complex, custom-built pipelines.
Intuitive Interface: The Pandas-like API and natural language expressions make it easy to learn and use.
Flexibility: Supports a wide range of LLMs and retrieval models.
Extensibility: Designed to be extended with new semantic operators and optimization techniques.
Faster Development: Enables rapid prototyping and development of AI-powered data applications.

What are the use cases of the project?

Semantic Data Joining: Joining datasets based on the meaning of text fields, rather than exact matches (as shown in the quickstart example).
Text Data Filtering: Filtering data based on complex natural language criteria.
Information Extraction: Extracting specific information from unstructured text data.
Data Summarization: Generating summaries of large text datasets.
Semantic Search: Searching through text data based on meaning, not just keywords.
Data Cleaning and Deduplication: Identifying and removing duplicate or similar entries based on semantic similarity.
Building AI-powered data applications: Any application that requires reasoning over structured and unstructured data. Examples include:
- Analyzing customer feedback.
- Building knowledge graphs from text documents.
- Creating intelligent chatbots that can query data.
- Performing complex data analysis tasks that require understanding the meaning of text.
- Classifying documents based on content.
- Generating insights from a combination of structured and unstructured data sources.