WilmerAI Project Description

What is the project about?

WilmerAI is a middleware system that processes user prompts before sending them to Large Language Model (LLM) APIs. It acts as an intermediary between a user interface (like SillyTavern or OpenWebUI) and various LLM backends. It's designed to enhance and manage interactions with LLMs. The name stands for "What If Language Models Expertly Routed All Inference?".

What problem does it solve?

WilmerAI addresses several challenges in working with LLMs:

Complex Prompt Routing: It categorizes incoming prompts and routes them to appropriate workflows, allowing for specialized handling of different types of requests (e.g., coding, factual questions, conversation).
Large Context Management: It can process large conversation histories (200,000+ tokens) and generate concise summaries, enabling smaller, more efficient prompts for LLMs. This is crucial for maintaining context in long conversations, even with models that have limited context windows.
Multi-LLM Orchestration: It allows users to leverage multiple LLMs simultaneously, potentially combining the strengths of different models for different tasks. This includes using different models for different stages of a workflow (e.g., one model for initial response, another for code review).
Lack of "Memory" in LLMs: It simulates a "memory" for LLMs by generating and updating chat summaries, providing context from past interactions.
Parallel Processing: It can distribute tasks across multiple computers/LLMs, improving performance and response times, especially for computationally intensive operations like memory generation.
Extensibility: It allows for custom workflows and integration with external tools (like the Offline Wikipedia API) and custom Python modules.

What are the features of the project?

Multi-LLM Assistants: Create assistants powered by multiple LLMs working in tandem, with customizable workflows for each category of prompt.
Offline Wikipedia Integration: Integrates with the OfflineWikipediaTextApi to provide factual context (RAG) from Wikipedia articles.
Chat Summary Generation: Continuously generates and updates summaries of conversations to simulate a "memory" for the LLM.
Parallel Processing: Distributes tasks (like memory generation) across multiple computers/LLMs.
Multi-LLM Group Chats: Facilitates group chats in SillyTavern where each character can be a different LLM.
Middleware Functionality: Sits between the user interface and LLM APIs, handling multiple backends.
Customizable Presets: Allows users to define and customize LLM presets via JSON files.
API Endpoints: Provides OpenAI API compatible endpoints (chat/Completions, v1/Completions) and Ollama compatible endpoints.
Prompt Templates: Supports prompt templates for v1/Completions endpoints.
Vision Multi-Modal Support: Experimental support for image processing via Ollama.
Mid-Workflow Conditional Routing: Allows for dynamic workflow branching based on conditions within a workflow.
Workflow Locks: Prevents race conditions in asynchronous operations, enabling continuous interaction while background tasks (like memory generation) are in progress.
Python Module Caller Node: Extends functionality by allowing calls to custom Python modules.
Custom File Node: Allows loading and using custom text files within workflows.

What are the technologies used in the project?

Python: The core programming language.
Flask: Used for creating the API endpoints.
Requests: Used for making HTTP requests to LLM APIs.
scikit-learn, urllib3, jinja2: Other Python libraries used for various functionalities.
JSON: Used extensively for configuration files (endpoints, users, workflows, presets, routing).
LLM APIs: Supports various LLM APIs, including OpenAI compatible, Ollama, and KoboldCpp.
OfflineWikipediaTextApi: (Optional) External API for accessing Wikipedia content.
SQLite: Used for managing workflow locks.

What are the benefits of the project?

Improved LLM Interaction: Provides more control and flexibility in interacting with LLMs.
Enhanced Context Management: Enables longer, more coherent conversations with LLMs.
Efficient Use of Resources: Optimizes LLM usage by routing prompts appropriately and managing context size.
Increased Performance: Leverages parallel processing to speed up tasks.
Extensibility: Allows for customization and integration with other tools.
Multi-Model Capabilities: Enables the use of multiple LLMs, potentially combining their strengths.

What are the use cases of the project?

Advanced Chatbots: Creating chatbots with improved memory and context handling.
Intelligent Assistants: Building assistants that can handle a variety of tasks, routing requests to specialized workflows.
Code Generation and Review: Developing workflows that combine multiple LLMs for code generation, review, and refinement.
Factual Question Answering: Integrating with knowledge sources (like Wikipedia) to provide accurate answers.
Roleplaying and Creative Writing: Facilitating complex roleplaying scenarios with multiple LLM-powered characters.
Research and Development: Experimenting with different LLM configurations and workflows.
Long-form Content Creation: Maintaining context and coherence in extended writing projects.