OthersideAI/self-operating-computer

Project Description: Self-Operating Computer Framework

What is the project about?

The Self-Operating Computer Framework is a system that allows multimodal AI models to control a computer in a way similar to a human user. It observes the screen and generates mouse and keyboard actions to achieve a given objective.

What problem does it solve?

This project aims to bridge the gap between AI models and direct computer operation. Instead of relying on specific APIs or structured data, the AI can interact with the computer's graphical user interface (GUI) directly, opening up possibilities for automating tasks across various applications and websites. It allows AI to interact with any software a human can.

What are the features of the project?

Multimodal Model Compatibility: Works with several multimodal models, including GPT-4o, o1, Gemini Pro Vision, Claude 3, and LLaVa.
Human-like Interaction: The AI uses the same inputs (screen view) and outputs (mouse and keyboard actions) as a human.
Voice Input: Supports voice commands for specifying the objective (requires additional setup).
Optical Character Recognition (OCR): Integrates OCR to improve interaction by identifying clickable elements on the screen.
Set-of-Mark (SoM) Prompting: Supports SoM prompting, a visual prompting technique, to enhance visual grounding.
Extensibility: Designed to be extended with additional models and features.
Cross-Platform: Compatible with Mac OS, Windows, and Linux.

What are the technologies used in the project?

Multimodal AI Models: GPT-4o, o1, Gemini Pro Vision, Claude 3, LLaVa.
Programming Language: Python (based on pip install command).
OCR: Used for text and element recognition.
YOLOv8: (For SoM prompting) A model for object detection, specifically trained for button detection in this project.
Ollama: Used for running LLaVa locally.
PortAudio: (For voice input) A cross-platform audio I/O library.
OpenAI, Google AI Studio, Anthropic Claude APIs: For accessing the respective models.

What are the benefits of the project?

Automation of Complex Tasks: Enables automation of tasks that require visual understanding and interaction with GUIs.
General-Purpose Interaction: Can potentially interact with any software or website a human can use.
Research Platform: Provides a framework for exploring and developing advanced AI agents capable of operating computers.
Accessibility: Could potentially be used to improve computer accessibility for users with disabilities.

What are the use cases of the project?

Automated Web Browsing: Performing tasks like booking flights, ordering products, or gathering information from websites.
Software Testing: Automating UI testing for applications.
Robotic Process Automation (RPA): Automating repetitive tasks across different software applications.
Data Entry and Processing: Automating data entry into forms or spreadsheets.
AI Assistant: Creating a general-purpose AI assistant that can perform a wide range of computer-based tasks.
Accessibility Tools: Assisting users with disabilities in operating computers.