ShowUI Project Description

What is the project about?

ShowUI is an open-source, end-to-end, lightweight vision-language-action (VLA) model designed specifically for creating GUI (Graphical User Interface) agents. It's a system that can understand and interact with computer interfaces visually, just like a human user would.

What problem does it solve?

ShowUI addresses the challenge of automating interactions with graphical user interfaces. Traditional methods often rely on brittle, hard-coded rules or access to underlying application code. ShowUI, as a VLA model, aims to provide a more robust and flexible approach by learning to "see" and "understand" the GUI, enabling it to perform actions based on visual and textual cues. It simplifies and automates tasks that would normally require manual human interaction with a computer's visual interface. It moves beyond simple navigation to include general computer control.

What are the features of the project?

End-to-end VLA model: Combines vision, language, and action capabilities in a single model.
Lightweight: Designed for efficiency, making it potentially deployable in resource-constrained environments.
GUI Agent Focus: Specifically tailored for interacting with graphical user interfaces.
Open-source: Allows for community contributions, modifications, and extensions.
Grounding and Navigation Training: Supports training on datasets like Mind2Web, AITW, and Miniwob.
Customizable Model Support: Works with ShowUI and Qwen2VL models.
Efficient Training: Includes features like DeepSpeed, BF16, QLoRA, SDPA/FlashAttention2, and Liger-Kernel for optimized training.
Multi-Dataset Training: Can train on multiple datasets simultaneously.
Interleaved Data Streaming: Efficiently handles data during training.
Image Resizing: Supports random image resizing (cropping and padding).
Training Monitoring: Integrates with Wandb for monitoring training progress.
Multi-GPU/Node Training: Supports distributed training across multiple GPUs and nodes.
UI-Guided Token Selection: A novel method for selecting actions based on UI elements.
API Calling: Supports API calls via Gradio Client.
vllm Inference: Supports vllm inference.
Iterative Refinement: Improves grounding accuracy.
GPT-4o Annotation Recaptioning: Provides scripts for improving annotations.
Computer Use Integration: Integrated with OOTB for direct computer control.

What are the technologies used in the project?

Deep Learning Frameworks: (Likely PyTorch, based on common practices and Hugging Face integration)
Hugging Face Transformers: Used for model implementation and access to pre-trained models.
Gradio: For creating interactive demos and API interfaces.
vllm: For fast inference.
DeepSpeed, QLoRA, FlashAttention2: Techniques for efficient training of large models.
Wandb: For experiment tracking and visualization.
GPT-4o: Used for data annotation improvements.
Python: The primary programming language.

What are the benefits of the project?

Automation: Automates tasks that require interaction with GUIs, saving time and effort.
Flexibility: Can adapt to different GUIs and tasks without requiring code changes to the underlying applications.
Robustness: Less prone to breaking due to minor UI changes compared to rule-based systems.
Accessibility: Could potentially be used to make software more accessible to users with disabilities.
Research Platform: Provides a valuable tool for research in VLA models and GUI automation.
Open Source: Fosters collaboration and innovation.

What are the use cases of the project?

Automated Software Testing: Automating UI testing procedures.
Robotic Process Automation (RPA): Automating repetitive tasks across different applications.
Personal Assistant: Creating intelligent assistants that can perform tasks on a user's computer.
Web Navigation and Interaction: Automating web browsing and form filling.
Data Extraction: Extracting information from GUIs.
Accessibility Tools: Developing tools to help users with disabilities interact with computers.
General Computer Control: Performing a wide range of tasks on a computer through visual interaction.
API Calling: Interacting with applications through their APIs based on visual context.