Skyvern-AI/skyvern | Public Repo's

What is the project about?

Skyvern is a project designed to automate browser-based workflows using Large Language Models (LLMs) and computer vision. It aims to replace traditional, brittle automation methods that rely on DOM parsing and XPaths.

What problem does it solve?

Skyvern addresses the limitations of traditional browser automation approaches, which are often fragile and break easily when website layouts change. It eliminates the need for custom scripts for each website, making automation more robust and adaptable. It also handles complex scenarios that require reasoning, like inferring information or dealing with product variations.

What are the features of the project?

Zero-shot automation: Operates on websites it has never seen before.
Resilience to layout changes: Doesn't rely on pre-determined XPaths.
Cross-website workflow application: Applies a single workflow to multiple websites.
LLM-powered reasoning: Handles complex situations and infers information.
Multi-agent architecture: Uses a swarm of specialized agents (Interactable Element, Navigation, Data Extraction, Password, 2FA, Dynamic Auto-complete).
Task and Workflow support: Tasks for single actions, Workflows for chaining multiple tasks.
Livestreaming: Allows users to see Skyvern's actions in real-time.
Form filling: Automatically fills out forms based on provided information.
Data extraction: Extracts structured data from websites based on a user-defined schema.
File downloading: Downloads files and uploads them to block storage.
Authentication (Beta): Supports password manager integrations (Bitwarden) and various 2FA methods.
Skyvern Cloud: A managed cloud version for running Skyvern at scale, with anti-bot detection, proxy network, and CAPTCHA solving.

What are the technologies used in the project?

LLMs: OpenAI (GPT-4 Turbo, GPT-4o, GPT-4o-mini), Anthropic (Claude 3 family, Claude 3.5 Sonnet), Azure OpenAI, AWS Bedrock, Gemini (coming soon), Llama 3.2 (coming soon), Novita AI.
Computer Vision: Used for parsing website elements in real-time.
Browser Automation Libraries: Playwright.
Task-Driven Autonomous Agent Design: Inspired by BabyAGI and AutoGPT.
Docker Compose: For easy setup and deployment.
Python 3.11: Primary programming language.
PostgreSQL: Database.
Frontend: React-based UI.

What are the benefits of the project?

Robustness: Less prone to breakage due to website changes.
Scalability: Can automate workflows across many websites.
Flexibility: Handles complex scenarios and variations.
Efficiency: Reduces the need for manual scripting and maintenance.
Ease of Use: Simple API and UI for creating and managing tasks/workflows.
Cost Savings (Potential): Through optimized context and prompt caching (future feature).

What are the use cases of the project?

Invoice downloading: Automating invoice retrieval from various websites.
Job application automation: Filling out job applications.
Materials procurement: Automating the process of finding and ordering materials.
Government website interaction: Registering accounts, filling out forms.
Contact form filling: Automating contact form submissions.
Insurance quote retrieval: Getting quotes from multiple insurance providers.
Competitor analysis: Extracting data from competitor websites.
E-commerce automation: Purchasing products from online stores.
Any repetitive browser-based task: General-purpose automation for tasks that involve interacting with websites.