GitHub

WebWalker: Benchmarking LLMs in Web Traversal

What is the project about?

WebWalker is a project focused on evaluating and improving the ability of Large Language Models (LLMs) to navigate and extract information from the web. It introduces a new benchmark, WebWalkerQA, and a multi-agent framework, WebWalker, designed to handle the complexities of web traversal.

What problem does it solve?

The project addresses the challenge of assessing and enhancing LLMs' capabilities in realistic web-based information-seeking tasks. Existing benchmarks often don't fully capture the intricacies of navigating real-world websites, which involve multiple hops, diverse content, and long contexts. WebWalkerQA provides a more challenging and realistic environment for evaluating LLMs. The WebWalker framework addresses the specific problem of managing long contexts during web navigation.

What are the features of the project?

  • WebWalkerQA Dataset: A benchmark dataset consisting of 680 queries derived from four real-world scenarios, spanning over 1373 webpages. The dataset includes detailed information about each query, including the answer, source URLs, navigation paths, and difficulty level. It also includes a "silver" dataset of ~14k QA pairs.
  • WebWalker Framework: A multi-agent framework designed to improve LLM performance on web traversal tasks, particularly those requiring long context management.
  • Online Demo: Interactive demos available on ModelScope and Hugging Face allow users to test web traversal capabilities.
  • Leaderboard: A public leaderboard on Hugging Face tracks the performance of different models and methods on the WebWalkerQA dataset.
  • Evaluation Script: Provides a script for evaluating the accuracy of answers using GPT-4.
  • RAG-System Support: Includes code for running and evaluating Retrieval-Augmented Generation (RAG) systems on the WebWalkerQA dataset.

What are the technologies used in the project?

  • Large Language Models (LLMs): The project focuses on benchmarking and improving LLMs. It specifically mentions compatibility with Qwen-Agent and requires API keys for OpenAI or Dashscope (Qwen).
  • Python: The primary programming language.
  • Hugging Face Datasets & Spaces: Used for hosting the dataset, leaderboard, and online demo.
  • ModelScope: Used for hosting an online demo.
  • Streamlit: Used for creating the local web application demo.
  • ReACT, Qwen-Agents, LangChain: Acknowledged as foundational frameworks/libraries.
  • ai4crawl: Used for web page crawling and conversion to Markdown-like format.
  • conda: Used for environment management.
  • GPT-4: Used in the evaluation script.

What are the benefits of the project?

  • Improved LLM Evaluation: Provides a more realistic and challenging benchmark for evaluating LLMs' web navigation abilities.
  • Advancement in Web-Based Agents: The WebWalker framework offers a potential solution for improving LLM performance in complex web tasks.
  • Open-Source Resources: The dataset, code, and demos are publicly available, fostering research and development in this area.
  • Community Engagement: The leaderboard encourages community participation and comparison of different approaches.

What are the use cases of the project?

  • Benchmarking LLMs: Researchers can use WebWalkerQA to evaluate the web traversal capabilities of their LLMs.
  • Developing Web Agents: Developers can use the WebWalker framework and dataset to build and train more effective web agents.
  • Improving Information Retrieval: The project contributes to advancements in information retrieval from the web, particularly in complex, multi-hop scenarios.
  • Question Answering Systems: The dataset and framework can be used to enhance question-answering systems that rely on web-based information.
  • Research on Long Context Handling: WebWalker provides a testbed for exploring techniques for managing long contexts in LLMs.
WebWalker screenshot