GitHub

What is the project about?

The project, GPT Crawler, is a tool designed to crawl a website or a set of URLs and generate knowledge files. These files can then be used to create custom GPTs (Generative Pre-trained Transformers) or assistants within the OpenAI ecosystem.

What problem does it solve?

It simplifies the process of creating custom GPTs or assistants with specific knowledge bases. Instead of manually compiling data, users can point the crawler at a website, and it will automatically extract relevant information.

What are the features of the project?

  • Crawls a website based on a starting URL and matching patterns.
  • Extracts text content from specified CSS selectors.
  • Limits the number of pages to crawl.
  • Outputs the crawled data to a JSON file.
  • Configurable to exclude specific file types.
  • Can limit output by file size or token count.
  • Can be run locally, in a Docker container, or as an API.
  • Sitemap support.

What are the technologies used in the project?

  • Node.js
  • TypeScript
  • Docker (optional)
  • Express JS (for API server)
  • Swagger (for API documentation)

What are the benefits of the project?

  • Automates the creation of knowledge bases for custom GPTs.
  • Saves time and effort compared to manual data compilation.
  • Flexible configuration options.
  • Multiple deployment methods.

What are the use cases of the project?

  • Creating a custom GPT that answers questions about a specific product or service (like the Builder.io example).
  • Building a knowledge base for an OpenAI assistant to be integrated into a product.
  • Generating training data for custom AI models.
  • Extracting information from websites for research or analysis.
gpt-crawler screenshot