MediaCrawler Project Description

What is the project about?

MediaCrawler is a web scraping project focused on collecting public data from various social media platforms.

What problem does it solve?

The project simplifies the process of gathering data from social media, bypassing the need to reverse-engineer complex JavaScript encryption for parameters. It provides an easier way to access public information for research and analysis.

What are the features of the project?

Multi-Platform Support: Crawls data from platforms like Xiaohongshu, Douyin, Kuaishou, Bilibili, Weibo, Tieba, and Zhihu.
Data Retrieval Modes:
- Keyword search.
- Specific post ID crawling.
- Creator's profile page crawling.
- Second-level comments.
Session Management: Maintains login state using Playwright.
IP Proxy Pool: Support for using IP proxy pools.
Comment Word Cloud: Generates word clouds from comments.
Data Storage: Saves data to MySQL, CSV, or JSON.

What are the technologies used in the project?

Python: The primary programming language.
Playwright: A library to automate browser actions and maintain login sessions.
JavaScript: Used for executing expressions to obtain encrypted parameters.
MySQL: Optional database for storing crawled data.
(Implied) Node.js: Required for Douyin and Zhihu crawling, version 16+.

What are the benefits of the project?

Simplified Data Collection: Reduces the complexity of scraping social media data.
Learning Resource: Provides a practical example of web scraping techniques.
Open Source: Freely available for use and modification.
Multiple Data Formats: Offers flexibility in how the data is stored.
Disclaimer: Includes a clear disclaimer about legal use and limitations of liability.

What are the use cases of the project?

Academic Research: Analyzing trends and public opinion on social media.
Data Analysis: Gathering data for market research or social listening.
Content Aggregation: (Potentially, but with strong ethical considerations) Collecting content for personal use or study.
Learning Web Scraping: A practical project to learn about web scraping and related technologies.
Generating comment word clouds.