MinerU

What is the project about?

MinerU is an open-source tool designed to convert PDF documents, particularly those from scientific literature, into machine-readable formats like Markdown and JSON. It aims to make the information within PDFs easily accessible and usable for various applications, especially in the context of large language models (LLMs).

What problem does it solve?

MinerU addresses the challenge of extracting structured information from PDFs, which are often designed for human readability rather than machine processing. It tackles issues like:

Removing irrelevant content (headers, footers, page numbers).
Preserving the logical reading order in complex layouts (multi-column, figures, tables).
Maintaining document structure (headings, paragraphs, lists).
Extracting non-textual elements (images, tables, formulas).
Handling scanned documents and different languages.

What are the features of the project?

Semantic Coherence: Removes headers, footers, footnotes, and page numbers.
Reading Order Preservation: Outputs text in a human-readable order, even with complex layouts.
Structure Preservation: Retains headings, paragraphs, and lists.
Multimedia Extraction: Extracts images, image descriptions, tables, table titles, and footnotes.
Formula Recognition: Converts formulas to LaTeX format.
Table Recognition: Converts tables to HTML format.
OCR Capability: Automatically detects and processes scanned PDFs, supporting 84 languages.
Multiple Output Formats: Supports multimodal and NLP Markdown, JSON (sorted by reading order), and rich intermediate formats.
Visualization: Provides layout and span visualizations to check output quality.
Hardware Flexibility: Runs on CPU, GPU (CUDA), NPU (CANN), and MPS (Apple Silicon).
Platform Compatibility: Works on Windows, Linux, and macOS.
Heading Classification: Supports hierarchical classification of headings.
Hybrid OCR: Combines text and OCR modes for improved accuracy in complex scenarios.

What are the technologies used in the project?

Python: The primary programming language.
PDF-Extract-Kit: A toolkit for PDF content extraction.
DocLayout-YOLO / layoutlmv3: Models for layout detection.
StructEqTable / RapidTable / tablemaster: Models for table recognition.
PaddleOCR: For optical character recognition.
PyMuPDF: (Currently used, but planned to be replaced) For PDF processing.
layoutreader: For reading order sorting.
fast-langdetect: For language detection.
unimernet: Model for formula parsing.
Docker: For containerized deployment.

What are the benefits of the project?

Open Source: Freely available and modifiable.
Accurate Extraction: Provides high-quality extraction of text and non-textual elements.
Structured Output: Makes PDF data usable for machine learning and other applications.
Multilingual Support: Handles a wide range of languages.
Hardware Acceleration: Offers options for faster processing.
Easy to Use: Provides command-line and API interfaces.
Extensible: Allows for the development of derived projects.

What are the use cases of the project?

Large Language Model Training: Preparing data for pre-training LLMs.
Scientific Literature Analysis: Extracting information from research papers.
Document Archiving and Indexing: Creating searchable archives of PDF documents.
Data Mining: Extracting specific data points from PDFs.
Accessibility: Converting PDFs to more accessible formats.
Content Repurposing: Reusing PDF content in other applications.
RAG (Retrieval-Augmented Generation): Enhancing LLMs with information from PDFs.