GitHub

What is the project about?

OCRmyPDF is a command-line tool that adds an Optical Character Recognition (OCR) text layer to scanned PDF files, making them searchable and copy-pasteable.

What problem does it solve?

It solves the problem of scanned PDFs being images only, and therefore not searchable or allowing text selection. It converts image-only PDFs into text-searchable PDFs, preserving the original image quality and layout as much as possible. It addresses limitations of other OCR tools, such as misplaced text, poor handling of non-English characters, resolution changes, large file sizes, and lack of PDF/A support.

What are the features of the project?

  • Generates searchable PDF/A files from scanned PDFs.
  • Accurately places OCR text below the image.
  • Preserves the original image resolution.
  • Inserts OCR information losslessly when possible.
  • Optimizes PDF images, often reducing file size.
  • Optionally deskews and cleans images before OCR.
  • Validates input and output files.
  • Uses multi-core processing for speed.
  • Supports over 100 languages via the Tesseract OCR engine.
  • Privacy-focused: processes files locally.
  • Scales to handle large files with thousands of pages.

What are the technologies used in the project?

  • Python: The core programming language.
  • Tesseract OCR: The OCR engine used for text recognition.
  • Ghostscript: Used for PDF processing and manipulation.
  • jbig2enc (optional): For JBIG2 image encoding.
  • unpaper (optional): For image cleaning and deskewing.
  • pikepdf, qpdf: For PDF manipulation.

What are the benefits of the project?

  • Makes scanned documents searchable.
  • Enables text selection and copying from scanned documents.
  • Creates archival-quality PDF/A files.
  • Improves accessibility of scanned documents.
  • Often reduces file size compared to the original scanned PDF.
  • Automates the OCR process through a command-line interface.
  • Open source and free to use.

What are the use cases of the project?

  • Digitizing paper documents for archiving and searching.
  • Making scanned lecture notes or books searchable.
  • Converting scanned forms into editable documents.
  • Creating accessible PDFs from scanned materials.
  • Batch processing of scanned documents.
  • Integrating OCR into document management workflows.
OCRmyPDF screenshot