What are the best Awesome Document Processing Engines GitHub Repositories?

Question 1

Accepted Answer

High-performance pipelines for converting large volumes of narrative text into machine-readable data.

**Distinguishing note:** Focuses on the high-performance pipeline and parallel execution aspects of document processing.

Explore 4 awesome GitHub repositories matching data & databases · Document Processing Engines. Refine with filters or upvote what's useful. Top picks: google/langextract, vikparuchuri/marker, deepseek-ai/deepseek-ocr, bytedance/dolphin.

Question 2

Why is google/langextract a recommended Document Processing Engines GitHub Repositories repository?

Accepted Answer

Executes parallel extraction passes to convert large volumes of narrative text into machine-readable data.

Question 3

Why is vikparuchuri/marker a recommended Document Processing Engines GitHub Repositories repository?

Accepted Answer

Employs high-performance pipelines to process large batches of PDF files in parallel via GPUs.

Question 4

Why is deepseek-ai/deepseek-ocr a recommended Document Processing Engines GitHub Repositories repository?

Accepted Answer

Provides high-performance pipelines for batch processing and text extraction from documents.

Question 5

Why is bytedance/dolphin a recommended Document Processing Engines GitHub Repositories repository?

Accepted Answer

Provides high-performance pipelines for converting large volumes of images into structured data through parallel execution.

Awesome GitHub RepositoriesDocument Processing Engines

google/langextract

VikParuchuri/marker

deepseek-ai/DeepSeek-OCR

bytedance/Dolphin