What are the best Awesome Document Processing GitHub Repositories?

Methods and services for parsing, analyzing, translating, and rendering complex document structures. Explore 243 awesome GitHub repositories matching content management & publishing · Document Processing. Refine with filters or upvote what's useful. Top picks: microsoft/markitdown, pewdiepie-archdaemon/odysseus, fffaraz/awesome-cpp, binary-husky/gpt_academic, unclecode/crawl4ai, opendatalab/mineru, ds4sd/docling, docling-project/docling, adam-p/markdown-here, marktext/marktext.

Why is microsoft/markitdown a recommended Document Processing GitHub Repositories repository?

Converts diverse file formats into structured Markdown syntax to facilitate automated document processing and data integration.

Why is pewdiepie-archdaemon/odysseus a recommended Document Processing GitHub Repositories repository?

Renders PDF pages in a viewer panel and provides tools for document interaction.

Why is fffaraz/awesome-cpp a recommended Document Processing GitHub Repositories repository?

Exposes libraries for parsing, creating, and modifying common office document formats like spreadsheets.

Why is binary-husky/gpt_academic a recommended Document Processing GitHub Repositories repository?

Translate document content automatically by triggering specialized file processing plugins configured with service credentials.

Why is unclecode/crawl4ai a recommended Document Processing GitHub Repositories repository?

Converts complex web page content into clean Markdown files, including automated filtering and citation formatting.

Why is opendatalab/mineru a recommended Document Processing GitHub Repositories repository?

Applies geometric heuristics and spatial analysis to reassemble fragmented text blocks into a coherent reading order.

Why is ds4sd/docling a recommended Document Processing GitHub Repositories repository?

Analyzes visual cell boundaries and alignments to transform complex PDF tables into structured machine-readable data.

Why is docling-project/docling a recommended Document Processing GitHub Repositories repository?

Organizes document content into a hierarchical tree structure that preserves the semantic and spatial relationships between individual elements.

Why is adam-p/markdown-here a recommended Document Processing GitHub Repositories repository?

DOM-based parsing replaces plain text nodes with rendered HTML elements to update content dynamically within the browser.

Why is marktext/marktext a recommended Document Processing GitHub Repositories repository?

Transforms input text into structured tree representations to enable efficient document parsing and rendering.

243 repository-uri

Awesome GitHub RepositoriesDocument Processing

Methods and services for parsing, analyzing, translating, and rendering complex document structures.

Explore 243 awesome GitHub repositories matching content management & publishing · Document Processing. Refine with filters or upvote what's useful.

Găsește cele mai bune repo-uri cu AI.Vom căuta cele mai potrivite repository-uri folosind AI.

microsoft/markitdown
microsoft/markitdown
154,485Vezi pe GitHub
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document st
Converts diverse file formats into structured Markdown syntax to facilitate automated document processing and data integration.
Pythonautogenautogen-extensionlangchain
Vezi pe GitHub154,485
pewdiepie-archdaemon/odysseus
pewdiepie-archdaemon/odysseus
72,184Vezi pe GitHub
Odysseus is a self-hosted AI workspace and autonomous agent framework designed for deploying and managing large language models. It serves as a centralized platform for orchestrating agentic tasks, utilizing a model context protocol server to connect AI models to external system utilities, browser automation, and local hardware. The system distinguishes itself through a combination of retrieval-augmented generation and a RAG knowledge base, using vector stores and local embeddings to provide persistent semantic memory. It further integrates AI-driven communication management to triage email i
Renders PDF pages in a viewer panel and provides tools for document interaction.
Python
Vezi pe GitHub72,184
fffaraz/awesome-cpp
fffaraz/awesome-cpp
71,817Vezi pe GitHub
This project is a comprehensive, curated directory of high-quality libraries, tools, and educational resources for C and C++ development. It serves as an ecosystem discovery index, helping developers navigate the vast landscape of third-party components, frameworks, and technical documentation available for the language. The collection is distinguished by its focus on high-performance systems programming and technical mastery. It provides deep coverage of specialized domains including SIMD-accelerated data processing, compile-time template metaprogramming, and asynchronous event-driven archit
Exposes libraries for parsing, creating, and modifying common office document formats like spreadsheets.
awesomeawesome-listc
Vezi pe GitHub71,817
binary-husky/gpt_academic
binary-husky/gpt_academic
70,912Vezi pe GitHub
This project provides a self-hosted, web-based interface designed to integrate large language models into academic and research workflows. It functions as a modular platform for document analysis, literature processing, and data handling, allowing users to maintain full control over their data and model connectivity through private server or local deployments. The system is distinguished by its extensible architecture, which enables users to inject custom Python scripts to automate repetitive tasks and extend core functionality. It also features a voice-enabled interaction layer that captures
Translate document content automatically by triggering specialized file processing plugins configured with service credentials.
Pythonacademicchatglm-6bchatgpt
Vezi pe GitHub70,912
unclecode/crawl4ai
unclecode/crawl4ai
68,644Vezi pe GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages l
Converts complex web page content into clean Markdown files, including automated filtering and citation formatting.
Python
Vezi pe GitHub68,644
opendatalab/mineru
opendatalab/MinerU
67,734Vezi pe GitHub
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation. The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recogn
Applies geometric heuristics and spatial analysis to reassemble fragmented text blocks into a coherent reading order.
Pythonai4sciencedocument-analysisextract-data
Vezi pe GitHub67,734
ds4sd/docling
DS4SD/docling
62,172Vezi pe GitHub
Docling is a multimodal content converter and document parser designed to transform PDFs, Office files, and HTML into structured Markdown or JSON for generative AI applications. It functions as an OCR document processor and a PDF layout analyzer that extracts tables, charts, and hierarchical structures while preserving the original page layout. The system operates as a local-first inference engine, allowing for the processing of sensitive data in air-gapped environments without external network connectivity. It can also be deployed as an API or a Model Context Protocol server to provide parsi
Analyzes visual cell boundaries and alignments to transform complex PDF tables into structured machine-readable data.
Python
Vezi pe GitHub62,172
docling-project/docling
docling-project/docling
61,674Vezi pe GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Organizes document content into a hierarchical tree structure that preserves the semantic and spatial relationships between individual elements.
Pythonaiconvertdocument-parser
Vezi pe GitHub61,674
adam-p/markdown-here
adam-p/markdown-here
60,218Vezi pe GitHub
Markdown Here is a browser extension that enables rich text composition within web-based editors that lack native formatting support. By transforming plain text markdown syntax into rendered HTML, it allows users to draft professional emails and documents using standard markup, including headers, tables, and footnotes, directly inside their browser. The tool distinguishes itself through a bidirectional transformation engine that supports both the conversion of markdown to HTML and the reversion of rendered content back into its original source code. This state-preserving functionality allows
DOM-based parsing replaces plain text nodes with rendered HTML elements to update content dynamically within the browser.
JavaScript
Vezi pe GitHub60,218
marktext/marktext
marktext/marktext
57,443Vezi pe GitHub
Marktext is a cross-platform desktop application designed for markdown document authoring and structured note-taking. It functions as a WYSIWYG text processor, providing a distraction-free interface that renders formatted content in real-time while hiding the underlying markup syntax. The application utilizes a multi-process architecture that separates system integration from the user interface, ensuring consistent performance across Windows, macOS, and Linux. By employing a custom editor core built on native browser capabilities and a structured syntax tree, it manages complex document eleme
Transforms input text into structured tree representations to enable efficient document parsing and rendering.
TypeScriptdark-modeeditorelectron
Vezi pe GitHub57,443
zylon-ai/private-gpt
zylon-ai/private-gpt
57,278Vezi pe GitHub
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to provide context-aware responses for chat and completion requests. The system distinguishes itself through a database-agnostic abstraction layer that supports various storage backends, ranging from local disk storage to enterprise-grade vector databases. It offers flexible deployment
Processes raw text into searchable document representations to support retrieval-augmented generation workflows.
Python
Vezi pe GitHub57,278
willmcgugan/rich
willmcgugan/rich
56,640Vezi pe GitHub
Rich is a Python terminal formatting library and user interface framework. It provides tools for rendering rich text, colors, and complex layouts within a terminal environment, including specialized formatters for markdown and source code syntax highlighting. The library distinguishes itself through high-level UI components such as tables with unicode borders, hierarchical tree views for nested data structures, and a system for building structured terminal user interfaces. It also includes a debugging visualizer for pretty-printing complex data and formatting error tracebacks. The capability
Translates markdown syntax into styled sequences of text objects for formatted console output.
Python
Vezi pe GitHub56,640
typst/typst
typst/typst
54,320Vezi pe GitHub
Typst is a programmable, markup-based typesetting engine designed for professional document creation. It functions as a scriptable publishing toolchain that transforms plain text and code into complex, paginated outputs. By utilizing a high-performance compiler, the system automates document assembly, mathematical rendering, and dynamic content generation, providing a unified workflow for academic and technical authoring. The engine distinguishes itself through a declarative layout framework that uses cascading rules to manage document structure and visual styling. Unlike traditional systems,
Uses a domain-specific language to handle programmatic document layout and complex mathematical typesetting.
Rustcompilermarkuptypesetting
Vezi pe GitHub54,320
santifer/career-ops
santifer/career-ops
54,119Vezi pe GitHub
Career-ops is an AI-driven job search automation system designed to manage the entire application lifecycle, from discovery to tracking. It functions as a career copilot that utilizes autonomous agents to identify vacancies, evaluate professional fit, and generate tailored application materials. The project distinguishes itself through a multi-archetype persona management system and writing style calibration, allowing users to maintain different professional identities and a consistent voice across documents. It employs a multi-dimensional weighted scoring system to evaluate job suitability a
Transforms structured professional profile data into ATS-compatible PDF resumes and cover letters using templates.
JavaScriptai-agentanthropicautomation
Vezi pe GitHub54,119
mozilla/pdf.js
mozilla/pdf.js
53,454Vezi pe GitHub
This project is a portable document rendering engine designed to parse and display complex document layouts directly within standard web browser environments. It functions as a web-native viewer that enables the presentation of documents without requiring external software or browser plugins. The engine utilizes a canvas-based rendering layer to map document page data onto standard web drawing surfaces, ensuring high-fidelity visual output. To maintain interface responsiveness, it offloads heavy parsing and object extraction tasks to background threads. The system also employs asynchronous by
Manages document loading, page navigation, and text extraction to facilitate seamless file viewing.
JavaScript
Vezi pe GitHub53,454
gogs/gogs
gogs/gogs
47,606Vezi pe GitHub
Gogs este un serviciu Git auto-găzduit și o platformă de găzduire a codului colaborativ. Funcționează ca un manager de control al versiunilor care permite utilizatorilor să stocheze și să gestioneze codul sursă pe propria infrastructură folosind protocoalele SSH, HTTP și HTTPS. Platforma se distinge prin capabilități cuprinzătoare de oglindire (mirroring), acționând ca un instrument pentru a sincroniza și oglindi depozitele și wiki-urile de la furnizori de găzduire externi către o instanță locală. Este concepută pentru o implementare securizată, containerizată, suportând configurații de utilizator non-root pentru a îndeplini cerințe stricte de securitate. Dincolo de găzduirea de bază, oferă o suită de instrumente de colaborare, inclusiv pull requests, urmărirea problemelor (issue tracking), wiki-uri și revizuiri de cod între colegi. Sistemul încorporează automatizarea fluxului de lucru prin webhook-uri și Git hooks, gestionează fișierele binare supradimensionate prin Large File Storage și oferă control granular al accesului pentru gestionarea depozitelor private. Serviciul poate fi implementat ca o imagine de container pentru un comportament consistent în diferite medii de găzduire.
Displays Jupyter Notebooks and PDF files directly within the web interface for seamless viewing.
Godockergitgo
Vezi pe GitHub47,606
videojs/video.js
videojs/video.js
39,805Vezi pe GitHub
Video.js is a customizable HTML5 video player framework and JavaScript media plugin system. It provides a foundation for rendering and controlling web video playback across different devices and screen sizes, supporting both standard HTML5 integration and adaptive bitrate streaming for live or on-demand content. The project is distinguished by a modular architecture that allows for the extension of playback functionality through a class-based plugin system and middleware-based method interception. This enables the development of tailored media interfaces and the integration of specialized beh
Overlays captions and subtitles on top of the video element using a programmable DOM layer for visual styling.
JavaScriptdashhlshtml
Vezi pe GitHub39,805
chatchat-space/langchain-chatchat
chatchat-space/Langchain-Chatchat
38,211Vezi pe GitHub
Langchain-Chatchat is a system for building retrieval-augmented generation applications and autonomous AI agents. It integrates a knowledge base management system and an agent framework to enable language models to interact with private documents and execute multi-step tasks through external tools. The platform supports local deployment of language models on private infrastructure to operate without an internet connection. It includes a multimodal AI platform that combines vision models for image analysis with text-to-image generation capabilities. The system provides a web-based conversatio
Provides infrastructure for loading, updating, and organizing local documents for subsequent information retrieval.
Pythonchatbotchatchatchatglm
Vezi pe GitHub38,211
exacity/deeplearningbook-chinese
exacity/deeplearningbook-chinese
37,285Vezi pe GitHub
This project is a comprehensive Chinese translation of a technical deep learning textbook, providing an educational resource on the theory and implementation of neural networks. It functions as a collaborative technical translation project designed to make complex academic AI literature accessible to non-English speakers. The project utilizes a community-driven translation model that integrates external suggestions and pull requests to refine linguistic accuracy and reduce bias. It employs standardized terminology mapping to ensure a uniform vocabulary throughout the translated content. To i
Transforms structured source files into Markdown format to facilitate web rendering and indexing.
TeX
Vezi pe GitHub37,285
vikparuchuri/marker
VikParuchuri/marker
36,164Vezi pe GitHub
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabi
Transforms various document formats into clean markdown including formatted tables, equations, and code blocks.
Python
Vezi pe GitHub36,164