awesome-repositories.com
© 2026 Bringes Technology SRL·VAT RO45896025·hello@bringes.io
MCPSitemapPrivacyTerms
Document Processing and Conversion · Awesome GitHub Repositories

15 repos

Awesome GitHub RepositoriesDocument Processing and Conversion

Engines and APIs that automate the conversion and processing of documents between various file formats.

Explore 15 awesome GitHub repositories matching content management & publishing · Document Processing and Conversion. Refine with filters or upvote what's useful.

  1. Home
  2. Content Management & Publishing
  3. Content Processing and Transformation
  4. Document Processing and Conversion

Awesome Document Processing and Conversion GitHub Repositories

Describe the repository you're looking for…
We'll search the best matching repositories with AI.
  • avelino/awesome-go

    avelino/awesome-go

    165,543GitHubView on GitHub↗

    This project serves as a comprehensive language ecosystem index, functioning as a centralized, community-curated directory for the Go programming language. It organizes a vast landscape of software components, libraries, and development tools into a structured, navigable hierarchy, enabling developers to efficiently di

    Goawesomeawesome-listgo
  • oven-sh/bun

    oven-sh/bun

    87,491GitHubView on GitHub↗

    Bun is a high-performance runtime environment designed to execute JavaScript and TypeScript applications with minimal latency and high throughput. Built on a native core implemented in Zig, it provides a unified execution engine that leverages JavaScriptCore for efficient memory management and low-latency startup. The

    Zigbunbundlerjavascript
  • microsoft/markitdown

    microsoft/markitdown

    87,305GitHubView on GitHub↗

    This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine

    Pythonautogenautogen-extensionlangchain
  • Stirling-Tools/Stirling-PDF

    Stirling-Tools/Stirling-PDF

    74,357GitHubView on GitHub↗

    Stirling-PDF is a self-hosted document processing suite designed for secure, private file management. It functions as a comprehensive transformation engine that executes complex operations—such as merging, splitting, converting, and redacting documents—directly on the host machine. The platform provides both a browser-

    TypeScriptdockerhacktoberfestjava
  • infiniflow/ragflow

    infiniflow/ragflow

    73,425GitHubView on GitHub↗

    This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasonin

    Pythonagentagenticagentic-ai
  • tesseract-ocr/tesseract

    tesseract-ocr/tesseract

    72,460GitHubView on GitHub↗

    Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d

    C++hacktoberfestlstmmachine-learning
  • binary-husky/gpt_academic

    binary-husky/gpt_academic

    70,112GitHubView on GitHub↗

    This project provides a self-hosted, web-based interface designed to integrate large language models into academic and research workflows. It functions as a modular platform for document analysis, literature processing, and data handling, allowing users to maintain full control over their data and model connectivity th

    Pythonacademicchatglm-6bchatgpt
  • fffaraz/awesome-cpp

    fffaraz/awesome-cpp

    69,832GitHubView on GitHub↗

    This project is a comprehensive, curated directory of high-quality libraries, tools, and educational resources for C and C++ development. It serves as an ecosystem discovery index, helping developers navigate the vast landscape of third-party components, frameworks, and technical documentation available for the languag

    awesomeawesome-listc
  • adam-p/markdown-here

    adam-p/markdown-here

    60,151GitHubView on GitHub↗

    Markdown Here is a browser extension that enables rich text composition within web-based editors that lack native formatting support. By transforming plain text markdown syntax into rendered HTML, it allows users to draft professional emails and documents using standard markup, including headers, tables, and footnotes,

    JavaScript
  • zylon-ai/private-gpt

    zylon-ai/private-gpt

    57,116GitHubView on GitHub↗

    This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov

    Python
  • opendatalab/MinerU

    opendatalab/MinerU

    54,523GitHubView on GitHub↗

    MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences w

    Pythonai4sciencedocument-analysisextract-data
  • marktext/marktext

    marktext/marktext

    53,968GitHubView on GitHub↗

    Marktext is a cross-platform desktop application designed for markdown document authoring and structured note-taking. It functions as a WYSIWYG text processor, providing a distraction-free interface that renders formatted content in real-time while hiding the underlying markup syntax. The application utilizes a multi-

    JavaScriptdark-modeeditorelectron
  • docling-project/docling

    docling-project/docling

    53,584GitHubView on GitHub↗

    Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing

    Pythonaiconvertdocument-parser
  • mozilla/pdf.js

    mozilla/pdf.js

    52,848GitHubView on GitHub↗

    This project is a portable document rendering engine designed to parse and display complex document layouts directly within standard web browser environments. It functions as a web-native viewer that enables the presentation of documents without requiring external software or browser plugins. The engine utilizes a can

    JavaScript
  • typst/typst

    typst/typst

    51,468GitHubView on GitHub↗

    Typst is a programmable, markup-based typesetting engine designed for professional document creation. It functions as a scriptable publishing toolchain that transforms plain text and code into complex, paginated outputs. By utilizing a high-performance compiler, the system automates document assembly, mathematical rend

    Rustcompilermarkuptypesetting

Explore sub-tags

  • Content Processing Utilities2 sub-tagsLow-level utilities for identifying document elements and managing hidden metadata within files.
  • Document Conversion1 sub-tagAutomated engines that transform documents from one file format or structure into another.
  • Document Processing7 sub-tagsMethods and services for parsing, analyzing, translating, and rendering complex document structures.
Document Processing APIs1 sub-tag
Programmatic interfaces that allow developers to control and automate the parsing of document content.
  • Document Processing Engines4 sub-tagsCore processing engines that handle document compilation, layout rendering, and state management.
  • Document Processing Tools6 sub-tagsTools for automating document workflows, including format conversion, data extraction, and structure parsing.