15 repos
Engines and APIs that automate the conversion and processing of documents between various file formats.
Explore 15 awesome GitHub repositories matching content management & publishing · Document Processing and Conversion. Refine with filters or upvote what's useful.
This project serves as a comprehensive language ecosystem index, functioning as a centralized, community-curated directory for the Go programming language. It organizes a vast landscape of software components, libraries, and development tools into a structured, navigable hierarchy, enabling developers to efficiently di
Bun is a high-performance runtime environment designed to execute JavaScript and TypeScript applications with minimal latency and high throughput. Built on a native core implemented in Zig, it provides a unified execution engine that leverages JavaScriptCore for efficient memory management and low-latency startup. The
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine
Stirling-PDF is a self-hosted document processing suite designed for secure, private file management. It functions as a comprehensive transformation engine that executes complex operations—such as merging, splitting, converting, and redacting documents—directly on the host machine. The platform provides both a browser-
This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasonin
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d
This project provides a self-hosted, web-based interface designed to integrate large language models into academic and research workflows. It functions as a modular platform for document analysis, literature processing, and data handling, allowing users to maintain full control over their data and model connectivity th
This project is a comprehensive, curated directory of high-quality libraries, tools, and educational resources for C and C++ development. It serves as an ecosystem discovery index, helping developers navigate the vast landscape of third-party components, frameworks, and technical documentation available for the languag
Markdown Here is a browser extension that enables rich text composition within web-based editors that lack native formatting support. By transforming plain text markdown syntax into rendered HTML, it allows users to draft professional emails and documents using standard markup, including headers, tables, and footnotes,
This project is a privacy-first backend service designed to facilitate retrieval-augmented generation by processing local documents into searchable vector representations. It provides a modular architecture that allows users to ingest diverse file formats, manage document metadata, and perform semantic searches to prov
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences w
Marktext is a cross-platform desktop application designed for markdown document authoring and structured note-taking. It functions as a WYSIWYG text processor, providing a distraction-free interface that renders formatted content in real-time while hiding the underlying markup syntax. The application utilizes a multi-
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing
This project is a portable document rendering engine designed to parse and display complex document layouts directly within standard web browser environments. It functions as a web-native viewer that enables the presentation of documents without requiring external software or browser plugins. The engine utilizes a can
Typst is a programmable, markup-based typesetting engine designed for professional document creation. It functions as a scriptable publishing toolchain that transforms plain text and code into complex, paginated outputs. By utilizing a high-performance compiler, the system automates document assembly, mathematical rend