Open-source tools and libraries for programmatically converting between various document, spreadsheet, and presentation file formats.
sChandra is a document processing platform that converts images, PDFs, Word documents, spreadsheets, and other formats into structured output such as HTML, Markdown, or JSON while preserving layout. It can also extract specific data fields from invoices, contracts, or reports using user-defined JSON schemas, with citations back to source locations. The service supports form filling in PDF and image documents, document generation from Markdown, and extraction of tracked changes from Word files. The platform distinguishes itself with pipeline-based processing chains that combine multiple processing steps into versioned, reusable pipelines, managed through draft, saved, and published states. These pipelines can execute as single requests with runtime parameter overrides and webhook callbacks for asynchronous completion. For batch workloads, documents can be processed in single requests to improve throughput, and PDF segmentation splits combined or batch-scanned documents into logical sections. Security controls include API key management, data usage preferences, result auto-expiration, and authenticated webhook delivery with cryptographic signatures. Additional capabilities include a typed Python SDK, automatic request retry with exponential backoff, file collection management, API health checks, and request analytics monitoring for self-hosted deployments. The service can be deployed on-premises in a containerized setup with restricted network access, TLS termination, and authentication.
Chandra is a self-hostable document processing platform that provides an API-driven interface for converting between various formats, generating PDFs, and handling batch processing, making it a comprehensive solution for your requirements.
Pandoc is a universal document converter that translates content between a wide range of markup and binary formats. It functions by parsing input documents into a unified intermediate abstract syntax tree, which serves as the foundation for consistent manipulation and transformation across diverse output types. The system is distinguished by its modular reader-writer pipeline, which decouples input parsing from output generation to allow for granular control over document structure. Users can programmatically manipulate this intermediate tree through a robust filter system, supporting both external JSON-based interop and an integrated scripting environment for custom transformations. This architecture enables complex document processing tasks, such as automated scholarly publishing, where citations, bibliographies, and mathematical expressions are managed through a specialized toolchain. Beyond core conversion, the project provides a comprehensive templating engine that merges structured document data with customizable templates to produce final outputs with specific styling and layout requirements. It also offers a network-based server mode for API-driven and batch processing, allowing the tool to be integrated into automated technical content pipelines. The software is primarily operated via a command-line interface, which provides extensive configuration options for managing input formats, citation styles, and document metadata.
Pandoc is a versatile document conversion engine that supports a vast array of formats, provides a server mode for API-driven workflows, and handles complex tasks like PDF generation and batch processing.
jsPDF is a document creation engine designed to generate professional PDF files through a unified programming interface. It functions as a cross-platform graphics library that enables the programmatic assembly of data into structured layouts, supporting both client-side generation within web browsers and server-side rendering in backend environments. The library utilizes a canvas-based drawing API that translates high-level geometric and text instructions into standardized PDF vector primitives. By employing a cross-platform runtime abstraction, it decouples document generation logic from environment-specific constraints, ensuring consistent behavior whether the engine is running in a browser or on a server. The engine includes comprehensive support for internationalized document publishing, featuring a Unicode-compliant text renderer that maps custom character sets and scripts onto document pages. To maintain document quality and efficiency, it incorporates font-subset-based embedding to include only necessary glyphs, alongside layered graphical state management to handle complex content composition and visual ordering.
This is a client-side PDF generation library focused on drawing graphics and text, rather than a document conversion engine capable of transforming existing office formats like DOCX or ODT into other types.
This project is a Laravel integration for the Dompdf rendering engine, providing a tool to convert HTML and CSS templates into PDF documents. It functions as a wrapper that allows Laravel applications to generate downloadable or streamable PDF files from web-standard content. The library includes specialized tools for producing PDF/A-3b compliant documents intended for long-term electronic preservation. This archival capability includes the ability to embed XML metadata and attachments, which supports electronic invoicing standards for digital business transactions. The software covers a broad range of document generation tasks, including the conversion of HTML strings, PDF file export to filesystems, and the delivery of documents via browser streams. It leverages template-driven generation and standardized storage interfaces to manage the output of rendered files.
This is a specialized PDF rendering library for the Laravel framework rather than a general-purpose document conversion engine capable of handling multiple office formats like DOCX or ODT.
This project is a web-based resume builder and document designer used to create professional CVs. It integrates a visual editor with a template system and PDF generation to transform structured professional history into polished documents. The tool features a template-based editor with predefined professional design themes, allowing users to switch layouts and color schemes without losing content. It supports custom template creation and the embedding of industry-standard icon sets to personalize the visual presentation. The system manages document design through a reactive interface and style preprocessing to handle professional layouts and custom typography. These designs are converted into portable document format files for consistent printing and distribution.
This project is a specialized resume builder and visual editor rather than a general-purpose document conversion engine for handling diverse office file formats like DOCX or ODT.
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy through validation against predefined templates. Its pipeline-based architecture supports pluggable processing backends, allowing for the dynamic integration of specialized engines for optical character recognition and complex visual layout analysis. Users can control parsing behavior and extraction parameters through declarative configuration files, facilitating integration into automated workflows and server-based architectures. The library provides both a programmatic interface and a command-line toolkit to support automated document processing and format conversion. It utilizes optional dependency management to allow for modular installation of specific features, such as media rendering or advanced processing capabilities, depending on the requirements of the application.
Docling is a powerful document parsing and extraction framework that supports converting various formats like PDF, DOCX, and HTML into structured data, though its primary focus is on semantic analysis rather than general-purpose office document format conversion.
Dompdf is a PHP library that functions as a document rendering engine, transforming HTML and CSS markup into portable document files. It operates by parsing web-based layout attributes and visual properties to generate static documents suitable for reports, invoices, or archival purposes. The library distinguishes itself by integrating a resource-fetching pipeline that retrieves external stylesheets and images to maintain visual fidelity. It also supports the execution of server-side scripts during the document creation process, allowing for the injection of dynamic data and custom logic into the final output. The rendering process involves converting web markup into a structured tree of geometric boxes, which are then translated into low-level vector instructions and text streams. This workflow includes calculating element dimensions, margins, and padding, as well as mapping font metrics to ensure accurate text wrapping and document flow.
This is a specialized HTML-to-PDF rendering library rather than a general-purpose document conversion engine capable of handling diverse formats like DOCX or ODT.
LapisCV is a PDF document generator and resume builder designed to convert structured Markdown text into professional curriculum vitae. It functions as a rendering pipeline that transforms simple markup and variables into print-ready documents using a headless browser engine or LaTeX templates. The project provides a collection of professional themes and visual styles to customize the typography and branding of academic and professional resumes. It utilizes variable-based style injection to allow for the adjustment of fonts, colors, and margins based on content volume and user preferences. The system manages print layout and document structure through precise control over page breaks, dimensions, and pagination. It also supports the insertion of visual elements, such as profile avatars and specialized icons, to enhance the professional presentation of the final output.
This tool is a specialized resume builder designed to render Markdown into PDFs, rather than a general-purpose document conversion engine capable of handling diverse office formats like DOCX or ODT.
This project is a browser rendering service and headless Chrome PDF generator built on Puppeteer. It functions as a backend tool for converting web pages and raw HTML content into PDF documents and screenshots. The service distinguishes itself through browser session control, allowing for the injection of session cookies and the configuration of navigation timeouts to handle authenticated pages. It also includes viewport-based layout scaling to adjust browser dimensions and device scale factors during the rendering process. The broader capability surface covers HTML content export and automated web screenshot capture. Operational stability is supported by a dedicated health check endpoint used to verify the status of the rendering engine.
This tool is a specialized headless browser service for converting web pages and HTML to PDF, but it lacks the broader document-to-document conversion capabilities for formats like DOCX or ODT requested by the visitor.
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. Its broader capability surface covers optical character recognition for creating searchable text layers, detailed data extraction of tables and key-value pairs, and security operations including AES/RC4 encryption and permanent content redaction. The library also handles complex document geometry, layout analysis, and the generation of PDFs from HTML and CSS. The library supports multi-format document loading for PDF, EPUB, MOBI, SVG, and Office files, with the ability to process files via memory streams.
This library provides a robust programmatic interface for manipulating, converting, and extracting data from PDFs and various document formats, serving as a powerful engine for document processing workflows.
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks into versioned, reusable sequences. It supports high-volume operations through batch processing and provides granular control over data extraction via schema management and confidence scoring. For enterprise requirements, it offers containerized deployment options that allow for on-premises execution, ensuring data privacy and security while maintaining consistent performance across environments. Beyond core analysis, the system includes integrated management for document lifecycles, storage, and event-driven notifications via webhooks. It provides a strongly-typed software development kit to facilitate programmatic interaction, alongside monitoring tools that track system health and usage metrics. Security is maintained through API access controls, request throttling, and payload validation for event notifications.
This is a document analysis and data extraction platform focused on converting unstructured files into structured data, rather than a general-purpose office document conversion engine for formats like DOCX or ODT.
Illa-builder is a low-code internal tool builder and API integration platform used to create business applications and admin panels. It functions as a database GUI dashboard and visual workflow automator, allowing users to connect to databases and external APIs to manage data and automate business processes. The platform provides a self-hosted app framework that can be deployed on private infrastructure via Docker. It enables the creation of custom dashboards and CRMs while maintaining full control over data and hosting. The system includes a visual drag-and-drop canvas for designing user interfaces with pre-built components. It covers data integration for SQL and NoSQL sources, real-time collaborative editing, and event-driven workflow automation triggered by schedules or webhooks.
This is a low-code platform for building internal business applications and dashboards, which is a different category than a dedicated document conversion engine for file format transformation.
pdfkit is a JavaScript PDF generation library used to programmatically create binary PDF documents in Node.js and browser environments. It functions as a vector graphics engine for rendering paths, shapes, gradients, and tiling patterns, and as a tool for producing rich text and tagged documents that follow international accessibility standards for screen reader compatibility. The library includes a security and encryption utility for applying document encryption and restricting user permissions regarding printing, copying, or editing. It also serves as a form and annotation tool, enabling the embedding of fillable fields, interactive hyperlinks, and document annotations. The system covers a broad range of capabilities including typography with automatic line wrapping and custom font embedding, as well as media asset management for inserting images and external file attachments. It further supports the creation of structural elements such as tables and internal navigation links.
This is a PDF generation library for creating documents from scratch rather than a conversion engine designed to transform existing office formats like DOCX or ODT into other types.
unioffice is a comprehensive document processing suite that provides a PDF document processor, an Open XML document library, a document security toolkit, and a document content extractor. It is designed to programmatically create, read, and modify Word, Excel, and PowerPoint files, as well as generate and edit PDF documents. The project is distinguished by its native language implementation of the Open XML standard, which removes native binary dependencies to simplify container deployments. It features advanced capabilities for digital document security, including hardware-based PDF signing, content encryption, and sensitive information redaction using regular expressions. The library covers a broad range of capabilities including the generation and manipulation of spreadsheets with formulas and charts, the creation of presentations, and the editing of Word documents. It also provides tools for PDF form automation, HTML to PDF conversion, PDF/A compliance validation, and AI-powered structured data extraction from unstructured documents.
This library provides a robust, native Go implementation for programmatically creating, modifying, and converting between various office formats and PDF, making it a powerful engine for document processing tasks.