# funstory-ai/babeldoc

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/funstory-ai-babeldoc).**

7,752 stars · 602 forks · Python · agpl-3.0

## Links

- GitHub: https://github.com/funstory-ai/BabelDOC
- Homepage: https://funstory-ai.github.io/BabelDOC/
- awesome-repositories: https://awesome-repositories.com/repository/funstory-ai-babeldoc.md

## Description

BabelDOC is a technical document translation system designed to translate PDF files while preserving their original layout and styling. It functions as a layout-preserving translator that utilizes large language models to convert content into target languages, specifically tailored for scientific and technical documents.

The system distinguishes itself through specialized handling of academic content, including the identification and preservation of mathematical formulas and complex layout structures. It ensures technical accuracy by employing glossary-driven terminology enforcement, using source-to-target mappings to maintain consistency across translated text.

The software covers a broad range of document processing capabilities, including PDF content extraction, spatial-based text reconstruction, and layout detection. It supports both monolingual and bilingual PDF generation, allowing for side-by-side comparisons of original and translated content through coordinate-normalized layout reflow.

The system uses TOML-based configuration files to manage processing pipelines and supports offline asset management for deployment in air-gapped environments.

## Tags

### Artificial Intelligence & ML

- [Document Translators](https://awesome-repositories.com/f/artificial-intelligence-ml/artificial-intelligence-tooling/language-model-integrations/translation-services/document-translators.md) — Provides an LLM-powered system to translate PDF documents while preserving original layout and styling. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/))
- [Bilingual Scientific Translators](https://awesome-repositories.com/f/artificial-intelligence-ml/artificial-intelligence-tooling/language-model-integrations/translation-services/document-translators/bilingual-scientific-translators.md) — Parses scientific layouts and formulas to generate bilingual comparison documents for academic translation. ([source](https://cdn.jsdelivr.net/gh/funstory-ai/babeldoc@main/README.md))
- [Technical Terminology Preservation](https://awesome-repositories.com/f/artificial-intelligence-ml/artificial-intelligence-tooling/language-model-integrations/translation-services/document-translators/technical-terminology-preservation.md) — Enforces consistent technical terminology across translations using specialized glossary-driven source-to-target mappings. ([source](https://funstory-ai.github.io/BabelDOC/))
- [Document Structure Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/document-structure-analysis.md) — Extracts structured information and hierarchical relationships between text, figures, and graphics from unstructured PDFs. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/PDFParsing/PDFParsing/))
- [Translated Text Reflow](https://awesome-repositories.com/f/artificial-intelligence-ml/image-translation-pipelines/image-text-translators/translated-text-reflow.md) — Adjusts text placement and scaling to ensure translated content fits within the original document bounding boxes. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/Typesetting/Typesetting/))
- [LLM Translation Integrations](https://awesome-repositories.com/f/artificial-intelligence-ml/llm-translation-integrations.md) — Integrates large language models to perform high-quality, context-aware text translations of technical documents. ([source](https://cdn.jsdelivr.net/gh/funstory-ai/babeldoc@main/README.md))
- [Document Layout Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/document-layout-analysis.md) — Analyzes complex PDF documents to identify layout types like titles and captions and their spatial hierarchies. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/ParagraphFinding/ParagraphFinding/))
- [Multilingual Content Translation](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-content-translation.md) — Converts documents between global languages using specialized logic to handle complex scripts and ligatures. ([source](https://funstory-ai.github.io/BabelDOC/supported_languages/))
- [Formula Classification](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-content-translation/formula-classification.md) — Distinguishes between text-based formulas requiring translation and numeric or symbolic formulas that must remain unchanged. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/StylesAndFormulas/StylesAndFormulas/))

### Content Management & Publishing

- [Multilingual PDF Generation](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/pdf-manipulation-utilities/pdf-editors/dynamic-pdf-generators/multilingual-pdf-generation.md) — Produces bilingual PDFs with original and translated content displayed side-by-side or in alternating pages. ([source](https://cdn.jsdelivr.net/gh/funstory-ai/babeldoc@main/README.md))
- [Structural Text Extractors](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/pdf-manipulation-utilities/pdf-editors/pdf-content-converters/structural-text-extractors.md) — Implements a utility to parse characters and mathematical formulas from PDFs to reconstruct document structures.
- [Hierarchical Document Models](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/hierarchical-document-models.md) — Constructs a hierarchical document model to preserve the spatial and semantic relationships between text and graphics.
- [Layout Preservation Systems](https://awesome-repositories.com/f/content-management-publishing/layout-preservation-systems.md) — Reinserts formulas and styled text into translated paragraphs using placeholders to maintain original document layout. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/ILTranslator/ILTranslator/))
- [Layout Reflow Engines](https://awesome-repositories.com/f/content-management-publishing/layout-reflow-engines.md) — Implements coordinate-normalized layout reflow to ensure translated text fits within original PDF bounding boxes.
- [Spatial Text Reconstruction](https://awesome-repositories.com/f/content-management-publishing/spatial-text-reconstruction.md) — Reconstructs paragraphs and lines from individual characters by analyzing their spatial coordinates and visual boundaries.
- [Translation Management](https://awesome-repositories.com/f/content-management-publishing/content-management-systems/translation-management.md) — Manages translation request volumes using rate limits and worker pools to prevent system overload. ([source](https://cdn.jsdelivr.net/gh/funstory-ai/babeldoc@main/README.md))
- [Content Preservation Placeholders](https://awesome-repositories.com/f/content-management-publishing/content-preservation-placeholders.md) — Uses token-based placeholders to protect non-translatable elements like mathematical formulas during the translation process.
- [CID-Font Management](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/rendering-visualization/document-rendering/multilingual-rendering/multilingual-font-support/cid-font-management.md) — Manages complex CID-font resources and character mapping for accurate multilingual rendering in PDF outputs.
- [Layout-Based Text Segmentation](https://awesome-repositories.com/f/content-management-publishing/layout-based-text-segmentation.md) — Identifies paragraph boundaries by analyzing visual cues like line width and table of contents entries. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/ParagraphFinding/ParagraphFinding/))
- [PDF Text Normalizers](https://awesome-repositories.com/f/content-management-publishing/pdf-text-normalizers.md) — Cleanses extracted PDF text by removing artificial line breaks and trailing spaces to improve machine translation quality. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/ParagraphFinding/ParagraphFinding/))
- [Text Style Analysis](https://awesome-repositories.com/f/content-management-publishing/text-style-analysis.md) — Extracts typographic properties like font names and sizes to group characters by visual appearance. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/StylesAndFormulas/StylesAndFormulas/))

### Part of an Awesome List

- [Text Extraction](https://awesome-repositories.com/f/awesome-lists/media/pdf/text-extraction.md) — Retrieves raw text and structural metadata from PDF layers while preserving font styles and spatial boundaries. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/PDFParsing/PDFParsing/))
- [Bilingual PDF Generators](https://awesome-repositories.com/f/awesome-lists/productivity/pdf-generation/bilingual-pdf-generators.md) — Provides a system for creating side-by-side or alternating page comparisons of original and translated PDF content.
- [PDF Generation](https://awesome-repositories.com/f/awesome-lists/media/pdf-generation.md) — Creates optimized PDFs containing only translated text with comprehensive resource cleanup and compression. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/PDFCreation/PDFCreation/))

### Data & Databases

- [Layout Preservation](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction/layout-preservation.md) — Extracts text from PDFs and reflows translated content back into original spatial bounding boxes.

### Scientific & Mathematical Computing

- [Formula Locators](https://awesome-repositories.com/f/scientific-mathematical-computing/formula-evaluators/symbolic-formula-parsers/formula-locators.md) — Provides a mechanism to identify the spatial coordinates and presence of mathematical formulas within PDF documents. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/StylesAndFormulas/StylesAndFormulas/))
- [Scientific Document Processing](https://awesome-repositories.com/f/scientific-mathematical-computing/research-analysis-workflows/scientific-document-processing.md) — Specializes in translating academic papers by identifying and preserving complex mathematical notation and layouts.
- [Formula Layout Preservation](https://awesome-repositories.com/f/scientific-mathematical-computing/formula-evaluators/symbolic-formula-parsers/formula-locators/formula-layout-preservation.md) — Preserves the original layout and vertical alignment of mathematical formulas relative to surrounding text. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/StylesAndFormulas/StylesAndFormulas/))

### Software Engineering & Architecture

- [Translation Term Mapping](https://awesome-repositories.com/f/software-engineering-architecture/glossaries/glossary-and-acronym-managers/translation-term-mapping.md) — Applies source-to-target terminology mappings from external glossaries to maintain technical consistency across documents.
- [Translation](https://awesome-repositories.com/f/software-engineering-architecture/glossaries/translation.md) — Integrates custom terminology lists and LLMs to ensure consistent automated translation of technical documents.
- [Concurrent Request Pools](https://awesome-repositories.com/f/software-engineering-architecture/object-pooling/task-pools/concurrent-request-pools.md) — Uses managed worker pools to execute multiple translation requests in parallel for higher throughput. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/ILTranslator/ILTranslator/))
- [Worker Pool Models](https://awesome-repositories.com/f/software-engineering-architecture/worker-pool-models.md) — Uses a background worker-pool execution model to process intensive document conversions without blocking the main application.

### Development Tools & Productivity

- [Asynchronous PDF Translation](https://awesome-repositories.com/f/development-tools-productivity/asynchronous-task-processing/asynchronous-pdf-translation.md) — Implements an asynchronous processing pipeline specifically for translating scientific PDF documents. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/AsyncTranslate/AsyncTranslate/))

### Graphics & Multimedia

- [Page Coordinate Mapping](https://awesome-repositories.com/f/graphics-multimedia/visualization-mapping/visualization-frameworks/coordinate-systems/page-coordinate-mapping.md) — Implements absolute page positioning and coordinate mapping to ensure translated elements align with original document boundaries. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/PDFParsing/PDFParsing/))

### User Interface & Experience

- [CID Font Management](https://awesome-repositories.com/f/user-interface-experience/font-configurations/font-overrides/pdf-font-optimizers/cid-font-management.md) — Manages CID fonts and resources to ensure correct character rendering across various languages in output PDFs. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/PDFCreation/PDFCreation/))
- [Technical Font Mapping](https://awesome-repositories.com/f/user-interface-experience/font-configurations/font-overrides/pdf-font-optimizers/technical-font-mapping.md) — Standardizes diverse PDF fonts into a consistent set with specialized handling for mathematical notations. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/StylesAndFormulas/StylesAndFormulas/))
- [PDF and HTML Content Extraction](https://awesome-repositories.com/f/user-interface-experience/html-content-processing/pdf-and-html-content-extraction.md) — Extracts text from PDFs into structured document objects by grouping characters into logical paragraphs based on layout. ([source](https://funstory-ai.github.io/BabelDOC/ImplementationDetails/ParagraphFinding/ParagraphFinding/))