Dots.ocr | Awesome Repository

dots.ocr is a suite of software utilities for document layout analysis, multilingual optical character recognition, and scene text digitization. It functions as an engine for extracting digital text and structured layout data from images and PDFs across various human scripts.

The project includes a specialized transformer for converting charts, diagrams, and chemical formulas from raster images into scalable vector graphics. It also provides a pipeline to transform extracted text and structural layout from documents and web screenshots into formatted Markdown files.

The system covers capabilities for identifying bounding boxes and categories of layout elements to produce structured JSON representations. It further includes tools for scene text detection within natural images and an evaluation framework for measuring text and table extraction accuracy against ground truth data.

Features

Document Analysis Tools - Provides a comprehensive tool for detecting bounding boxes and layout categories to structure documents as JSON.
Bounding Box Detection - Provides bounding box detection to locate page elements and distinguish body text from headers and footers.
Multilingual OCR Systems - Extracts digital text and structured layout data from images and PDFs across various human scripts.
Document Layout Analysis - Identifies bounding boxes and categories for layout elements within images and PDF files.

Features

Document Analysis Tools - Provides a comprehensive tool for detecting bounding boxes and layout categories to structure documents as JSON.
Bounding Box Detection - Provides bounding box detection to locate page elements and distinguish body text from headers and footers.
Multilingual OCR Systems - Extracts digital text and structured layout data from images and PDFs across various human scripts.
Document Layout Analysis - Identifies bounding boxes and categories for layout elements within images and PDF files.