GOT OCR2.0

GOT-OCR2.0 is an end-to-end optical character recognition system and document text extractor. It utilizes a unified transformer architecture to recognize and extract plain and formatted text from diverse images and documents.

The system features a multi-crop processing method that divides high-resolution or dense documents into smaller sections to maintain recognition detail. It also includes a renderer that transforms recognized text into HTML to preserve the original structure and layout of the document.

The project provides a framework for fine-tuning pre-trained models on custom datasets for specialized domains. It further includes utilities for model performance evaluation and benchmarking using multi-GPU acceleration.

Features

End-to-End Architectures - Utilizes a unified transformer architecture to process images directly into structured text sequences without multi-stage pipelines.

High-Resolution Document OCR - Capturing high-detail text from large or dense documents by processing the image in smaller cropped sections.

Image Tiling - Divides high-resolution images into smaller overlapping tiles to maintain pixel density for fine-grained text recognition.

Document Text Extractors - Captures plain and formatted text from diverse document types and complex image layouts.

Multi-Crop Processing - Divides large or dense documents into smaller sections to maintain high recognition detail.

Text Extraction and OCR - Extracts plain and formatted text from diverse images and documents using a unified deep learning model.

Optical Character Recognitions - Recognizes plain and formatted text from diverse document types using a unified deep learning model.

Layout Recovery - Converts scanned images of complex documents into HTML outputs that preserve the original layout and formatting.

Multi-Crop Processing - Captures high-detail text across large or dense documents by dividing complex images into smaller sections.

Visual-Textual Alignments - Maps image features and text tokens into a shared latent space to correlate visual structure with linguistic meaning.

GPU-Accelerated Inference - Implements multi-GPU acceleration to increase throughput during large-scale document processing.

Vision Model Fine-Tuning - Provides a framework for adjusting pre-trained vision transformers using specialized datasets to improve domain-specific vocabulary recognition.

OCR Model Fine-Tuners - Provides procedures for retraining OCR models on custom datasets to improve recognition accuracy for specialized domains.

HTML Document Renderers - Transforms recognized text and layout coordinates into structured HTML to preserve the original document formatting.

Data Processing - Optical character recognition model for document understanding.

Ucas-HaoranWeiGOT-OCR2.0

Features

Star history