Nougat

Nougat is a neural OCR system and LLM document parser designed to convert images of academic PDF documents into structured markdown text and mathematical formulas. It functions as a PDF to markdown converter that uses deep learning to handle layout and formula recognition.

The project provides a document training pipeline for generating datasets and training neural networks to recognize specific academic document styles. This includes utilities for training dataset generation, neural model training, and model checkpoint management to ensure reproducible deployment.

The system covers a broad range of capabilities including academic document digitization and automated text extraction. It incorporates tools for model accuracy evaluation, performance testing, and training metric logging to monitor model convergence and stability.

Programmatic access to these capabilities is available via web service endpoints for document conversion, text prediction, and structured OCR extraction.

Features

PDF to Markdown Converters - Transforms academic document images into structured markdown while preserving complex mathematical formulas and tables.

End-to-End Document Parsers - Implements an end-to-end neural architecture that maps document images directly to text without intermediate OCR.

Image-to-Text Transformers - Employs a transformer-based neural network to map document image pixels directly to structured markdown text.

Neural Network Training - Enables training neural networks on custom datasets to improve the accuracy of academic document recognition.

OCR Engines - Implements a transformer-based OCR engine for converting complex academic document images into machine-readable text.

Scholarly Document Digitization - Digitizes scholarly papers into structured markdown while preserving mathematical formulas and complex tables.

Vision-Based Document Parsers - Uses multimodal vision models to interpret academic document layouts and convert them into structured markdown.

Neural Text Extraction - Provides a neural network-based process to transform academic document images into machine-readable text.

Document Pair Generation - Includes a utility to pair PDF pages with HTML sources to create indexed training datasets in JSONL format.

Image-Text Pair Mappings - Uses JSONL-based mapping to pair document images with ground-truth text targets for efficient training.

Document Parsing Model Training - Provides tools for training and evaluating neural network models on custom academic datasets.

Model Checkpointing - Implements a system for saving and restoring neural network weights to ensure reproducible training and deployment.

Visual Encoders - Utilizes Swin-Transformer visual encoding to extract hierarchical features from document images.

Training Pipelines - Provides an automated workflow for managing the end-to-end training and evaluation of document parsing models.

Image-to-Tensor Conversions - Converts PDF document pages into image tensors to serve as primary input for the neural model.

Model Testing - Tests model checkpoints against datasets to evaluate prediction accuracy across different text modalities.

Model Evaluation - Calculates model performance by comparing predicted markdown against ground truth using distance and overlap scores.

Data Preprocessing - Academic PDF parser capable of understanding LaTeX and complex tables.

facebookresearchnougat

Features

Star history