Surya | Awesome Repository

Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion.

The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks into versioned, reusable sequences. It supports high-volume operations through batch processing and provides granular control over data extraction via schema management and confidence scoring. For enterprise requirements, it offers containerized deployment options that allow for on-premises execution, ensuring data privacy and security while maintaining consistent performance across environments.

Beyond core analysis, the system includes integrated management for document lifecycles, storage, and event-driven notifications via webhooks. It provides a strongly-typed software development kit to facilitate programmatic interaction, alongside monitoring tools that track system health and usage metrics. Security is maintained through API access controls, request throttling, and payload validation for event notifications.

Features

Document Analysis - Performs text recognition, layout analysis, and reading order detection using typed clients and asynchronous requests.
Document Conversion - Transforms PDFs, images, and other files into structured formats like markdown, HTML, or JSON for automated data systems.
Document and Unstructured Extraction - Transforms unstructured documents like PDFs and images into structured machine-readable formats for business pipelines.
Structured Data Extraction - Parses unstructured document content into predefined fields using centralized schemas for consistent machine-readable output.

Features

Document Analysis - Performs text recognition, layout analysis, and reading order detection using typed clients and asynchronous requests.
Document Conversion - Transforms PDFs, images, and other files into structured formats like markdown, HTML, or JSON for automated data systems.
Document and Unstructured Extraction - Transforms unstructured documents like PDFs and images into structured machine-readable formats for business pipelines.
Structured Data Extraction - Parses unstructured document content into predefined fields using centralized schemas for consistent machine-readable output.