# datalab-to/surya

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/datalab-to-surya).**

19,291 stars · 1,326 forks · Python · gpl-3.0

## Links

- GitHub: https://github.com/datalab-to/surya
- Homepage: https://www.datalab.to
- awesome-repositories: https://awesome-repositories.com/repository/datalab-to-surya.md

## Description

Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion.

The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks into versioned, reusable sequences. It supports high-volume operations through batch processing and provides granular control over data extraction via schema management and confidence scoring. For enterprise requirements, it offers containerized deployment options that allow for on-premises execution, ensuring data privacy and security while maintaining consistent performance across environments.

Beyond core analysis, the system includes integrated management for document lifecycles, storage, and event-driven notifications via webhooks. It provides a strongly-typed software development kit to facilitate programmatic interaction, alongside monitoring tools that track system health and usage metrics. Security is maintained through API access controls, request throttling, and payload validation for event notifications.

## Tags

### Artificial Intelligence & ML

- [Document Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/document-analysis.md) — Performs text recognition, layout analysis, and reading order detection using typed clients and asynchronous requests. ([source](https://documentation.datalab.to/platform/versioning))
- [Analysis SDKs](https://awesome-repositories.com/f/artificial-intelligence-ml/document-analysis/analysis-sdks.md) — Provides a typed interface for integrating advanced text recognition and document conversion into custom software.

### Content Management & Publishing

- [Document Conversion](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-conversion.md) — Transforms PDFs, images, and other files into structured formats like markdown, HTML, or JSON for automated data systems. ([source](https://documentation.datalab.to/llms.txt))
- [Document Automation Pipelines](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/document-automation-interfaces/document-automation-pipelines.md) — Chains multiple analysis tasks into versioned and reusable workflows to automate complex document transformation.
- [Form Automation](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/form-automation.md) — Injects structured data into native PDF fields or visual document overlays to generate completed forms. ([source](https://documentation.datalab.to/docs/recipes/form-filling/form-filling-api-overview))
- [Document Management Systems](https://awesome-repositories.com/f/content-management-publishing/content-management-systems/content-management-platforms/enterprise-specialized-systems/document-management-systems.md) — Manages document lifecycles through centralized storage, batch processing, and automated notifications.
- [Confidence Scoring](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/document-data-extraction/confidence-scoring.md) — Calculates and returns numerical reliability ratings for each extracted field to assess recognition accuracy. ([source](https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview))
- [Documentation Generators](https://awesome-repositories.com/f/content-management-publishing/documentation-knowledge-management/documentation-generators.md) — Creates structured DOCX files from markdown input while maintaining support for tracked changes and custom content tags. ([source](https://documentation.datalab.to/llms.txt))

### Data & Databases

- [Document and Unstructured Extraction](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/document-unstructured-extraction.md) — Transforms unstructured documents like PDFs and images into structured machine-readable formats for business pipelines.
- [Structured Data Extraction](https://awesome-repositories.com/f/data-databases/structured-data-extraction.md) — Parses unstructured document content into predefined fields using centralized schemas for consistent machine-readable output.
- [Data Pipeline Orchestration](https://awesome-repositories.com/f/data-databases/data-pipeline-orchestration.md) — Chains multiple document analysis steps into versioned and reusable sequences to automate complex data extraction workflows.
- [Document Processing Pipelines](https://awesome-repositories.com/f/data-databases/document-processing-pipelines.md) — Chains document analysis, text recognition, and layout segmentation tasks into versioned, automated, and reusable workflows.
- [Batch Processing](https://awesome-repositories.com/f/data-databases/batch-processing.md) — Executes analysis tasks across large document collections simultaneously to improve throughput for high-volume workloads. ([source](https://documentation.datalab.to/llms.txt))
- [Data Schema Management](https://awesome-repositories.com/f/data-databases/data-schema-management.md) — Defines and stores data structures centrally to reference them by identifier across multiple extraction requests. ([source](https://documentation.datalab.to/docs/recipes/structured-extraction/api-overview))
- [Document Processing Platforms](https://awesome-repositories.com/f/data-databases/document-processing-platforms.md) — Provides a strongly-typed interface for executing document conversion, structured data extraction, and pipeline management. ([source](https://documentation.datalab.to/docs/welcome/quickstart))
- [File Storage Management](https://awesome-repositories.com/f/data-databases/file-storage-management.md) — Handles file lifecycle operations including uploading, listing, metadata retrieval, and deletion in remote storage. ([source](https://documentation.datalab.to/docs/welcome/sdk))

### DevOps & Infrastructure

- [Pipeline Orchestration](https://awesome-repositories.com/f/devops-infrastructure/pipeline-orchestration.md) — Connects multiple document processing tasks into versioned and reusable pipelines for complex extraction. ([source](https://documentation.datalab.to/docs/welcome/quickstart))
- [Self-Hosted Deployment Platforms](https://awesome-repositories.com/f/devops-infrastructure/self-hosted-deployment-platforms.md) — Supports containerized deployment on private infrastructure to provide full control over document processing environments. ([source](https://documentation.datalab.to/docs/on-prem/overview))
- [Document Analysis Services](https://awesome-repositories.com/f/devops-infrastructure/on-premise-deployment/document-analysis-services.md) — Deploys containerized services to perform local text recognition and layout analysis with strict data privacy.
- [Service Containerization](https://awesome-repositories.com/f/devops-infrastructure/service-containerization.md) — Packages document processing logic into isolated containers for consistent local execution and secure on-premises deployment.

### Business & Productivity Software

- [Form Automation](https://awesome-repositories.com/f/business-productivity-software/document-digitization-tools/form-automation.md) — Programmatically injects structured data into PDF fields and visual overlays to streamline reporting.

### Security & Cryptography

- [API Access Control](https://awesome-repositories.com/f/security-cryptography/api-access-control.md) — Limits usage and spending by assigning unique keys to different environments and rotating them. ([source](https://documentation.datalab.to/platform/security))
- [Webhook Security](https://awesome-repositories.com/f/security-cryptography/webhook-security.md) — Includes a configurable secret in event payloads to allow receiving servers to validate incoming notifications. ([source](https://documentation.datalab.to/platform/security))
- [Container Isolation](https://awesome-repositories.com/f/security-cryptography/security/infrastructure-and-hardware/infrastructure-system-hardening/deployment-security-hardening/container-isolation.md) — Isolates document processing containers behind reverse proxies and firewalls to restrict network access. ([source](https://documentation.datalab.to/platform/security))

### User Interface & Experience

- [Form and Input Management](https://awesome-repositories.com/f/user-interface-experience/form-and-input-management.md) — Programmatically populates digital forms and injects structured data into document overlays for automated reporting.

### Development Tools & Productivity

- [Webhook Notifications](https://awesome-repositories.com/f/development-tools-productivity/webhook-notifications.md) — Triggers automated callbacks to specified endpoints upon task completion to eliminate manual polling. ([source](https://documentation.datalab.to/docs/common/limits))
- [Revision Extraction](https://awesome-repositories.com/f/development-tools-productivity/change-tracking/revision-extraction.md) — Identifies and extracts revision history, redlines, and tracked changes from word processing files into structured output. ([source](https://documentation.datalab.to/docs/welcome/sdk/cli))

### Software Engineering & Architecture

- [Webhook Event Notifications](https://awesome-repositories.com/f/software-engineering-architecture/integration-extensibility/programmatic-interfaces/webhook-event-notifications.md) — Triggers automated callbacks to external endpoints upon task completion to eliminate manual polling.
- [Request Throttling](https://awesome-repositories.com/f/software-engineering-architecture/request-throttling.md) — Limits the size and volume of document processing requests to ensure system stability during high-traffic periods. ([source](https://documentation.datalab.to/platform/billing))
- [Type-Safe Development](https://awesome-repositories.com/f/software-engineering-architecture/type-safe-development.md) — Provides a strongly-typed software development kit to simplify programmatic interaction and ensure reliable data structures.
- [Workflow Versioning](https://awesome-repositories.com/f/software-engineering-architecture/workflow-versioning.md) — Maintains fixed snapshots of processing configurations to ensure production stability during iterative pipeline development.

### System Administration & Monitoring

- [Performance Monitoring](https://awesome-repositories.com/f/system-administration-monitoring/performance-monitoring.md) — Tracks request volumes, processing latency, and system health to provide visibility into operational status. ([source](https://documentation.datalab.to/docs/on-prem/api))
- [Usage Analytics](https://awesome-repositories.com/f/system-administration-monitoring/usage-analytics.md) — Queries historical request volumes and success rates to inform infrastructure capacity planning. ([source](https://documentation.datalab.to/docs/on-prem/usage-analytics))

### Graphics & Multimedia

- [Document Segmentation](https://awesome-repositories.com/f/graphics-multimedia/media-processing-analysis/media-manipulation/media-processing-workflows/computer-vision-pipelines/document-segmentation.md) — Identifies and isolates distinct sections within documents to improve data extraction accuracy. ([source](https://documentation.datalab.to/llms.txt))
