# datalab-to/chandra

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/datalab-to-chandra).**

4,833 stars · 546 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/datalab-to/chandra
- Homepage: https://www.datalab.to
- awesome-repositories: https://awesome-repositories.com/repository/datalab-to-chandra.md

## Topics

`ai` `ocr`

## Description

sChandra is a document processing platform that converts images, PDFs, Word documents, spreadsheets, and other formats into structured output such as HTML, Markdown, or JSON while preserving layout. It can also extract specific data fields from invoices, contracts, or reports using user-defined JSON schemas, with citations back to source locations. The service supports form filling in PDF and image documents, document generation from Markdown, and extraction of tracked changes from Word files.

The platform distinguishes itself with pipeline-based processing chains that combine multiple processing steps into versioned, reusable pipelines, managed through draft, saved, and published states. These pipelines can execute as single requests with runtime parameter overrides and webhook callbacks for asynchronous completion. For batch workloads, documents can be processed in single requests to improve throughput, and PDF segmentation splits combined or batch-scanned documents into logical sections. Security controls include API key management, data usage preferences, result auto-expiration, and authenticated webhook delivery with cryptographic signatures.

Additional capabilities include a typed Python SDK, automatic request retry with exponential backoff, file collection management, API health checks, and request analytics monitoring for self-hosted deployments. The service can be deployed on-premises in a containerized setup with restricted network access, TLS termination, and authentication.

## Tags

### Content Management & Publishing

- [Document Conversion](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-conversion.md) — Converts PDFs, images, Office files, and ebooks into structured HTML, Markdown, or JSON while preserving layout. ([source](https://documentation.datalab.to/docs/common/supportedfiletypes))
- [Document Generation from Markdown](https://awesome-repositories.com/f/content-management-publishing/content-management-systems/content-management-platforms/enterprise-specialized-systems/document-management-systems/pdf-form-filling/document-generation-from-markdown.md) — Generates Word documents from Markdown with tracked changes and rich text formatting.
- [Document Processing Pipelines](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/pdf-processing-engines/pdf-processing/document-processing-pipelines.md) — Chains multiple document processing steps into versioned, reusable pipelines that execute as single requests with webhook notifications.
- [Multi-Format Output Converters](https://awesome-repositories.com/f/content-management-publishing/multi-format-output-converters.md) — Converts a wide range of document formats into structured HTML, Markdown, or JSON while preserving layout.

### Artificial Intelligence & ML

- [Structured Document Extraction](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/structured-document-extraction.md) — Converts PDFs, images, and Office files into structured HTML, Markdown, or JSON while preserving layout. ([source](https://documentation.datalab.to/docs/recipes/conversion/conversion-api-overview))
- [Document AI Containers](https://awesome-repositories.com/f/artificial-intelligence-ml/self-hosted-ai-platforms/document-ai-containers.md) — Ships a self-hosted container for on-premises document conversion, extraction, and analytics.

### Data & Databases

- [Schema-Driven Extraction](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-processing/document-unstructured-extraction/schema-driven-extraction.md) — Extracts structured data from documents by applying user-defined JSON schemas and returning citations to source locations.
- [Field Value Extraction](https://awesome-repositories.com/f/data-databases/field-value-extraction.md) — Extracts specific data fields from invoices, contracts, or reports using a user-defined schema and returns them with source citations.

### Development Tools & Productivity

- [Pipeline Version Storages](https://awesome-repositories.com/f/development-tools-productivity/platform-versioning-tools/pipeline-task-versioning/pipeline-version-storages.md) — Manages pipeline configurations through draft, saved, and published versions with immutable snapshots.
- [Document Processing Pipelines](https://awesome-repositories.com/f/development-tools-productivity/platform-versioning-tools/pipeline-task-versioning/pipeline-version-storages/document-processing-pipelines.md) — Chains document processors into versioned pipelines with runtime overrides and webhook notifications.
- [Change Tracking](https://awesome-repositories.com/f/development-tools-productivity/change-tracking.md) — Extracts insertions, deletions, and comments from Word documents as structured markup. ([source](https://documentation.datalab.to/docs/recipes/extract-redlines-and-comments/track-changes-from-word-documents))
- [Revision Extraction](https://awesome-repositories.com/f/development-tools-productivity/change-tracking/revision-extraction.md) — Extracts tracked changes, insertions, deletions, and comments from Word documents as structured HTML or Markdown.
- [Draft-Save-Publish Lifecycle](https://awesome-repositories.com/f/development-tools-productivity/platform-versioning-tools/pipeline-task-versioning/pipeline-version-storages/draft-save-publish-lifecycle.md) — Manages pipeline configurations through draft, saved, and published states with immutable snapshots. ([source](https://documentation.datalab.to))
- [Python API Clients](https://awesome-repositories.com/f/development-tools-productivity/rest-apis/rest-api-clients/python-api-clients.md) — Provides a typed Python client library for simplified API calls, authentication, and response management. ([source](https://documentation.datalab.to/docs/welcome/api))
- [Webhook Notifications](https://awesome-repositories.com/f/development-tools-productivity/webhook-notifications.md) — Sends automatic HTTP POST notifications to user-configured endpoints when jobs complete. ([source](https://documentation.datalab.to/docs/common/limits))

### DevOps & Infrastructure

- [Self-Hosted Deployments](https://awesome-repositories.com/f/devops-infrastructure/self-hosted-infrastructure/document-processing/self-hosted-deployments.md) — Provides a containerized on-premises deployment option with TLS, authentication, and network restrictions.

### Security & Cryptography

- [Secure Web Service Deployment](https://awesome-repositories.com/f/security-cryptography/secure-web-service-deployment.md) — Secures self-hosted containers with network restrictions, TLS termination, and authentication. ([source](https://documentation.datalab.to/platform/security))

### Software Engineering & Architecture

- [Batch Document Processing](https://awesome-repositories.com/f/software-engineering-architecture/batch-document-processing.md) — Processes multiple documents in a single batch to improve throughput and reduce per-document overhead.
- [Pipeline Chaining Frameworks](https://awesome-repositories.com/f/software-engineering-architecture/pipeline-chaining-frameworks.md) — Chains multiple document processing steps into versioned, reusable pipelines executed as single requests.
- [Pipeline Execution with Overrides](https://awesome-repositories.com/f/software-engineering-architecture/default-configuration-values/execution-parameter-configurations/application-parameter-configurators/build-time-parameter-overrides/runtime-parameter-overrides/pipeline-execution-with-overrides.md) — Runs pipelines with runtime parameter overrides and sends webhook notifications on completion. ([source](https://documentation.datalab.to/docs/recipes/pipelines/pipeline-overview))

### Web Development

- [OCR Document Conversion](https://awesome-repositories.com/f/web-development/document-conversion-apis/ocr-document-conversion.md) — Converts images and PDFs into structured output using OCR while preserving layout and tables.
