Why is datalab-to/marker a recommended Document Processing Platforms GitHub Repositories repository?

A comprehensive service for converting, extracting, and structuring data from complex files through automated and scalable workflows.

Why is datalab-to/surya a recommended Document Processing Platforms GitHub Repositories repository?

Provides a strongly-typed interface for executing document conversion, structured data extraction, and pipeline management.

Why is anthropics/claude-quickstarts a recommended Document Processing Platforms GitHub Repositories repository?

Provides integrated document processing capabilities for analyzing and visualizing diverse file formats.

Why is unstructured-io/unstructured a recommended Document Processing Platforms GitHub Repositories repository?

Converts unstructured files into structured elements using configurable strategies like OCR and vision-language models.

Why is lokesh/color-thief a recommended Document Processing Platforms GitHub Repositories repository?

Allows stopping active color extraction processes mid-execution to free up system resources.

Why is fasterxml/jackson a recommended Document Processing Platforms GitHub Repositories repository?

Detects and maps sealed class hierarchies to their specific subtypes during data conversion.

Why is teivah/100-go-mistakes a recommended Document Processing Platforms GitHub Repositories repository?

Covers conscious use of Go type embedding to promote behaviors without exposing hidden internals.

Why is messagepack-csharp/messagepack-csharp a recommended Document Processing Platforms GitHub Repositories repository?

Embeds .NET type names in binary for polymorphic deserialization without explicit type arguments.

Why is typedb/typedb a recommended Document Processing Platforms GitHub Repositories repository?

Supports the definition of polymorphic type hierarchies where specialized types inherit properties from supertypes.

Why is imapsync/imapsync a recommended Document Processing Platforms GitHub Repositories repository?

Restores an account hierarchy by moving it from a source subfolder back to the root level.

10 مستودعات

Awesome GitHub RepositoriesDocument Processing Platforms

Comprehensive systems for automated and scalable document data extraction and structuring.

Distinguishing note: Provides a full platform for document workflows rather than single-purpose extraction or conversion tools.

Explore 10 awesome GitHub repositories matching data & databases · Document Processing Platforms. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

datalab-to/marker
datalab-to/marker
36,137عرض على GitHub
Marker is a comprehensive document processing platform designed to automate the conversion, extraction, and structuring of data from complex files. It functions as an orchestration engine that chains modular processing steps into versioned, reusable pipelines, allowing organizations to standardize document handling and automate repetitive business tasks at scale. The platform distinguishes itself through its support for secure, private infrastructure deployment, enabling users to run containerized services within their own environments to maintain strict data privacy. It features specialized
A comprehensive service for converting, extracting, and structuring data from complex files through automated and scalable workflows.
Python
عرض على GitHub36,137
datalab-to/surya
datalab-to/surya
20,889عرض على GitHub
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
Provides a strongly-typed interface for executing document conversion, structured data extraction, and pipeline management.
Python
عرض على GitHub20,889
anthropics/claude-quickstarts
anthropics/claude-quickstarts
17,085عرض على GitHub
Claude Quickstarts is a development framework and collection of reference implementations designed for building autonomous agents. It provides the foundational patterns necessary to orchestrate multi-agent workflows, enabling models to perform complex, multi-step tasks across software engineering, customer support, and computer-use domains. The platform distinguishes itself through specialized capabilities for desktop and browser automation, allowing agents to interact with graphical interfaces by capturing visual context and executing precise mouse and keyboard inputs. It includes robust inf
Provides integrated document processing capabilities for analyzing and visualizing diverse file formats.
Python
عرض على GitHub17,085
unstructured-io/unstructured
Unstructured-IO/unstructured
14,019عرض على GitHub
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Converts unstructured files into structured elements using configurable strategies like OCR and vision-language models.
HTMLdata-pipelinesdeep-learningdocument-image-analysis
عرض على GitHub14,019
lokesh/color-thief
lokesh/color-thief
13,596عرض على GitHub
Color-thief هي مكتبة تكميم ألوان وأداة استخراج لوحة ألوان الصور مصممة لتحديد الألوان الأكثر بروزاً في الوسائط المرئية. تعمل كمصنف ألوان دلالي ومحول مساحة ألوان، وتوفر أدوات لاستخراج الألوان المهيمنة وتوليد لوحات تمثيلية من الصور ومقاطع الفيديو وعناصر canvas. يستخدم المشروع معالج ألوان WebAssembly وعمال الخلفية لإجراء تحليل بكسل عالي الأداء. ينفذ محلل تباين WCAG لحساب نسب تباين الألوان وتحديد ألوان نص المقدمة التي يمكن الوصول إليها بناءً على معايير إمكانية الوصول. تغطي المكتبة مجموعة واسعة من قدرات التحليل، بما في ذلك استخراج العينات الدلالية لتصنيف الألوان كألوان نابضة بالحياة، أو مكتومة، أو داكنة، أو فاتحة، وأخذ العينات في الوقت الفعلي من تدفقات الفيديو المباشرة. تتضمن أيضاً واجهة سطر أوامر لتحليل الصور البرمجي وتصدير بيانات الألوان.
Allows stopping active color extraction processes mid-execution to free up system resources.
TypeScript
عرض على GitHub13,596
fasterxml/jackson
FasterXML/jackson
9,740عرض على GitHub
Jackson is a Java data binding framework and multi-format data serializer used to translate data structures into native language objects. It functions as a JSON data binding library and a streaming parser that reads and writes data as discrete tokens to process large datasets with minimal memory. The project distinguishes itself through a bytecode serialization accelerator that replaces standard reflection with generated bytecode to increase data binding speed. It employs a module-based extensibility model to support a wide range of formats beyond JSON, including XML, YAML, CSV, TOML, and bin
Detects and maps sealed class hierarchies to their specific subtypes during data conversion.
hacktoberfestjacksonjava-json
عرض على GitHub9,740
teivah/100-go-mistakes
teivah/100-go-mistakes
7,915عرض على GitHub
100 Go Mistakes is a reference book and code review companion that catalogues frequent Go programming anti-patterns and provides corrected implementations for each one. It covers a wide range of common pitfalls, from range loop variable capture and interface nil handling to error wrapping and map iteration randomization, helping developers recognize and avoid these issues in their own code. The project distinguishes itself by offering a structured, example-driven approach to learning idiomatic Go. It covers core design decisions such as when to use pointer versus value receivers, how to apply
Covers conscious use of Go type embedding to promote behaviors without exposing hidden internals.
Gobookchinesedocumentation
عرض على GitHub7,915
messagepack-csharp/messagepack-csharp
MessagePack-CSharp/MessagePack-CSharp
6,607عرض على GitHub
MessagePack-CSharp is a high-performance binary serializer for .NET that converts C# objects to and from the compact MessagePack format. It uses compile-time source generation to produce AOT-safe formatters and resolvers, eliminating runtime reflection and enabling ahead-of-time compilation scenarios. The serializer encodes object fields as integer indices instead of string keys, producing compact binary output with deterministic field ordering, and provides stack-allocated reader and writer structs for direct encoding and decoding of MessagePack primitives without heap allocations. The libra
Embeds .NET type names in binary for polymorphic deserialization without explicit type arguments.
C#c-sharplz4messagepack
عرض على GitHub6,607
typedb/typedb
typedb/typedb
4,353عرض على GitHub
TypeDB هو قاعدة بيانات رسومية ذات نوع قوي ونظام إدارة قاعدة بيانات معرفية. يعمل كمخزن بيانات متعدد النماذج يوحد الهياكل العلائقية والمستندية والرسومية في بيئة واحدة، ويعمل كقاعدة بيانات متوافقة مع ACID ومحرك استعلام تصريحي. يتميز النظام باستخدام نمذجة الرسم البياني الفائق (n-ary hypergraph) وتسلسلات هرمية للأنواع متعددة الأشكال. يستخدم مخططاً ذا نوع قوي لفرض القواعد الهيكلية والتحقق من سلامة البيانات، مما يسمح بالاستدلال متعدد الأشكال القائم على النوع وتعدد أشكال الواجهة القائم على الدور لحل العلاقات المعقدة تلقائياً أثناء تنفيذ الاستعلام. تغطي المنصة مجموعة واسعة من القدرات بما في ذلك حساب العلاقات العودية عبر الجدولة، ومعاملات عزل اللقطات، واسترجاع البيانات التصريحي. كما يدعم التوافر العالي من خلال تكرار الكتلة القائم على الإجماع، والتحكم في الوصول القائم على الدور، والتكامل مع وكلاء الذكاء الاصطناعي لاسترجاع البيانات المهيكلة. يتم دعم الإدارة عبر واجهة سطر أوامر، ويوفر النظام أدوات لتصور مخططات الرسوم البيانية وتدقيق النشاط الإداري.
Supports the definition of polymorphic type hierarchies where specialized types inherit properties from supertypes.
Rustdatabaseinferenceknowledge-base
عرض على GitHub4,353
imapsync/imapsync
imapsync/imapsync
3,945عرض على GitHub
imapsync is an IMAP mailbox synchronization tool and data migration utility designed to copy and synchronize email messages and folder structures between two IMAP servers. It functions as a migration manager for transferring bulk email accounts between different hosting providers, preserving folder hierarchies and message metadata. The tool is distinguished by its ability to automate the transfer of multiple mailboxes sequentially from delimited lists using administrative credentials or user-specific authentication. It supports advanced authentication methods including OAuth2 and XOAUTH2, and
Restores an account hierarchy by moving it from a source subfolder back to the root level.
Shellemailsimapimaps
عرض على GitHub3,945

Awesome Document Processing Platforms GitHub Repositories

datalab-to/marker

datalab-to/surya

anthropics/claude-quickstarts

Unstructured-IO/unstructured

lokesh/color-thief

FasterXML/jackson

teivah/100-go-mistakes

MessagePack-CSharp/MessagePack-CSharp

typedb/typedb

imapsync/imapsync

استكشف الوسوم الفرعية