10 مستودعات
Comprehensive systems for automated and scalable document data extraction and structuring.
Distinguishing note: Provides a full platform for document workflows rather than single-purpose extraction or conversion tools.
Explore 10 awesome GitHub repositories matching data & databases · Document Processing Platforms. Refine with filters or upvote what's useful.
Marker is a comprehensive document processing platform designed to automate the conversion, extraction, and structuring of data from complex files. It functions as an orchestration engine that chains modular processing steps into versioned, reusable pipelines, allowing organizations to standardize document handling and automate repetitive business tasks at scale. The platform distinguishes itself through its support for secure, private infrastructure deployment, enabling users to run containerized services within their own environments to maintain strict data privacy. It features specialized
A comprehensive service for converting, extracting, and structuring data from complex files through automated and scalable workflows.
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
Provides a strongly-typed interface for executing document conversion, structured data extraction, and pipeline management.
Claude Quickstarts is a development framework and collection of reference implementations designed for building autonomous agents. It provides the foundational patterns necessary to orchestrate multi-agent workflows, enabling models to perform complex, multi-step tasks across software engineering, customer support, and computer-use domains. The platform distinguishes itself through specialized capabilities for desktop and browser automation, allowing agents to interact with graphical interfaces by capturing visual context and executing precise mouse and keyboard inputs. It includes robust inf
Provides integrated document processing capabilities for analyzing and visualizing diverse file formats.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Converts unstructured files into structured elements using configurable strategies like OCR and vision-language models.
Color-thief هي مكتبة تكميم ألوان وأداة استخراج لوحة ألوان الصور مصممة لتحديد الألوان الأكثر بروزاً في الوسائط المرئية. تعمل كمصنف ألوان دلالي ومحول مساحة ألوان، وتوفر أدوات لاستخراج الألوان المهيمنة وتوليد لوحات تمثيلية من الصور ومقاطع الفيديو وعناصر canvas. يستخدم المشروع معالج ألوان WebAssembly وعمال الخلفية لإجراء تحليل بكسل عالي الأداء. ينفذ محلل تباين WCAG لحساب نسب تباين الألوان وتحديد ألوان نص المقدمة التي يمكن الوصول إليها بناءً على معايير إمكانية الوصول. تغطي المكتبة مجموعة واسعة من قدرات التحليل، بما في ذلك استخراج العينات الدلالية لتصنيف الألوان كألوان نابضة بالحياة، أو مكتومة، أو داكنة، أو فاتحة، وأخذ العينات في الوقت الفعلي من تدفقات الفيديو المباشرة. تتضمن أيضاً واجهة سطر أوامر لتحليل الصور البرمجي وتصدير بيانات الألوان.
Allows stopping active color extraction processes mid-execution to free up system resources.
Jackson is a Java data binding framework and multi-format data serializer used to translate data structures into native language objects. It functions as a JSON data binding library and a streaming parser that reads and writes data as discrete tokens to process large datasets with minimal memory. The project distinguishes itself through a bytecode serialization accelerator that replaces standard reflection with generated bytecode to increase data binding speed. It employs a module-based extensibility model to support a wide range of formats beyond JSON, including XML, YAML, CSV, TOML, and bin
Detects and maps sealed class hierarchies to their specific subtypes during data conversion.
100 Go Mistakes is a reference book and code review companion that catalogues frequent Go programming anti-patterns and provides corrected implementations for each one. It covers a wide range of common pitfalls, from range loop variable capture and interface nil handling to error wrapping and map iteration randomization, helping developers recognize and avoid these issues in their own code. The project distinguishes itself by offering a structured, example-driven approach to learning idiomatic Go. It covers core design decisions such as when to use pointer versus value receivers, how to apply
Covers conscious use of Go type embedding to promote behaviors without exposing hidden internals.
MessagePack-CSharp is a high-performance binary serializer for .NET that converts C# objects to and from the compact MessagePack format. It uses compile-time source generation to produce AOT-safe formatters and resolvers, eliminating runtime reflection and enabling ahead-of-time compilation scenarios. The serializer encodes object fields as integer indices instead of string keys, producing compact binary output with deterministic field ordering, and provides stack-allocated reader and writer structs for direct encoding and decoding of MessagePack primitives without heap allocations. The libra
Embeds .NET type names in binary for polymorphic deserialization without explicit type arguments.
TypeDB هو قاعدة بيانات رسومية ذات نوع قوي ونظام إدارة قاعدة بيانات معرفية. يعمل كمخزن بيانات متعدد النماذج يوحد الهياكل العلائقية والمستندية والرسومية في بيئة واحدة، ويعمل كقاعدة بيانات متوافقة مع ACID ومحرك استعلام تصريحي. يتميز النظام باستخدام نمذجة الرسم البياني الفائق (n-ary hypergraph) وتسلسلات هرمية للأنواع متعددة الأشكال. يستخدم مخططاً ذا نوع قوي لفرض القواعد الهيكلية والتحقق من سلامة البيانات، مما يسمح بالاستدلال متعدد الأشكال القائم على النوع وتعدد أشكال الواجهة القائم على الدور لحل العلاقات المعقدة تلقائياً أثناء تنفيذ الاستعلام. تغطي المنصة مجموعة واسعة من القدرات بما في ذلك حساب العلاقات العودية عبر الجدولة، ومعاملات عزل اللقطات، واسترجاع البيانات التصريحي. كما يدعم التوافر العالي من خلال تكرار الكتلة القائم على الإجماع، والتحكم في الوصول القائم على الدور، والتكامل مع وكلاء الذكاء الاصطناعي لاسترجاع البيانات المهيكلة. يتم دعم الإدارة عبر واجهة سطر أوامر، ويوفر النظام أدوات لتصور مخططات الرسوم البيانية وتدقيق النشاط الإداري.
Supports the definition of polymorphic type hierarchies where specialized types inherit properties from supertypes.
imapsync is an IMAP mailbox synchronization tool and data migration utility designed to copy and synchronize email messages and folder structures between two IMAP servers. It functions as a migration manager for transferring bulk email accounts between different hosting providers, preserving folder hierarchies and message metadata. The tool is distinguished by its ability to automate the transfer of multiple mailboxes sequentially from delimited lists using administrative credentials or user-specific authentication. It supports advanced authentication methods including OAuth2 and XOAUTH2, and
Restores an account hierarchy by moving it from a source subfolder back to the root level.