10 repository-uri
Comprehensive systems for automated and scalable document data extraction and structuring.
Distinguishing note: Provides a full platform for document workflows rather than single-purpose extraction or conversion tools.
Explore 10 awesome GitHub repositories matching data & databases · Document Processing Platforms. Refine with filters or upvote what's useful.
Marker is a comprehensive document processing platform designed to automate the conversion, extraction, and structuring of data from complex files. It functions as an orchestration engine that chains modular processing steps into versioned, reusable pipelines, allowing organizations to standardize document handling and automate repetitive business tasks at scale. The platform distinguishes itself through its support for secure, private infrastructure deployment, enabling users to run containerized services within their own environments to maintain strict data privacy. It features specialized
A comprehensive service for converting, extracting, and structuring data from complex files through automated and scalable workflows.
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
Provides a strongly-typed interface for executing document conversion, structured data extraction, and pipeline management.
Claude Quickstarts is a development framework and collection of reference implementations designed for building autonomous agents. It provides the foundational patterns necessary to orchestrate multi-agent workflows, enabling models to perform complex, multi-step tasks across software engineering, customer support, and computer-use domains. The platform distinguishes itself through specialized capabilities for desktop and browser automation, allowing agents to interact with graphical interfaces by capturing visual context and executing precise mouse and keyboard inputs. It includes robust inf
Provides integrated document processing capabilities for analyzing and visualizing diverse file formats.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Converts unstructured files into structured elements using configurable strategies like OCR and vision-language models.
Color-thief este o bibliotecă de cuantizare a culorilor și un extractor de palete de culori pentru imagini, conceput pentru a identifica cele mai proeminente culori din mediile vizuale. Funcționează ca un clasificator semantic de culori și convertor de spațiu de culoare, oferind instrumente pentru a extrage culori dominante și a genera palete reprezentative din imagini, videoclipuri și elemente canvas. Proiectul utilizează un procesor de culoare WebAssembly și worker-i de fundal pentru a efectua analize de pixeli de înaltă performanță. Implementează un analizor de contrast WCAG pentru a calcula rapoartele de contrast al culorilor și a determina culorile accesibile ale textului din prim-plan pe baza standardelor de accesibilitate. Biblioteca acoperă o gamă largă de capabilități de analiză, inclusiv extragerea semantică a eșantioanelor pentru categorisirea culorilor ca vibrante, șterse, întunecate sau deschise și eșantionarea în timp real din fluxuri video live. Include, de asemenea, o interfață de linie de comandă pentru analiza programatică a imaginilor și exportarea datelor de culoare.
Allows stopping active color extraction processes mid-execution to free up system resources.
Jackson is a Java data binding framework and multi-format data serializer used to translate data structures into native language objects. It functions as a JSON data binding library and a streaming parser that reads and writes data as discrete tokens to process large datasets with minimal memory. The project distinguishes itself through a bytecode serialization accelerator that replaces standard reflection with generated bytecode to increase data binding speed. It employs a module-based extensibility model to support a wide range of formats beyond JSON, including XML, YAML, CSV, TOML, and bin
Detects and maps sealed class hierarchies to their specific subtypes during data conversion.
100 Go Mistakes is a reference book and code review companion that catalogues frequent Go programming anti-patterns and provides corrected implementations for each one. It covers a wide range of common pitfalls, from range loop variable capture and interface nil handling to error wrapping and map iteration randomization, helping developers recognize and avoid these issues in their own code. The project distinguishes itself by offering a structured, example-driven approach to learning idiomatic Go. It covers core design decisions such as when to use pointer versus value receivers, how to apply
Covers conscious use of Go type embedding to promote behaviors without exposing hidden internals.
MessagePack-CSharp is a high-performance binary serializer for .NET that converts C# objects to and from the compact MessagePack format. It uses compile-time source generation to produce AOT-safe formatters and resolvers, eliminating runtime reflection and enabling ahead-of-time compilation scenarios. The serializer encodes object fields as integer indices instead of string keys, producing compact binary output with deterministic field ordering, and provides stack-allocated reader and writer structs for direct encoding and decoding of MessagePack primitives without heap allocations. The libra
Embeds .NET type names in binary for polymorphic deserialization without explicit type arguments.
TypeDB este o bază de date graf și un sistem de gestionare a cunoștințelor (knowledge graph) puternic tipizat. Servește ca un magazin de date multi-model care unifică structurile relaționale, document și graf într-un singur mediu, funcționând atât ca o bază de date conformă ACID, cât și ca un motor de interogare declarativ. Sistemul se distinge prin utilizarea modelării n-ary hypergraph și a ierarhiilor de tip polimorfice. Utilizează o schemă puternic tipizată pentru a impune reguli structurale și a valida integritatea datelor, permițând inferența polimorfică bazată pe tip și polimorfismul de interfață bazat pe roluri pentru a rezolva automat relațiile complexe în timpul execuției interogărilor. Platforma acoperă o gamă largă de capabilități, inclusiv calcularea relațiilor recursive prin tabling, tranzacții cu izolare de snapshot și regăsirea declarativă a datelor. De asemenea, suportă disponibilitatea ridicată prin replicarea clusterelor bazată pe consens, controlul accesului bazat pe roluri și integrarea cu agenți AI pentru regăsirea datelor structurate. Gestionarea este susținută printr-o interfață de linie de comandă, iar sistemul oferă instrumente pentru vizualizarea schemelor graf și auditarea activității administrative.
Supports the definition of polymorphic type hierarchies where specialized types inherit properties from supertypes.
imapsync is an IMAP mailbox synchronization tool and data migration utility designed to copy and synchronize email messages and folder structures between two IMAP servers. It functions as a migration manager for transferring bulk email accounts between different hosting providers, preserving folder hierarchies and message metadata. The tool is distinguished by its ability to automate the transfer of multiple mailboxes sequentially from delimited lists using administrative credentials or user-specific authentication. It supports advanced authentication methods including OAuth2 and XOAUTH2, and
Restores an account hierarchy by moving it from a source subfolder back to the root level.