6 dépôts
Tools for parsing, cleaning, and structuring unstructured data formats for downstream analysis or model consumption.
Distinguishing note: Focuses on the transformation of unstructured documents into structured nodes, distinct from general database management.
Explore 6 awesome GitHub repositories matching data & databases · Document Processing Pipelines. Refine with filters or upvote what's useful.
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
Converting complex documents like PDFs, tables, and charts into clean, structured formats that are ready for analysis and model consumption.
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
Chains document analysis, text recognition, and layout segmentation tasks into versioned, automated, and reusable workflows.
JSDoc is a JavaScript API documentation generator that parses comments in source code to produce structured documentation files for a project interface. It functions as a source code documentation tool that extracts metadata from code comments to automate the creation of technical API references. The system operates as a template-based documentation engine, supporting external templates to customize the visual presentation and layout of the output. It also serves as a Markdown documentation exporter, transforming extracted documentation into Markdown files for use on alternative publishing pl
Implements a pluggable pipeline allowing external plugins to intercept and modify documentation data before output.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Partitions, enriches, and transforms unstructured documents into structured formats for AI and retrieval-augmented generation workflows.
Vespa is a distributed search engine, vector database, and machine learning ranking engine. It serves as an AI search platform designed to handle large-scale document indexing and complex query processing across a cluster of nodes, combining keyword retrieval with high-dimensional embedding storage for semantic similarity search. The platform distinguishes itself by integrating machine learning models directly into the search pipeline to perform real-time inference and ranking. It converts these models into ranking expressions to score and order results based on relevance, while providing a s
Routes raw data through a series of chainable processors to transform and cleanse documents before indexing.
Ce projet est un système de capture de données modifiées (CDC) et une couche de synchronisation qui déplace les données des bases de données MySQL vers des index Elasticsearch. Il fonctionne comme un mappeur relationnel-vers-document, transformant les tables de base de données en documents interrogeables pour permettre l'intégration de données en temps réel et la recherche plein texte. Le synchroniseur se différencie en prenant en charge la dénormalisation des données relationnelles, qui transforme les jointures un-à-plusieurs de la base de données en structures de documents parent-enfant. Il permet également l'agrégation de tables partitionnées, en utilisant des expressions régulières pour regrouper plusieurs tables de base de données dans un seul index de recherche. Le système couvre le mappage et la transformation complets des données, incluant la conversion de types de champs, le mappage de schémas et le filtrage de champs synchronisés. Il emploie un modèle de traitement basé sur un pipeline pour décoder et fusionner les champs, utilisant à la fois le chargement initial basé sur des snapshots pour les bases de référence et le streaming de logs binaires pour les mises à jour en temps réel.
Passes data through processing nodes to decode JSON or merge fields before the final indexing step.