6 रिपॉजिटरी
Tools for parsing, cleaning, and structuring unstructured data formats for downstream analysis or model consumption.
Distinguishing note: Focuses on the transformation of unstructured documents into structured nodes, distinct from general database management.
Explore 6 awesome GitHub repositories matching data & databases · Document Processing Pipelines. Refine with filters or upvote what's useful.
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
Converting complex documents like PDFs, tables, and charts into clean, structured formats that are ready for analysis and model consumption.
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
Chains document analysis, text recognition, and layout segmentation tasks into versioned, automated, and reusable workflows.
JSDoc is a JavaScript API documentation generator that parses comments in source code to produce structured documentation files for a project interface. It functions as a source code documentation tool that extracts metadata from code comments to automate the creation of technical API references. The system operates as a template-based documentation engine, supporting external templates to customize the visual presentation and layout of the output. It also serves as a Markdown documentation exporter, transforming extracted documentation into Markdown files for use on alternative publishing pl
Implements a pluggable pipeline allowing external plugins to intercept and modify documentation data before output.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Partitions, enriches, and transforms unstructured documents into structured formats for AI and retrieval-augmented generation workflows.
Vespa is a distributed search engine, vector database, and machine learning ranking engine. It serves as an AI search platform designed to handle large-scale document indexing and complex query processing across a cluster of nodes, combining keyword retrieval with high-dimensional embedding storage for semantic similarity search. The platform distinguishes itself by integrating machine learning models directly into the search pipeline to perform real-time inference and ranking. It converts these models into ranking expressions to score and order results based on relevance, while providing a s
Routes raw data through a series of chainable processors to transform and cleanse documents before indexing.
This project is a change data capture system and synchronization layer that moves data from MySQL databases into Elasticsearch indices. It functions as a relational-to-document mapper, transforming database tables into searchable documents to enable real-time data integration and full-text search. The synchronizer differentiates itself by supporting relational data denormalization, which transforms one-to-many database joins into parent-child document structures. It also allows for partitioned table aggregation, using regular expression patterns to group multiple database tables into a single
Passes data through processing nodes to decode JSON or merge fields before the final indexing step.