6 个仓库
Tools for parsing, cleaning, and structuring unstructured data formats for downstream analysis or model consumption.
Distinguishing note: Focuses on the transformation of unstructured documents into structured nodes, distinct from general database management.
Explore 6 awesome GitHub repositories matching data & databases · Document Processing Pipelines. Refine with filters or upvote what's useful.
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
Converting complex documents like PDFs, tables, and charts into clean, structured formats that are ready for analysis and model consumption.
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
Chains document analysis, text recognition, and layout segmentation tasks into versioned, automated, and reusable workflows.
JSDoc is a JavaScript API documentation generator that parses comments in source code to produce structured documentation files for a project interface. It functions as a source code documentation tool that extracts metadata from code comments to automate the creation of technical API references. The system operates as a template-based documentation engine, supporting external templates to customize the visual presentation and layout of the output. It also serves as a Markdown documentation exporter, transforming extracted documentation into Markdown files for use on alternative publishing pl
Implements a pluggable pipeline allowing external plugins to intercept and modify documentation data before output.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Partitions, enriches, and transforms unstructured documents into structured formats for AI and retrieval-augmented generation workflows.
Vespa is a distributed search engine, vector database, and machine learning ranking engine. It serves as an AI search platform designed to handle large-scale document indexing and complex query processing across a cluster of nodes, combining keyword retrieval with high-dimensional embedding storage for semantic similarity search. The platform distinguishes itself by integrating machine learning models directly into the search pipeline to perform real-time inference and ranking. It converts these models into ranking expressions to score and order results based on relevance, while providing a s
Routes raw data through a series of chainable processors to transform and cleanse documents before indexing.
该项目是一个变更数据捕获 (CDC) 系统和同步层,用于将数据从 MySQL 数据库移动到 Elasticsearch 索引中。它作为一个关系型到文档的映射器,将数据库表转换为可搜索的文档,以实现实时数据集成和全文搜索。 该同步器通过支持关系数据去规范化而脱颖而出,它将一对多数据库连接转换为父子文档结构。它还允许进行分区表聚合,使用正则表达式模式将多个数据库表分组到一个搜索索引中。 该系统涵盖了全面的数据映射和转换,包括字段类型转换、模式映射和同步字段过滤。它采用基于管道的处理模型来解码和合并字段,利用基于快照的初始加载作为基准,并利用二进制日志流进行实时更新。
Passes data through processing nodes to decode JSON or merge fields before the final indexing step.