Why is run-llama/llama_index a recommended Document Processing Pipelines GitHub Repositories repository?

Converting complex documents like PDFs, tables, and charts into clean, structured formats that are ready for analysis and model consumption.

Why is datalab-to/surya a recommended Document Processing Pipelines GitHub Repositories repository?

Chains document analysis, text recognition, and layout segmentation tasks into versioned, automated, and reusable workflows.

Why is jsdoc/jsdoc a recommended Document Processing Pipelines GitHub Repositories repository?

Implements a pluggable pipeline allowing external plugins to intercept and modify documentation data before output.

Why is unstructured-io/unstructured a recommended Document Processing Pipelines GitHub Repositories repository?

Partitions, enriches, and transforms unstructured documents into structured formats for AI and retrieval-augmented generation workflows.

Why is vespa-engine/vespa a recommended Document Processing Pipelines GitHub Repositories repository?

Routes raw data through a series of chainable processors to transform and cleanse documents before indexing.

Why is go-mysql-org/go-mysql-elasticsearch a recommended Document Processing Pipelines GitHub Repositories repository?

Passes data through processing nodes to decode JSON or merge fields before the final indexing step.

6 रिपॉजिटरी

Awesome GitHub RepositoriesDocument Processing Pipelines

Tools for parsing, cleaning, and structuring unstructured data formats for downstream analysis or model consumption.

Distinguishing note: Focuses on the transformation of unstructured documents into structured nodes, distinct from general database management.

Explore 6 awesome GitHub repositories matching data & databases · Document Processing Pipelines. Refine with filters or upvote what's useful.

AI के साथ बेहतरीन रिपॉजिटरी खोजें।हम AI का उपयोग करके सबसे सटीक रिपॉजिटरी खोजेंगे।

run-llama/llama_index
run-llama/llama_index
50,306GitHub पर देखें
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
Converting complex documents like PDFs, tables, and charts into clean, structured formats that are ready for analysis and model consumption.
Pythonagentsapplicationdata
GitHub पर देखें50,306
datalab-to/surya
datalab-to/surya
20,889GitHub पर देखें
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
Chains document analysis, text recognition, and layout segmentation tasks into versioned, automated, and reusable workflows.
Python
GitHub पर देखें20,889
jsdoc/jsdoc
jsdoc/jsdoc
15,442GitHub पर देखें
JSDoc is a JavaScript API documentation generator that parses comments in source code to produce structured documentation files for a project interface. It functions as a source code documentation tool that extracts metadata from code comments to automate the creation of technical API references. The system operates as a template-based documentation engine, supporting external templates to customize the visual presentation and layout of the output. It also serves as a Markdown documentation exporter, transforming extracted documentation into Markdown files for use on alternative publishing pl
Implements a pluggable pipeline allowing external plugins to intercept and modify documentation data before output.
JavaScript
GitHub पर देखें15,442
unstructured-io/unstructured
Unstructured-IO/unstructured
14,019GitHub पर देखें
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Partitions, enriches, and transforms unstructured documents into structured formats for AI and retrieval-augmented generation workflows.
HTMLdata-pipelinesdeep-learningdocument-image-analysis
GitHub पर देखें14,019
vespa-engine/vespa
vespa-engine/vespa
6,961GitHub पर देखें
Vespa is a distributed search engine, vector database, and machine learning ranking engine. It serves as an AI search platform designed to handle large-scale document indexing and complex query processing across a cluster of nodes, combining keyword retrieval with high-dimensional embedding storage for semantic similarity search. The platform distinguishes itself by integrating machine learning models directly into the search pipeline to perform real-time inference and ranking. It converts these models into ranking expressions to score and order results based on relevance, while providing a s
Routes raw data through a series of chainable processors to transform and cleanse documents before indexing.
Java
GitHub पर देखें6,961
go-mysql-org/go-mysql-elasticsearch
go-mysql-org/go-mysql-elasticsearch
4,154GitHub पर देखें
This project is a change data capture system and synchronization layer that moves data from MySQL databases into Elasticsearch indices. It functions as a relational-to-document mapper, transforming database tables into searchable documents to enable real-time data integration and full-text search. The synchronizer differentiates itself by supporting relational data denormalization, which transforms one-to-many database joins into parent-child document structures. It also allows for partitioned table aggregation, using regular expression patterns to group multiple database tables into a single
Passes data through processing nodes to decode JSON or merge fields before the final indexing step.
Go
GitHub पर देखें4,154

Awesome Document Processing Pipelines GitHub Repositories

run-llama/llama_index

datalab-to/surya

jsdoc/jsdoc

Unstructured-IO/unstructured

vespa-engine/vespa

go-mysql-org/go-mysql-elasticsearch

सब-टैग एक्सप्लोर करें