3 مستودعات
Tools for separating collection items into distinct documents within a stream.
Distinct from Document Subscriptions: None of the candidates were relevant; this is a core data processing function.
Explore 3 awesome GitHub repositories matching data & databases · Document Splitters. Refine with filters or upvote what's useful.
This project is a PDF data extraction tool and document preprocessor designed to convert PDF files into structured formats such as Markdown, JSON, and HTML. It functions as an OCR document parser for scanned files, an accessibility automator for generating PDF/UA compliant metadata, and a loader for AI orchestration frameworks like LangChain. The software distinguishes itself through specialized handling of complex document elements, including the conversion of mathematical formulas into LaTeX and the generation of natural-language descriptions for charts and images. It utilizes recursive seg
Functions as a document loader that integrates structured PDF content into the LangChain orchestration framework.
This tool is a command-line processor designed for querying, updating, and transforming structured data files. It functions as a versatile engine for manipulating YAML, JSON, TOML, and XML documents, allowing users to perform complex operations directly from the terminal. By utilizing a path-based expression language, it enables precise navigation and modification of data structures within configuration files and infrastructure-as-code workflows. What distinguishes this tool is its ability to perform in-place document mutations while preserving original formatting, comments, and metadata. It
Separates individual items from a collection into distinct documents within a single output stream.
langchaingo is an LLM application framework for Go designed for building language model-powered applications and autonomous agents. It serves as an orchestration library and tool integration framework that allows developers to link prompt sequences and model calls into complex, multi-step workflows. The project provides a toolkit for implementing retrieval-augmented generation pipelines by processing unstructured documents and retrieving relevant context via vector search. It includes a dedicated integration layer for indexing high-dimensional embeddings and performing similarity searches acr
Ships a pipeline of loaders and text splitters to transform diverse file formats into chunked data.