2 مستودعات
Components that extract raw text from various file formats and web sources.
Distinct from Document Splitters: Focuses on the ingestion of diverse file formats, whereas document splitters focus on dividing existing text into chunks.
Explore 2 awesome GitHub repositories matching data & databases · Document Loaders. Refine with filters or upvote what's useful.
This project is a PDF data extraction tool and document preprocessor designed to convert PDF files into structured formats such as Markdown, JSON, and HTML. It functions as an OCR document parser for scanned files, an accessibility automator for generating PDF/UA compliant metadata, and a loader for AI orchestration frameworks like LangChain. The software distinguishes itself through specialized handling of complex document elements, including the conversion of mathematical formulas into LaTeX and the generation of natural-language descriptions for charts and images. It utilizes recursive seg
Functions as a document loader that integrates structured PDF content into the LangChain orchestration framework.
langchaingo is an LLM application framework for Go designed for building language model-powered applications and autonomous agents. It serves as an orchestration library and tool integration framework that allows developers to link prompt sequences and model calls into complex, multi-step workflows. The project provides a toolkit for implementing retrieval-augmented generation pipelines by processing unstructured documents and retrieving relevant context via vector search. It includes a dedicated integration layer for indexing high-dimensional embeddings and performing similarity searches acr
Ships a pipeline of loaders and text splitters to transform diverse file formats into chunked data.