22 مستودعات
Utilities for parsing, segmenting, and extracting structured data from complex file formats for downstream analysis.
Distinguishing note: Focuses on the extraction and structural normalization of unstructured document data, distinct from general database management.
Explore 22 awesome GitHub repositories matching data & databases · Document Extraction Tools. Refine with filters or upvote what's useful.
Daytona is a cloud-native development environment platform designed to orchestrate ephemeral, containerized workspaces. It provides a centralized system for managing reproducible coding environments as code, ensuring consistency across distributed teams by abstracting the underlying infrastructure. By utilizing declarative configuration, the platform automates the entire lifecycle of development sandboxes, from initial provisioning to resource governance. The platform distinguishes itself through its infrastructure-agnostic runner layer, which allows development environments to be deployed ac
Extracts code symbols to facilitate navigation and structural analysis within the development environment.
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
Provides specialized parsing and extraction pipelines that convert complex document formats into structured nodes for data analysis.
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
Parses Word documents into structured objects by converting embedded tables to CSV.
Letta is a framework for building, deploying, and managing autonomous AI agents that maintain persistent state across long-term interactions. It provides a comprehensive suite of primitives for defining agents with configurable personas, modular memory blocks, and tool-use capabilities, enabling them to retain user preferences and conversation history over extended sessions. The platform distinguishes itself through its advanced memory management and orchestration capabilities. It allows agents to autonomously update their own memory, perform retrieval-augmented generation, and coordinate com
Parses text from PDF files to enable context-aware question answering by agents.
llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation. The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural lang
Parses and extracts structured elements like images, tables, and headers from complex file formats.
ExplainShell is a shell command explainer and syntax analyzer that matches command line arguments to manual page documentation. It functions as a man page parser and documentation extraction tool, converting roff-formatted manual pages into a structured database of command options and metadata. The project uses a combination of large language models and roff-macro parsing to identify specific line ranges that define flags and arguments. It employs a command syntax analyzer to deconstruct shell commands into tokens, which are then mapped against documented entries to provide plain language exp
Extracts structured flag and argument definitions from man pages using LLMs and roff macros.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Captures and maps source-level access control lists into metadata to track permissions.
Bytebot is an LLM desktop automation framework and virtual Linux desktop environment. It enables AI agents to plan and execute mouse and keyboard actions on a virtual computer using natural language, allowing for autonomous desktop automation and the integration of legacy systems that lack native APIs. The system operates as an LLM API gateway and a Model Context Protocol server, routing requests across multiple language model providers with integrated load balancing and rate limiting. It provides isolated, containerized environments where agents use visual reasoning to interpret screenshots
Extracts structured information from uploaded PDFs for data cross-referencing and document generation.
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Provides the ability to generate N-dimensional feature representations of documents for downstream similarity searches.
pypdf is a Python library for parsing, manipulating, and generating PDF documents. It provides high-level operations for document processing, such as merging multiple files into one or splitting a single document into smaller files. The project includes specialized tools for managing interactive elements, including the creation and modification of annotations, hyperlinks, and form fields. It also supports advanced metadata management, allowing for the extraction and modification of standard document properties and XML-based XMP metadata. Beyond basic structural changes, the library covers pa
Retrieves permission flags from encrypted files to determine available user actions.
Semantic is a Haskell-based library and command-line tool designed for polyglot source code analysis. It functions as a static program analysis framework and a polyglot abstract syntax tree parser that converts multiple programming languages into structured syntax trees based on grammar definitions. The system distinguishes itself through a semantic code comparison engine that detects structural and meaningful changes between code versions rather than relying on textual differences. It further enables analysis across different programming syntaxes by translating surface languages into a unifi
Provides specialized tools for identifying and indexing named identifiers and types within source code files.
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
Identifies and retrieves tabular data and key-value pairs from document pages.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Offers a CLI tool installable via Homebrew or Docker for extracting document content.
This is a graph convolutional network library designed for performing node and graph classification on graph-structured data. It functions as a framework for generating graph embeddings and implementing spectral convolutional neural networks to predict labels for nodes and entire graph structures. The library provides specialized tools for spectral graph convolutions, utilizing Chebyshev polynomial approximations to perform feature aggregation. It includes a multi-graph processing framework that manages batches of different graph instances through block-diagonal adjacency matrices and pooling
Generates low-dimensional vector representations of nodes based on their structural connectivity within a graph.
Layout-parser هو إطار عمل للتعلم العميق لتحليل تخطيط المستندات وتحليل الصور. يوفر مجموعة أدوات لاستخراج المعلومات الهيكلية وأنماط التخطيط من المستندات الممسوحة ضوئياً والصور الرقمية، وتحويلها إلى هياكل بيانات برمجية للتحليل الآلي. يدمج إطار العمل اكتشاف التخطيط مع التعرف الضوئي على الحروف لتحويل المناطق الجدولية إلى بيانات مقروءة آلياً. يستخدم الشبكات العصبية لتحديد وتصنيف العناصر الهيكلية داخل صور المستندات دون الاعتماد على أنظمة يدوية قائمة على القواعد. يغطي النظام مجموعة واسعة من قدرات تحليل المستندات، بما في ذلك تحليل هيكل المستند، واستخراج الجدول الآلي، وتمثيل التخطيط الهرمي. يتضمن أيضاً أدوات تصور لرسم العناصر المكتشفة والتسلسلات الهرمية فوق الصور الأصلية للتحقق من النتائج.
Offers a library for parsing document images into programmatic data structures for downstream analysis.
pdf2htmlEX is a PDF to HTML converter that transforms documents into web pages while preserving the original layout, fonts, and formatting. It functions as a layout engine and text extractor, mapping PDF coordinate data to HTML and CSS to maintain visual fidelity. The tool converts PDF content into searchable and selectable native HTML text by embedding original document fonts. It maintains document interactivity by preserving internal links, bookmarks, and outlines, converting them into functional web navigation. The conversion process supports flexible output structures, allowing documents
Converts the PDF table of contents into a structured web outline for easier navigation.
pdfminer هي مكتبة Python لتحليل ملفات PDF لاستخراج النصوص، وتحليل التخطيطات، وفك تشفير المحتوى، وتحويل المستندات إلى تنسيقات HTML أو XML. تعمل كمحرك لاستخراج النصوص وأداة لتحليل التخطيط مصممة لاسترجاع الأحرف والكلمات مع الحفاظ على التنظيم الهيكلي للمستند الأصلي. يوفر المشروع أدوات لتحويل محتوى PDF إلى HTML أو XML مهيكل للحفاظ على التخطيط البصري وأداة فك تشفير لفتح المستندات المقيدة باستخدام مفاتيح التشفير. ويحدد مواقع وتجمعات عناصر النص لإعادة بناء تنظيم الصفحة واسترجاع المخططات الهرمية. تغطي المكتبة مساحة واسعة من معالجة PDF، بما في ذلك استخراج البيانات الوصفية، وتحليل تخطيط المستند، وتصدير كائنات PDF الداخلية لتصحيح الأخطاء. وتتعامل مع استرجاع النص جنبًا إلى جنب مع الإحداثيات، وبيانات الخط الوصفية، واتجاه الكتابة.
Extracts hierarchical bookmark trees and table of contents from PDF documents.
nvim-surround هو امتداد قائم على Lua لـ Neovim مصمم لإضافة وتغيير وحذف أزواج المحددات المحيطة حول النص والكود. يعمل كمعالج لكائنات النص يقوم بلف أو إزالة الأقواس وعلامات الاقتباس والعلامات باستخدام الحركات والتحديدات. تتكامل الإضافة مع Tree-sitter لتحديد عقد الكود الهيكلية، مما يسمح بالإحاطة الدقيقة لعناصر بناء الجملة بناءً على شجرة بناء الجملة الهيكلية. كما تدعم تعريفات الإحاطة المخصصة، مما يمكن المستخدمين من تحديد أزواج محددات متخصصة وأسماء مستعارة. يغطي سطح القدرة الأساسي عمليات الإحاطة الأساسية، بما في ذلك إضافة وتغيير وحذف المحددات. يتضمن دعماً لتكرار أحدث إجراء إحاطة للحفاظ على اتساق التنسيق عبر تحديدات نصية مختلفة.
Uses Tree-sitter structural node querying to precisely identify and surround complex code blocks.
LuaSnip is a scriptable text expansion framework and Lua-based snippet engine. It allows for the creation of reusable text templates and complex nested structures that expand into a buffer using triggers and jumpable tabstops. The system distinguishes itself by using abstract syntax trees to trigger expansions based on structural code patterns rather than simple text matching. It features a multi-format importer capable of parsing snippet definitions from community standards such as LSP and SnipMate. The framework covers dynamic code generation through Lua functions, regex-based capture grou
Triggers a postfix snippet only when a specific tree‑sitter node sits in front of the trigger.
render-markdown.nvim is a Neovim plugin that transforms raw markdown syntax into a visually formatted layout directly inside the editor. It acts as a component visualizer and syntax highlighter, replacing standard markdown elements with custom symbols, icons, and formatted blocks to improve document readability. The plugin provides a toggle between rendered visual layouts and raw text views, allowing users to switch based on their current needs. It also applies markdown styling to injected content sections found within non-markdown file types. The system covers the visualization of various d
Uses tree-sitter grammars to precisely identify markdown elements for styling and icon placement.