22 dépôts
Utilities for parsing, segmenting, and extracting structured data from complex file formats for downstream analysis.
Distinguishing note: Focuses on the extraction and structural normalization of unstructured document data, distinct from general database management.
Explore 22 awesome GitHub repositories matching data & databases · Document Extraction Tools. Refine with filters or upvote what's useful.
Daytona is a cloud-native development environment platform designed to orchestrate ephemeral, containerized workspaces. It provides a centralized system for managing reproducible coding environments as code, ensuring consistency across distributed teams by abstracting the underlying infrastructure. By utilizing declarative configuration, the platform automates the entire lifecycle of development sandboxes, from initial provisioning to resource governance. The platform distinguishes itself through its infrastructure-agnostic runner layer, which allows development environments to be deployed ac
Extracts code symbols to facilitate navigation and structural analysis within the development environment.
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
Provides specialized parsing and extraction pipelines that convert complex document formats into structured nodes for data analysis.
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
Parses Word documents into structured objects by converting embedded tables to CSV.
Letta is a framework for building, deploying, and managing autonomous AI agents that maintain persistent state across long-term interactions. It provides a comprehensive suite of primitives for defining agents with configurable personas, modular memory blocks, and tool-use capabilities, enabling them to retain user preferences and conversation history over extended sessions. The platform distinguishes itself through its advanced memory management and orchestration capabilities. It allows agents to autonomously update their own memory, perform retrieval-augmented generation, and coordinate com
Parses text from PDF files to enable context-aware question answering by agents.
llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation. The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural lang
Parses and extracts structured elements like images, tables, and headers from complex file formats.
ExplainShell is a shell command explainer and syntax analyzer that matches command line arguments to manual page documentation. It functions as a man page parser and documentation extraction tool, converting roff-formatted manual pages into a structured database of command options and metadata. The project uses a combination of large language models and roff-macro parsing to identify specific line ranges that define flags and arguments. It employs a command syntax analyzer to deconstruct shell commands into tokens, which are then mapped against documented entries to provide plain language exp
Extracts structured flag and argument definitions from man pages using LLMs and roff macros.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Captures and maps source-level access control lists into metadata to track permissions.
Bytebot is an LLM desktop automation framework and virtual Linux desktop environment. It enables AI agents to plan and execute mouse and keyboard actions on a virtual computer using natural language, allowing for autonomous desktop automation and the integration of legacy systems that lack native APIs. The system operates as an LLM API gateway and a Model Context Protocol server, routing requests across multiple language model providers with integrated load balancing and rate limiting. It provides isolated, containerized environments where agents use visual reasoning to interpret screenshots
Extracts structured information from uploaded PDFs for data cross-referencing and document generation.
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Provides the ability to generate N-dimensional feature representations of documents for downstream similarity searches.
pypdf is a Python library for parsing, manipulating, and generating PDF documents. It provides high-level operations for document processing, such as merging multiple files into one or splitting a single document into smaller files. The project includes specialized tools for managing interactive elements, including the creation and modification of annotations, hyperlinks, and form fields. It also supports advanced metadata management, allowing for the extraction and modification of standard document properties and XML-based XMP metadata. Beyond basic structural changes, the library covers pa
Retrieves permission flags from encrypted files to determine available user actions.
Semantic est une bibliothèque basée sur Haskell et un outil en ligne de commande conçu pour l'analyse de code source polyglotte. Il fonctionne comme un framework d'analyse statique et un parseur d'arbres de syntaxe abstraite (AST) capable de convertir plusieurs langages de programmation en arbres structurés basés sur des définitions de grammaire. Le système se distingue par son moteur de comparaison sémantique qui détecte les changements structurels et fonctionnels entre les versions de code, plutôt que de se limiter aux différences textuelles. Il permet une analyse transversale en traduisant les langages sources en une représentation intermédiaire polyglotte unifiée. Le framework offre une large gamme de capacités pour parser des langages comme Rust, Go, Python, Ruby, PHP, TypeScript et TSX. Il couvre l'analyse sémantique via le mappage de portée, l'extraction de symboles, la génération de graphes sémantiques, ainsi que des outils d'analyse de motifs et d'évaluation du comportement des programmes. L'ensemble d'outils inclut des utilitaires en ligne de commande pour standardiser la mise en page des fichiers sources Haskell.
Provides specialized tools for identifying and indexing named identifiers and types within source code files.
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
Identifies and retrieves tabular data and key-value pairs from document pages.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Offers a CLI tool installable via Homebrew or Docker for extracting document content.
This is a graph convolutional network library designed for performing node and graph classification on graph-structured data. It functions as a framework for generating graph embeddings and implementing spectral convolutional neural networks to predict labels for nodes and entire graph structures. The library provides specialized tools for spectral graph convolutions, utilizing Chebyshev polynomial approximations to perform feature aggregation. It includes a multi-graph processing framework that manages batches of different graph instances through block-diagonal adjacency matrices and pooling
Generates low-dimensional vector representations of nodes based on their structural connectivity within a graph.
Layout-parser est un framework d'analyse d'image et d'analyseur de mise en page de document basé sur le deep learning. Il fournit une boîte à outils pour extraire des informations structurelles et des modèles de mise en page à partir de documents scannés et d'images numériques, les transformant en structures de données programmatiques pour une analyse automatisée. Le framework intègre la détection de mise en page avec la reconnaissance optique de caractères pour convertir les régions tabulaires en données lisibles par machine. Il utilise des réseaux de neurones pour identifier et classer les éléments structurels au sein des images de document sans dépendre de systèmes basés sur des règles manuelles. Le système couvre un large éventail de fonctionnalités d'analyse de document, y compris l'analyse de structure de document, l'extraction automatisée de tableaux et la représentation hiérarchique de mise en page. Il inclut également des outils de visualisation pour rendre les éléments détectés et les hiérarchies sur les images originales pour la vérification des résultats.
Offers a library for parsing document images into programmatic data structures for downstream analysis.
pdf2htmlEX is a PDF to HTML converter that transforms documents into web pages while preserving the original layout, fonts, and formatting. It functions as a layout engine and text extractor, mapping PDF coordinate data to HTML and CSS to maintain visual fidelity. The tool converts PDF content into searchable and selectable native HTML text by embedding original document fonts. It maintains document interactivity by preserving internal links, bookmarks, and outlines, converting them into functional web navigation. The conversion process supports flexible output structures, allowing documents
Converts the PDF table of contents into a structured web outline for easier navigation.
pdfminer est une bibliothèque Python pour analyser les fichiers PDF afin d'extraire du texte, analyser les mises en page, déchiffrer le contenu et convertir des documents aux formats HTML ou XML. Il fonctionne comme un moteur d'extraction de texte et un outil d'analyse de mise en page conçu pour récupérer les caractères et les mots tout en préservant l'organisation structurelle du document original. Le projet fournit des utilitaires pour convertir le contenu PDF en HTML ou XML structuré afin de maintenir la mise en page visuelle, ainsi qu'un outil de déchiffrement pour déverrouiller les documents restreints à l'aide de clés de chiffrement. Il identifie les positions et les regroupements d'éléments textuels pour reconstruire l'organisation des pages et récupérer les plans hiérarchiques. La bibliothèque couvre un large spectre du traitement PDF, incluant l'extraction de métadonnées, l'analyse de mise en page de documents et l'exportation d'objets PDF internes pour le débogage. Elle gère la récupération du texte ainsi que les coordonnées, les métadonnées de police et le sens d'écriture.
Extracts hierarchical bookmark trees and table of contents from PDF documents.
nvim-surround est une extension basée sur Lua pour Neovim conçue pour ajouter, changer et supprimer des paires de délimiteurs entourant du texte et du code. Il fonctionne comme un manipulateur d'objets de texte qui enveloppe ou supprime des crochets, des guillemets et des balises en utilisant des mouvements et des sélections. Le plugin s'intègre avec Tree-sitter pour identifier les nœuds de code structurels, permettant l'entourage précis des éléments de syntaxe basé sur l'arbre de syntaxe structurel. Il supporte également des définitions d'entourage personnalisées, permettant aux utilisateurs de définir des paires de délimiteurs spécialisées et des alias. La surface de capacité principale couvre les opérations d'entourage de base, incluant l'ajout, le changement et la suppression de délimiteurs. Il inclut le support pour répéter la dernière action d'entourage pour maintenir la cohérence du formatage à travers différentes sélections de texte.
Uses Tree-sitter structural node querying to precisely identify and surround complex code blocks.
LuaSnip is a scriptable text expansion framework and Lua-based snippet engine. It allows for the creation of reusable text templates and complex nested structures that expand into a buffer using triggers and jumpable tabstops. The system distinguishes itself by using abstract syntax trees to trigger expansions based on structural code patterns rather than simple text matching. It features a multi-format importer capable of parsing snippet definitions from community standards such as LSP and SnipMate. The framework covers dynamic code generation through Lua functions, regex-based capture grou
Triggers a postfix snippet only when a specific tree‑sitter node sits in front of the trigger.
render-markdown.nvim is a Neovim plugin that transforms raw markdown syntax into a visually formatted layout directly inside the editor. It acts as a component visualizer and syntax highlighter, replacing standard markdown elements with custom symbols, icons, and formatted blocks to improve document readability. The plugin provides a toggle between rendered visual layouts and raw text views, allowing users to switch based on their current needs. It also applies markdown styling to injected content sections found within non-markdown file types. The system covers the visualization of various d
Uses tree-sitter grammars to precisely identify markdown elements for styling and icon placement.