22 Repos
Utilities for parsing, segmenting, and extracting structured data from complex file formats for downstream analysis.
Distinguishing note: Focuses on the extraction and structural normalization of unstructured document data, distinct from general database management.
Explore 22 awesome GitHub repositories matching data & databases · Document Extraction Tools. Refine with filters or upvote what's useful.
Daytona is a cloud-native development environment platform designed to orchestrate ephemeral, containerized workspaces. It provides a centralized system for managing reproducible coding environments as code, ensuring consistency across distributed teams by abstracting the underlying infrastructure. By utilizing declarative configuration, the platform automates the entire lifecycle of development sandboxes, from initial provisioning to resource governance. The platform distinguishes itself through its infrastructure-agnostic runner layer, which allows development environments to be deployed ac
Extracts code symbols to facilitate navigation and structural analysis within the development environment.
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
Provides specialized parsing and extraction pipelines that convert complex document formats into structured nodes for data analysis.
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
Parses Word documents into structured objects by converting embedded tables to CSV.
Letta is a framework for building, deploying, and managing autonomous AI agents that maintain persistent state across long-term interactions. It provides a comprehensive suite of primitives for defining agents with configurable personas, modular memory blocks, and tool-use capabilities, enabling them to retain user preferences and conversation history over extended sessions. The platform distinguishes itself through its advanced memory management and orchestration capabilities. It allows agents to autonomously update their own memory, perform retrieval-augmented generation, and coordinate com
Parses text from PDF files to enable context-aware question answering by agents.
llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation. The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural lang
Parses and extracts structured elements like images, tables, and headers from complex file formats.
ExplainShell is a shell command explainer and syntax analyzer that matches command line arguments to manual page documentation. It functions as a man page parser and documentation extraction tool, converting roff-formatted manual pages into a structured database of command options and metadata. The project uses a combination of large language models and roff-macro parsing to identify specific line ranges that define flags and arguments. It employs a command syntax analyzer to deconstruct shell commands into tokens, which are then mapped against documented entries to provide plain language exp
Extracts structured flag and argument definitions from man pages using LLMs and roff macros.
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Captures and maps source-level access control lists into metadata to track permissions.
Bytebot is an LLM desktop automation framework and virtual Linux desktop environment. It enables AI agents to plan and execute mouse and keyboard actions on a virtual computer using natural language, allowing for autonomous desktop automation and the integration of legacy systems that lack native APIs. The system operates as an LLM API gateway and a Model Context Protocol server, routing requests across multiple language model providers with integrated load balancing and rate limiting. It provides isolated, containerized environments where agents use visual reasoning to interpret screenshots
Extracts structured information from uploaded PDFs for data cross-referencing and document generation.
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Provides the ability to generate N-dimensional feature representations of documents for downstream similarity searches.
pypdf is a Python library for parsing, manipulating, and generating PDF documents. It provides high-level operations for document processing, such as merging multiple files into one or splitting a single document into smaller files. The project includes specialized tools for managing interactive elements, including the creation and modification of annotations, hyperlinks, and form fields. It also supports advanced metadata management, allowing for the extraction and modification of standard document properties and XML-based XMP metadata. Beyond basic structural changes, the library covers pa
Retrieves permission flags from encrypted files to determine available user actions.
Semantic ist eine auf Haskell basierende Bibliothek und ein Kommandozeilen-Tool für die polyglotte Quellcode-Analyse. Es fungiert als Framework für statische Programmanalyse und als polyglotter Parser für abstrakte Syntaxbäume, der verschiedene Programmiersprachen in strukturierte Syntaxbäume auf Basis von Grammatikdefinitionen umwandelt. Das System zeichnet sich durch eine semantische Code-Vergleichs-Engine aus, die strukturelle und inhaltliche Änderungen zwischen Code-Versionen erkennt, anstatt sich auf rein textuelle Unterschiede zu verlassen. Es ermöglicht zudem die Analyse über verschiedene Programmiersyntaxen hinweg, indem es Oberflächensprachen in eine einheitliche, polyglotte Zwischenrepräsentation übersetzt. Das Framework bietet eine breite Palette an Funktionen für das Parsen von Sprachen wie Rust, Go, Python, Ruby, PHP, TypeScript und TSX. Es deckt die semantische Analyse durch Code-Scope-Mapping, Extraktion von Symbolen und die Generierung semantischer Graphen ab, ergänzt durch Werkzeuge zur Musteranalyse und Bewertung des Programmverhaltens. Das Toolset enthält Kommandozeilen-Dienstprogramme zur Standardisierung von Haskell-Quellcodedateien.
Provides specialized tools for identifying and indexing named identifiers and types within source code files.
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
Identifies and retrieves tabular data and key-value pairs from document pages.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Offers a CLI tool installable via Homebrew or Docker for extracting document content.
This is a graph convolutional network library designed for performing node and graph classification on graph-structured data. It functions as a framework for generating graph embeddings and implementing spectral convolutional neural networks to predict labels for nodes and entire graph structures. The library provides specialized tools for spectral graph convolutions, utilizing Chebyshev polynomial approximations to perform feature aggregation. It includes a multi-graph processing framework that manages batches of different graph instances through block-diagonal adjacency matrices and pooling
Generates low-dimensional vector representations of nodes based on their structural connectivity within a graph.
Layout-parser ist ein Deep-Learning-Dokument-Layout-Parser und ein Framework zur Bildanalyse. Es bietet ein Toolkit zum Extrahieren struktureller Informationen und Layout-Muster aus gescannten Dokumenten und digitalen Bildern und transformiert diese in programmatische Datenstrukturen für die automatisierte Analyse. Das Framework integriert Layout-Erkennung mit optischer Zeichenerkennung (OCR), um tabellarische Regionen in maschinenlesbare Daten umzuwandeln. Es nutzt neuronale Netzwerke, um strukturelle Elemente innerhalb von Dokumentbildern zu identifizieren und zu klassifizieren, ohne sich auf manuelle regelbasierte Systeme zu verlassen. Das System deckt ein breites Spektrum an Dokumentanalysefunktionen ab, einschließlich Dokumentstruktur-Parsing, automatisierter Tabellenextraktion und hierarchischer Layout-Repräsentation. Es enthält zudem Visualisierungstools, um erkannte Elemente und Hierarchien über Originalbildern zur Ergebnisverifizierung darzustellen.
Offers a library for parsing document images into programmatic data structures for downstream analysis.
pdf2htmlEX is a PDF to HTML converter that transforms documents into web pages while preserving the original layout, fonts, and formatting. It functions as a layout engine and text extractor, mapping PDF coordinate data to HTML and CSS to maintain visual fidelity. The tool converts PDF content into searchable and selectable native HTML text by embedding original document fonts. It maintains document interactivity by preserving internal links, bookmarks, and outlines, converting them into functional web navigation. The conversion process supports flexible output structures, allowing documents
Converts the PDF table of contents into a structured web outline for easier navigation.
pdfminer is a Python library for parsing PDF files to extract text, analyze layouts, decrypt content, and convert documents into HTML or XML formats. It functions as a text extraction engine and layout analysis tool designed to retrieve characters and words while preserving the structural organization of the original document. The project provides utilities for converting PDF content into structured HTML or XML to maintain visual layout and a decryption tool for unlocking restricted documents using encryption keys. It identifies the positions and groupings of text elements to reconstruct page
Extracts hierarchical bookmark trees and table of contents from PDF documents.
nvim-surround is a Lua-based extension for Neovim designed to add, change, and delete surrounding delimiter pairs around text and code. It functions as a text object manipulator that wraps or removes brackets, quotes, and tags using motions and selections. The plugin integrates with Tree-sitter to identify structural code nodes, allowing for the precise surrounding of syntax elements based on the structural syntax tree. It also supports custom surround definitions, enabling users to define specialized delimiter pairs and aliases. The core capability surface covers basic surrounding operation
Uses Tree-sitter structural node querying to precisely identify and surround complex code blocks.
LuaSnip is a scriptable text expansion framework and Lua-based snippet engine. It allows for the creation of reusable text templates and complex nested structures that expand into a buffer using triggers and jumpable tabstops. The system distinguishes itself by using abstract syntax trees to trigger expansions based on structural code patterns rather than simple text matching. It features a multi-format importer capable of parsing snippet definitions from community standards such as LSP and SnipMate. The framework covers dynamic code generation through Lua functions, regex-based capture grou
Triggers a postfix snippet only when a specific tree‑sitter node sits in front of the trigger.
render-markdown.nvim is a Neovim plugin that transforms raw markdown syntax into a visually formatted layout directly inside the editor. It acts as a component visualizer and syntax highlighter, replacing standard markdown elements with custom symbols, icons, and formatted blocks to improve document readability. The plugin provides a toggle between rendered visual layouts and raw text views, allowing users to switch based on their current needs. It also applies markdown styling to injected content sections found within non-markdown file types. The system covers the visualization of various d
Uses tree-sitter grammars to precisely identify markdown elements for styling and icon placement.