Why is daytonaio/daytona a recommended Document Extraction Tools GitHub Repositories repository?

Extracts code symbols to facilitate navigation and structural analysis within the development environment.

Why is run-llama/llama_index a recommended Document Extraction Tools GitHub Repositories repository?

Provides specialized parsing and extraction pipelines that convert complex document formats into structured nodes for data analysis.

Why is cinnamon/kotaemon a recommended Document Extraction Tools GitHub Repositories repository?

Parses Word documents into structured objects by converting embedded tables to CSV.

Why is letta-ai/letta a recommended Document Extraction Tools GitHub Repositories repository?

Parses text from PDF files to enable context-aware question answering by agents.

Why is llmware-ai/llmware a recommended Document Extraction Tools GitHub Repositories repository?

Parses and extracts structured elements like images, tables, and headers from complex file formats.

Why is idank/explainshell a recommended Document Extraction Tools GitHub Repositories repository?

Extracts structured flag and argument definitions from man pages using LLMs and roff macros.

Why is unstructured-io/unstructured a recommended Document Extraction Tools GitHub Repositories repository?

Captures and maps source-level access control lists into metadata to track permissions.

Why is bytebot-ai/bytebot a recommended Document Extraction Tools GitHub Repositories repository?

Extracts structured information from uploaded PDFs for data cross-referencing and document generation.

Why is autogluon/autogluon a recommended Document Extraction Tools GitHub Repositories repository?

Provides the ability to generate N-dimensional feature representations of documents for downstream similarity searches.

Why is py-pdf/pypdf a recommended Document Extraction Tools GitHub Repositories repository?

Retrieves permission flags from encrypted files to determine available user actions.

22 Repos

Awesome GitHub RepositoriesDocument Extraction Tools

Utilities for parsing, segmenting, and extracting structured data from complex file formats for downstream analysis.

Distinguishing note: Focuses on the extraction and structural normalization of unstructured document data, distinct from general database management.

Explore 22 awesome GitHub repositories matching data & databases · Document Extraction Tools. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

daytonaio/daytona
daytonaio/daytona
72,416Auf GitHub ansehen
Daytona is a cloud-native development environment platform designed to orchestrate ephemeral, containerized workspaces. It provides a centralized system for managing reproducible coding environments as code, ensuring consistency across distributed teams by abstracting the underlying infrastructure. By utilizing declarative configuration, the platform automates the entire lifecycle of development sandboxes, from initial provisioning to resource governance. The platform distinguishes itself through its infrastructure-agnostic runner layer, which allows development environments to be deployed ac
Extracts code symbols to facilitate navigation and structural analysis within the development environment.
TypeScriptagentic-workflowaiai-agents
Auf GitHub ansehen72,416
run-llama/llama_index
run-llama/llama_index
50,306Auf GitHub ansehen
LlamaIndex is a comprehensive development framework designed to connect private or external data sources to large language models. It functions as a data-centric toolkit that enables the construction of retrieval-augmented generation systems, allowing developers to build applications that provide context-aware answers based on specific organizational information. The project distinguishes itself through a robust agentic orchestration engine that supports the creation of autonomous agents capable of multi-step reasoning, memory management, and complex tool execution. Beyond simple retrieval, i
Provides specialized parsing and extraction pipelines that convert complex document formats into structured nodes for data analysis.
Pythonagentsapplicationdata
Auf GitHub ansehen50,306
cinnamon/kotaemon
Cinnamon/kotaemon
25,139Auf GitHub ansehen
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
Parses Word documents into structured objects by converting embedded tables to CSV.
Pythonchatbotllmsopen-source
Auf GitHub ansehen25,139
letta-ai/letta
letta-ai/letta
21,168Auf GitHub ansehen
Letta is a framework for building, deploying, and managing autonomous AI agents that maintain persistent state across long-term interactions. It provides a comprehensive suite of primitives for defining agents with configurable personas, modular memory blocks, and tool-use capabilities, enabling them to retain user preferences and conversation history over extended sessions. The platform distinguishes itself through its advanced memory management and orchestration capabilities. It allows agents to autonomously update their own memory, perform retrieval-augmented generation, and coordinate com
Parses text from PDF files to enable context-aware question answering by agents.
Pythonaiai-agentsllm
Auf GitHub ansehen21,168
llmware-ai/llmware
llmware-ai/llmware
14,838Auf GitHub ansehen
llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation. The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural lang
Parses and extracts structured elements like images, tables, and headers from complex file formats.
Python
Auf GitHub ansehen14,838
idank/explainshell
idank/explainshell
14,084Auf GitHub ansehen
ExplainShell is a shell command explainer and syntax analyzer that matches command line arguments to manual page documentation. It functions as a man page parser and documentation extraction tool, converting roff-formatted manual pages into a structured database of command options and metadata. The project uses a combination of large language models and roff-macro parsing to identify specific line ranges that define flags and arguments. It employs a command syntax analyzer to deconstruct shell commands into tokens, which are then mapped against documented entries to provide plain language exp
Extracts structured flag and argument definitions from man pages using LLMs and roff macros.
Python
Auf GitHub ansehen14,084
unstructured-io/unstructured
Unstructured-IO/unstructured
14,019Auf GitHub ansehen
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Captures and maps source-level access control lists into metadata to track permissions.
HTMLdata-pipelinesdeep-learningdocument-image-analysis
Auf GitHub ansehen14,019
bytebot-ai/bytebot
bytebot-ai/bytebot
10,413Auf GitHub ansehen
Bytebot is an LLM desktop automation framework and virtual Linux desktop environment. It enables AI agents to plan and execute mouse and keyboard actions on a virtual computer using natural language, allowing for autonomous desktop automation and the integration of legacy systems that lack native APIs. The system operates as an LLM API gateway and a Model Context Protocol server, routing requests across multiple language model providers with integrated load balancing and rate limiting. It provides isolated, containerized environments where agents use visual reasoning to interpret screenshots
Extracts structured information from uploaded PDFs for data cross-referencing and document generation.
TypeScriptagentagentic-aiagents
Auf GitHub ansehen10,413
autogluon/autogluon
autogluon/autogluon
9,997Auf GitHub ansehen
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Provides the ability to generate N-dimensional feature representations of documents for downstream similarity searches.
Pythonautogluonautomated-machine-learningautoml
Auf GitHub ansehen9,997
py-pdf/pypdf
py-pdf/pypdf
9,818Auf GitHub ansehen
pypdf is a Python library for parsing, manipulating, and generating PDF documents. It provides high-level operations for document processing, such as merging multiple files into one or splitting a single document into smaller files. The project includes specialized tools for managing interactive elements, including the creation and modification of annotations, hyperlinks, and form fields. It also supports advanced metadata management, allowing for the extraction and modification of standard document properties and XML-based XMP metadata. Beyond basic structural changes, the library covers pa
Retrieves permission flags from encrypted files to determine available user actions.
Pythonhelp-wantedpdfpdf-documents
Auf GitHub ansehen9,818
github/semantic
github/semantic
9,041Auf GitHub ansehen
Semantic ist eine auf Haskell basierende Bibliothek und ein Kommandozeilen-Tool für die polyglotte Quellcode-Analyse. Es fungiert als Framework für statische Programmanalyse und als polyglotter Parser für abstrakte Syntaxbäume, der verschiedene Programmiersprachen in strukturierte Syntaxbäume auf Basis von Grammatikdefinitionen umwandelt. Das System zeichnet sich durch eine semantische Code-Vergleichs-Engine aus, die strukturelle und inhaltliche Änderungen zwischen Code-Versionen erkennt, anstatt sich auf rein textuelle Unterschiede zu verlassen. Es ermöglicht zudem die Analyse über verschiedene Programmiersyntaxen hinweg, indem es Oberflächensprachen in eine einheitliche, polyglotte Zwischenrepräsentation übersetzt. Das Framework bietet eine breite Palette an Funktionen für das Parsen von Sprachen wie Rust, Go, Python, Ruby, PHP, TypeScript und TSX. Es deckt die semantische Analyse durch Code-Scope-Mapping, Extraktion von Symbolen und die Generierung semantischer Graphen ab, ergänzt durch Werkzeuge zur Musteranalyse und Bewertung des Programmverhaltens. Das Toolset enthält Kommandozeilen-Dienstprogramme zur Standardisierung von Haskell-Quellcodedateien.
Provides specialized tools for identifying and indexing named identifiers and types within source code files.
Haskell
Auf GitHub ansehen9,041
pymupdf/pymupdf
pymupdf/PyMuPDF
9,086Auf GitHub ansehen
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
Identifies and retrieves tabular data and key-value pairs from document pages.
Pythondata-scienceepubextract-data
Auf GitHub ansehen9,086
kreuzberg-dev/kreuzberg
kreuzberg-dev/kreuzberg
8,527Auf GitHub ansehen
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Offers a CLI tool installable via Homebrew or Docker for extracting document content.
Rustdocument-intelligenceelixirffi
Auf GitHub ansehen8,527
tkipf/gcn
tkipf/gcn
7,361Auf GitHub ansehen
This is a graph convolutional network library designed for performing node and graph classification on graph-structured data. It functions as a framework for generating graph embeddings and implementing spectral convolutional neural networks to predict labels for nodes and entire graph structures. The library provides specialized tools for spectral graph convolutions, utilizing Chebyshev polynomial approximations to perform feature aggregation. It includes a multi-graph processing framework that manages batches of different graph instances through block-diagonal adjacency matrices and pooling
Generates low-dimensional vector representations of nodes based on their structural connectivity within a graph.
Python
Auf GitHub ansehen7,361
layout-parser/layout-parser
Layout-Parser/layout-parser
5,749Auf GitHub ansehen
Layout-parser ist ein Deep-Learning-Dokument-Layout-Parser und ein Framework zur Bildanalyse. Es bietet ein Toolkit zum Extrahieren struktureller Informationen und Layout-Muster aus gescannten Dokumenten und digitalen Bildern und transformiert diese in programmatische Datenstrukturen für die automatisierte Analyse. Das Framework integriert Layout-Erkennung mit optischer Zeichenerkennung (OCR), um tabellarische Regionen in maschinenlesbare Daten umzuwandeln. Es nutzt neuronale Netzwerke, um strukturelle Elemente innerhalb von Dokumentbildern zu identifizieren und zu klassifizieren, ohne sich auf manuelle regelbasierte Systeme zu verlassen. Das System deckt ein breites Spektrum an Dokumentanalysefunktionen ab, einschließlich Dokumentstruktur-Parsing, automatisierter Tabellenextraktion und hierarchischer Layout-Repräsentation. Es enthält zudem Visualisierungstools, um erkannte Elemente und Hierarchien über Originalbildern zur Ergebnisverifizierung darzustellen.
Offers a library for parsing document images into programmatic data structures for downstream analysis.
Python
Auf GitHub ansehen5,749
pdf2htmlex/pdf2htmlex
pdf2htmlEX/pdf2htmlEX
5,412Auf GitHub ansehen
pdf2htmlEX is a PDF to HTML converter that transforms documents into web pages while preserving the original layout, fonts, and formatting. It functions as a layout engine and text extractor, mapping PDF coordinate data to HTML and CSS to maintain visual fidelity. The tool converts PDF content into searchable and selectable native HTML text by embedding original document fonts. It maintains document interactivity by preserving internal links, bookmarks, and outlines, converting them into functional web navigation. The conversion process supports flexible output structures, allowing documents
Converts the PDF table of contents into a structured web outline for easier navigation.
HTMLhtmlpdfpdf-document-processor
Auf GitHub ansehen5,412
euske/pdfminer
euske/pdfminer
5,290Auf GitHub ansehen
pdfminer is a Python library for parsing PDF files to extract text, analyze layouts, decrypt content, and convert documents into HTML or XML formats. It functions as a text extraction engine and layout analysis tool designed to retrieve characters and words while preserving the structural organization of the original document. The project provides utilities for converting PDF content into structured HTML or XML to maintain visual layout and a decryption tool for unlocking restricted documents using encryption keys. It identifies the positions and groupings of text elements to reconstruct page
Extracts hierarchical bookmark trees and table of contents from PDF documents.
Python
Auf GitHub ansehen5,290
kylechui/nvim-surround
kylechui/nvim-surround
4,228Auf GitHub ansehen
nvim-surround is a Lua-based extension for Neovim designed to add, change, and delete surrounding delimiter pairs around text and code. It functions as a text object manipulator that wraps or removes brackets, quotes, and tags using motions and selections. The plugin integrates with Tree-sitter to identify structural code nodes, allowing for the precise surrounding of syntax elements based on the structural syntax tree. It also supports custom surround definitions, enabling users to define specialized delimiter pairs and aliases. The core capability surface covers basic surrounding operation
Uses Tree-sitter structural node querying to precisely identify and surround complex code blocks.
Lua
Auf GitHub ansehen4,228
l3mon4d3/luasnip
L3MON4D3/LuaSnip
4,276Auf GitHub ansehen
LuaSnip is a scriptable text expansion framework and Lua-based snippet engine. It allows for the creation of reusable text templates and complex nested structures that expand into a buffer using triggers and jumpable tabstops. The system distinguishes itself by using abstract syntax trees to trigger expansions based on structural code patterns rather than simple text matching. It features a multi-format importer capable of parsing snippet definitions from community standards such as LSP and SnipMate. The framework covers dynamic code generation through Lua functions, regex-based capture grou
Triggers a postfix snippet only when a specific tree‑sitter node sits in front of the trigger.
Lualuaneovimsnippet-engine
Auf GitHub ansehen4,276
meanderingprogrammer/render-markdown.nvim
MeanderingProgrammer/render-markdown.nvim
4,146Auf GitHub ansehen
render-markdown.nvim is a Neovim plugin that transforms raw markdown syntax into a visually formatted layout directly inside the editor. It acts as a component visualizer and syntax highlighter, replacing standard markdown elements with custom symbols, icons, and formatted blocks to improve document readability. The plugin provides a toggle between rendered visual layouts and raw text views, allowing users to switch based on their current needs. It also applies markdown styling to injected content sections found within non-markdown file types. The system covers the visualization of various d
Uses tree-sitter grammars to precisely identify markdown elements for styling and icon placement.
Lualuamarkdownneovim
Auf GitHub ansehen4,146