What are the best Awesome Document and Unstructured Extraction GitHub Repositories?

Automated processes for parsing unstructured text, documents, or web content into structured, machine-readable formats. Explore 51 awesome GitHub repositories matching data & databases · Document and Unstructured Extraction. Refine with filters or upvote what's useful. Top picks: unclecode/crawl4ai, docling-project/docling, embedchain/embedchain, imartinez/privategpt, soxoj/maigret, supermemoryai/supermemory, cinnamon/kotaemon, openai/chatgpt-retrieval-plugin, datalab-to/surya, camel-ai/camel.

Why is unclecode/crawl4ai a recommended Document and Unstructured Extraction GitHub Repositories repository?

Maps unstructured web content into predefined data structures using automated path selection or intelligent language model analysis.

Why is docling-project/docling a recommended Document and Unstructured Extraction GitHub Repositories repository?

Defines specific input types and file formats to ensure that documents are processed according to custom requirements.

Why is embedchain/embedchain a recommended Document and Unstructured Extraction GitHub Repositories repository?

Provides a pipeline to process unstructured interactions and isolate confirmed facts as permanent long-term memory entries.

Why is imartinez/privategpt a recommended Document and Unstructured Extraction GitHub Repositories repository?

Parses various file formats and transforms unstructured text into machine-readable formats for local indexing.

Why is soxoj/maigret a recommended Document and Unstructured Extraction GitHub Repositories repository?

Custom parsing logic maps unstructured HTML and API responses into a unified data format for consistent cross-platform analysis.

Why is supermemoryai/supermemory a recommended Document and Unstructured Extraction GitHub Repositories repository?

Parses unstructured text into individual, standalone data points to ensure information is stored in a granular, searchable format.

Why is cinnamon/kotaemon a recommended Document and Unstructured Extraction GitHub Repositories repository?

Extracts text content from various unstructured file formats including office documents, images, and emails.

Why is openai/chatgpt-retrieval-plugin a recommended Document and Unstructured Extraction GitHub Repositories repository?

Parses key information like authors and dates from unstructured text using a model to return structured JSON.

Why is datalab-to/surya a recommended Document and Unstructured Extraction GitHub Repositories repository?

Transforms unstructured documents like PDFs and images into structured machine-readable formats for business pipelines.

Why is camel-ai/camel a recommended Document and Unstructured Extraction GitHub Repositories repository?

Parses complex documents and images using OCR to convert unstructured files into machine-readable formats.

51 Repos

Awesome GitHub RepositoriesDocument and Unstructured Extraction

Automated processes for parsing unstructured text, documents, or web content into structured, machine-readable formats.

Explore 51 awesome GitHub repositories matching data & databases · Document and Unstructured Extraction. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

unclecode/crawl4ai
unclecode/crawl4ai
68,644Auf GitHub ansehen
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages l
Maps unstructured web content into predefined data structures using automated path selection or intelligent language model analysis.
Python
Auf GitHub ansehen68,644
docling-project/docling
docling-project/docling
61,674Auf GitHub ansehen
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Defines specific input types and file formats to ensure that documents are processed according to custom requirements.
Pythonaiconvertdocument-parser
Auf GitHub ansehen61,674
embedchain/embedchain
embedchain/embedchain
58,769Auf GitHub ansehen
Embedchain is an LLM memory management framework and RAG orchestration engine designed to provide AI agents with a persistent storage layer. It functions as a long-term memory pipeline that extracts facts from unstructured interactions and stores them as permanent knowledge base entries to retain user preferences and interaction history across sessions. The system employs a hybrid vector database interface that combines semantic embeddings with traditional keyword search. It utilizes an entity-linking knowledge graph to connect related information points and applies temporal ranking to distin
Provides a pipeline to process unstructured interactions and isolate confirmed facts as permanent long-term memory entries.
Python
Auf GitHub ansehen58,769
imartinez/privategpt
imartinez/privateGPT
57,281Auf GitHub ansehen
PrivateGPT is a private AI document assistant and local knowledge base manager designed for querying private files and documents using retrieval-augmented generation. It functions as a local language model application and API gateway, allowing users to obtain cited answers from unstructured data without sending information to external servers. The system differentiates itself by acting as a tool integrator that connects language models to external functions, including web search, tabular data analysis, and custom action extensions. It provides a standardized API layer that allows local infere
Parses various file formats and transforms unstructured text into machine-readable formats for local indexing.
Python
Auf GitHub ansehen57,281
soxoj/maigret
soxoj/maigret
33,154Auf GitHub ansehen
Maigret is an open-source intelligence framework designed for automated digital footprint discovery and identity investigation. It functions as a search engine that aggregates profile metadata by querying thousands of websites for specific usernames, mapping an individual's online presence across diverse platforms. The tool distinguishes itself through recursive discovery capabilities, which identify links within discovered profiles to expand the scope of an investigation automatically. It supports cross-platform identity correlation by mapping disparate accounts and pseudonymous personas, in
Custom parsing logic maps unstructured HTML and API responses into a unified data format for consistent cross-platform analysis.
Pythonblueteamclicybersecurity
Auf GitHub ansehen33,154
supermemoryai/supermemory
supermemoryai/supermemory
27,334Auf GitHub ansehen
Supermemory is an artificial intelligence memory management platform designed to provide autonomous agents with persistent, long-term knowledge bases. It functions as a centralized repository that synchronizes multimodal data, enabling agents to maintain context and historical information across complex, multi-session workflows. By serving as a knowledge graph engine and vector database orchestrator, the platform ensures that information remains accessible and relevant for automated tasks. The system distinguishes itself through its hybrid indexing approach, which combines vector similarity s
Parses unstructured text into individual, standalone data points to ensure information is stored in a granular, searchable format.
TypeScriptcloudflare-kvcloudflare-pagescloudflare-workers
Auf GitHub ansehen27,334
cinnamon/kotaemon
Cinnamon/kotaemon
25,139Auf GitHub ansehen
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
Extracts text content from various unstructured file formats including office documents, images, and emails.
Pythonchatbotllmsopen-source
Auf GitHub ansehen25,139
openai/chatgpt-retrieval-plugin
openai/chatgpt-retrieval-plugin
21,192Auf GitHub ansehen
This project is a retrieval-augmented generation pipeline designed for building custom ChatGPT plugins that allow language models to query private or professional documents. It implements a full retrieval workflow, from processing and indexing document chunks to retrieving relevant context for natural language queries. The system distinguishes itself through a hybrid retrieval approach that combines dense vector embeddings with sparse keyword matching, further refined by a two-stage semantic re-ranking process. It includes specialized data privacy tools for screening personally identifiable i
Parses key information like authors and dates from unstructured text using a model to return structured JSON.
Pythonchatgptchatgpt-plugins
Auf GitHub ansehen21,192
datalab-to/surya
datalab-to/surya
20,889Auf GitHub ansehen
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
Transforms unstructured documents like PDFs and images into structured machine-readable formats for business pipelines.
Python
Auf GitHub ansehen20,889
camel-ai/camel
camel-ai/camel
17,253Auf GitHub ansehen
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
Parses complex documents and images using OCR to convert unstructured files into machine-readable formats.
Pythonagentai-societiesartificial-intelligence
Auf GitHub ansehen17,253
h4ckf0r0day/obscura
h4ckf0r0day/obscura
16,110Auf GitHub ansehen
Obscura is a web scraping infrastructure and headless browser server designed for AI agents. It provides a system for AI models to control browser sessions, interact with websites, and extract web data using a WebSocket implementation of the Chrome DevTools Protocol. The project focuses on bot detection evasion by randomizing browser fingerprints, masking native functions, and blocking tracking scripts to mimic human behavior. It further secures identities through a traffic layer that routes network requests via HTTP or SOCKS5 proxies. The system supports large-scale data extraction through
Transforms structured HTML trees into flattened markdown to optimize token usage for large language models.
Rustantidetectantidetect-browserbrowser
Auf GitHub ansehen16,110
langbot-app/langbot
langbot-app/LangBot
15,311Auf GitHub ansehen
LangBot is an orchestration platform designed for building, managing, and deploying AI agents. It functions as a comprehensive framework for integrating large language models with custom workflows, enabling developers to connect intelligent agents to various messaging platforms and external tools. The platform distinguishes itself through a modular, plugin-based architecture that allows for the extension of agent capabilities via custom tools and file parsers. It features a secure, sandbox-isolated runtime environment that executes untrusted code and plugin logic within resource-constrained c
Converts binary files into structured text content to prepare data for indexing and retrieval.
Pythonagentcozedeepseek
Auf GitHub ansehen15,311
getmaxun/maxun
getmaxun/maxun
15,049Auf GitHub ansehen
Maxun is an open-source web scraping and automation platform designed to transform dynamic website content into structured data. By leveraging artificial intelligence to interpret natural language prompts, the system identifies page elements and extracts information without requiring manual selector configuration. It serves as a bridge between raw web content and intelligent workflows, providing structured outputs in formats optimized for large language model ingestion and agent-based applications. The platform distinguishes itself through its ability to handle complex, authenticated, and dyn
Provides automated parsing of unstructured web content and documents into structured, machine-readable formats.
TypeScriptagentsapiautomation
Auf GitHub ansehen15,049
codelucas/newspaper
codelucas/newspaper
14,982Auf GitHub ansehen
Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources. The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keywo
Provides global and instance-specific settings for customizing extraction parameters like timeouts and content filtering.
HTMLcrawlercrawlingnews
Auf GitHub ansehen14,982
othmanadi/planning-with-files
OthmanAdi/planning-with-files
14,139Auf GitHub ansehen
Planning with files is an enterprise knowledge graph platform designed to transform unstructured organizational data into a searchable, interconnected network. By utilizing a graph-based retrieval-augmented generation engine, the system grounds language model outputs in verified internal data, ensuring that responses are explainable, traceable, and free from hallucinations. The platform distinguishes itself through a focus on data sovereignty and secure, private infrastructure deployment. It enables organizations to maintain full control over sensitive information by processing data locally o
Extracts entities and relationships from documents, emails, and tickets to build a structured network of organizational knowledge.
Pythonadalagentagent-skills
Auf GitHub ansehen14,139
unstructured-io/unstructured
Unstructured-IO/unstructured
14,019Auf GitHub ansehen
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Provides automated processes for parsing unstructured documents into structured, machine-readable formats for AI workflows.
HTMLdata-pipelinesdeep-learningdocument-image-analysis
Auf GitHub ansehen14,019
567-labs/instructor
567-labs/instructor
13,176Auf GitHub ansehen
Instructor is a framework designed for structured data extraction, validation, and language model integration. It functions as a library that transforms unstructured text into validated, type-safe objects by leveraging schema definitions and model-specific tool-calling capabilities. By acting as a validation middleware, the project ensures that language model outputs strictly conform to defined data structures. The library distinguishes itself through a robust validation-based retry loop that automatically re-submits failed responses with error feedback to iteratively correct schema complianc
Configures underlying protocols like tool calling or constrained grammar sampling to optimize data extraction.
Pythonopenaiopenai-function-calliopenai-functions
Auf GitHub ansehen13,176
nlp-compromise/compromise
nlp-compromise/compromise
12,122Auf GitHub ansehen
Compromise is a natural language processing library and rule-based engine designed for English text manipulation, analysis, and parsing. It provides a toolkit for tokenizing text, identifying parts of speech, and performing linguistic analysis to achieve semantic understanding of unstructured strings. The project distinguishes itself through its ability to programmatically transform grammar, such as modifying verb tenses, noun plurality, and adjective forms. It also functions as a named entity recognizer capable of extracting people, places, organizations, dates, and contact information from
Converts unstructured strings into organized data by identifying named entities, dates, and grammatical components.
JavaScript
Auf GitHub ansehen12,122
h2oai/h2ogpt
h2oai/h2ogpt
12,016Auf GitHub ansehen
h2oGPT is a self-hosted platform designed for running large language models and executing retrieval-augmented generation workflows locally. It provides a comprehensive web interface that allows users to index private document collections into searchable databases, enabling context-aware question answering and summarization without exposing sensitive data to external services. The platform distinguishes itself by offering a modular architecture that supports both local model execution and connections to external inference servers. It facilitates the development of autonomous agents capable of
Extracts structured data from unstructured documents using optical character recognition and machine learning.
Pythonaichatgptembeddings
Auf GitHub ansehen12,016
tmc/langchaingo
tmc/langchaingo
9,416Auf GitHub ansehen
langchaingo is an LLM application framework for Go designed for building language model-powered applications and autonomous agents. It serves as an orchestration library and tool integration framework that allows developers to link prompt sequences and model calls into complex, multi-step workflows. The project provides a toolkit for implementing retrieval-augmented generation pipelines by processing unstructured documents and retrieving relevant context via vector search. It includes a dedicated integration layer for indexing high-dimensional embeddings and performing similarity searches acr
Provides automated processes for parsing unstructured text and documents into formats suitable for indexing.
Go
Auf GitHub ansehen9,416