What are the best Awesome Schema-Driven Extraction GitHub Repositories?

Tools that map unstructured web content into predefined data structures using automated path selection. Explore 11 awesome GitHub repositories matching data & databases · Schema-Driven Extraction. Refine with filters or upvote what's useful. Top picks: unclecode/crawl4ai, soxoj/maigret, axa-group/parsr, mishushakov/llm-scraper, datalab-to/chandra, lixin4ever/conference-acceptance-rate, zjunlp/deepke, seriouscache/uabe, browserbase/mcp-server-browserbase, any4ai/anycrawl.

Why is soxoj/maigret a recommended Schema-Driven Extraction GitHub Repositories repository?

Custom parsing logic maps unstructured HTML and API responses into a unified data format for consistent cross-platform analysis.

Why is axa-group/parsr a recommended Schema-Driven Extraction GitHub Repositories repository?

Organizes extracted document fragments into a structured hierarchy based on target data definitions.

Why is mishushakov/llm-scraper a recommended Schema-Driven Extraction GitHub Repositories repository?

Defines the shape of data to extract from webpages with Zod or JSON schemas.

Why is datalab-to/chandra a recommended Schema-Driven Extraction GitHub Repositories repository?

Extracts structured data from documents by applying user-defined JSON schemas and returning citations to source locations.

Why is lixin4ever/conference-acceptance-rate a recommended Schema-Driven Extraction GitHub Repositories repository?

Employs schema-driven extraction to consistently parse submission and acceptance numbers from raw text files.

Why is zjunlp/deepke a recommended Schema-Driven Extraction GitHub Repositories repository?

Maps unstructured text to predefined structured formats and task descriptions for domain-specific knowledge extraction.

Why is seriouscache/uabe a recommended Schema-Driven Extraction GitHub Repositories repository?

Uses predefined data structures to isolate specific asset types from monolithic bundle files.

Why is browserbase/mcp-server-browserbase a recommended Schema-Driven Extraction GitHub Repositories repository?

Validates unstructured web content against predefined schemas to ensure consistent, typed data extraction.

Why is any4ai/anycrawl a recommended Schema-Driven Extraction GitHub Repositories repository?

Uses language models and JSON schemas to pull specific information from web pages into validated formats.

11 Repos

Awesome GitHub RepositoriesSchema-Driven Extraction

Tools that map unstructured web content into predefined data structures using automated path selection.

Explore 11 awesome GitHub repositories matching data & databases · Schema-Driven Extraction. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

unclecode/crawl4ai
unclecode/crawl4ai
68,644Auf GitHub ansehen
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages l
Maps unstructured web content into predefined data structures using automated path selection or intelligent language model analysis.
Python
Auf GitHub ansehen68,644
soxoj/maigret
soxoj/maigret
33,154Auf GitHub ansehen
Maigret is an open-source intelligence framework designed for automated digital footprint discovery and identity investigation. It functions as a search engine that aggregates profile metadata by querying thousands of websites for specific usernames, mapping an individual's online presence across diverse platforms. The tool distinguishes itself through recursive discovery capabilities, which identify links within discovered profiles to expand the scope of an investigation automatically. It supports cross-platform identity correlation by mapping disparate accounts and pseudonymous personas, in
Custom parsing logic maps unstructured HTML and API responses into a unified data format for consistent cross-platform analysis.
Pythonblueteamclicybersecurity
Auf GitHub ansehen33,154
axa-group/parsr
axa-group/Parsr
6,178Auf GitHub ansehen
Parsr ist ein Extraktor für unstrukturierte Daten und eine Dokumenten-Parsing-Pipeline, die Rohdateien und Bilder in bereinigte, maschinenlesbare Formate konvertiert. Es fungiert als Dokumenten-Layout-Analysator und Pipeline zur Extraktion strukturierter Daten und Labels mittels Large Language Models. Das System enthält einen Dokumenten-Parsing-Visualizer, der ein grafisches Interface bietet, um Dokumente hochzuladen und den resultierenden strukturierten Datenausgang zu inspizieren. Das Projekt deckt Dokumentendigitalisierungs-Workflows ab, einschließlich Layout-Analyse zur Erkennung von Überschriften, Tabellen und Listen sowie automatisierte Dateneingabe durch die Bereinigung und Anreicherung unstrukturierter Inhalte.
Organizes extracted document fragments into a structured hierarchy based on target data definitions.
JavaScript
Auf GitHub ansehen6,178
mishushakov/llm-scraper
mishushakov/llm-scraper
6,190Auf GitHub ansehen
Defines the shape of data to extract from webpages with Zod or JSON schemas.
TypeScriptaiartificial-intelligencebrowser
Auf GitHub ansehen6,190
datalab-to/chandra
datalab-to/chandra
4,833Auf GitHub ansehen
sChandra is a document processing platform that converts images, PDFs, Word documents, spreadsheets, and other formats into structured output such as HTML, Markdown, or JSON while preserving layout. It can also extract specific data fields from invoices, contracts, or reports using user-defined JSON schemas, with citations back to source locations. The service supports form filling in PDF and image documents, document generation from Markdown, and extraction of tracked changes from Word files. The platform distinguishes itself with pipeline-based processing chains that combine multiple proces
Extracts structured data from documents by applying user-defined JSON schemas and returning citations to source locations.
Pythonaiocr
Auf GitHub ansehen4,833
lixin4ever/conference-acceptance-rate
lixin4ever/Conference-Acceptance-Rate
4,757Auf GitHub ansehen
Dieses Projekt ist ein Tracker für akademische Wettbewerbe und ein Metriken-Repository, das historische Einreichungs- und Akzeptanzraten für große Konferenzen der KI-Forschung bereitstellt. Es dient als Datensatz für KI-Konferenzstatistiken, um Trends im Forschungswettbewerb zu überwachen. Das Repository ermöglicht die Verfolgung von Konferenz-Akzeptanzraten, um historische Daten zu analysieren, die Wettbewerbsfähigkeit von Publikationen zu bewerten und das jährliche Wachstum der Einreichungsvolumina über Machine-Learning-Veranstaltungen hinweg zu überwachen. Das Projekt ist als statische Website implementiert, die Markdown-basierte Datenspeicherung und schema-gesteuertes Parsing verwendet, um responsive Datentabellen zu rendern.
Employs schema-driven extraction to consistently parse submission and acceptance numbers from raw text files.
Jupyter Notebook
Auf GitHub ansehen4,757
zjunlp/deepke
zjunlp/DeepKE
4,433Auf GitHub ansehen
DeepKE ist ein Toolkit und Framework zur Wissensextraktion, das darauf ausgelegt ist, unstrukturierte Texte in strukturierte Wissensgraphen zu transformieren. Es bietet eine Pipeline zur Identifizierung und Klassifizierung benannter Entitäten, semantischer Beziehungen und Ereignisse und konvertiert rohe Datensätze in strukturierte Tripel. Das Projekt nutzt Large Language Models als Tool-Caller durch ein standardisiertes Kontextprotokoll, um automatisierte Datenextraktionsprozesse voranzutreiben. Es unterstützt schema-gesteuerte Extraktion über mehrere Domänen und zweisprachige Texte hinweg und verwendet gemeinsame Entitäts- und Beziehungsextraktion, um Komponenten in einer einzigen strukturierten Ausgabe zu identifizieren. Das Toolkit umfasst Funktionen für Modelltraining und Fine-Tuning, Hyperparameter-Optimierung und Datenvorbereitung via Distant Supervision und automatisierter Beziehungslabeling. Es bietet zudem verteiltes GPU-Training, Modell-Speicheroptimierung durch Quantisierung und die Möglichkeit, trainierte Modelle als Inference-Services über API-Endpunkte bereitzustellen.
Maps unstructured text to predefined structured formats and task descriptions for domain-specific knowledge extraction.
Python
Auf GitHub ansehen4,433
seriouscache/uabe
SeriousCache/UABE
4,137Auf GitHub ansehen
UABE is a specialized toolset for extracting, modifying, and converting assets stored within Unity engine bundle files. It functions as an asset bundle extractor and a game modding utility designed to alter 3D meshes, textures, and audio within Unity games. The project includes an asset format converter that transforms internal Unity data into common file formats for external editing. It also features a mod installer generator to create standalone installation packages from modified asset bundle files. The software provides capabilities for game resource extraction and asset conversion, allo
Uses predefined data structures to isolate specific asset types from monolithic bundle files.
C++unityunity3d
Auf GitHub ansehen4,137
browserbase/mcp-server-browserbase
browserbase/mcp-server-browserbase
3,139Auf GitHub ansehen
This project is an MCP browser automation server that connects large language models to headless cloud browsers. It functions as an autonomous web workflow engine and an LLM web agent interface, enabling the translation of natural language instructions into browser actions and structured data retrieval. The system distinguishes itself through a managed headless browser cloud API that supports concurrent Chromium sessions with integrated stealth modes, CAPTCHA solving, and proxy traffic routing. It utilizes self-healing element selection to maintain automation resilience when page structures c
Validates unstructured web content against predefined schemas to ensure consistent, typed data extraction.
TypeScriptaibrowserchrome
Auf GitHub ansehen3,139
any4ai/anycrawl
any4ai/AnyCrawl
2,742Auf GitHub ansehen
AnyCrawl is an AI-powered data extractor, automated web crawler, and headless browser orchestrator. It serves as a web content extraction API and a gateway that connects crawling and scraping tools to language models using a standardized API protocol. The project specializes in converting unstructured website content into structured JSON or markdown optimized for AI assistants. It utilizes language models and JSON schemas to pull specific information into validated formats and provides capabilities for AI page summarization and LLM-optimized content extraction. The system manages comprehensi
Uses language models and JSON schemas to pull specific information from web pages into validated formats.
TypeScriptai-scrapingaitoolscrawl
Auf GitHub ansehen2,742
negokaz/excel-mcp-server
negokaz/excel-mcp-server
973Auf GitHub ansehen
Dieses Projekt ist ein Model Context Protocol-Server, der es KI-Assistenten ermöglicht, direkt mit Microsoft Excel-Dateien zu interagieren. Er fungiert als Brücke, über die externe Systeme Tabellendaten über eine standardisierte Schnittstelle lesen, schreiben und bearbeiten können. Durch die Unterstützung sowohl direkter Dateimanipulation als auch Headless-Automatisierung bietet der Server ein umfassendes Werkzeug für die programmatische Arbeitsmappenverwaltung. Der Server zeichnet sich durch die Kombination von Datenverarbeitungsfunktionen mit einer visuellen Rendering-Pipeline aus. Er kann Bild-Snapshots spezifischer Tabellenbereiche generieren und Screenshots der aktiven Anwendungsoberfläche erstellen, was visuellen Kontext für automatisierte Aufgaben und Berichte liefert. Diese Funktionen ermöglichen es Benutzern, strukturierte Daten zu extrahieren und gleichzeitig den Zustand komplexer Arbeitsmappen zu visualisieren und zu dokumentieren. Über grundlegende Datenoperationen hinaus unterstützt das Tool umfangreiche Workflows zur Umstrukturierung und Formatierung von Arbeitsmappen. Es ermöglicht das Erstellen, Duplizieren und Organisieren von Arbeitsblättern sowie die programmatische Anwendung von Zellstilen und Zahlenformaten. Um Stabilität bei groß angelegten Operationen zu gewährleisten, implementiert das System ein paginierungsbasiertes Daten-Streaming, das den Speicherverbrauch und die Datenübertragung bei umfangreichen Dateien optimiert.
Parses raw spreadsheet content into structured formats to enable programmatic interpretation and analysis of tabular data.
Go
Auf GitHub ansehen973