What are the best Awesome Schema-Driven Extraction GitHub Repositories?

Tools that map unstructured web content into predefined data structures using automated path selection. Explore 11 awesome GitHub repositories matching data & databases · Schema-Driven Extraction. Refine with filters or upvote what's useful. Top picks: unclecode/crawl4ai, soxoj/maigret, axa-group/parsr, mishushakov/llm-scraper, datalab-to/chandra, lixin4ever/conference-acceptance-rate, zjunlp/deepke, seriouscache/uabe, browserbase/mcp-server-browserbase, any4ai/anycrawl.

Why is soxoj/maigret a recommended Schema-Driven Extraction GitHub Repositories repository?

Custom parsing logic maps unstructured HTML and API responses into a unified data format for consistent cross-platform analysis.

Why is axa-group/parsr a recommended Schema-Driven Extraction GitHub Repositories repository?

Organizes extracted document fragments into a structured hierarchy based on target data definitions.

Why is mishushakov/llm-scraper a recommended Schema-Driven Extraction GitHub Repositories repository?

Defines the shape of data to extract from webpages with Zod or JSON schemas.

Why is datalab-to/chandra a recommended Schema-Driven Extraction GitHub Repositories repository?

Extracts structured data from documents by applying user-defined JSON schemas and returning citations to source locations.

Why is lixin4ever/conference-acceptance-rate a recommended Schema-Driven Extraction GitHub Repositories repository?

Employs schema-driven extraction to consistently parse submission and acceptance numbers from raw text files.

Why is zjunlp/deepke a recommended Schema-Driven Extraction GitHub Repositories repository?

Maps unstructured text to predefined structured formats and task descriptions for domain-specific knowledge extraction.

Why is seriouscache/uabe a recommended Schema-Driven Extraction GitHub Repositories repository?

Uses predefined data structures to isolate specific asset types from monolithic bundle files.

Why is browserbase/mcp-server-browserbase a recommended Schema-Driven Extraction GitHub Repositories repository?

Validates unstructured web content against predefined schemas to ensure consistent, typed data extraction.

Why is any4ai/anycrawl a recommended Schema-Driven Extraction GitHub Repositories repository?

Uses language models and JSON schemas to pull specific information from web pages into validated formats.

11 repository-uri

Awesome GitHub RepositoriesSchema-Driven Extraction

Tools that map unstructured web content into predefined data structures using automated path selection.

Explore 11 awesome GitHub repositories matching data & databases · Schema-Driven Extraction. Refine with filters or upvote what's useful.

Găsește cele mai bune repo-uri cu AI.Vom căuta cele mai potrivite repository-uri folosind AI.

unclecode/crawl4ai
unclecode/crawl4ai
68,644Vezi pe GitHub
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages l
Maps unstructured web content into predefined data structures using automated path selection or intelligent language model analysis.
Python
Vezi pe GitHub68,644
soxoj/maigret
soxoj/maigret
33,154Vezi pe GitHub
Maigret is an open-source intelligence framework designed for automated digital footprint discovery and identity investigation. It functions as a search engine that aggregates profile metadata by querying thousands of websites for specific usernames, mapping an individual's online presence across diverse platforms. The tool distinguishes itself through recursive discovery capabilities, which identify links within discovered profiles to expand the scope of an investigation automatically. It supports cross-platform identity correlation by mapping disparate accounts and pseudonymous personas, in
Custom parsing logic maps unstructured HTML and API responses into a unified data format for consistent cross-platform analysis.
Pythonblueteamclicybersecurity
Vezi pe GitHub33,154
axa-group/parsr
axa-group/Parsr
6,178Vezi pe GitHub
Parsr este un extractor de date nestructurate și un pipeline de parsare a documentelor care convertește fișierele brute și imaginile în formate curate, lizibile de către mașină. Funcționează ca un analizor de layout al documentelor și un pipeline pentru extragerea datelor structurate și a etichetelor folosind modele de limbaj mari (LLM). Sistemul include un vizualizator de parsare a documentelor, oferind o interfață grafică pentru a încărca documente și a inspecta rezultatul datelor structurate. Proiectul acoperă fluxurile de lucru de digitizare a documentelor, inclusiv analiza layout-ului pentru a detecta titluri, tabele și liste, precum și introducerea automată a datelor prin curățarea și îmbogățirea conținutului nestructurat.
Organizes extracted document fragments into a structured hierarchy based on target data definitions.
JavaScript
Vezi pe GitHub6,178
mishushakov/llm-scraper
mishushakov/llm-scraper
6,190Vezi pe GitHub
Defines the shape of data to extract from webpages with Zod or JSON schemas.
TypeScriptaiartificial-intelligencebrowser
Vezi pe GitHub6,190
datalab-to/chandra
datalab-to/chandra
4,833Vezi pe GitHub
sChandra is a document processing platform that converts images, PDFs, Word documents, spreadsheets, and other formats into structured output such as HTML, Markdown, or JSON while preserving layout. It can also extract specific data fields from invoices, contracts, or reports using user-defined JSON schemas, with citations back to source locations. The service supports form filling in PDF and image documents, document generation from Markdown, and extraction of tracked changes from Word files. The platform distinguishes itself with pipeline-based processing chains that combine multiple proces
Extracts structured data from documents by applying user-defined JSON schemas and returning citations to source locations.
Pythonaiocr
Vezi pe GitHub4,833
lixin4ever/conference-acceptance-rate
lixin4ever/Conference-Acceptance-Rate
4,757Vezi pe GitHub
Acest proiect este un tracker de competiții academice și un repository de metrici care oferă rate istorice de trimitere și acceptare pentru principalele conferințe de cercetare în inteligență artificială. Servește drept set de date cu statistici ale conferințelor AI pentru a monitoriza tendințele competiției în cercetare. Repository-ul permite urmărirea ratelor de acceptare la conferințe pentru a analiza datele istorice, a evalua competitivitatea publicațiilor și a monitoriza creșterea de la an la an a volumelor de trimitere în cadrul conferințelor de machine learning. Proiectul este implementat ca un site static care utilizează stocarea datelor bazată pe Markdown și parsarea bazată pe schemă pentru a randa tabele de date responsive.
Employs schema-driven extraction to consistently parse submission and acceptance numbers from raw text files.
Jupyter Notebook
Vezi pe GitHub4,757
zjunlp/deepke
zjunlp/DeepKE
4,433Vezi pe GitHub
DeepKE is a knowledge extraction toolkit and framework designed to transform unstructured text into structured knowledge graphs. It provides a pipeline for identifying and classifying named entities, semantic relations, and events, converting raw datasets into structured triples. The project utilizes large language models as tool callers through a standardized context protocol to drive automated data extraction processes. It supports schema-driven extraction across multiple domains and bilingual text, employing joint entity and relation extraction to identify components in a single structured
Maps unstructured text to predefined structured formats and task descriptions for domain-specific knowledge extraction.
Python
Vezi pe GitHub4,433
seriouscache/uabe
SeriousCache/UABE
4,137Vezi pe GitHub
UABE este un set de instrumente specializat pentru extragerea, modificarea și convertirea activelor stocate în fișierele de tip bundle ale motorului Unity. Funcționează ca un extractor de asset bundle-uri și un utilitar de modding pentru jocuri, conceput pentru a modifica mesh-uri 3D, texturi și audio în jocurile Unity. Proiectul include un convertor de format de active care transformă datele interne Unity în formate de fișiere comune pentru editare externă. De asemenea, dispune de un generator de instalatoare de mod-uri pentru a crea pachete de instalare standalone din fișierele de asset bundle modificate. Software-ul oferă capabilități pentru extragerea resurselor de joc și conversia activelor, permițând utilizatorilor să recupereze fișiere încorporate și să transforme activele interne proprietare în tipuri media standard.
Uses predefined data structures to isolate specific asset types from monolithic bundle files.
C++unityunity3d
Vezi pe GitHub4,137
browserbase/mcp-server-browserbase
browserbase/mcp-server-browserbase
3,139Vezi pe GitHub
This project is an MCP browser automation server that connects large language models to headless cloud browsers. It functions as an autonomous web workflow engine and an LLM web agent interface, enabling the translation of natural language instructions into browser actions and structured data retrieval. The system distinguishes itself through a managed headless browser cloud API that supports concurrent Chromium sessions with integrated stealth modes, CAPTCHA solving, and proxy traffic routing. It utilizes self-healing element selection to maintain automation resilience when page structures c
Validates unstructured web content against predefined schemas to ensure consistent, typed data extraction.
TypeScriptaibrowserchrome
Vezi pe GitHub3,139
any4ai/anycrawl
any4ai/AnyCrawl
2,742Vezi pe GitHub
AnyCrawl is an AI-powered data extractor, automated web crawler, and headless browser orchestrator. It serves as a web content extraction API and a gateway that connects crawling and scraping tools to language models using a standardized API protocol. The project specializes in converting unstructured website content into structured JSON or markdown optimized for AI assistants. It utilizes language models and JSON schemas to pull specific information into validated formats and provides capabilities for AI page summarization and LLM-optimized content extraction. The system manages comprehensi
Uses language models and JSON schemas to pull specific information from web pages into validated formats.
TypeScriptai-scrapingaitoolscrawl
Vezi pe GitHub2,742
negokaz/excel-mcp-server
negokaz/excel-mcp-server
973Vezi pe GitHub
Acest proiect este un server Model Context Protocol care permite asistenților AI să interacționeze direct cu fișierele Microsoft Excel. Acționează ca o punte, permițând sistemelor externe să citească, să scrie și să modifice datele din tabele printr-o interfață standardizată. Suportând atât manipularea directă a fișierelor, cât și automatizarea aplicațiilor headless, serverul oferă un utilitar cuprinzător pentru gestionarea programatică a registrelor de lucru. Serverul se distinge prin combinarea capabilităților de procesare a datelor cu un pipeline de randare vizuală. Poate genera capturi de ecran ale unor intervale specifice din tabel și poate captura interfața activă a aplicației, oferind context vizual pentru sarcini automatizate și raportare. Aceste funcții permit utilizatorilor să extragă date structurate, menținând în același timp capacitatea de a vizualiza și documenta starea registrelor complexe. Dincolo de operațiunile de bază, instrumentul suportă fluxuri de lucru extinse de restructurare și formatare. Permite crearea, duplicarea și organizarea foilor de calcul, precum și aplicarea programatică a stilurilor de celule și a formatelor numerice. Pentru a asigura stabilitatea în timpul operațiunilor la scară largă, sistemul implementează streaming de date bazat pe paginare, care optimizează utilizarea memoriei și transferul de date la manipularea fișierelor mari.
Parses raw spreadsheet content into structured formats to enable programmatic interpretation and analysis of tabular data.
Go
Vezi pe GitHub973