Why is mendableai/firecrawl a recommended Structured Data Extractors GitHub Repositories repository?

Transforms unstructured web pages and documents into standardized, machine-readable formats using natural language prompts.

Why is docling-project/docling a recommended Structured Data Extractors GitHub Repositories repository?

Identifies and transforms complex document layouts into standardized, machine-readable information.

Why is guardrails-ai/guardrails a recommended Structured Data Extractors GitHub Repositories repository?

Generates validated JSON or schema-based structured data from free-form model responses using function calling.

Why is matthewmueller/x-ray a recommended Structured Data Extractors GitHub Repositories repository?

Provides a selector-based parser to retrieve text and attributes from HTML as structured nested objects or arrays.

Why is rchipka/node-osmosis a recommended Structured Data Extractors GitHub Repositories repository?

Implements CSS and XPath selectors to extract structured data from HTML and XML documents.

Why is oblac/jodd a recommended Structured Data Extractors GitHub Repositories repository?

Provides an HTML parser that allows element extraction using CSS3 selector patterns.

Why is epicenterhq/epicenter a recommended Structured Data Extractors GitHub Repositories repository?

Projects language model outputs into structured, schema-validated tables for advanced knowledge management.

8 个仓库

Awesome GitHub RepositoriesStructured Data Extractors

Tools that identify and transform unstructured document content into standardized, machine-readable formats.

Explore 8 awesome GitHub repositories matching data & databases · Structured Data Extractors. Refine with filters or upvote what's useful.

用 AI 发现最棒的仓库。我们将通过 AI 为您搜索最匹配的仓库。

mendableai/firecrawl
mendableai/firecrawl
139,399在 GitHub 上查看
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Transforms unstructured web pages and documents into standardized, machine-readable formats using natural language prompts.
TypeScript
在 GitHub 上查看139,399
opendatalab/mineru
opendatalab/MinerU
67,734在 GitHub 上查看
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences with geometric heuristics, the system reconstructs the reading order and structural hierarchy of documents to ensure accurate data representation. The project distinguishes itself through a multi-stage processing workflow that integrates layout detection, optical character recogn
Transforms unstructured document content into standardized, machine-readable formats for automated information retrieval.
Pythonai4sciencedocument-analysisextract-data
在 GitHub 上查看67,734
docling-project/docling
docling-project/docling
61,674在 GitHub 上查看
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Identifies and transforms complex document layouts into standardized, machine-readable information.
Pythonaiconvertdocument-parser
在 GitHub 上查看61,674
guardrails-ai/guardrails
guardrails-ai/guardrails
7,033在 GitHub 上查看
Guardrails is a Python SDK that wraps calls to large language models with configurable validation pipelines, corrective actions, and structured output generation. It provides a unified API layer that connects to over 100 language models, applying consistent validation, streaming, and error-handling across providers. The framework validates and corrects model responses against safety and quality rules, detecting and mitigating risks in both inputs and outputs using pre-built and custom validators. The project distinguishes itself through a validator-pipeline architecture that sequentially appl
Generates validated JSON or schema-based structured data from free-form model responses using function calling.
Pythonaifoundation-modelgpt-3
在 GitHub 上查看7,033
matthewmueller/x-ray
matthewmueller/x-ray
5,904在 GitHub 上查看
X-ray 是一个无头浏览器 Web 爬虫和 HTML 内容抓取工具，旨在从网站中提取结构化数据。它作为一个基于流的数据抓取器和结构化数据提取器，使用选择器将 HTML 中的文本和属性检索为嵌套对象或数组。该项目包含一个请求速率控制器，通过并发限制、节流和超时来管理网络流量。它通过无头浏览器渲染 JavaScript 来处理动态网站抓取，并使用广度优先链接跟踪和分页管理执行自动化网站爬取。该系统提供了一个数据管线，将函数式值转换应用于原始字符串，并将结果写入可读流，以防止在大规模 Web 抓取作业期间出现内存溢出。
Provides a selector-based parser to retrieve text and attributes from HTML as structured nested objects or arrays.
JavaScript
在 GitHub 上查看5,904
rchipka/node-osmosis
rchipka/node-osmosis
4,110在 GitHub 上查看
该项目是一个 Node.js Web 抓取框架，旨在通过请求、解析和文档交互的程序化工作流自动化数据提取。它作为一个无头 Web 爬虫、HTTP 请求管理器以及 DOM 解析器和提取器。该框架通过结合用于与动态内容交互的 JavaScript 执行引擎和利用 CSS 及 XPath 选择器的混合选择系统脱颖而出。它包括用于代理轮换和 Cookie 罐会话管理的专用中间件，以维护身份验证状态并管理自动化流量。其更广泛的功能涵盖递归链接爬取、分页处理和 Web 表单自动化。该工具还提供流量管理功能，例如通过定时延迟进行请求速率限制和自定义 HTTP 标头配置。
Implements CSS and XPath selectors to extract structured data from HTML and XML documents.
JavaScript
在 GitHub 上查看4,110
oblac/jodd
oblac/jodd
4,059在 GitHub 上查看
Jodd 是一套轻量级的 Java 扩展和标准库工具，专为应用配置、数据库映射、依赖注入和 HTML 解析而设计。它提供了一组整合的核心工具，以促进 Java 开发，并具有零依赖的核心，确保在不同环境下的兼容性和小巧的占用空间。该项目具有用于管理对象生命周期的实用依赖注入容器，以及使用 SQL 模板将结果集直接映射到 Java 对象的数据库映射器。它包括一个支持配置文件、部分和宏的专用配置管理器，以及一个使用 CSS3 选择器提取元素的 HTML 解析器。其他功能涵盖通过轻量级 HTTP 客户端进行网络通信、JSON 序列化以及电子邮件发送和接收。该工具包还提供用于数据验证、类型转换、事务管理以及生成用于行为拦截的动态代理的工具。
Provides an HTML parser that allows element extraction using CSS3 selector patterns.
Javaaopdatabasehtml-parser
在 GitHub 上查看4,059
epicenterhq/epicenter
EpicenterHQ/epicenter
4,091在 GitHub 上查看
Epicenter is a local-first knowledge management system and data orchestrator designed to structure information generated by large language models into validated schemas. It functions as a storage architecture that persists application data in human-readable files and databases to ensure user ownership and portability. The system distinguishes itself by projecting language model outputs into structured, schema-validated tables and utilizing conflict-free replicated data types to synchronize application state across multiple devices without a central server. This allows for offline access and c
Projects language model outputs into structured, schema-validated tables for advanced knowledge management.
TypeScriptsveltesveltekittailwindcss
在 GitHub 上查看4,091

Awesome Structured Data Extractors GitHub Repositories

mendableai/firecrawl

opendatalab/MinerU

docling-project/docling

guardrails-ai/guardrails

matthewmueller/x-ray

rchipka/node-osmosis

oblac/jodd

EpicenterHQ/epicenter

探索子标签