Why is datalab-to/marker a recommended Data Extraction Tools GitHub Repositories repository?

A specialized engine that identifies and maps specific information from unstructured documents into predefined schemas for programmatic use.

Why is simstudioai/sim a recommended Data Extraction Tools GitHub Repositories repository?

Workflow Platform scrapes, searches, crawls, maps, and extracts structured data from websites to facilitate web-based information gathering and content processing tasks.

Why is beekeeper-studio/beekeeper-studio a recommended Data Extraction Tools GitHub Repositories repository?

Extracts individual table rows from query results for use in external tools.

Why is microsoft/airsim a recommended Data Extraction Tools GitHub Repositories repository?

Exports static geometry and mesh data from the simulation environment for external analysis.

Why is laramies/theharvester a recommended Data Extraction Tools GitHub Repositories repository?

Automates the collection of public-facing digital assets and intelligence to map an organization's external attack surface.

Why is shengqiangzhang/examples-of-web-crawlers a recommended Data Extraction Tools GitHub Repositories repository?

Extracts and analyzes user profile data, friendship statistics, and social interactions from messaging platforms.

Why is dataabc/weibospider a recommended Data Extraction Tools GitHub Repositories repository?

Gathers detailed user profiles and published posts from Sina Weibo for academic research and trend analysis.

Why is olifolkerd/tabulator a recommended Data Extraction Tools GitHub Repositories repository?

Extracts the current contents of a sheet into array format for storage or processing.

Why is subzeroid/instagrapi a recommended Data Extraction Tools GitHub Repositories repository?

Fetches public Threads profile metadata, posts, replies, and network relationships for analytics and monitoring.

Why is justanotherarchivist/snscrape a recommended Data Extraction Tools GitHub Repositories repository?

Gathers public posts, user profiles, and hashtags from social platforms for analysis and archival.

15 repositorios

Awesome GitHub RepositoriesData Extraction Tools

Utilities for querying and exporting structured data from application workspaces.

Distinguishing note: Focuses on the extraction of workspace data into external formats, distinct from general data storage.

Explore 15 awesome GitHub repositories matching data & databases · Data Extraction Tools. Refine with filters or upvote what's useful.

Encuentra los mejores repositorios con IA.Buscaremos los repositorios que mejor coincidan usando IA.

datalab-to/marker
datalab-to/marker
36,137Ver en GitHub
Marker is a comprehensive document processing platform designed to automate the conversion, extraction, and structuring of data from complex files. It functions as an orchestration engine that chains modular processing steps into versioned, reusable pipelines, allowing organizations to standardize document handling and automate repetitive business tasks at scale. The platform distinguishes itself through its support for secure, private infrastructure deployment, enabling users to run containerized services within their own environments to maintain strict data privacy. It features specialized
A specialized engine that identifies and maps specific information from unstructured documents into predefined schemas for programmatic use.
Python
Ver en GitHub36,137
simstudioai/sim
simstudioai/sim
28,796Ver en GitHub
This project is an AI agent orchestration platform that provides a visual environment for building, testing, and deploying complex automation workflows. It functions as a low-code development interface where users can chain discrete functional blocks into dependency-aware pipelines to integrate artificial intelligence with external data and services. The platform supports the creation of intelligent conversational agents, automated business processes, and multi-service API orchestrations within a unified workspace. The platform distinguishes itself through its event-driven integration engine,
Workflow Platform scrapes, searches, crawls, maps, and extracts structured data from websites to facilitate web-based information gathering and content processing tasks.
TypeScriptagent-workflowagentic-workflowagents
Ver en GitHub28,796
beekeeper-studio/beekeeper-studio
beekeeper-studio/beekeeper-studio
22,030Ver en GitHub
Beekeeper Studio is a cross-platform desktop application designed for database management and SQL development. It provides a unified graphical interface to connect to, query, and modify data across a wide range of relational and NoSQL database systems. The application functions as a comprehensive workspace, integrating tools for schema design, record editing, and data visualization. The project distinguishes itself through a focus on secure, flexible connectivity and AI-assisted workflows. It supports advanced authentication methods, including enterprise single sign-on, multi-factor authentic
Extracts individual table rows from query results for use in external tools.
TypeScriptbigquerycassandracockroachdb
Ver en GitHub22,030
microsoft/airsim
microsoft/AirSim
17,956Ver en GitHub
AirSim is a high-fidelity simulation platform designed for the development and testing of autonomous vehicles. Built as a plugin for game engines, it provides a physics-based environment that models vehicle dynamics and sensor data, serving as a foundation for robotics research, computer vision training, and reinforcement learning. The platform distinguishes itself through its support for hardware-in-the-loop and software-in-the-loop testing, allowing developers to validate control logic and firmware against real-world signals or concurrent processes. It offers extensive programmatic control
Exports static geometry and mesh data from the simulation environment for external analysis.
C++aiairsimartificial-intelligence
Ver en GitHub17,956
laramies/theharvester
laramies/theHarvester
15,687Ver en GitHub
theHarvester is a command-line utility designed for gathering open-source intelligence and mapping an organization's external attack surface. It functions as a security information gathering framework that automates the collection of publicly available data to assist in reconnaissance and threat analysis. The tool utilizes a plugin-based architecture to execute isolated queries against various search engines and public databases. It employs asynchronous task execution to run multiple discovery operations in parallel, while a centralized pipeline aggregates and deduplicates findings from these
Automates the collection of public-facing digital assets and intelligence to map an organization's external attack surface.
Pythonblueteamdiscoveryemails
Ver en GitHub15,687
shengqiangzhang/examples-of-web-crawlers
shengqiangzhang/examples-of-web-crawlers
14,651Ver en GitHub
This project is a collection of Python scripts and tools designed for web scraping, browser automation, and large-scale data extraction. It provides a set of implementations for retrieving information from websites and private APIs, including tools for multimedia downloading and social media data archiving. The toolset includes specialized mechanisms for bypassing anti-scraping measures through IP proxy pool rotation and multi-threaded crawlers. It also features capabilities for simulating browser sessions to handle authentication, intercepting session cookies, and decrypting network payloads
Extracts and analyzes user profile data, friendship statistics, and social interactions from messaging platforms.
HTMLagent-poolcrawlerexample
Ver en GitHub14,651
dataabc/weibospider
dataabc/weiboSpider
9,630Ver en GitHub
weiboSpider is a Python web scraper and social media crawler designed to extract user profiles, posts, and engagement metrics from Sina Weibo. It functions as an automated data pipeline for academic research and trend analysis, collecting long-form text and multimedia content. The tool distinguishes itself through the use of browser session cookies to authenticate requests and access protected profiles. It implements randomized request pacing and global pauses to manage traffic and avoid platform rate limits, while supporting incremental crawling to capture only new content based on timestamp
Gathers detailed user profiles and published posts from Sina Weibo for academic research and trend analysis.
Pythonhelp-wantedpythonpython3
Ver en GitHub9,630
olifolkerd/tabulator
olifolkerd/tabulator
7,550Ver en GitHub
Tabulator is an interactive data table library and virtual DOM data grid used to create high-performance tables from JSON or arrays. It functions as a hierarchical data viewer and a spreadsheet interface component, capable of rendering thousands of records efficiently through viewport-based virtualization and progressive loading. The library distinguishes itself by providing a full spreadsheet interface mode with multi-sheet management, cell range selection, and bulk copy-paste capabilities. It supports complex data architectures, including nested data field mapping, expandable tree structure
Extracts the current contents of a sheet into array format for storage or processing.
JavaScriptajaxcdnjsdata
Ver en GitHub7,550
subzeroid/instagrapi
subzeroid/instagrapi
6,366Ver en GitHub
Fetches public Threads profile metadata, posts, replies, and network relationships for analytics and monitoring.
Pythonapi-wrapperinstabotinstagram
Ver en GitHub6,366
justanotherarchivist/snscrape
JustAnotherArchivist/snscrape
5,398Ver en GitHub
snscrape es un scraper y crawler de redes sociales basado en Python diseñado para extraer publicaciones públicas, perfiles y hashtags de redes sociales sin el uso de APIs oficiales. Funciona como una herramienta de archivo y una utilidad para la recopilación de datos de inteligencia de fuentes abiertas (OSINT), permitiendo la recopilación de información disponible públicamente para investigar tendencias y personas. La herramienta facilita la extracción de datos de redes sociales con fines de investigación y archivo, permitiendo la creación de registros históricos de conversaciones y actividad de los usuarios. Soporta flujos de trabajo para el análisis social académico y la exportación de grandes conjuntos de metadatos y mensajes a archivos locales. Las capacidades incluyen la capacidad de scrapear varias plataformas de redes sociales y limitar el volumen de resultados extraídos. El sistema puede exportar elementos descubiertos como listas de URLs o archivos detallados que contienen contenido y marcas de tiempo.
Gathers public posts, user profiles, and hashtags from social platforms for analysis and archival.
Python
Ver en GitHub5,398
spiderclub/weibospider
SpiderClub/weibospider
4,787Ver en GitHub
Weibospider es un crawler web distribuido diseñado para extraer publicaciones, perfiles y datos de interacción de la red social Weibo. Funciona como un extractor de datos de redes sociales que utiliza una cola de tareas distribuida para escalar las operaciones de scraping a través de múltiples nodos de trabajo. El sistema incluye una interfaz administrativa gráfica para configurar los ajustes del crawler, identificadores de usuario de destino y palabras clave de búsqueda. Emplea una arquitectura distribuida para aumentar el rendimiento de los datos y gestionar la recopilación a gran escala de contenido de redes sociales. La herramienta cubre una amplia gama de capacidades de recopilación de datos, incluyendo la recolección de perfiles de usuario, extracción de búsqueda basada en palabras clave y el mapeo de grafos sociales a través de listas de seguidores, comentarios y reposts. También cuenta con mecanismos para la regulación de la tasa de solicitudes, rotación de cuentas y automatización de tareas recurrentes para mantener la persistencia de la sesión y la recopilación continua de datos.
Extracts public social platform profile metadata, posts, and network relationships for large-scale analysis.
Pythondata-analysisdistributed-crawlerpython3
Ver en GitHub4,787
fxembed/fxembed
FxEmbed/FxEmbed
4,737Ver en GitHub
FxEmbed is a collection of specialized services providing a social media data API, a social media embed gateway, and a URL unshortener and sanitizer. It functions as an edge-deployed content proxy designed to programmatically fetch posts, threads, profiles, and search results from various social platforms. The project transforms social media links into rich media previews and interactive embeds for messaging platforms. It also expands shortened links to their original destinations while removing tracking parameters to improve user privacy and security. The system includes capabilities for so
Retrieves posts, threads, and profile metadata from social platforms via a standardized interface.
TypeScript
Ver en GitHub4,737
dataabc/weibo-crawler
dataabc/weibo-crawler
4,541Ver en GitHub
Este proyecto es un scraper web de Sina Weibo y una tubería de datos de redes sociales diseñada para extraer perfiles de usuario, publicaciones, comentarios y activos multimedia. Funciona como un crawler de datos contenedorizado que automatiza la recopilación y el almacenamiento local de contenido de redes sociales y métricas de interacción. El sistema incluye una capa de procesamiento que utiliza modelos de lenguaje de gran tamaño (LLM) para analizar el texto extraído, generando resúmenes y análisis de sentimiento. Se diferencia por un modelo de contenedor listo para el despliegue que cuenta con una interfaz HTTP para gestionar tareas de extracción y monitorear el progreso de los trabajos. El crawler cubre una amplia gama de capacidades, incluyendo el monitoreo de redes sociales mediante actualizaciones incrementales programadas, el archivo de activos multimedia en discos locales y la exportación de datos en múltiples formatos a archivos planos o bases de datos. También captura interacciones sociales detalladas, como comentarios de primer nivel y republicaciones.
Retrieves detailed post information including timestamps, interaction counts, hashtags, and publication tools.
Pythoncrawlerweiboweibo-spider
Ver en GitHub4,541
nghuyong/weibospider
nghuyong/WeiboSpider
4,086Ver en GitHub
WeiboSpider es un scraper de redes sociales diseñado para extraer perfiles de usuario, publicaciones y datos de interacción de la plataforma Sina Weibo. Funciona como un crawler de datos basado en web que recupera información a través de interfaces externas en lugar de analizar el frontend visual. La herramienta incluye un rastreador de linaje de contenido para seguir publicaciones compartidas hasta sus fuentes originales. También cuenta con un analizador de engagement social para recopilar conteos de visualizaciones e hilos de comentarios anidados para medir métricas de interacción del usuario. El sistema proporciona capacidades para el monitoreo social basado en palabras clave y filtrado de resultados de búsqueda para rastrear temas específicos a lo largo del tiempo. Gestiona grandes conjuntos de datos mediante iteración basada en paginación y recorrido recursivo de hilos de interacción.
Extracts user profiles, posts, and interaction lists from external platform interfaces to gather raw activity data.
Pythonpythonscrapyweibo
Ver en GitHub4,086
withsecurelabs/chainsaw
WithSecureLabs/chainsaw
3,446Ver en GitHub
Chainsaw is a Windows forensic analysis tool used for parsing system databases and extracting security artefacts. It functions as a forensic artefact extractor and a scanner for identifying security threats and log tampering within Windows event logs. The project distinguishes itself by implementing a Sigma rule forensic scanner that applies standardized detection logic and custom rule sets to event logs and forensic artefacts. It enables threat hunting workflows by matching event data against patterns to identify malicious activity, lateral movement, and brute force attacks. The tool's capa
Extracts raw data from internal file tables and databases to facilitate deep manual analysis in external software.
Rustattackblueteamchainsaw
Ver en GitHub3,446

Awesome Data Extraction Tools GitHub Repositories

datalab-to/marker

simstudioai/sim

beekeeper-studio/beekeeper-studio

microsoft/AirSim

laramies/theHarvester

shengqiangzhang/examples-of-web-crawlers

dataabc/weiboSpider

olifolkerd/tabulator

subzeroid/instagrapi

JustAnotherArchivist/snscrape

SpiderClub/weibospider

FxEmbed/FxEmbed

dataabc/weibo-crawler

nghuyong/WeiboSpider

WithSecureLabs/chainsaw

Explorar subetiquetas