Why is datalab-to/marker a recommended Data Extraction Tools GitHub Repositories repository?

A specialized engine that identifies and maps specific information from unstructured documents into predefined schemas for programmatic use.

Why is simstudioai/sim a recommended Data Extraction Tools GitHub Repositories repository?

Workflow Platform scrapes, searches, crawls, maps, and extracts structured data from websites to facilitate web-based information gathering and content processing tasks.

Why is beekeeper-studio/beekeeper-studio a recommended Data Extraction Tools GitHub Repositories repository?

Extracts individual table rows from query results for use in external tools.

Why is microsoft/airsim a recommended Data Extraction Tools GitHub Repositories repository?

Exports static geometry and mesh data from the simulation environment for external analysis.

Why is laramies/theharvester a recommended Data Extraction Tools GitHub Repositories repository?

Automates the collection of public-facing digital assets and intelligence to map an organization's external attack surface.

Why is shengqiangzhang/examples-of-web-crawlers a recommended Data Extraction Tools GitHub Repositories repository?

Extracts and analyzes user profile data, friendship statistics, and social interactions from messaging platforms.

Why is dataabc/weibospider a recommended Data Extraction Tools GitHub Repositories repository?

Gathers detailed user profiles and published posts from Sina Weibo for academic research and trend analysis.

Why is olifolkerd/tabulator a recommended Data Extraction Tools GitHub Repositories repository?

Extracts the current contents of a sheet into array format for storage or processing.

Why is subzeroid/instagrapi a recommended Data Extraction Tools GitHub Repositories repository?

Fetches public Threads profile metadata, posts, replies, and network relationships for analytics and monitoring.

Why is justanotherarchivist/snscrape a recommended Data Extraction Tools GitHub Repositories repository?

Gathers public posts, user profiles, and hashtags from social platforms for analysis and archival.

15 dépôts

Awesome GitHub RepositoriesData Extraction Tools

Utilities for querying and exporting structured data from application workspaces.

Distinguishing note: Focuses on the extraction of workspace data into external formats, distinct from general data storage.

Explore 15 awesome GitHub repositories matching data & databases · Data Extraction Tools. Refine with filters or upvote what's useful.

Trouvez les meilleurs dépôts grâce à l'IA.Nous recherchons les dépôts les plus pertinents grâce à l'IA.

datalab-to/marker
datalab-to/marker
36,137Voir sur GitHub
Marker is a comprehensive document processing platform designed to automate the conversion, extraction, and structuring of data from complex files. It functions as an orchestration engine that chains modular processing steps into versioned, reusable pipelines, allowing organizations to standardize document handling and automate repetitive business tasks at scale. The platform distinguishes itself through its support for secure, private infrastructure deployment, enabling users to run containerized services within their own environments to maintain strict data privacy. It features specialized
A specialized engine that identifies and maps specific information from unstructured documents into predefined schemas for programmatic use.
Python
Voir sur GitHub36,137
simstudioai/sim
simstudioai/sim
28,796Voir sur GitHub
This project is an AI agent orchestration platform that provides a visual environment for building, testing, and deploying complex automation workflows. It functions as a low-code development interface where users can chain discrete functional blocks into dependency-aware pipelines to integrate artificial intelligence with external data and services. The platform supports the creation of intelligent conversational agents, automated business processes, and multi-service API orchestrations within a unified workspace. The platform distinguishes itself through its event-driven integration engine,
Workflow Platform scrapes, searches, crawls, maps, and extracts structured data from websites to facilitate web-based information gathering and content processing tasks.
TypeScriptagent-workflowagentic-workflowagents
Voir sur GitHub28,796
beekeeper-studio/beekeeper-studio
beekeeper-studio/beekeeper-studio
22,030Voir sur GitHub
Beekeeper Studio is a cross-platform desktop application designed for database management and SQL development. It provides a unified graphical interface to connect to, query, and modify data across a wide range of relational and NoSQL database systems. The application functions as a comprehensive workspace, integrating tools for schema design, record editing, and data visualization. The project distinguishes itself through a focus on secure, flexible connectivity and AI-assisted workflows. It supports advanced authentication methods, including enterprise single sign-on, multi-factor authentic
Extracts individual table rows from query results for use in external tools.
TypeScriptbigquerycassandracockroachdb
Voir sur GitHub22,030
microsoft/airsim
microsoft/AirSim
17,956Voir sur GitHub
AirSim is a high-fidelity simulation platform designed for the development and testing of autonomous vehicles. Built as a plugin for game engines, it provides a physics-based environment that models vehicle dynamics and sensor data, serving as a foundation for robotics research, computer vision training, and reinforcement learning. The platform distinguishes itself through its support for hardware-in-the-loop and software-in-the-loop testing, allowing developers to validate control logic and firmware against real-world signals or concurrent processes. It offers extensive programmatic control
Exports static geometry and mesh data from the simulation environment for external analysis.
C++aiairsimartificial-intelligence
Voir sur GitHub17,956
laramies/theharvester
laramies/theHarvester
15,687Voir sur GitHub
theHarvester is a command-line utility designed for gathering open-source intelligence and mapping an organization's external attack surface. It functions as a security information gathering framework that automates the collection of publicly available data to assist in reconnaissance and threat analysis. The tool utilizes a plugin-based architecture to execute isolated queries against various search engines and public databases. It employs asynchronous task execution to run multiple discovery operations in parallel, while a centralized pipeline aggregates and deduplicates findings from these
Automates the collection of public-facing digital assets and intelligence to map an organization's external attack surface.
Pythonblueteamdiscoveryemails
Voir sur GitHub15,687
shengqiangzhang/examples-of-web-crawlers
shengqiangzhang/examples-of-web-crawlers
14,651Voir sur GitHub
This project is a collection of Python scripts and tools designed for web scraping, browser automation, and large-scale data extraction. It provides a set of implementations for retrieving information from websites and private APIs, including tools for multimedia downloading and social media data archiving. The toolset includes specialized mechanisms for bypassing anti-scraping measures through IP proxy pool rotation and multi-threaded crawlers. It also features capabilities for simulating browser sessions to handle authentication, intercepting session cookies, and decrypting network payloads
Extracts and analyzes user profile data, friendship statistics, and social interactions from messaging platforms.
HTMLagent-poolcrawlerexample
Voir sur GitHub14,651
dataabc/weibospider
dataabc/weiboSpider
9,630Voir sur GitHub
weiboSpider is a Python web scraper and social media crawler designed to extract user profiles, posts, and engagement metrics from Sina Weibo. It functions as an automated data pipeline for academic research and trend analysis, collecting long-form text and multimedia content. The tool distinguishes itself through the use of browser session cookies to authenticate requests and access protected profiles. It implements randomized request pacing and global pauses to manage traffic and avoid platform rate limits, while supporting incremental crawling to capture only new content based on timestamp
Gathers detailed user profiles and published posts from Sina Weibo for academic research and trend analysis.
Pythonhelp-wantedpythonpython3
Voir sur GitHub9,630
olifolkerd/tabulator
olifolkerd/tabulator
7,550Voir sur GitHub
Tabulator is an interactive data table library and virtual DOM data grid used to create high-performance tables from JSON or arrays. It functions as a hierarchical data viewer and a spreadsheet interface component, capable of rendering thousands of records efficiently through viewport-based virtualization and progressive loading. The library distinguishes itself by providing a full spreadsheet interface mode with multi-sheet management, cell range selection, and bulk copy-paste capabilities. It supports complex data architectures, including nested data field mapping, expandable tree structure
Extracts the current contents of a sheet into array format for storage or processing.
JavaScriptajaxcdnjsdata
Voir sur GitHub7,550
subzeroid/instagrapi
subzeroid/instagrapi
6,366Voir sur GitHub
Fetches public Threads profile metadata, posts, replies, and network relationships for analytics and monitoring.
Pythonapi-wrapperinstabotinstagram
Voir sur GitHub6,366
justanotherarchivist/snscrape
JustAnotherArchivist/snscrape
5,398Voir sur GitHub
snscrape est un scraper et crawler web de réseaux sociaux basé sur Python conçu pour extraire des publications publiques, des profils et des hashtags de réseaux sociaux sans utiliser d'API officielles. Il fonctionne comme un outil d'archivage et un utilitaire pour la collecte de données en source ouverte (OSINT), permettant le rassemblement d'informations accessibles publiquement pour enquêter sur les tendances et les personnes. L'outil facilite l'extraction de données de réseaux sociaux à des fins de recherche et d'archivage, permettant la création d'enregistrements historiques de conversations et d'activité utilisateur. Il prend en charge des workflows pour l'analyse sociale académique et l'exportation de grands ensembles de métadonnées et de messages dans des fichiers locaux. Les capacités incluent la possibilité de scraper diverses plateformes de réseaux sociaux et de limiter le volume de résultats extraits. Le système peut exporter les éléments découverts sous forme de listes d'URL ou de fichiers détaillés contenant le contenu et les horodatages.
Gathers public posts, user profiles, and hashtags from social platforms for analysis and archival.
Python
Voir sur GitHub5,398
spiderclub/weibospider
SpiderClub/weibospider
4,787Voir sur GitHub
Weibospider est un crawler web distribué conçu pour extraire des posts, des profils et des données d'interaction du réseau social Weibo. Il fonctionne comme un extracteur de données de médias sociaux qui utilise une file d'attente de tâches distribuée pour mettre à l'échelle les opérations de scraping à travers plusieurs nœuds de travail. Le système inclut une interface administrative graphique pour configurer les paramètres du crawler, les identifiants d'utilisateurs cibles et les mots-clés de recherche. Il emploie une architecture distribuée pour augmenter le débit de données et gérer la collecte à grande échelle de contenu de médias sociaux. L'outil couvre un large éventail de capacités de collecte de données, y compris la récolte de profils d'utilisateurs, l'extraction de recherche basée sur des mots-clés et le mappage de graphes sociaux via des listes d'abonnés, des commentaires et des reposts. Il dispose également de mécanismes pour la régulation du taux de requête, la rotation de compte et l'automatisation des tâches récurrentes pour maintenir la persistance de session et la collecte de données continue.
Extracts public social platform profile metadata, posts, and network relationships for large-scale analysis.
Pythondata-analysisdistributed-crawlerpython3
Voir sur GitHub4,787
fxembed/fxembed
FxEmbed/FxEmbed
4,737Voir sur GitHub
FxEmbed is a collection of specialized services providing a social media data API, a social media embed gateway, and a URL unshortener and sanitizer. It functions as an edge-deployed content proxy designed to programmatically fetch posts, threads, profiles, and search results from various social platforms. The project transforms social media links into rich media previews and interactive embeds for messaging platforms. It also expands shortened links to their original destinations while removing tracking parameters to improve user privacy and security. The system includes capabilities for so
Retrieves posts, threads, and profile metadata from social platforms via a standardized interface.
TypeScript
Voir sur GitHub4,737
dataabc/weibo-crawler
dataabc/weibo-crawler
4,541Voir sur GitHub
Ce projet est un scraper web pour Sina Weibo et un pipeline de données de réseaux sociaux conçu pour extraire les profils d'utilisateurs, les publications, les commentaires et les ressources multimédias. Il fonctionne comme un crawler de données conteneurisé qui automatise la collecte et le stockage local de contenu de réseaux sociaux et de métriques d'engagement. Le système inclut une couche de traitement qui utilise des modèles de langage étendus (LLM) pour analyser le texte extrait, générant des résumés et une analyse de sentiment. Il se distingue par un modèle de déploiement prêt à l'emploi sous forme de conteneur, doté d'une interface HTTP pour gérer les tâches d'extraction et surveiller la progression des travaux. Le crawler couvre un large éventail de capacités, incluant la surveillance des réseaux sociaux via des mises à jour incrémentales planifiées, l'archivage des ressources multimédias sur disques locaux et l'exportation de données multi-formats vers des fichiers plats ou des bases de données. Il capture également des interactions sociales détaillées, telles que les commentaires de premier niveau et les reposts.
Retrieves detailed post information including timestamps, interaction counts, hashtags, and publication tools.
Pythoncrawlerweiboweibo-spider
Voir sur GitHub4,541
nghuyong/weibospider
nghuyong/WeiboSpider
4,086Voir sur GitHub
WeiboSpider est un scraper de réseaux sociaux conçu pour extraire les profils d'utilisateurs, les publications et les données d'interaction de la plateforme Sina Weibo. Il fonctionne comme un crawler de données web qui récupère les informations via des interfaces externes plutôt qu'en analysant le frontend visuel. L'outil inclut un traceur de lignée de contenu pour suivre les publications partagées jusqu'à leurs sources originales. Il dispose également d'un analyseur d'engagement social pour collecter le nombre de vues et les fils de commentaires imbriqués afin de mesurer les métriques d'interaction des utilisateurs. Le système fournit des capacités de surveillance sociale par mots-clés et de filtrage des résultats de recherche pour suivre des sujets spécifiques au fil du temps. Il gère de grands ensembles de données via une itération basée sur la pagination et une traversée récursive des fils d'engagement.
Extracts user profiles, posts, and interaction lists from external platform interfaces to gather raw activity data.
Pythonpythonscrapyweibo
Voir sur GitHub4,086
withsecurelabs/chainsaw
WithSecureLabs/chainsaw
3,446Voir sur GitHub
Chainsaw is a Windows forensic analysis tool used for parsing system databases and extracting security artefacts. It functions as a forensic artefact extractor and a scanner for identifying security threats and log tampering within Windows event logs. The project distinguishes itself by implementing a Sigma rule forensic scanner that applies standardized detection logic and custom rule sets to event logs and forensic artefacts. It enables threat hunting workflows by matching event data against patterns to identify malicious activity, lateral movement, and brute force attacks. The tool's capa
Extracts raw data from internal file tables and databases to facilitate deep manual analysis in external software.
Rustattackblueteamchainsaw
Voir sur GitHub3,446

Awesome Data Extraction Tools GitHub Repositories

datalab-to/marker

simstudioai/sim

beekeeper-studio/beekeeper-studio

microsoft/AirSim

laramies/theHarvester

shengqiangzhang/examples-of-web-crawlers

dataabc/weiboSpider

olifolkerd/tabulator

subzeroid/instagrapi

JustAnotherArchivist/snscrape

SpiderClub/weibospider

FxEmbed/FxEmbed

dataabc/weibo-crawler

nghuyong/WeiboSpider

WithSecureLabs/chainsaw

Explorer les sous-tags