Why is datalab-to/marker a recommended Data Extraction Tools GitHub Repositories repository?

A specialized engine that identifies and maps specific information from unstructured documents into predefined schemas for programmatic use.

Why is simstudioai/sim a recommended Data Extraction Tools GitHub Repositories repository?

Workflow Platform scrapes, searches, crawls, maps, and extracts structured data from websites to facilitate web-based information gathering and content processing tasks.

Why is beekeeper-studio/beekeeper-studio a recommended Data Extraction Tools GitHub Repositories repository?

Extracts individual table rows from query results for use in external tools.

Why is microsoft/airsim a recommended Data Extraction Tools GitHub Repositories repository?

Exports static geometry and mesh data from the simulation environment for external analysis.

Why is laramies/theharvester a recommended Data Extraction Tools GitHub Repositories repository?

Automates the collection of public-facing digital assets and intelligence to map an organization's external attack surface.

Why is shengqiangzhang/examples-of-web-crawlers a recommended Data Extraction Tools GitHub Repositories repository?

Extracts and analyzes user profile data, friendship statistics, and social interactions from messaging platforms.

Why is dataabc/weibospider a recommended Data Extraction Tools GitHub Repositories repository?

Gathers detailed user profiles and published posts from Sina Weibo for academic research and trend analysis.

Why is olifolkerd/tabulator a recommended Data Extraction Tools GitHub Repositories repository?

Extracts the current contents of a sheet into array format for storage or processing.

Why is subzeroid/instagrapi a recommended Data Extraction Tools GitHub Repositories repository?

Fetches public Threads profile metadata, posts, replies, and network relationships for analytics and monitoring.

Why is justanotherarchivist/snscrape a recommended Data Extraction Tools GitHub Repositories repository?

Gathers public posts, user profiles, and hashtags from social platforms for analysis and archival.

15 Repos

Awesome GitHub RepositoriesData Extraction Tools

Utilities for querying and exporting structured data from application workspaces.

Distinguishing note: Focuses on the extraction of workspace data into external formats, distinct from general data storage.

Explore 15 awesome GitHub repositories matching data & databases · Data Extraction Tools. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

datalab-to/marker
datalab-to/marker
36,137Auf GitHub ansehen
Marker is a comprehensive document processing platform designed to automate the conversion, extraction, and structuring of data from complex files. It functions as an orchestration engine that chains modular processing steps into versioned, reusable pipelines, allowing organizations to standardize document handling and automate repetitive business tasks at scale. The platform distinguishes itself through its support for secure, private infrastructure deployment, enabling users to run containerized services within their own environments to maintain strict data privacy. It features specialized
A specialized engine that identifies and maps specific information from unstructured documents into predefined schemas for programmatic use.
Python
Auf GitHub ansehen36,137
simstudioai/sim
simstudioai/sim
28,796Auf GitHub ansehen
This project is an AI agent orchestration platform that provides a visual environment for building, testing, and deploying complex automation workflows. It functions as a low-code development interface where users can chain discrete functional blocks into dependency-aware pipelines to integrate artificial intelligence with external data and services. The platform supports the creation of intelligent conversational agents, automated business processes, and multi-service API orchestrations within a unified workspace. The platform distinguishes itself through its event-driven integration engine,
Workflow Platform scrapes, searches, crawls, maps, and extracts structured data from websites to facilitate web-based information gathering and content processing tasks.
TypeScriptagent-workflowagentic-workflowagents
Auf GitHub ansehen28,796
beekeeper-studio/beekeeper-studio
beekeeper-studio/beekeeper-studio
22,030Auf GitHub ansehen
Beekeeper Studio is a cross-platform desktop application designed for database management and SQL development. It provides a unified graphical interface to connect to, query, and modify data across a wide range of relational and NoSQL database systems. The application functions as a comprehensive workspace, integrating tools for schema design, record editing, and data visualization. The project distinguishes itself through a focus on secure, flexible connectivity and AI-assisted workflows. It supports advanced authentication methods, including enterprise single sign-on, multi-factor authentic
Extracts individual table rows from query results for use in external tools.
TypeScriptbigquerycassandracockroachdb
Auf GitHub ansehen22,030
microsoft/airsim
microsoft/AirSim
17,956Auf GitHub ansehen
AirSim is a high-fidelity simulation platform designed for the development and testing of autonomous vehicles. Built as a plugin for game engines, it provides a physics-based environment that models vehicle dynamics and sensor data, serving as a foundation for robotics research, computer vision training, and reinforcement learning. The platform distinguishes itself through its support for hardware-in-the-loop and software-in-the-loop testing, allowing developers to validate control logic and firmware against real-world signals or concurrent processes. It offers extensive programmatic control
Exports static geometry and mesh data from the simulation environment for external analysis.
C++aiairsimartificial-intelligence
Auf GitHub ansehen17,956
laramies/theharvester
laramies/theHarvester
15,687Auf GitHub ansehen
theHarvester is a command-line utility designed for gathering open-source intelligence and mapping an organization's external attack surface. It functions as a security information gathering framework that automates the collection of publicly available data to assist in reconnaissance and threat analysis. The tool utilizes a plugin-based architecture to execute isolated queries against various search engines and public databases. It employs asynchronous task execution to run multiple discovery operations in parallel, while a centralized pipeline aggregates and deduplicates findings from these
Automates the collection of public-facing digital assets and intelligence to map an organization's external attack surface.
Pythonblueteamdiscoveryemails
Auf GitHub ansehen15,687
shengqiangzhang/examples-of-web-crawlers
shengqiangzhang/examples-of-web-crawlers
14,651Auf GitHub ansehen
This project is a collection of Python scripts and tools designed for web scraping, browser automation, and large-scale data extraction. It provides a set of implementations for retrieving information from websites and private APIs, including tools for multimedia downloading and social media data archiving. The toolset includes specialized mechanisms for bypassing anti-scraping measures through IP proxy pool rotation and multi-threaded crawlers. It also features capabilities for simulating browser sessions to handle authentication, intercepting session cookies, and decrypting network payloads
Extracts and analyzes user profile data, friendship statistics, and social interactions from messaging platforms.
HTMLagent-poolcrawlerexample
Auf GitHub ansehen14,651
dataabc/weibospider
dataabc/weiboSpider
9,630Auf GitHub ansehen
weiboSpider is a Python web scraper and social media crawler designed to extract user profiles, posts, and engagement metrics from Sina Weibo. It functions as an automated data pipeline for academic research and trend analysis, collecting long-form text and multimedia content. The tool distinguishes itself through the use of browser session cookies to authenticate requests and access protected profiles. It implements randomized request pacing and global pauses to manage traffic and avoid platform rate limits, while supporting incremental crawling to capture only new content based on timestamp
Gathers detailed user profiles and published posts from Sina Weibo for academic research and trend analysis.
Pythonhelp-wantedpythonpython3
Auf GitHub ansehen9,630
olifolkerd/tabulator
olifolkerd/tabulator
7,550Auf GitHub ansehen
Tabulator is an interactive data table library and virtual DOM data grid used to create high-performance tables from JSON or arrays. It functions as a hierarchical data viewer and a spreadsheet interface component, capable of rendering thousands of records efficiently through viewport-based virtualization and progressive loading. The library distinguishes itself by providing a full spreadsheet interface mode with multi-sheet management, cell range selection, and bulk copy-paste capabilities. It supports complex data architectures, including nested data field mapping, expandable tree structure
Extracts the current contents of a sheet into array format for storage or processing.
JavaScriptajaxcdnjsdata
Auf GitHub ansehen7,550
subzeroid/instagrapi
subzeroid/instagrapi
6,366Auf GitHub ansehen
Fetches public Threads profile metadata, posts, replies, and network relationships for analytics and monitoring.
Pythonapi-wrapperinstabotinstagram
Auf GitHub ansehen6,366
justanotherarchivist/snscrape
JustAnotherArchivist/snscrape
5,398Auf GitHub ansehen
snscrape ist ein Python-basierter Social-Media-Web-Scraper und Crawler, der darauf ausgelegt ist, öffentliche Posts, Profile und Hashtags aus sozialen Netzwerken ohne die Verwendung offizieller APIs zu extrahieren. Er fungiert als Archivierungstool und Dienstprogramm für die Datensammlung im Bereich Open-Source-Intelligence, was das Sammeln öffentlich verfügbarer Informationen zur Untersuchung von Trends und Personen ermöglicht. Das Tool erleichtert die Extraktion von Social-Media-Daten für Forschungs- und Archivierungszwecke und ermöglicht die Erstellung historischer Aufzeichnungen von Konversationen und Benutzeraktivitäten. Es unterstützt Workflows für akademische soziale Analysen und den Export großer Mengen an Metadaten und Nachrichten in lokale Dateien. Die Funktionen umfassen die Fähigkeit, verschiedene soziale Netzwerke zu scrapen und das Volumen der extrahierten Ergebnisse zu begrenzen. Das System kann entdeckte Elemente als Listen von URLs oder detaillierte Dateien mit Inhalten und Zeitstempeln exportieren.
Gathers public posts, user profiles, and hashtags from social platforms for analysis and archival.
Python
Auf GitHub ansehen5,398
spiderclub/weibospider
SpiderClub/weibospider
4,787Auf GitHub ansehen
Weibospider ist ein verteilter Web-Crawler, der darauf ausgelegt ist, Posts, Profile und Interaktionsdaten aus dem sozialen Netzwerk Weibo zu extrahieren. Er fungiert als Social-Media-Datenextraktor, der eine verteilte Task-Queue nutzt, um Scraping-Operationen über mehrere Worker-Knoten hinweg zu skalieren. Das System enthält eine grafische Administrationsschnittstelle zur Konfiguration von Crawler-Einstellungen, Ziel-Benutzerkennungen und Suchbegriffen. Es verwendet eine verteilte Architektur, um den Datendurchsatz zu erhöhen und die groß angelegte Sammlung von Social-Media-Inhalten zu verwalten. Das Tool deckt ein breites Spektrum an Datensammlungsfunktionen ab, einschließlich Harvesting von Benutzerprofilen, Extraktion basierend auf Suchbegriffen und das Mapping sozialer Graphen durch Follower-Listen, Kommentare und Reposts. Es bietet zudem Mechanismen für Request-Rate-Regulierung, Account-Rotation und die Automatisierung wiederkehrender Aufgaben, um Sitzungspersistenz und kontinuierliche Datensammlung aufrechtzuerhalten.
Extracts public social platform profile metadata, posts, and network relationships for large-scale analysis.
Pythondata-analysisdistributed-crawlerpython3
Auf GitHub ansehen4,787
fxembed/fxembed
FxEmbed/FxEmbed
4,737Auf GitHub ansehen
FxEmbed is a collection of specialized services providing a social media data API, a social media embed gateway, and a URL unshortener and sanitizer. It functions as an edge-deployed content proxy designed to programmatically fetch posts, threads, profiles, and search results from various social platforms. The project transforms social media links into rich media previews and interactive embeds for messaging platforms. It also expands shortened links to their original destinations while removing tracking parameters to improve user privacy and security. The system includes capabilities for so
Retrieves posts, threads, and profile metadata from social platforms via a standardized interface.
TypeScript
Auf GitHub ansehen4,737
dataabc/weibo-crawler
dataabc/weibo-crawler
4,541Auf GitHub ansehen
Dieses Projekt ist ein Sina Weibo-Web-Scraper und eine Social-Media-Datenpipeline, die darauf ausgelegt ist, Benutzerprofile, Beiträge, Kommentare und Multimedia-Assets zu extrahieren. Es fungiert als containerisierter Daten-Crawler, der die Sammlung und lokale Speicherung von Social-Media-Inhalten und Engagement-Metriken automatisiert. Das System umfasst eine Verarbeitungsschicht, die Large Language Models zur Analyse der gescrapten Texte verwendet, um Zusammenfassungen und Sentiment-Analysen zu generieren. Es unterscheidet sich durch ein einsatzbereites Container-Modell mit einer HTTP-Schnittstelle zur Verwaltung von Extraktionsaufgaben und zur Überwachung des Fortschritts. Der Crawler deckt ein breites Spektrum an Funktionen ab, darunter Social-Media-Monitoring mittels geplanter inkrementeller Updates, Archivierung von Multimedia-Assets auf lokalen Festplatten und Datenexport in verschiedenen Formaten in Flat-Files oder Datenbanken. Zudem erfasst er detaillierte soziale Interaktionen wie Kommentare erster Ebene und Reposts.
Retrieves detailed post information including timestamps, interaction counts, hashtags, and publication tools.
Pythoncrawlerweiboweibo-spider
Auf GitHub ansehen4,541
nghuyong/weibospider
nghuyong/WeiboSpider
4,086Auf GitHub ansehen
WeiboSpider ist ein Social-Media-Scraper, der darauf ausgelegt ist, Benutzerprofile, Beiträge und Interaktionsdaten von der Sina Weibo-Plattform zu extrahieren. Er fungiert als webbasierter Daten-Crawler, der Informationen über externe Schnittstellen abruft, anstatt das visuelle Frontend zu parsen. Das Tool enthält einen Content-Lineage-Tracer, um geteilte Beiträge bis zu ihren ursprünglichen Quellen zurückzuverfolgen. Es bietet zudem einen Social-Engagement-Analyzer, um Aufrufzahlen und verschachtelte Kommentar-Threads zu erfassen und Interaktionsmetriken zu messen. Das System bietet Funktionen für schlüsselwortbasiertes Social-Monitoring und die Filterung von Suchergebnissen, um spezifische Themen im Zeitverlauf zu verfolgen. Es verwaltet große Datensätze durch paginierungsbasierte Iteration und rekursive Durchquerung von Engagement-Threads.
Extracts user profiles, posts, and interaction lists from external platform interfaces to gather raw activity data.
Pythonpythonscrapyweibo
Auf GitHub ansehen4,086
withsecurelabs/chainsaw
WithSecureLabs/chainsaw
3,446Auf GitHub ansehen
Chainsaw is a Windows forensic analysis tool used for parsing system databases and extracting security artefacts. It functions as a forensic artefact extractor and a scanner for identifying security threats and log tampering within Windows event logs. The project distinguishes itself by implementing a Sigma rule forensic scanner that applies standardized detection logic and custom rule sets to event logs and forensic artefacts. It enables threat hunting workflows by matching event data against patterns to identify malicious activity, lateral movement, and brute force attacks. The tool's capa
Extracts raw data from internal file tables and databases to facilitate deep manual analysis in external software.
Rustattackblueteamchainsaw
Auf GitHub ansehen3,446