15 个仓库
Utilities for querying and exporting structured data from application workspaces.
Distinguishing note: Focuses on the extraction of workspace data into external formats, distinct from general data storage.
Explore 15 awesome GitHub repositories matching data & databases · Data Extraction Tools. Refine with filters or upvote what's useful.
Marker is a comprehensive document processing platform designed to automate the conversion, extraction, and structuring of data from complex files. It functions as an orchestration engine that chains modular processing steps into versioned, reusable pipelines, allowing organizations to standardize document handling and automate repetitive business tasks at scale. The platform distinguishes itself through its support for secure, private infrastructure deployment, enabling users to run containerized services within their own environments to maintain strict data privacy. It features specialized
A specialized engine that identifies and maps specific information from unstructured documents into predefined schemas for programmatic use.
This project is an AI agent orchestration platform that provides a visual environment for building, testing, and deploying complex automation workflows. It functions as a low-code development interface where users can chain discrete functional blocks into dependency-aware pipelines to integrate artificial intelligence with external data and services. The platform supports the creation of intelligent conversational agents, automated business processes, and multi-service API orchestrations within a unified workspace. The platform distinguishes itself through its event-driven integration engine,
Workflow Platform scrapes, searches, crawls, maps, and extracts structured data from websites to facilitate web-based information gathering and content processing tasks.
Beekeeper Studio is a cross-platform desktop application designed for database management and SQL development. It provides a unified graphical interface to connect to, query, and modify data across a wide range of relational and NoSQL database systems. The application functions as a comprehensive workspace, integrating tools for schema design, record editing, and data visualization. The project distinguishes itself through a focus on secure, flexible connectivity and AI-assisted workflows. It supports advanced authentication methods, including enterprise single sign-on, multi-factor authentic
Extracts individual table rows from query results for use in external tools.
AirSim is a high-fidelity simulation platform designed for the development and testing of autonomous vehicles. Built as a plugin for game engines, it provides a physics-based environment that models vehicle dynamics and sensor data, serving as a foundation for robotics research, computer vision training, and reinforcement learning. The platform distinguishes itself through its support for hardware-in-the-loop and software-in-the-loop testing, allowing developers to validate control logic and firmware against real-world signals or concurrent processes. It offers extensive programmatic control
Exports static geometry and mesh data from the simulation environment for external analysis.
theHarvester is a command-line utility designed for gathering open-source intelligence and mapping an organization's external attack surface. It functions as a security information gathering framework that automates the collection of publicly available data to assist in reconnaissance and threat analysis. The tool utilizes a plugin-based architecture to execute isolated queries against various search engines and public databases. It employs asynchronous task execution to run multiple discovery operations in parallel, while a centralized pipeline aggregates and deduplicates findings from these
Automates the collection of public-facing digital assets and intelligence to map an organization's external attack surface.
This project is a collection of Python scripts and tools designed for web scraping, browser automation, and large-scale data extraction. It provides a set of implementations for retrieving information from websites and private APIs, including tools for multimedia downloading and social media data archiving. The toolset includes specialized mechanisms for bypassing anti-scraping measures through IP proxy pool rotation and multi-threaded crawlers. It also features capabilities for simulating browser sessions to handle authentication, intercepting session cookies, and decrypting network payloads
Extracts and analyzes user profile data, friendship statistics, and social interactions from messaging platforms.
weiboSpider is a Python web scraper and social media crawler designed to extract user profiles, posts, and engagement metrics from Sina Weibo. It functions as an automated data pipeline for academic research and trend analysis, collecting long-form text and multimedia content. The tool distinguishes itself through the use of browser session cookies to authenticate requests and access protected profiles. It implements randomized request pacing and global pauses to manage traffic and avoid platform rate limits, while supporting incremental crawling to capture only new content based on timestamp
Gathers detailed user profiles and published posts from Sina Weibo for academic research and trend analysis.
Tabulator is an interactive data table library and virtual DOM data grid used to create high-performance tables from JSON or arrays. It functions as a hierarchical data viewer and a spreadsheet interface component, capable of rendering thousands of records efficiently through viewport-based virtualization and progressive loading. The library distinguishes itself by providing a full spreadsheet interface mode with multi-sheet management, cell range selection, and bulk copy-paste capabilities. It supports complex data architectures, including nested data field mapping, expandable tree structure
Extracts the current contents of a sheet into array format for storage or processing.
Fetches public Threads profile metadata, posts, replies, and network relationships for analytics and monitoring.
snscrape 是一款基于 Python 的社交媒体 Web 爬虫和抓取工具,旨在在不使用官方 API 的情况下从社交网络中提取公开帖子、个人资料和标签。它作为一个归档工具和开源情报数据收集实用程序,允许收集公开可用信息以调查趋势和人物。 该工具促进了用于研究和归档目的的社交媒体数据提取,实现了对话和用户活动历史记录的创建。它支持学术社交分析的工作流,并将大量元数据和消息导出到本地文件。 功能包括抓取各种社交网络平台并限制提取结果数量的能力。该系统可以将发现的项目导出为 URL 列表或包含内容和时间戳的详细文件。
Gathers public posts, user profiles, and hashtags from social platforms for analysis and archival.
Weibospider 是一个分布式网络爬虫,旨在从微博社交网络中提取博文、个人资料和互动数据。它作为一个社交媒体数据提取器,利用分布式任务队列在多个工作节点上扩展抓取操作。 系统包含一个图形化管理界面,用于配置爬虫设置、目标用户 ID 和搜索关键词。它采用分布式架构来提高数据吞吐量,并管理大规模社交媒体内容的采集。 该工具涵盖了广泛的数据采集功能,包括用户资料抓取、基于关键词的搜索提取,以及通过关注列表、评论和转发来映射社交关系图谱。它还具备请求速率限制、账号轮换和循环任务自动化机制,以维持会话持久性和持续的数据采集。
Extracts public social platform profile metadata, posts, and network relationships for large-scale analysis.
FxEmbed is a collection of specialized services providing a social media data API, a social media embed gateway, and a URL unshortener and sanitizer. It functions as an edge-deployed content proxy designed to programmatically fetch posts, threads, profiles, and search results from various social platforms. The project transforms social media links into rich media previews and interactive embeds for messaging platforms. It also expands shortened links to their original destinations while removing tracking parameters to improve user privacy and security. The system includes capabilities for so
Retrieves posts, threads, and profile metadata from social platforms via a standardized interface.
这是一个新浪微博网页爬虫和社交媒体数据管道,旨在提取用户资料、帖子、评论和多媒体资源。它作为一个容器化的数据爬虫,自动化收集社交媒体内容和互动指标,并将其存储在本地。 该系统包含一个处理层,利用大语言模型分析抓取的文本,生成摘要和情感分析。它通过一个部署就绪的容器模型脱颖而出,该模型具有用于管理提取任务和监控作业进度的 HTTP 界面。 该爬虫涵盖了广泛的功能,包括通过定时增量更新进行社交媒体监控、将多媒体资源归档到本地磁盘,以及向平面文件或数据库进行多格式数据导出。它还能捕获详细的社交互动,如一级评论和转发。
Retrieves detailed post information including timestamps, interaction counts, hashtags, and publication tools.
WeiboSpider 是一个社交媒体爬虫,旨在从新浪微博平台提取用户资料、帖子和交互数据。它作为一个基于 Web 的数据爬虫,通过外部接口检索信息,而不是解析可视化前端。 该工具包括一个内容血缘追踪器,用于将分享的帖子追溯到其原始来源。它还具有一个社交参与度分析器,用于收集浏览量和嵌套评论线程,以衡量用户交互指标。 该系统提供了基于关键字的社交监控和搜索结果过滤功能,以跟踪特定主题随时间的变化。它通过基于分页的迭代和参与线程的递归遍历来管理大数据集。
Extracts user profiles, posts, and interaction lists from external platform interfaces to gather raw activity data.
Chainsaw is a Windows forensic analysis tool used for parsing system databases and extracting security artefacts. It functions as a forensic artefact extractor and a scanner for identifying security threats and log tampering within Windows event logs. The project distinguishes itself by implementing a Sigma rule forensic scanner that applies standardized detection logic and custom rule sets to event logs and forensic artefacts. It enables threat hunting workflows by matching event data against patterns to identify malicious activity, lateral movement, and brute force attacks. The tool's capa
Extracts raw data from internal file tables and databases to facilitate deep manual analysis in external software.