15 repository-uri
Utilities for querying and exporting structured data from application workspaces.
Distinguishing note: Focuses on the extraction of workspace data into external formats, distinct from general data storage.
Explore 15 awesome GitHub repositories matching data & databases · Data Extraction Tools. Refine with filters or upvote what's useful.
Marker is a comprehensive document processing platform designed to automate the conversion, extraction, and structuring of data from complex files. It functions as an orchestration engine that chains modular processing steps into versioned, reusable pipelines, allowing organizations to standardize document handling and automate repetitive business tasks at scale. The platform distinguishes itself through its support for secure, private infrastructure deployment, enabling users to run containerized services within their own environments to maintain strict data privacy. It features specialized
A specialized engine that identifies and maps specific information from unstructured documents into predefined schemas for programmatic use.
This project is an AI agent orchestration platform that provides a visual environment for building, testing, and deploying complex automation workflows. It functions as a low-code development interface where users can chain discrete functional blocks into dependency-aware pipelines to integrate artificial intelligence with external data and services. The platform supports the creation of intelligent conversational agents, automated business processes, and multi-service API orchestrations within a unified workspace. The platform distinguishes itself through its event-driven integration engine,
Workflow Platform scrapes, searches, crawls, maps, and extracts structured data from websites to facilitate web-based information gathering and content processing tasks.
Beekeeper Studio is a cross-platform desktop application designed for database management and SQL development. It provides a unified graphical interface to connect to, query, and modify data across a wide range of relational and NoSQL database systems. The application functions as a comprehensive workspace, integrating tools for schema design, record editing, and data visualization. The project distinguishes itself through a focus on secure, flexible connectivity and AI-assisted workflows. It supports advanced authentication methods, including enterprise single sign-on, multi-factor authentic
Extracts individual table rows from query results for use in external tools.
AirSim is a high-fidelity simulation platform designed for the development and testing of autonomous vehicles. Built as a plugin for game engines, it provides a physics-based environment that models vehicle dynamics and sensor data, serving as a foundation for robotics research, computer vision training, and reinforcement learning. The platform distinguishes itself through its support for hardware-in-the-loop and software-in-the-loop testing, allowing developers to validate control logic and firmware against real-world signals or concurrent processes. It offers extensive programmatic control
Exports static geometry and mesh data from the simulation environment for external analysis.
theHarvester is a command-line utility designed for gathering open-source intelligence and mapping an organization's external attack surface. It functions as a security information gathering framework that automates the collection of publicly available data to assist in reconnaissance and threat analysis. The tool utilizes a plugin-based architecture to execute isolated queries against various search engines and public databases. It employs asynchronous task execution to run multiple discovery operations in parallel, while a centralized pipeline aggregates and deduplicates findings from these
Automates the collection of public-facing digital assets and intelligence to map an organization's external attack surface.
This project is a collection of Python scripts and tools designed for web scraping, browser automation, and large-scale data extraction. It provides a set of implementations for retrieving information from websites and private APIs, including tools for multimedia downloading and social media data archiving. The toolset includes specialized mechanisms for bypassing anti-scraping measures through IP proxy pool rotation and multi-threaded crawlers. It also features capabilities for simulating browser sessions to handle authentication, intercepting session cookies, and decrypting network payloads
Extracts and analyzes user profile data, friendship statistics, and social interactions from messaging platforms.
weiboSpider is a Python web scraper and social media crawler designed to extract user profiles, posts, and engagement metrics from Sina Weibo. It functions as an automated data pipeline for academic research and trend analysis, collecting long-form text and multimedia content. The tool distinguishes itself through the use of browser session cookies to authenticate requests and access protected profiles. It implements randomized request pacing and global pauses to manage traffic and avoid platform rate limits, while supporting incremental crawling to capture only new content based on timestamp
Gathers detailed user profiles and published posts from Sina Weibo for academic research and trend analysis.
Tabulator is an interactive data table library and virtual DOM data grid used to create high-performance tables from JSON or arrays. It functions as a hierarchical data viewer and a spreadsheet interface component, capable of rendering thousands of records efficiently through viewport-based virtualization and progressive loading. The library distinguishes itself by providing a full spreadsheet interface mode with multi-sheet management, cell range selection, and bulk copy-paste capabilities. It supports complex data architectures, including nested data field mapping, expandable tree structure
Extracts the current contents of a sheet into array format for storage or processing.
Fetches public Threads profile metadata, posts, replies, and network relationships for analytics and monitoring.
snscrape is a Python-based social media web scraper and crawler designed to extract public posts, profiles, and hashtags from social networks without the use of official APIs. It functions as an archival tool and a utility for open-source intelligence data collection, allowing for the gathering of publicly available information to investigate trends and people. The tool facilitates social media data extraction for research and archival purposes, enabling the creation of historical records of conversations and user activity. It supports workflows for academic social analysis and the export of
Gathers public posts, user profiles, and hashtags from social platforms for analysis and archival.
Weibospider este un crawler web distribuit conceput pentru a extrage postări, profiluri și date de interacțiune din rețeaua socială Weibo. Acesta funcționează ca un extractor de date din social media care utilizează o coadă de sarcini distribuită pentru a scala operațiunile de scraping pe mai multe noduri de lucru. Sistemul include o interfață administrativă grafică pentru configurarea setărilor crawler-ului, a identificatorilor utilizatorilor țintă și a cuvintelor cheie de căutare. Utilizează o arhitectură distribuită pentru a crește throughput-ul datelor și a gestiona colectarea la scară largă a conținutului din social media. Instrumentul acoperă o gamă largă de capabilități de colectare a datelor, inclusiv recoltarea profilurilor utilizatorilor, extragerea căutărilor bazate pe cuvinte cheie și maparea grafurilor sociale prin liste de urmăritori, comentarii și repostări. De asemenea, dispune de mecanisme pentru reglarea ratei cererilor, rotația conturilor și automatizarea sarcinilor recurente pentru a menține persistența sesiunii și colectarea continuă a datelor.
Extracts public social platform profile metadata, posts, and network relationships for large-scale analysis.
FxEmbed is a collection of specialized services providing a social media data API, a social media embed gateway, and a URL unshortener and sanitizer. It functions as an edge-deployed content proxy designed to programmatically fetch posts, threads, profiles, and search results from various social platforms. The project transforms social media links into rich media previews and interactive embeds for messaging platforms. It also expands shortened links to their original destinations while removing tracking parameters to improve user privacy and security. The system includes capabilities for so
Retrieves posts, threads, and profile metadata from social platforms via a standardized interface.
Acest proiect este un scraper web pentru Sina Weibo și un pipeline de date pentru social media, conceput pentru a extrage profiluri de utilizatori, postări, comentarii și active multimedia. Acesta funcționează ca un crawler de date containerizat care automatizează colectarea și stocarea locală a conținutului de social media și a metricilor de engagement. Sistemul include un strat de procesare care utilizează modele de limbaj mari (LLM) pentru a analiza textul extras, generând rezumate și analize de sentiment. Se diferențiază printr-un model de container gata de deployment, care dispune de o interfață HTTP pentru gestionarea sarcinilor de extracție și monitorizarea progresului joburilor. Crawler-ul acoperă o gamă largă de capabilități, inclusiv monitorizarea social media prin actualizări incrementale programate, arhivarea activelor multimedia pe discuri locale și exportul de date în formate multiple către fișiere plate sau baze de date. De asemenea, captează interacțiuni sociale detaliate, cum ar fi comentariile de prim nivel și repostările.
Retrieves detailed post information including timestamps, interaction counts, hashtags, and publication tools.
WeiboSpider is a social media scraper designed to extract user profiles, posts, and interaction data from the Sina Weibo platform. It functions as a web-based data crawler that retrieves information via external interfaces rather than parsing the visual frontend. The tool includes a content lineage tracer to follow shared posts back to their original sources. It also features a social engagement analyzer to collect view counts and nested comment threads to measure user interaction metrics. The system provides capabilities for keyword-based social monitoring and search result filtering to tra
Extracts user profiles, posts, and interaction lists from external platform interfaces to gather raw activity data.
Chainsaw is a Windows forensic analysis tool used for parsing system databases and extracting security artefacts. It functions as a forensic artefact extractor and a scanner for identifying security threats and log tampering within Windows event logs. The project distinguishes itself by implementing a Sigma rule forensic scanner that applies standardized detection logic and custom rule sets to event logs and forensic artefacts. It enables threat hunting workflows by matching event data against patterns to identify malicious activity, lateral movement, and brute force attacks. The tool's capa
Extracts raw data from internal file tables and databases to facilitate deep manual analysis in external software.