33 dépôts
Systems for collecting, normalizing, and unifying data from disparate web sources.
Distinguishing note: Focuses on the aggregation and standardization of data, distinct from raw storage.
Explore 33 awesome GitHub repositories matching data & databases · Data Aggregation Pipelines. Refine with filters or upvote what's useful.
MediaCrawler is an automated web scraping framework designed to extract public posts, comments, and creator metadata from various social media platforms. It functions as a headless browser automator, utilizing real browser instances to render dynamic content and execute the client-side scripts necessary for interacting with modern web interfaces. The system distinguishes itself through a focus on session persistence and network flexibility. It supports remote debugging to reuse active browser sessions and cookies, which helps minimize the risk of triggering platform security challenges. To ma
Standardizes data retrieval from multiple services into a unified format for consistent processing.
Multica is an autonomous coding agent manager and LLM agent orchestration platform. It coordinates teams of autonomous agents to execute coding tasks and manage their lifecycles through a centralized dashboard. The system provides multi-tenant agent workspaces that isolate agents, settings, and project issues into distinct organizational boundaries. The platform distinguishes itself through an agent skill library that captures successful task solutions as reusable, versioned skills. These skills are shared across the agent team and pinned using content hashes to ensure consistent behavior acr
Aggregates hourly token consumption and task usage metrics to populate operational dashboards.
Vector is a high-performance observability data pipeline designed to collect, transform, and route logs, metrics, and traces across distributed infrastructure. It functions as a modular engine that decouples data ingestion from processing and transmission, utilizing a component-based architecture to connect diverse sources to multiple destinations. The project distinguishes itself through a focus on reliability and flow control. It implements backpressure-aware data movement to prevent data loss during traffic spikes and utilizes disk-backed event buffering to ensure durability during network
Deploys dedicated nodes to receive, process, and route data from multiple upstream sources for optimized performance.
Parse Server is a backend-as-a-service solution and Node.js framework that provides a ready-to-use REST and GraphQL API for mobile and web applications. It functions as a core backend infrastructure for managing database schemas, user authentication, and API routing. The system distinguishes itself with a real-time data engine that pushes database updates to clients via WebSockets and a GraphQL server that automatically generates schemas based on application data models. It also features an adapter-based storage layer that abstracts interactions with various cloud and local backends. The pla
Processes large datasets through transformation stages using native database aggregation frameworks.
Fx is a command-line processing suite designed for the transformation, conversion, exploration, and visualization of structured data. It functions as a terminal-based utility that handles both automated shell pipelines and interactive navigation of complex, nested data hierarchies. The tool distinguishes itself by integrating a JavaScript-based engine that executes user-provided logic to filter, map, or modify data fields within a sandboxed runtime. It maintains a responsive interface by decoupling data processing from the display loop, allowing users to explore large datasets through an inte
Buffers incoming line-delimited data into a unified memory structure for complex transformations.
Cube is a semantic data layer that provides a unified framework for defining business metrics, dimensions, and relationships across diverse data sources. By acting as a headless business intelligence engine, it transforms raw data into a governed model that can be queried via SQL, REST, and GraphQL interfaces. This architecture ensures consistent data definitions and logic across all downstream analytical applications and reporting tools. The platform distinguishes itself through its integrated conversational AI capabilities, which allow users to explore data using natural language. It orches
Compiles data models and builds pre-aggregations before a deployment begins serving traffic to ensure optimal query performance.
This project provides a collection of structured, binary-encoded routing datasets designed for proxy software to automate network traffic management. By mapping domain names and IP addresses to specific functional categories, it enables proxy clients to make granular, policy-based connection decisions. The repository serves as a centralized source for routing metadata, ensuring that traffic steering logic remains consistent across various networking implementations. The project distinguishes itself through an automated aggregation pipeline that processes community-maintained datasets into a u
Automates the collection and normalization of community-maintained datasets into unified routing rule files.
StatsD is a metrics aggregator and UDP collection server that collects system counters and timers. It functions as a time-series data forwarder, receiving high-frequency metric updates via a lightweight line protocol and summarizing them before flushing the data to a backend. The project features a pluggable metrics backend framework, allowing aggregated statistics to be routed to various third-party monitoring services or time-series databases such as Graphite. It supports horizontal scaling and high availability through a proxy ring distribution system that forwards incoming packets across
Summarizes incoming counters and timers in memory over fixed time windows before flushing to backends.
Telegraf is a modular, cross-platform telemetry pipeline designed to collect, process, and route metrics from diverse infrastructure, applications, and hardware. It functions as a server-side middleware that normalizes heterogeneous data into a unified format, enabling consistent monitoring across complex environments. By utilizing a plugin-driven architecture, the agent manages the entire lifecycle of telemetry data from initial ingestion to final transmission. The project distinguishes itself through a declarative, configuration-driven execution model that allows users to define complex dat
Combines individual data points into summary statistics over time windows to reduce data volume.
Gensim is a natural language processing toolkit designed for large-scale text analysis and the training of semantic vector embeddings. It provides a framework for identifying latent thematic structures within document collections and calculating semantic similarity between text segments using unsupervised statistical algorithms. The project is distinguished by its ability to handle datasets that exceed available system memory through incremental corpus streaming, which processes documents one at a time from disk. It utilizes sparse vector representations and dictionary-based token mapping to
Aggregates data points across disparate sources to create unified representations.
theHarvester is a command-line utility designed for gathering open-source intelligence and mapping an organization's external attack surface. It functions as a security information gathering framework that automates the collection of publicly available data to assist in reconnaissance and threat analysis. The tool utilizes a plugin-based architecture to execute isolated queries against various search engines and public databases. It employs asynchronous task execution to run multiple discovery operations in parallel, while a centralized pipeline aggregates and deduplicates findings from these
Implements a pipeline to collect, normalize, and deduplicate data from multiple disparate sources.
This project serves as a comprehensive cybersecurity training platform and resource repository focused on web application security. It functions as a centralized hub for security practitioners, providing both a curated collection of technical documentation and research, and a system for deploying isolated, containerized environments to practice security analysis and exploitation techniques. The platform distinguishes itself by integrating automated data aggregation with hands-on, container-based orchestration. It maintains a current knowledge base of industry research and digital threats whil
Automates the collection and indexing of external research through periodic scripts.
This project serves as a comprehensive resource hub for individuals navigating remote work, functioning as both a job aggregator and a professional development guide. It provides a centralized platform for discovering remote employment opportunities across diverse industries while offering structured insights into the challenges and requirements of a home-based career. The platform distinguishes itself by combining automated job discovery with a curated knowledge base focused on the nuances of distributed work. It enables users to filter listings based on specific professional taxonomies and
Automates the fetching and normalization of job data from multiple external sources into a unified database.
This project is a command-line utility designed to monitor and analyze token consumption and financial expenditure for AI coding assistants. By parsing local session logs directly on the user's machine, it provides a privacy-focused way to track development activity without transmitting sensitive data to external servers. The tool distinguishes itself through its ability to aggregate disparate log formats from multiple coding assistants into a unified, schema-agnostic representation. It features a decoupled pricing engine that allows users to apply custom model-specific cost multipliers, over
Consolidate local log files from multiple coding assistants into a unified report to track token consumption and estimated costs across different tools.
wavesurfer.js is a WebAudio playback library and interactive waveform visualizer that renders audio data onto an HTML5 canvas. It enables users to see and navigate sound files through a visual representation of audio peaks, allowing for direct seeking and playback control within a web browser. The project is distinguished by its flexible rendering model, which can use precomputed peak data to display waveforms without downloading or decoding the full audio file. It utilizes a plugin-based extension model to integrate advanced tools such as spectrograms, interactive audio timelines, and real-t
Allows waveform rendering without downloading full audio files by using precomputed arrays of maximum amplitude values.
The mongo-go-driver is a Go library for building applications that integrate with a MongoDB document store. It enables the storage and retrieval of flexible document data by providing a bridge between Go backends and the database. The driver implements specialized capabilities for semantic vector search, allowing the handling and execution of high-dimensional vector data for similarity-based retrieval. It also supports full-text search via linguistic analysis and programmatic search index management. The project covers a broad range of database operations, including document-based CRUD, bulk
Implements aggregation pipelines to transform and summarize documents for computed results.
Texture is an iOS framework for building user interfaces that render on background threads using thread-safe node abstractions. It provides an asynchronous display node architecture that constructs and composites view hierarchies off the main thread, then synchronises the final bitmap for presentation, enabling smooth and responsive apps. The framework replaces UIKit's standard view system with node-based hierarchies that can be created, configured, and mutated on any queue without locking the main thread. The framework distinguishes itself through a precomputed rendering pipeline that decode
Implements a precomputed rendering pipeline that caches image decoding and text sizing ahead of display.
InfoSpider is a personal data aggregator and digital footprint analyzer. It extracts user activity and history from social platforms and local browser database files to consolidate information into a unified format. The system functions as a social media archiving tool that converts feed data and albums from external links into downloadable PDF documents for offline preservation. It also serves as a browser history extractor that reads local SQLite database files to retrieve and analyze web navigation history. The project covers capabilities for data aggregation, digital footprint analysis,
Standardizes disparate information from various third-party digital services into a single unified format for consistent processing.
This project is a MongoDB database driver and object-relational mapper that brings MongoDB support to the Laravel Eloquent model and query builder. It provides a NoSQL model mapper that allows MongoDB collections to be mapped to object-oriented models using the Active Record pattern. The integration enables the use of a fluent query builder for constructing queries and aggregation pipelines without writing raw database syntax. It supports schema-less model integration, allowing applications to manage unstructured data while maintaining compatibility with standard object-oriented patterns. Th
Executes native MongoDB aggregation pipelines to compute summarized results from complex documents.
xorm is a relational mapper and object-relational mapping tool for Go. It translates Go structures into SQL queries and maps database rows back into native objects, providing a multi-dialect database driver that supports MySQL, PostgreSQL, SQLite, Oracle, SQL Server, and TiDB. The project features a read-write splitting manager that routes modification requests to a primary database and read requests to replicas. It includes a database schema synchronizer to automatically align table structures and indexes with application data models, as well as a fluent SQL query builder for constructing co
Provides native database aggregation functions to calculate sums and totals across filtered record sets.