79 repository-uri
Techniques for processing data streams that exceed available system memory.
Distinguishing note: Focuses on memory-efficient streaming rather than batch processing.
Explore 79 awesome GitHub repositories matching data & databases · Incremental Data Streaming. Refine with filters or upvote what's useful.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Processes data blocks incrementally to handle datasets that exceed total cluster memory capacity.
Vegeta is an HTTP load testing tool and library designed to measure the performance and stability of web services. It functions as a command-line utility, a programmable package for integration into other applications, and a distributed load generator capable of splitting request rates across multiple machines. The tool is distinguished by its constant-rate request scheduler, which dispatches requests at a fixed frequency regardless of target response times. It employs lazy target streaming to maintain low memory usage during large tests and uses a binary-encoded storage format to minimize di
Reads request definitions from files or standard input as a stream to keep memory usage low during large tests.
Toon is a data serialization library and toolkit designed to convert complex objects into compact, human-readable formats optimized for large language models. By focusing on token efficiency, the library minimizes the context window footprint of structured data through techniques like key folding and tabular layout optimization. It provides a streaming-capable processor that handles the encoding and decoding of hierarchical data while maintaining structural integrity. The project distinguishes itself through its path-aware transformation pipeline and configurable serialization logic, which al
Handles massive data records incrementally through event-driven stream processing to maintain memory efficiency.
Guzzle is a PHP HTTP client used for sending synchronous and asynchronous requests to web services. It serves as a concurrent HTTP request manager, an HTTP stream handler, and a middleware-based HTTP pipeline. The project is a PSR-7 compliant client, utilizing standardized PHP interfaces for requests, responses, and streams. The library differentiates itself through a customizable functional handler stack that allows for the interception and modification of the request and response lifecycle. It features an adapter-based transport system that enables swapping between network implementations,
Transfers large file uploads and downloads incrementally to avoid system memory exhaustion.
Vector is a high-performance observability data pipeline designed to collect, transform, and route logs, metrics, and traces across distributed infrastructure. It functions as a modular engine that decouples data ingestion from processing and transmission, utilizing a component-based architecture to connect diverse sources to multiple destinations. The project distinguishes itself through a focus on reliability and flow control. It implements backpressure-aware data movement to prevent data loss during traffic spikes and utilizes disk-backed event buffering to ensure durability during network
Transforms incremental metrics into absolute values by tracking changes over time to simplify historical data analysis.
Datasets is a library designed for the management, processing, and sharing of large-scale data collections for machine learning workflows. It functions as both a data processing framework and a versioning platform, providing tools to organize, filter, and transform massive datasets while ensuring reproducibility across research and development teams. The library distinguishes itself by enabling the handling of datasets that exceed available system memory. It utilizes memory-mapped file access, disk-based caching, and lazy iterative streaming to maintain performance when working with large-sca
Enables memory-efficient processing of massive datasets by streaming records on demand from remote or local sources.
Readest is a comprehensive digital reading platform designed to manage, annotate, and consume electronic books across multiple devices. It functions as a versatile library manager and reading environment, supporting a wide range of user needs from standard ebook consumption to specialized study and accessibility-focused workflows. The platform distinguishes itself through advanced features like parallel text study, which enables side-by-side document rendering with synchronized scrolling, and a robust text-to-speech engine that provides hands-free reading with synchronized visual highlighting
Fetches remote book content in small chunks to allow immediate access to large documents without requiring full file downloads.
Excelize is a Go library designed for reading, writing, and modifying Microsoft Excel files in XML-based formats. It functions as a spreadsheet file parser and generator that enables the programmatic extraction and modification of data. The library includes a streaming spreadsheet processor to handle massive datasets incrementally, preventing system memory exhaustion during large-scale read and write operations. It also provides a chart generator to convert worksheet values or external data sources into visual representations within the spreadsheet. Beyond core file processing, the project c
Handles massive spreadsheet files through incremental streaming to minimize memory usage.
uWebSockets is a high-performance networking engine providing an HTTP web server and a WebSocket server framework. It implements a multi-threaded event loop architecture to deploy isolated application instances across multiple CPU cores and includes an SSL/TLS network layer for secure, encrypted communication. The project features a dedicated WebSocket pub/sub engine for distributing messages to specific groups of connected clients. It optimizes network throughput through syscall corking to reduce kernel overhead and employs payload compression to minimize data transfer sizes. The system cov
Processes large data payloads in chunks to minimize memory consumption during high-volume transfers.
brpc is a high-performance C++ RPC framework and network programming library designed for building distributed systems. It functions as a multi-protocol RPC server capable of hosting and detecting multiple communication protocols, including gRPC, Thrift, HTTP, Redis, and Memcached, on a single TCP port. The project distinguishes itself through high-throughput data transport and memory efficiency, utilizing RDMA-based transport to bypass the kernel TCP stack and zero-copy memory management to eliminate data duplication. It also implements the Raft algorithm for consensus-based state replicatio
Sends and receives data in continuous streams to handle large datasets efficiently.
Gensim is a natural language processing toolkit designed for large-scale text analysis and the training of semantic vector embeddings. It provides a framework for identifying latent thematic structures within document collections and calculating semantic similarity between text segments using unsupervised statistical algorithms. The project is distinguished by its ability to handle datasets that exceed available system memory through incremental corpus streaming, which processes documents one at a time from disk. It utilizes sparse vector representations and dictionary-based token mapping to
Enables processing of massive datasets that exceed system memory by streaming documents incrementally from disk.
This project is an asynchronous network framework for Python that provides both a client and a server for HTTP communication. It is designed to handle high-concurrency network operations by leveraging cooperative multitasking, allowing for the management of thousands of simultaneous connections without the overhead of traditional thread-per-request models. The framework distinguishes itself through its focus on efficient resource management and persistent communication. It utilizes connection pooling to reuse network sockets, which reduces latency during sequential requests, and supports full
Enables incremental stream processing to handle large data payloads with constant memory usage.
Hyper is a low-level networking library designed for building high-performance HTTP clients and servers. It provides a foundational toolkit for creating network services that leverage asynchronous execution and memory-safe data handling, supporting both HTTP/1 and HTTP/2 protocols. The library distinguishes itself through a protocol-agnostic architecture that separates transport logic from HTTP semantics. It utilizes a service-trait abstraction to decouple network logic from the underlying transport, enabling developers to inject custom middleware for request interception and response transfo
Processes large network payloads incrementally as asynchronous streams to maintain low memory usage.
ImageMagick is a comprehensive software suite for the creation, editing, composition, and conversion of digital images. It functions as both a command-line utility for batch processing and automation, and as a programming library that allows developers to integrate advanced image manipulation capabilities into external applications. The project is distinguished by its modular architecture, which supports hundreds of image formats through a pluggable coder system and external delegate libraries. It is designed for high-performance environments, utilizing memory-mapped pixel caching, stream-ori
Processes image pixels incrementally as they are read or written to minimize memory usage for massive files.
Cesium is a JavaScript geospatial visualization library and 3D globe engine designed to render world-scale environments and precision spatial data. It functions as a web-based mapping tool that displays hardware-accelerated 3D globes and 2D maps directly in a browser without requiring external plugins. The project operates as a 3D tiles renderer, utilizing the 3D Tiles open standard to stream and display large-scale geospatial datasets. It enables the visualization and analysis of high-accuracy spatial information across global environments. The library covers a broad range of capabilities,
Streams large-scale 3D tiles, terrain, and imagery from cloud or offline sources using open standards.
ExcelJS is a Node.js spreadsheet engine and manipulation library used for reading, writing, and modifying XLSX and CSV files. It functions as a formatting tool and asynchronous streaming parser for generating complex workbooks containing formulas, rich text, and custom styles. The library is distinguished by its ability to process large datasets using asynchronous data streaming and incremental processing, which minimizes memory usage during data extraction and file generation. Its capability surface covers comprehensive data management, including structured tables, named ranges, and cell da
Implements memory-efficient streaming by writing data in chunks to avoid memory overflow when generating large files.
This project is a comprehensive Python network request framework designed for both synchronous and asynchronous HTTP communication. It provides a high-performance client capable of executing non-blocking requests within event-driven applications, while also supporting standard blocking calls for simpler scripts. The library is built to operate natively across diverse asynchronous runtimes, automatically detecting and utilizing the underlying event loop for concurrency. What distinguishes this library is its modular architecture, which decouples request construction from network execution thro
Handles large request and response bodies as incremental byte chunks to maintain low memory usage.
PapaParse is a delimited text processing library that converts CSV files into JSON objects or arrays. It provides a suite of tools for parsing delimited text and transforming structured data objects back into CSV formats through bidirectional serialization. The library is characterized by its ability to process massive datasets using incremental streaming and chunk-based processing to prevent memory overload. It includes an automatic delimiter detector to identify separator characters without manual configuration and utilizes web workers to offload parsing logic to background threads, keeping
Emits parsed results incrementally to allow processing of data streams that exceed available memory.
This project is a Node.js client for PostgreSQL databases, providing a protocol parser to translate raw binary streams into JavaScript objects. It serves as a driver for executing queries, managing data, and integrating Node.js applications with PostgreSQL backends. The library includes a connection pool manager to reduce network overhead by caching reusable connections and a result streamer that uses cursors to retrieve large datasets incrementally. It also functions as an event listener for subscribing to asynchronous server-side notifications to trigger real-time application events. Broad
Retrieves large datasets incrementally using database cursors to prevent application memory overflow.
This project provides a lossless compression algorithm and a byte-oriented compression library designed for high-speed data reduction and maximum decompression speed. It functions as a stream-oriented compression engine, a software library for encoding and decoding data blocks, and a command-line tool for managing interoperable compressed frames. The system distinguishes itself through the use of predefined pattern dictionaries to improve compression ratios for small data sets and small packets. It supports multiple processing modes, including high-speed block compression for minimal latency
Processes data incrementally to handle large datasets without requiring the entire payload in memory.