22 个仓库
Techniques for processing large datasets in small chunks to prevent memory overload.
Distinct from Stream Processing: Distinct from general Stream Processing by focusing on local memory efficiency and chunking rather than real-time high-velocity data analysis.
Explore 22 awesome GitHub repositories matching data & databases · Memory-Efficient Data Streaming. Refine with filters or upvote what's useful.
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
Divides large matrices into smaller blocks to balance memory bandwidth and maximize hardware compute utilization.
PHPExcel is a PHP spreadsheet library used for programmatically reading and writing spreadsheet files in various formats. It utilizes an in-memory spreadsheet model that maps spreadsheet structures to a hierarchy of objects for programmatic manipulation. The library functions as an Office Open XML processor for generating and manipulating XLSX documents and serves as a reader for extracting data and structure from legacy binary XLS files. It also includes tools for CSV data integration and importing. The project provides capabilities for automated report generation and spreadsheet data extra
Implements chunk-based processing to minimize memory consumption when reading or writing large spreadsheet datasets.
This project is a structured Node.js programming course and educational guide designed to teach JavaScript backend development. It provides a sequence of workshops and interactive tutorials that focus on the fundamentals of the Node.js runtime and its core modules. The material emphasizes asynchronous programming, specifically covering non-blocking I/O, callback patterns, and event-driven architecture. It includes a practical exploration of the core API for managing network applications, file system operations, and binary data. The curriculum covers module management and dependency resolutio
Teaches how to process large datasets using streams to avoid loading entire files into memory.
This project is a software engineering style guide and a curated collection of architectural patterns and coding standards. It provides a multi-language coding standard to ensure maintainable software across Ruby, Python, JavaScript, and Swift. The project establishes a development workflow specification for version control, continuous integration, and peer review to maintain a linear project history. It also includes a web accessibility framework based on ARIA and WCAG standards, using design tokens and semantic HTML patterns to build inclusive interfaces. The guides cover a broad range of
Implements sequential chunk processing for infinite event streams to prevent memory overflows.
YARA is a pattern matching engine and binary analysis tool used to identify and classify malware samples. It functions as a malware research framework that allows for the definition of file descriptions and detection rules to find indicators of compromise within binaries. The system enables the creation of custom detection rules using strings, wildcards, and regular expressions. These rules use boolean logic to match textual or binary patterns, allowing for the classification of files into specific malware families and the automation of threat intelligence. The engine utilizes Aho-Corasick s
Processes large binaries in memory-efficient chunks to prevent system memory overload during scans.
llrt is a low-latency JavaScript runtime based on the QuickJS engine, specifically designed for executing asynchronous functions in serverless environments. It provides a lightweight execution layer optimized for fast startup times and minimal memory usage when running ES2023 workloads. The project differentiates itself by bundling natively optimized cloud service SDKs directly into the runtime binary to eliminate external dependency loading. To further reduce cold start latency, it implements parallel connection warming for TLS and network handshakes during the startup phase. The runtime co
Processes continuous data flows using buffers and stream interfaces for efficient memory management.
Higress 是一个 AI 原生和云原生的 API 网关,用于路由、保护和优化客户端与大语言模型服务之间的流量。它作为微服务的集中入口点,同时充当 Kubernetes Ingress 控制器和 AI 网关编排器。 该项目通过使用统一协议管理跨多个 AI 提供商的流量而脱颖而出,结合了令牌感知速率限制和响应缓存以优化模型推理。它协调 AI 模型与外部工具之间的通信以提供实时上下文和数据,同时还为 AI 代理托管服务器端点。 广泛的功能包括通过 Web 应用防火墙(WAF)实施 API 安全、自动 TLS 证书管理和动态服务发现。该网关通过沙箱化的 WebAssembly 插件支持自定义请求处理,允许通过热重载进行流量转换。 该系统实现了标准化的 Ingress API,以低资源开销管理容器化集群内的网络路由。
Processes request and response bodies as continuous data streams to minimize memory overhead for AI responses.
CloudSaver is a multi-cloud file transfer manager and storage aggregator designed to discover remote resources and save them directly to cloud drives. It functions as a cloud file downloader and management platform that enables the movement of data between different cloud storage providers without requiring files to be downloaded to a local device first. The system uses OAuth authentication to manage secure connections to third-party cloud drives, facilitating direct server-to-server data transfers. It incorporates asynchronous streaming to move data between remote sources and destinations, p
Uses memory-efficient data streaming to move large files between remote servers without loading them into RAM.
The C++ REST SDK is a library for asynchronous HTTP and RESTful communication in native C++ applications. It provides a non-blocking network client for sending requests and receiving responses, a JSON parser for serializing and deserializing data, and a WebSocket client library for real-time, full-duplex communication. The project includes a dedicated OAuth2 authentication client to manage access tokens and authorization flows for secure communication with protected cloud resources. It utilizes a task-based asynchronous model to coordinate background operations and keep application interfaces
Processes large network payloads in incremental chunks to maintain memory efficiency.
elasticsearch-dump is a command line tool for importing, exporting, and transferring data between Elasticsearch and OpenSearch instances. It functions as an index dump utility that saves documents, mappings, and analyzers to local files or standard output. The tool enables the movement of data between clusters using local files as an intermediary and can flatten nested JSON documents into CSV files for external analysis. It allows for the modification or anonymization of documents during the transfer process through the use of custom JavaScript functions. The utility covers data extraction a
Processes documents in sequential chunks to move data without overloading system memory.
This project is a learning guide and collection of study notes designed to teach Node.js backend development. It provides a comprehensive core API reference and practical demonstrations for implementing server-side logic, network programming, and system APIs. The guide specifically covers advanced technical domains including process management for scaling applications via clusters and child processes, as well as network programming for building TCP, UDP, and HTTP services. It also includes detailed instructional material on security implementation, focusing on cryptographic hashing and encryp
Processes large datasets incrementally in small chunks to maintain low memory overhead.
DbGate is a universal database management tool and SQL client that provides a unified interface for querying and administering multiple SQL and NoSQL databases. It functions as a multi-database administration GUI and SQL IDE, allowing users to write and execute scripts and manage database schemas. The project distinguishes itself by acting as an API client and explorer for REST, GraphQL, and OData services, enabling users to fetch and export data from these endpoints. It also serves as a data integration tool, facilitating the movement of records between diverse databases and file formats suc
Moves records between sources and destinations using a pipeline of readers and writers to handle large datasets efficiently.
Lit-llama is a PyTorch-based implementation framework for the LLaMA language model, providing a system for pre-training, fine-tuning, and high-performance inference. It includes a pre-training pipeline for creating foundational language models from scratch and tools for running pretrained weights to generate natural text and predict sequences. The project provides specialized toolkits for parameter-efficient fine-tuning using low-rank adaptation and lightweight adapters. It also includes a quantization library that reduces model memory footprints through four-bit and eight-bit precision to en
Processes massive datasets in small chunks from disk to prevent system memory overload during pre-training.
CppGuide is a curated collection of educational resources and practical guides focused on C++ server development, Linux kernel internals, concurrent programming, network protocols, and security exploitation. It provides structured learning paths for backend developers, covering everything from interview preparation to building high-performance network servers and understanding operating system fundamentals. The guide distinguishes itself by offering in-depth, hands-on tutorials that walk through real-world implementations, including building a Redis-like server from scratch, designing custom
Streams results through worker pools and pipelines to handle high-volume data efficiently.
X-Ray 是一个 Web 抓取框架和异步 Web 爬虫,旨在从网站中提取结构化数据。它作为一个 HTML 数据提取器,使用 CSS 样式选择器将原始页面内容转换为定义的模式。 该项目实现了一个能够执行 JavaScript 以渲染动态内容的无头浏览器爬虫。它通过广度优先爬取策略和自动分页发现来处理网站内容发现,以遍历多页结果集。 该框架使用并发限制的请求队列和请求速率控制来管理 Web 数据管线,以调节传出的网络调用。提取的结果通过基于流的数据持久化进行处理,以在不占用系统内存的情况下处理大数据集。
Writes extracted data to streams to process large datasets without overloading system memory.
该库是一个 CSV 数据序列化和字符串化工具,用于将结构化记录转换为逗号分隔值。它提供了通过同步、基于回调或基于流的实现将数据记录转换为纯文本的工具。 该项目的特色在于通过原生的 Node.js Transform API 提供流式实现,允许在不将所有记录加载到内存的情况下处理大型数据集。它还包含一个灵活的格式化系统,用于定义特定的分隔符、引号、转义字符和标题配置。 该工具集涵盖数据导出自动化和记录到字符串的映射,支持从数据库记录或 API 响应中以编程方式生成文件。
Utilizes a streaming pipeline to transform records into CSV format while minimizing memory usage.
more-itertools 是一个 Python 可迭代对象工具库,提供用于操作、过滤和转换数据序列的高级函数。它作为一个数据流处理工具包和一组用于迭代器状态管理的工具,扩展了标准 Python itertools 模块的功能。 该库包括一个用于生成排列、组合和幂集的组合数学工具包,以及用于数论计算和矩阵运算的例程。它还提供了用于流状态管理的工具,允许用户查看即将到来的元素或在序列内搜索,以控制数据的消费方式。 附加功能涵盖了用于分块、交错和展平复杂序列的数据处理例程。该工具包还包括分析可迭代对象属性和同步并发数据流的函数。
Offers a toolkit for chunking, interleaving, and flattening sequences to process large datasets with minimal memory overhead.
该项目是一个用于生成合成表格数据的框架,该数据保留了原始源数据集的统计属性和关系完整性。它作为一个元数据驱动的引擎,利用语言模型来合成信息,即使在原始训练样本受限的情况下也是如此。该系统旨在在复杂的、多表结构中保持逻辑一致性,同时确保生成的输出符合定义的模式要求。 该平台通过专注于隐私保护合成而脱颖而出,集成了通过差分隐私和匿名化技术量化并减轻重新识别风险的工具。它支持模块化扩展,允许集成自定义生成模型和数据连接器。此外,该框架包括自动化验证例程,将合成输出的分布和相关模式与源数据进行比较,以验证统计保真度。 除了核心生成外,该系统还通过从学习到的模式中派生新列,提供了数据增强和特征工程的功能。它结合了操作监督工具,以在海量任务期间监控资源利用率和处理效率。该库旨在通过内存高效的流处理和迭代批处理来处理大规模数据集,以确保稳定性。
Processes large-scale datasets in memory-efficient chunks to maintain system stability during high-volume generation.
Swift OpenAPI Generator 是一个构建时工具,可直接从 OpenAPI 规范文档生成类型安全的 Swift 客户端和服务器代码。通过利用原生插件与构建系统集成,它自动化创建强类型接口和协议存根,将网络操作映射到原生方法,确保应用程序代码与定义的数据模式严格保持一致。 该项目通过协议导向的架构脱颖而出,将业务逻辑与特定的传输实现解耦。它利用可插拔的传输层和基于中间件的请求拦截来处理诸如认证、日志记录和指标收集等横切关注点。这种设计允许开发者在保持通信层一致的同时,无需依赖底层的 Web 框架或网络传输细节。 该生成器支持广泛的功能,包括模式驱动的数据映射和针对各种格式的内容协商。它通过增量流处理提供对大有效载荷的内存高效处理,允许在不将全部内容加载到内存的情况下交换复杂数据。该工具集还包括用于自动化契约测试和生成交互式文档的工具,以协助验证端点需求。
Handles large request and response payloads incrementally to maintain memory efficiency during network exchanges.
Kotlinx-io is a multiplatform library designed for input and output operations, providing a unified interface for streaming data, managing byte buffers, and interacting with local filesystems. It serves as a cross-platform abstraction layer that standardizes how applications handle data movement across different operating systems and hardware architectures. The library distinguishes itself by providing high-performance tools for both mutable and immutable byte sequences. It utilizes segmented memory pools and direct memory access to minimize allocation overhead and prevent unnecessary data co
Processes large datasets in continuous flows to minimize memory usage.