22 مستودعات
Techniques for processing large datasets in small chunks to prevent memory overload.
Distinct from Stream Processing: Distinct from general Stream Processing by focusing on local memory efficiency and chunking rather than real-time high-velocity data analysis.
Explore 22 awesome GitHub repositories matching data & databases · Memory-Efficient Data Streaming. Refine with filters or upvote what's useful.
AISystem is a comprehensive AI full-stack infrastructure project covering the entire pipeline from AI chip architecture to high-level training frameworks. It encompasses the development of AI compiler frameworks, inference engines, and distributed training orchestrators designed to coordinate workloads across a heterogeneous compute stack of CPUs, GPUs, and NPUs. The project focuses on the deep integration of software and hardware, employing software-hardware co-design to align tensor layouts with physical memory structures. It provides specialized capabilities for accelerating Transformer mo
Divides large matrices into smaller blocks to balance memory bandwidth and maximize hardware compute utilization.
PHPExcel is a PHP spreadsheet library used for programmatically reading and writing spreadsheet files in various formats. It utilizes an in-memory spreadsheet model that maps spreadsheet structures to a hierarchy of objects for programmatic manipulation. The library functions as an Office Open XML processor for generating and manipulating XLSX documents and serves as a reader for extracting data and structure from legacy binary XLS files. It also includes tools for CSV data integration and importing. The project provides capabilities for automated report generation and spreadsheet data extra
Implements chunk-based processing to minimize memory consumption when reading or writing large spreadsheet datasets.
This project is a structured Node.js programming course and educational guide designed to teach JavaScript backend development. It provides a sequence of workshops and interactive tutorials that focus on the fundamentals of the Node.js runtime and its core modules. The material emphasizes asynchronous programming, specifically covering non-blocking I/O, callback patterns, and event-driven architecture. It includes a practical exploration of the core API for managing network applications, file system operations, and binary data. The curriculum covers module management and dependency resolutio
Teaches how to process large datasets using streams to avoid loading entire files into memory.
This project is a software engineering style guide and a curated collection of architectural patterns and coding standards. It provides a multi-language coding standard to ensure maintainable software across Ruby, Python, JavaScript, and Swift. The project establishes a development workflow specification for version control, continuous integration, and peer review to maintain a linear project history. It also includes a web accessibility framework based on ARIA and WCAG standards, using design tokens and semantic HTML patterns to build inclusive interfaces. The guides cover a broad range of
Implements sequential chunk processing for infinite event streams to prevent memory overflows.
YARA is a pattern matching engine and binary analysis tool used to identify and classify malware samples. It functions as a malware research framework that allows for the definition of file descriptions and detection rules to find indicators of compromise within binaries. The system enables the creation of custom detection rules using strings, wildcards, and regular expressions. These rules use boolean logic to match textual or binary patterns, allowing for the classification of files into specific malware families and the automation of threat intelligence. The engine utilizes Aho-Corasick s
Processes large binaries in memory-efficient chunks to prevent system memory overload during scans.
llrt is a low-latency JavaScript runtime based on the QuickJS engine, specifically designed for executing asynchronous functions in serverless environments. It provides a lightweight execution layer optimized for fast startup times and minimal memory usage when running ES2023 workloads. The project differentiates itself by bundling natively optimized cloud service SDKs directly into the runtime binary to eliminate external dependency loading. To further reduce cold start latency, it implements parallel connection warming for TLS and network handshakes during the startup phase. The runtime co
Processes continuous data flows using buffers and stream interfaces for efficient memory management.
Higress is an AI-native and cloud-native API gateway that routes, secures, and optimizes traffic between clients and large language model services. It functions as a centralized entry point for microservices, serving as both a Kubernetes ingress controller and an AI gateway orchestrator. The project distinguishes itself by managing traffic across multiple AI providers using a unified protocol, incorporating token-aware rate limiting and response caching to optimize model inference. It coordinates communication between AI models and external tools to provide real-time context and data, while a
Processes request and response bodies as continuous data streams to minimize memory overhead for AI responses.
CloudSaver is a multi-cloud file transfer manager and storage aggregator designed to discover remote resources and save them directly to cloud drives. It functions as a cloud file downloader and management platform that enables the movement of data between different cloud storage providers without requiring files to be downloaded to a local device first. The system uses OAuth authentication to manage secure connections to third-party cloud drives, facilitating direct server-to-server data transfers. It incorporates asynchronous streaming to move data between remote sources and destinations, p
Uses memory-efficient data streaming to move large files between remote servers without loading them into RAM.
The C++ REST SDK is a library for asynchronous HTTP and RESTful communication in native C++ applications. It provides a non-blocking network client for sending requests and receiving responses, a JSON parser for serializing and deserializing data, and a WebSocket client library for real-time, full-duplex communication. The project includes a dedicated OAuth2 authentication client to manage access tokens and authorization flows for secure communication with protected cloud resources. It utilizes a task-based asynchronous model to coordinate background operations and keep application interfaces
Processes large network payloads in incremental chunks to maintain memory efficiency.
elasticsearch-dump is a command line tool for importing, exporting, and transferring data between Elasticsearch and OpenSearch instances. It functions as an index dump utility that saves documents, mappings, and analyzers to local files or standard output. The tool enables the movement of data between clusters using local files as an intermediary and can flatten nested JSON documents into CSV files for external analysis. It allows for the modification or anonymization of documents during the transfer process through the use of custom JavaScript functions. The utility covers data extraction a
Processes documents in sequential chunks to move data without overloading system memory.
This project is a learning guide and collection of study notes designed to teach Node.js backend development. It provides a comprehensive core API reference and practical demonstrations for implementing server-side logic, network programming, and system APIs. The guide specifically covers advanced technical domains including process management for scaling applications via clusters and child processes, as well as network programming for building TCP, UDP, and HTTP services. It also includes detailed instructional material on security implementation, focusing on cryptographic hashing and encryp
Processes large datasets incrementally in small chunks to maintain low memory overhead.
DbGate is a universal database management tool and SQL client that provides a unified interface for querying and administering multiple SQL and NoSQL databases. It functions as a multi-database administration GUI and SQL IDE, allowing users to write and execute scripts and manage database schemas. The project distinguishes itself by acting as an API client and explorer for REST, GraphQL, and OData services, enabling users to fetch and export data from these endpoints. It also serves as a data integration tool, facilitating the movement of records between diverse databases and file formats suc
Moves records between sources and destinations using a pipeline of readers and writers to handle large datasets efficiently.
Lit-llama هو إطار عمل تنفيذ يعتمد على PyTorch لنموذج اللغة LLaMA، ويوفر نظاماً للتدريب المسبق، والضبط الدقيق، والاستدلال عالي الأداء. يتضمن خط أنابيب تدريب مسبق لإنشاء نماذج لغوية أساسية من الصفر وأدوات لتشغيل الأوزان المدربة مسبقاً لتوليد نص طبيعي والتنبؤ بالتسلسلات. يوفر المشروع مجموعات أدوات متخصصة للضبط الدقيق الفعال للمعلمات باستخدام التكيف منخفض الرتبة (LoRA) والمحولات خفيفة الوزن. كما يتضمن مكتبة تكميم (quantization) تقلل من بصمات ذاكرة النموذج من خلال دقة 4 بت و8 بت لتمكين التنفيذ على الأجهزة ذات الموارد المحدودة. يدمج إطار العمل تصميم محول مبسط ويوظف انتباه الفلاش (flash attention) لتحسين الذاكرة والسرعة. كما يدير مجموعات بيانات واسعة النطاق من خلال تنسيقات بيانات البث لتجنب تحميل مجموعات النصوص الكاملة في ذاكرة النظام.
Processes massive datasets in small chunks from disk to prevent system memory overload during pre-training.
CppGuide is a curated collection of educational resources and practical guides focused on C++ server development, Linux kernel internals, concurrent programming, network protocols, and security exploitation. It provides structured learning paths for backend developers, covering everything from interview preparation to building high-performance network servers and understanding operating system fundamentals. The guide distinguishes itself by offering in-depth, hands-on tutorials that walk through real-world implementations, including building a Redis-like server from scratch, designing custom
Streams results through worker pools and pipelines to handle high-volume data efficiently.
X-Ray هو إطار عمل لكشط الويب ومزاحف ويب غير متزامن مصمم لاستخراج البيانات المهيكلة من المواقع. يعمل كمستخرج بيانات HTML يحول محتوى الصفحة الخام إلى مخطط محدد باستخدام محددات بنمط CSS. يطبق المشروع مزاحف متصفح بدون واجهة رسومية قادراً على تنفيذ JavaScript لعرض المحتوى الديناميكي. يتعامل مع اكتشاف محتوى الموقع من خلال استراتيجية زحف بالعرض أولاً واكتشاف الترقيم التلقائي لاجتياز مجموعات النتائج متعددة الصفحات. يدير إطار العمل خطوط أنابيب بيانات الويب باستخدام قائمة انتظار طلبات محدودة التزامن والتحكم في معدل الطلبات لتنظيم مكالمات الشبكة الصادرة. تتم معالجة النتائج المستخرجة عبر استمرارية البيانات القائمة على التدفق لمعالجة مجموعات البيانات الكبيرة دون تحميل ذاكرة النظام بشكل زائد.
Writes extracted data to streams to process large datasets without overloading system memory.
This library is a CSV data serializer and stringifier for transforming structured records into comma-separated values. It provides tools for converting data records into plain text via synchronous, callback-based, or stream-based implementations. The project distinguishes itself by offering a streaming implementation through the native Node.js Transform API, which allows for the processing of large datasets without loading all records into memory. It also includes a flexible formatting system to define specific delimiters, quotes, escape characters, and header configurations. The toolset cov
Utilizes a streaming pipeline to transform records into CSV format while minimizing memory usage.
more-itertools is a Python iterable utility library providing advanced functions for manipulating, filtering, and transforming data sequences. It serves as a data stream processing toolkit and a set of utilities for iterator state management, extending the capabilities of the standard Python itertools module. The library includes a combinatorial math toolkit for generating permutations, combinations, and powersets, alongside routines for number theory calculations and matrix operations. It also provides tools for stream state management, allowing users to peek at upcoming elements or seek wit
Offers a toolkit for chunking, interleaving, and flattening sequences to process large datasets with minimal memory overhead.
هذا المشروع عبارة عن إطار عمل لتوليد بيانات جدولية اصطناعية تحافظ على الخصائص الإحصائية والسلامة العلائقية لمجموعات البيانات المصدر الأصلية. يعمل كمحرك مدفوع بالبيانات الوصفية، ويستخدم نماذج لغوية لتوليف المعلومات حتى عندما تكون عينات التدريب الأصلية مقيدة. تم تصميم النظام للحفاظ على الاتساق المنطقي عبر هياكل معقدة متعددة الجداول مع ضمان التزام المخرجات المولدة بمتطلبات المخطط المحددة. تتميز المنصة بتركيزها على التوليف الذي يحافظ على الخصوصية، حيث تدمج أدوات لقياس وتخفيف مخاطر إعادة تحديد الهوية من خلال الخصوصية التفاضلية وتقنيات إخفاء الهوية. تدعم المنصة القابلية للتوسع المعياري، مما يسمح بدمج نماذج توليد مخصصة وموصلات بيانات. علاوة على ذلك، يتضمن إطار العمل إجراءات تحقق آلية تقارن أنماط التوزيع والارتباط للمخرجات الاصطناعية مقابل بيانات المصدر للتحقق من الدقة الإحصائية. بعيداً عن التوليد الأساسي، يوفر النظام قدرات لإثراء البيانات وهندسة الميزات من خلال اشتقاق أعمدة جديدة من الأنماط المتعلمة. كما يدمج أدوات رقابة تشغيلية لمراقبة استخدام الموارد وكفاءة المعالجة أثناء المهام ذات الحجم الكبير. تم تصميم المكتبة للتعامل مع مجموعات البيانات واسعة النطاق من خلال معالجة التدفق الموفرة للذاكرة والتجميع التكراري لضمان الاستقرار.
Processes large-scale datasets in memory-efficient chunks to maintain system stability during high-volume generation.
Swift OpenAPI Generator هي أداة وقت بناء تنتج كود عميل وخادم Swift آمن للأنواع مباشرة من مستندات مواصفات OpenAPI. من خلال التكامل مع أنظمة البناء عبر المكونات الإضافية الأصلية، فإنها تؤتمت إنشاء واجهات مكتوبة بقوة وكعوب بروتوكول تربط عمليات الشبكة بالطرق الأصلية، مما يضمن بقاء كود التطبيق متسقاً تماماً مع مخططات البيانات المحددة. يتميز المشروع ببنية موجهة نحو البروتوكول تفصل منطق الأعمال عن تنفيذات النقل المحددة. ويستخدم طبقة نقل قابلة للتوصيل واعتراض طلب قائم على البرمجية الوسيطة للتعامل مع الاهتمامات المتقاطعة مثل المصادقة، والتسجيل، وجمع المقاييس. يسمح هذا التصميم للمطورين بالحفاظ على طبقة اتصال متسقة مع البقاء غير مبالين بأطر عمل الويب الأساسية أو تفاصيل نقل الشبكة. يدعم المولد مجموعة واسعة من الإمكانات، بما في ذلك تعيين البيانات القائم على المخطط والتفاوض على المحتوى للتنسيقات المختلفة. ويوفر معالجة فعالة للذاكرة للحمولات الكبيرة من خلال معالجة التدفق التزايدي، مما يسمح بتبادل البيانات المعقدة دون تحميل المحتويات بالكامل في الذاكرة. تتضمن مجموعة الأدوات أيضاً أدوات لاختبار العقد المؤتمت وإنشاء توثيق تفاعلي للمساعدة في التحقق من متطلبات نقطة النهاية.
Handles large request and response payloads incrementally to maintain memory efficiency during network exchanges.
Kotlinx-io is a multiplatform library designed for input and output operations, providing a unified interface for streaming data, managing byte buffers, and interacting with local filesystems. It serves as a cross-platform abstraction layer that standardizes how applications handle data movement across different operating systems and hardware architectures. The library distinguishes itself by providing high-performance tools for both mutable and immutable byte sequences. It utilizes segmented memory pools and direct memory access to minimize allocation overhead and prevent unnecessary data co
Processes large datasets in continuous flows to minimize memory usage.