25 مستودعات
Techniques for structuring and summarizing raw data into compact formats to optimize storage and latency.
Distinguishing note: Focuses on schema-based summarization for AI memory, distinct from general file compression.
Explore 25 awesome GitHub repositories matching data & databases · Data Compression. Refine with filters or upvote what's useful.
Mem0 is an agent-agnostic memory layer designed to provide intelligent agents with long-term persistence and cross-session state management. By acting as a centralized service, it allows diverse AI agents to recall user preferences, past interactions, and historical context, ensuring continuity across multiple workflows and independent agent systems. The platform distinguishes itself through a multi-signal retrieval engine that combines semantic vectors, keyword matching, and entity-linked metadata to surface the most relevant information. It employs an adaptive memory engine that automatical
Summarizes and structures raw interaction data into compact, machine-readable formats to optimize storage efficiency and retrieval latency.
LevelDB is an embedded database library and persistent storage engine that provides a sorted key-value store. It uses a log-structured merge-tree architecture to map byte arrays to values, running directly within a process to provide storage without the need for a separate server process. The system is distinguished by its use of custom comparison functions to define key ordering, enabling efficient range scans and sequenced lookups. It ensures data reliability through atomic batch execution, consistent snapshot generation, and log-based recovery after failures. The engine covers broad capab
Compresses and decompresses data to balance processing performance with disk space reduction.
XGBoost is a distributed machine learning library for implementing scalable gradient boosting decision trees used for regression, classification, and ranking. It functions as a predictive model framework and a cross-language toolkit, providing a core implementation with native bindings for Python, R, Java, Scala, and C++. The system is designed as a GPU-accelerated library that utilizes CUDA and NCCL to speed up the training of decision tree ensembles. It operates as a distributed framework capable of scaling training and prediction across multi-node clusters and GPU environments to process m
Handles massive datasets by storing data in compressed on-disk blocks and loading them as needed.
This project is a reactive, offline-first NoSQL database engine designed for JavaScript applications. It provides a robust framework for managing application state by synchronizing data across browsers, mobile devices, and server-side runtimes. By treating local storage as the primary source of truth, it enables applications to remain functional without network connectivity, automatically reconciling changes with remote backends once a connection is restored. The database distinguishes itself through a modular architecture that supports cross-environment synchronization and high-performance d
Reduces storage footprint by automatically mapping long attribute names to shorter keys based on schema.
Forem is an open-source platform designed for building and managing technical communities. It functions as a social publishing engine that enables members to share long-form content, participate in threaded discussions, and engage through social interactions. The platform provides tools for organizations to maintain branded profiles, host community hackathons, and facilitate collaborative learning through structured educational tracks. Beyond its social features, Forem integrates advanced capabilities for AI agent workflow orchestration and codebase knowledge graphing. It allows developers to
Provides tools to trigger compression tasks on demand for better resource management and storage optimization.
TimescaleDB is an open-source PostgreSQL extension that adds native time-series capabilities to the database. At its core, it transforms standard PostgreSQL tables into hypertables—automatically partitioned by time intervals—so data is stored in fixed-size chunks without manual sharding. The extension includes a library of over 200 built-in SQL functions purpose-built for time-series workloads, such as time bucketing, gap filling, percentile estimation, and time-weighted averages. What distinguishes TimescaleDB from generic PostgreSQL is its set of integrated time-series features that work th
Compresses time-series data by converting row-oriented data to a columnar format with type-specific compression.
Planning with files is an enterprise knowledge graph platform designed to transform unstructured organizational data into a searchable, interconnected network. By utilizing a graph-based retrieval-augmented generation engine, the system grounds language model outputs in verified internal data, ensuring that responses are explainable, traceable, and free from hallucinations. The platform distinguishes itself through a focus on data sovereignty and secure, private infrastructure deployment. It enables organizations to maintain full control over sensitive information by processing data locally o
Optimizes internal data formats to reduce computational overhead and lower the cost of processing large-scale organizational knowledge.
StockSharp is an algorithmic trading platform and quantitative framework used for developing and deploying trading robots across stock, forex, and cryptocurrency markets. It functions as a multi-asset trading gateway and a dedicated development environment for building, debugging, and scheduling automated strategies. The platform includes a visual strategy workflow editor that maps logic blocks to executable code and a simulation engine that replays historical tick data to validate trading logic. It utilizes a plugin-based broker integration system to normalize diverse exchange protocols into
Uses specialized compression formats to reduce disk usage and increase read speed for tick-level data.
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
Utilizes AI-enabled compression to represent large-scale volumes more efficiently.
better-sqlite3 is a high-performance SQLite3 client for Node.js that executes queries synchronously, returning results directly without callbacks or promises. It compiles as a native addon using N-API, binding directly to the SQLite3 C library for immediate query execution and zero-copy result serialization into native JavaScript objects. The library is optimized for Write-Ahead Logging (WAL) mode, enabling faster concurrent reads and writes in web applications. It provides durability level tuning through the synchronous pragma, allowing adjustments between FULL, NORMAL, and OFF modes to bala
Processes queries efficiently on multi-gigabyte databases using proper indexing and joins.
Logos is a curated collection of optimized SVG logos for developer tools and brands, stored as individual SVG files in a flat directory structure. The collection is manually selected and optimized to ensure quality and consistency, with each logo served as a raw SVG file that browsers and tools can render natively. The collection supports direct file-system access through its flat directory storage, and includes a lightweight index of brand names and file paths for fast keyword-based logo lookup. Logos are delivered as static assets over HTTP, relying on standard web server caching for perfor
Applies SVG optimization to reduce file size while preserving visual fidelity of logos.
zlib is a lossless data compression library that implements the deflate compression algorithm, combining LZ77 sliding window and Huffman coding. It provides the core compression and decompression engines, along with support for gzip, zlib, and raw deflate stream formats, enabling data to be compressed and restored without any loss of information. The library offers a range of capabilities for handling compressed data, including single-call memory and file operations, as well as incremental stream-based processing for working with data larger than available memory. It includes mechanisms for a
Supports processing data exceeding 4 GB without loss or corruption.
Apache IoTDB is a time-series database designed for the Internet of Things, purpose-built to ingest high-volume data from millions of low-power devices and store timestamp-value pairs with configurable data types and encoding schemes. It organizes time series data and device metadata in a tree-like hierarchy, enabling efficient management of complex industrial sensor networks. The database supports rich querying capabilities, including time-aligned data retrieval across multiple devices, time-based aggregation like downsampling, and frequency-domain signal analysis. It provides high-throughpu
Compress time series data with high-ratio algorithms to reduce hardware storage costs.
Reduces the storage and memory footprint of sparse 3D volumes like smoke and clouds using neural compression techniques.
CppGuide is a curated collection of educational resources and practical guides focused on C++ server development, Linux kernel internals, concurrent programming, network protocols, and security exploitation. It provides structured learning paths for backend developers, covering everything from interview preparation to building high-performance network servers and understanding operating system fundamentals. The guide distinguishes itself by offering in-depth, hands-on tutorials that walk through real-world implementations, including building a Redis-like server from scratch, designing custom
Compresses swapped-out pages in RAM before they reach disk to reduce memory pressure.
Applies SVGO compression to SVG logos on demand without modifying the stored originals.
Loro is a conflict-free replicated data type (CRDT) framework and collaborative state engine designed for building real-time collaborative applications. It provides a distributed data synchronizer that enables multiple users to edit shared documents and complex nested structures—such as maps, lists, trees, and counters—with automatic state convergence without requiring a central server. The project distinguishes itself through a versioned document store that supports branching, forking, and merging via a directed acyclic graph of causal operation history. It enables advanced version control c
Computes compact diffs between document versions by removing canceling operations to optimize network data transfer.
GluonTS هو إطار عمل للتنبؤ بالسلاسل الزمنية الاحتمالية، مصمم للتنبؤ بالقيم المستقبلية كتوزيعات احتمالية مع فترات ثقة. يدعم كلاً من تدريب النموذج التقليدي والتنبؤ بدون تدريب مسبق (zero-shot)، حيث تولد النماذج المدربة مسبقاً تنبؤات لسلاسل جديدة دون تدريب إضافي. يتميز المشروع بدمج مجموعة واسعة من نهج التنبؤ في سير عمل موحد. يتضمن ذلك بنى التعلم العميق مثل الشبكات العصبية المتكررة والالتفافات السببية، بالإضافة إلى دمج النماذج الإحصائية الخارجية، ومكتبة Prophet، وحزم R. توفر مجموعة الأدوات سطحاً شاملاً لهندسة بيانات السلاسل الزمنية، وتغطي توسيع مجموعة البيانات، والتقسيم، وتحويل البيانات الزمنية الخام إلى موترات (tensors). كما تتضمن مجموعة من أدوات التقييم لقياس دقة التنبؤ وفترات عدم اليقين، بالإضافة إلى أدوات لاستمرارية مجموعة البيانات باستخدام تنسيقات مثل Arrow و Parquet. يدعم إطار العمل نشر نماذج التنبؤ داخل البنية التحتية السحابية.
Writes datasets to binary Arrow or Parquet files using configurable compression and array flattening.
OpenTSDB هي قاعدة بيانات موزعة للسلاسل الزمنية ومحرك مقاييس مصمم لتخزين وإدارة أحجام هائلة من مقاييس النظام عالية التباين. تعمل كمخزن بيانات ومنصة تحليلات تتيح استيعاب المقاييس على نطاق واسع ومراقبة أداء البنية التحتية عبر مجموعة موزعة. يتميز النظام بتجريد تخزين موزع يدعم خلفيات متعددة مثل HBase و Cassandra و Google Bigtable. يستخدم شجرة مقاييس هرمية لتنظيم السلاسل الزمنية ويستخدم فهرسة المعرفات الرقمية لتقليل بصمات التخزين وتسريع عمليات البحث للمقاييس الموسومة. يغطي المشروع مجالات قدرات واسعة بما في ذلك تحليل بيانات السلاسل الزمنية مع حسابات النسبة المئوية الموزعة وأخذ العينات الفرعية، بالإضافة إلى إدارة شاملة للبيانات الوصفية. يوفر دمج واجهة برمجة التطبيقات لاستيعاب البيانات والاستعلام، وتخزين مؤقت خارج الكومة (Off-heap) لتحسين الأداء، وأدوات لتدقيق سلامة البيانات وتحليل الشذوذ. يتم إدارة النظام عبر واجهة سطر أوامر لإدارة قاعدة البيانات ومزامنة شجرة المقاييس.
Merges multiple columns within a row into a single column to reduce the physical disk space usage.
m3 is a distributed time series database designed for high-resolution metrics and high-cardinality data management. It functions as a scalable storage system and a multi-cluster query engine, providing a distributed metrics aggregator capable of downsampling and summarizing data before it is committed to storage. The project distinguishes itself through a coordinated cluster model using etcd for node membership and shard placement. It supports multiple ingestion protocols, including the Prometheus remote write protocol, InfluxDB line protocol, and Graphite Carbon plaintext protocol, and provi
Implements specialized compression algorithms and hybrid encoding to reduce the memory and disk footprint of time series.