25 repositorios
Techniques for structuring and summarizing raw data into compact formats to optimize storage and latency.
Distinguishing note: Focuses on schema-based summarization for AI memory, distinct from general file compression.
Explore 25 awesome GitHub repositories matching data & databases · Data Compression. Refine with filters or upvote what's useful.
Mem0 is an agent-agnostic memory layer designed to provide intelligent agents with long-term persistence and cross-session state management. By acting as a centralized service, it allows diverse AI agents to recall user preferences, past interactions, and historical context, ensuring continuity across multiple workflows and independent agent systems. The platform distinguishes itself through a multi-signal retrieval engine that combines semantic vectors, keyword matching, and entity-linked metadata to surface the most relevant information. It employs an adaptive memory engine that automatical
Summarizes and structures raw interaction data into compact, machine-readable formats to optimize storage efficiency and retrieval latency.
LevelDB is an embedded database library and persistent storage engine that provides a sorted key-value store. It uses a log-structured merge-tree architecture to map byte arrays to values, running directly within a process to provide storage without the need for a separate server process. The system is distinguished by its use of custom comparison functions to define key ordering, enabling efficient range scans and sequenced lookups. It ensures data reliability through atomic batch execution, consistent snapshot generation, and log-based recovery after failures. The engine covers broad capab
Compresses and decompresses data to balance processing performance with disk space reduction.
XGBoost is a distributed machine learning library for implementing scalable gradient boosting decision trees used for regression, classification, and ranking. It functions as a predictive model framework and a cross-language toolkit, providing a core implementation with native bindings for Python, R, Java, Scala, and C++. The system is designed as a GPU-accelerated library that utilizes CUDA and NCCL to speed up the training of decision tree ensembles. It operates as a distributed framework capable of scaling training and prediction across multi-node clusters and GPU environments to process m
Handles massive datasets by storing data in compressed on-disk blocks and loading them as needed.
This project is a reactive, offline-first NoSQL database engine designed for JavaScript applications. It provides a robust framework for managing application state by synchronizing data across browsers, mobile devices, and server-side runtimes. By treating local storage as the primary source of truth, it enables applications to remain functional without network connectivity, automatically reconciling changes with remote backends once a connection is restored. The database distinguishes itself through a modular architecture that supports cross-environment synchronization and high-performance d
Reduces storage footprint by automatically mapping long attribute names to shorter keys based on schema.
Forem is an open-source platform designed for building and managing technical communities. It functions as a social publishing engine that enables members to share long-form content, participate in threaded discussions, and engage through social interactions. The platform provides tools for organizations to maintain branded profiles, host community hackathons, and facilitate collaborative learning through structured educational tracks. Beyond its social features, Forem integrates advanced capabilities for AI agent workflow orchestration and codebase knowledge graphing. It allows developers to
Provides tools to trigger compression tasks on demand for better resource management and storage optimization.
TimescaleDB is an open-source PostgreSQL extension that adds native time-series capabilities to the database. At its core, it transforms standard PostgreSQL tables into hypertables—automatically partitioned by time intervals—so data is stored in fixed-size chunks without manual sharding. The extension includes a library of over 200 built-in SQL functions purpose-built for time-series workloads, such as time bucketing, gap filling, percentile estimation, and time-weighted averages. What distinguishes TimescaleDB from generic PostgreSQL is its set of integrated time-series features that work th
Compresses time-series data by converting row-oriented data to a columnar format with type-specific compression.
Planning with files is an enterprise knowledge graph platform designed to transform unstructured organizational data into a searchable, interconnected network. By utilizing a graph-based retrieval-augmented generation engine, the system grounds language model outputs in verified internal data, ensuring that responses are explainable, traceable, and free from hallucinations. The platform distinguishes itself through a focus on data sovereignty and secure, private infrastructure deployment. It enables organizations to maintain full control over sensitive information by processing data locally o
Optimizes internal data formats to reduce computational overhead and lower the cost of processing large-scale organizational knowledge.
StockSharp is an algorithmic trading platform and quantitative framework used for developing and deploying trading robots across stock, forex, and cryptocurrency markets. It functions as a multi-asset trading gateway and a dedicated development environment for building, debugging, and scheduling automated strategies. The platform includes a visual strategy workflow editor that maps logic blocks to executable code and a simulation engine that replays historical tick data to validate trading logic. It utilizes a plugin-based broker integration system to normalize diverse exchange protocols into
Uses specialized compression formats to reduce disk usage and increase read speed for tick-level data.
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
Utilizes AI-enabled compression to represent large-scale volumes more efficiently.
better-sqlite3 is a high-performance SQLite3 client for Node.js that executes queries synchronously, returning results directly without callbacks or promises. It compiles as a native addon using N-API, binding directly to the SQLite3 C library for immediate query execution and zero-copy result serialization into native JavaScript objects. The library is optimized for Write-Ahead Logging (WAL) mode, enabling faster concurrent reads and writes in web applications. It provides durability level tuning through the synchronous pragma, allowing adjustments between FULL, NORMAL, and OFF modes to bala
Processes queries efficiently on multi-gigabyte databases using proper indexing and joins.
Logos is a curated collection of optimized SVG logos for developer tools and brands, stored as individual SVG files in a flat directory structure. The collection is manually selected and optimized to ensure quality and consistency, with each logo served as a raw SVG file that browsers and tools can render natively. The collection supports direct file-system access through its flat directory storage, and includes a lightweight index of brand names and file paths for fast keyword-based logo lookup. Logos are delivered as static assets over HTTP, relying on standard web server caching for perfor
Applies SVG optimization to reduce file size while preserving visual fidelity of logos.
zlib is a lossless data compression library that implements the deflate compression algorithm, combining LZ77 sliding window and Huffman coding. It provides the core compression and decompression engines, along with support for gzip, zlib, and raw deflate stream formats, enabling data to be compressed and restored without any loss of information. The library offers a range of capabilities for handling compressed data, including single-call memory and file operations, as well as incremental stream-based processing for working with data larger than available memory. It includes mechanisms for a
Supports processing data exceeding 4 GB without loss or corruption.
Apache IoTDB is a time-series database designed for the Internet of Things, purpose-built to ingest high-volume data from millions of low-power devices and store timestamp-value pairs with configurable data types and encoding schemes. It organizes time series data and device metadata in a tree-like hierarchy, enabling efficient management of complex industrial sensor networks. The database supports rich querying capabilities, including time-aligned data retrieval across multiple devices, time-based aggregation like downsampling, and frequency-domain signal analysis. It provides high-throughpu
Compress time series data with high-ratio algorithms to reduce hardware storage costs.
Reduces the storage and memory footprint of sparse 3D volumes like smoke and clouds using neural compression techniques.
CppGuide is a curated collection of educational resources and practical guides focused on C++ server development, Linux kernel internals, concurrent programming, network protocols, and security exploitation. It provides structured learning paths for backend developers, covering everything from interview preparation to building high-performance network servers and understanding operating system fundamentals. The guide distinguishes itself by offering in-depth, hands-on tutorials that walk through real-world implementations, including building a Redis-like server from scratch, designing custom
Compresses swapped-out pages in RAM before they reach disk to reduce memory pressure.
Applies SVGO compression to SVG logos on demand without modifying the stored originals.
Loro is a conflict-free replicated data type (CRDT) framework and collaborative state engine designed for building real-time collaborative applications. It provides a distributed data synchronizer that enables multiple users to edit shared documents and complex nested structures—such as maps, lists, trees, and counters—with automatic state convergence without requiring a central server. The project distinguishes itself through a versioned document store that supports branching, forking, and merging via a directed acyclic graph of causal operation history. It enables advanced version control c
Computes compact diffs between document versions by removing canceling operations to optimize network data transfer.
GluonTS es un framework para el pronóstico probabilístico de series temporales, diseñado para predecir valores futuros como distribuciones de probabilidad con intervalos de confianza. Soporta tanto el entrenamiento de modelos tradicionales como el pronóstico zero-shot, donde modelos preentrenados generan predicciones para nuevas series sin entrenamiento adicional. El proyecto se distingue por integrar una amplia variedad de enfoques de pronóstico en un flujo de trabajo unificado. Esto incluye arquitecturas de aprendizaje profundo como redes neuronales recurrentes y convoluciones causales, así como la integración de modelos estadísticos externos, la librería Prophet y paquetes de R. El kit de herramientas proporciona una superficie integral para la ingeniería de datos de series temporales, cubriendo el escalado de conjuntos de datos, la división y la transformación de datos temporales sin procesar en tensores. También incluye un conjunto de herramientas de evaluación para medir la precisión del pronóstico y los intervalos de incertidumbre, así como utilidades para la persistencia de conjuntos de datos utilizando formatos como Arrow y Parquet. El framework soporta el despliegue de modelos de pronóstico dentro de la infraestructura en la nube.
Writes datasets to binary Arrow or Parquet files using configurable compression and array flattening.
OpenTSDB es una base de datos de series temporales distribuida y un motor de métricas diseñado para almacenar y gestionar volúmenes masivos de métricas de sistema de alta cardinalidad. Funciona como un almacén de datos y plataforma de análisis que permite la ingesta de métricas a gran escala y el monitoreo del rendimiento de la infraestructura a través de un clúster distribuido. El sistema se distingue por una abstracción de almacenamiento distribuido que admite múltiples backends como HBase, Cassandra y Google Bigtable. Utiliza un árbol de métricas jerárquico para organizar series temporales y emplea indexación de identificadores numéricos para reducir la huella de almacenamiento y acelerar las búsquedas de métricas etiquetadas. El proyecto cubre áreas de capacidad amplias, incluyendo análisis de datos de series temporales con cálculos de percentiles distribuidos y submuestreo, así como una gestión integral de metadatos. Proporciona integración de API para la ingesta y consulta de datos, caché fuera de memoria (off-heap) para optimización del rendimiento y herramientas para la auditoría de integridad de datos y análisis de anomalías. El sistema se gestiona a través de una interfaz de línea de comandos para la administración de bases de datos y la sincronización del árbol de métricas.
Merges multiple columns within a row into a single column to reduce the physical disk space usage.
m3 es una base de datos de series temporales distribuida, diseñada para métricas de alta resolución y gestión de datos de alta cardinalidad. Funciona como un sistema de almacenamiento escalable y un motor de consultas multiclúster, proporcionando un agregador de métricas distribuido capaz de realizar downsampling y resumir datos antes de que se confirmen en el almacenamiento. El proyecto se distingue por un modelo de clúster coordinado que utiliza etcd para la pertenencia a nodos y la colocación de shards. Soporta múltiples protocolos de ingesta, incluyendo el protocolo de escritura remota de Prometheus, el protocolo de línea de InfluxDB y el protocolo de texto plano de Graphite Carbon, y proporciona interfaces de consulta compatibles para PromQL y Graphite. El sistema cubre amplias áreas de capacidad, incluyendo almacenamiento de series temporales en columnas, replicación de datos síncrona y distribución de consultas (fan-out) distribuida. Incorpora automatización del ciclo de vida de los datos, ajuste de consistencia basado en quórum e indexación de series basada en etiquetas para mantener la integridad de los datos y la velocidad de recuperación en espacios de nombres aislados. La orquestación del clúster y la colocación de componentes se gestionan mediante herramientas y operadores automatizados para garantizar la alta disponibilidad y una distribución equilibrada de los datos.
Implements specialized compression algorithms and hybrid encoding to reduce the memory and disk footprint of time series.