49 repositorios
Tools and techniques for restructuring and compressing data formats to improve storage efficiency and retrieval speed.
Distinguishing note: Focuses on structural decomposition of semi-structured data for storage optimization, distinct from general-purpose database engines.
Explore 49 awesome GitHub repositories matching data & databases · Data Storage Optimizers. Refine with filters or upvote what's useful.
ClickHouse is a high-performance, columnar analytical database designed for real-time query execution and large-scale data aggregation. It functions as a distributed data warehouse capable of processing petabytes of information, while also providing an embedded engine that integrates directly into applications for native query capabilities without external dependencies. The system is built to handle high-throughput ingestion and complex analytical workloads, delivering millisecond-level latency for interactive dashboards and operational monitoring. The platform distinguishes itself through ad
Automatically decomposes semi-structured data into dense internal columns to optimize storage space and query performance.
RocksDB is a high-performance, embeddable persistent key-value library and storage engine based on Log-Structured Merge-trees. It is designed to provide durable storage for large-scale datasets, integrating directly into applications to manage data on flash and RAM-based hardware. The engine is distinguished by its focus on minimizing read and write amplification through multi-threaded compaction and custom memory allocators. It features specialized optimizations for flash storage, including support for zoned block devices, and provides the ability to extend store behavior via external plugin
Optimizes remote storage access using asynchronous I/O and prefetching to reduce latency on network filesystems.
OpenZeppelin Contracts is a library of modular, secure, and reusable smart contract components designed for the development of decentralized applications. It provides a foundational framework for building standard-compliant contracts, offering battle-tested implementations for token standards, access control, and common utility patterns. The project distinguishes itself through its comprehensive support for complex architectural patterns, including proxy-based upgradeability, role-based access control, and account abstraction. It enables developers to implement modular logic injection via hoo
Writes data to specific storage slots and packs short strings to reduce gas costs and prevent storage conflicts.
This project is a reactive, offline-first NoSQL database engine designed for JavaScript applications. It provides a robust framework for managing application state by synchronizing data across browsers, mobile devices, and server-side runtimes. By treating local storage as the primary source of truth, it enables applications to remain functional without network connectivity, automatically reconciling changes with remote backends once a connection is restored. The database distinguishes itself through a modular architecture that supports cross-environment synchronization and high-performance d
Compresses data keys and values to minimize disk space and network bandwidth usage.
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Implements indexing strategies and storage engines tailored for transactional and analytical workloads.
This project serves as a comprehensive technical reference and educational platform for the Ethereum ecosystem. It provides a deep dive into the fundamental architecture of decentralized ledger systems, covering the core mechanisms that enable trustless state transitions, cryptographic security, and network consensus. The documentation distinguishes itself by bridging high-level conceptual frameworks with practical implementation details. It details the lifecycle of smart contract development, from source code compilation and bytecode analysis to deployment and interaction patterns. Furthermo
Pre-declare accessed addresses and storage slots in an access list to reduce the total gas costs required for complex smart contract interactions.
This project is a deep learning framework designed for constructing, training, and deploying neural networks across diverse hardware environments. It functions as a high-performance tensor computation library that provides both imperative and symbolic programming interfaces, allowing developers to balance flexible, step-by-step model building with the efficiency of compiled computation graphs. The framework distinguishes itself through a hybrid execution engine that integrates declarative graph compilation with imperative runtime logic. It supports scalable, distributed training across multip
Restructures and compresses data formats to improve storage efficiency and read performance for large datasets.
LangChain.js is a framework for building, executing, and monitoring stateful agentic applications. It provides an orchestration engine that models workflows as directed graphs, allowing developers to connect language models, data sources, and external tools into modular, multi-step processes. The platform distinguishes itself through its focus on stateful execution and human-in-the-loop control. It manages agent lifecycles by persisting execution state across threads, enabling fault tolerance and the ability to pause workflows at designated breakpoints for manual review or modification. This
Minimizes storage footprint for large data histories through checkpoint pruning and delta-based updates.
This project is a comprehensive framework for engineering financial data pipelines, designed to automate the collection, cleaning, and synchronization of large-scale market datasets. It functions as a quantitative trading data engine, providing the infrastructure necessary to manage historical and real-time asset pricing information for research and machine learning workflows. The system distinguishes itself through a configuration-driven approach to orchestration, allowing users to manage complex data acquisition tasks across multiple financial providers. It features resilient middleware tha
Optimizes storage performance by partitioning large datasets into time-based chunks.
VictoriaMetrics is a high-performance, scalable time series database and observability platform designed for long-term storage and analysis of metric, log, and trace data. It functions as a unified backend for monitoring ecosystems, offering full compatibility with industry-standard protocols and query languages. The system is built to handle massive data volumes through a distributed architecture that supports horizontal scaling and efficient data lifecycle management. The platform distinguishes itself through a storage engine that utilizes consistent hashing for data sharding and log-struct
Reduces storage footprint and improves query performance by automatically deduplicating data and downsampling older metrics.
Dask es un framework de computación paralela y un programador de tareas distribuido diseñado para escalar flujos de trabajo de ciencia de datos en Python desde máquinas individuales hasta grandes clústeres. Funciona como un gestor de recursos de clúster que orquesta la lógica computacional representando las tareas y sus dependencias como grafos acíclicos dirigidos. Esta arquitectura permite al sistema automatizar la distribución de cargas de trabajo a través del hardware disponible mientras gestiona requisitos de ejecución complejos. El proyecto se distingue por un motor de evaluación perezosa que difiere las operaciones de datos hasta que se solicitan explícitamente, permitiendo la optimización global del grafo y una asignación eficiente de recursos. Incorpora el volcado de datos consciente de la memoria para evitar fallos del sistema al procesar conjuntos de datos que exceden la memoria disponible, y utiliza la fusión de grafos de tareas para combinar secuencias de operaciones en pasos de ejecución únicos, minimizando la sobrecarga de programación y la comunicación entre nodos. La plataforma proporciona una superficie de capacidades integral para el análisis de datos a gran escala, incluyendo soporte para aprendizaje automático distribuido, integración de computación de alto rendimiento y procesamiento de datos en paralelo. Ofrece herramientas extensas para la gestión del ciclo de vida del clúster, perfilado de rendimiento y monitoreo en tiempo real de la ejecución de tareas. Los usuarios pueden desplegar estos entornos en diversas infraestructuras, incluyendo hardware local, proveedores de nube, sistemas en contenedores y clústeres de computación de alto rendimiento.
Provides tools for optimizing data storage formats and compression schemes to improve performance across distributed datasets.
LoRA is a framework for parameter-efficient fine-tuning of large-scale neural networks. It functions by injecting trainable low-rank decomposition matrices into frozen model layers, allowing for task-specific adaptation while preserving the integrity of the original base model weights. The project distinguishes itself by enabling the direct merging of these trained low-rank matrices into primary model weights. This process eliminates additional computational overhead during inference, ensuring that adapted models maintain the same performance characteristics as the original architecture. Furt
Minimizes storage requirements by saving only small task-specific adaptation matrices instead of storing entire sets of original model weights.
The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane. The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It
Provides automated data partitioning and storage format optimization for improved query performance.
This PHP library provides tools for generating and validating universally unique identifiers according to RFC 4122 standards. It implements a generation tool for creating version 1, 3, 4, and 5 identifiers, as well as sequential and Nil UUIDs. The library features specialized capabilities for transforming identifiers between hexadecimal strings, binary bytes, integers, and date objects. It supports the generation of sequential identifiers to improve database indexing and storage performance, as well as deterministic name-based identifiers using MD5 or SHA-1 hashing. The project includes a va
Implements sequential identifier generation and binary encoding to optimize database indexing and reduce storage fragmentation.
Kopia is a backup utility designed to create encrypted, deduplicated, and compressed snapshots of files and directories. It functions as a client-side tool that secures data locally before transmitting it to various storage targets, ensuring that sensitive information remains protected throughout the backup process. The system utilizes content-addressable block storage and metadata-driven versioning to identify and remove redundant data across multiple snapshots. By employing a pluggable storage abstraction layer, it supports a wide range of local, network, and cloud-based storage providers,
Optimizes storage efficiency by applying compression and deduplication before writing to disk.
Citus is a PostgreSQL extension that transforms a standard database into a distributed system. It functions as a sharding framework and distributed SQL engine, enabling horizontal scaling by partitioning tables across a cluster of nodes. By utilizing a coordinator-worker topology, the system manages metadata and routes queries to the appropriate nodes, allowing for parallel execution of complex operations across distributed data shards. The platform distinguishes itself through its specialized support for multi-tenant architectures and real-time analytical processing. It enables tenant-based
Incorporates semi-structured data formats into distributed tables to track complex metrics without requiring rigid schema changes for every new attribute.
ZFS is an enterprise-grade file system and logical volume manager that integrates storage pooling with advanced data protection. It functions as a storage engine that aggregates multiple physical devices into a unified resource pool, allowing for the dynamic allocation of capacity across individual file systems. The system utilizes a transactional, copy-on-write architecture that ensures file system consistency through intent logging and atomic operations. It maintains data integrity by organizing blocks into a hierarchical tree structure, where cryptographic checksums are used to detect and
Adjusts input and output scheduling and workload parameters to maximize data transfer speeds.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Analyzes usage patterns to identify and remove redundant datasets, optimizing cloud storage costs.
Peewee is a SQL object-relational mapper and query builder that provides an object-oriented interface for mapping application classes to relational database tables. It functions as a relational database toolkit for managing schemas, executing migrations, and handling complex table relationships. The project distinguishes itself by providing an asyncio database driver for non-blocking database operations, ensuring event loop responsiveness. It also supports semi-structured data storage, allowing the storage and querying of flexible JSON documents within traditional relational database systems.
Integrates flexible JSON and HStore data formats into relational tables for semi-structured storage.
Realm Java is a NoSQL mobile object database and reactive database engine. It provides a persistent local data store that saves native objects directly to disk, replacing traditional SQL storage and object-relational mapping layers. The system functions as a real-time data synchronizer, coordinating local database changes with a cloud backend across multiple devices. It integrates a reactive engine that uses change listeners and asynchronous event streams to automatically update user interfaces when underlying data changes. The project covers object-oriented data modeling, CRUD operations, a
Includes tools for reducing database file size through compaction and managing memory via frozen data snapshots.