23 dépôts
Data structures that provide approximate answers to membership and frequency queries with high memory efficiency.
Distinguishing note: No existing candidates provided; this category captures memory-efficient approximate data structures like Bloom filters.
Explore 23 awesome GitHub repositories matching data & databases · Probabilistic Data Structures. Refine with filters or upvote what's useful.
Guava is a Java standard library extension and utility toolkit that provides optimized data structures, concurrency tools, and core extensions. It serves as a comprehensive set of helpers for Java development, focusing on reducing repetitive boilerplate logic. The project is distinguished by its specialized implementations of immutable collections, which ensure thread safety and data consistency by preventing accidental modification. It also includes a dedicated graph data structure library for modeling and traversing networks of interconnected nodes and edges, alongside advanced collection t
Implements Bloom filters for memory-efficient probabilistic membership checking.
Dragonfly is a high-performance, multi-model in-memory data store designed to serve as a drop-in replacement for existing database infrastructures. By utilizing a multi-threaded, shared-nothing architecture and a fiber-based concurrency model, it maximizes CPU utilization and minimizes latency for read and write operations. The system supports a wide range of data structures, including strings, hashes, lists, sets, sorted sets, and JSON documents, while maintaining full compatibility with standard industry wire protocols and client libraries. What distinguishes Dragonfly is its focus on effic
Dragonfly performs probabilistic membership testing using bloom filters to efficiently determine if an element is likely present in a large dataset without scanning all records.
Hutool is a comprehensive suite of Java extensions designed to serve as a standard library extension. Its primary purpose is to reduce development boilerplate for common programming tasks and data manipulation through a collection of utility classes. The project provides specialized toolkits for database management using active record patterns and connection pooling, as well as network communication via a simplified HTTP client and asynchronous socket management. It includes security and identity capabilities such as symmetric and asymmetric encryption, image captcha generation, and JWT token
Implements memory-efficient probabilistic structures like Bloom filters for fast membership verification.
This project is a comprehensive collection of common computer science algorithms and data structures implemented in Swift. It serves as an educational reference and library for studying computational complexity, algorithmic logic, and data structure engineering through practical code examples. The repository provides a wide suite of data structure implementations, including various types of linked lists, heaps, hash tables, and an extensive range of hierarchical trees such as Red-Black, B-Tree, and Splay trees. It also covers diverse sorting and searching techniques, from basic bubble sort to
A space-efficient method, such as a Bloom filter, for checking if an element is likely present in a set.
Redisson is a Java client library for Redis and Valkey that provides a distributed data structure library, a distributed lock manager, and a distributed MapReduce framework. It enables application instances in a cluster to share state through thread-safe collections and objects. The project implements a JCache compliant caching layer for standardized data storage and retrieval. It also functions as a probabilistic data store, providing memory-efficient structures such as Bloom filters and HyperLogLog for high-volume data membership testing. The library covers distributed state management usi
Implements memory-efficient probabilistic structures like Bloom filters and HyperLogLog for high-volume membership testing.
Redisson is a Java library and Redis client that functions as a distributed Java object mapper, caching provider, and locking framework. It maps Java collections and concurrency primitives to distributed implementations backed by Redis and Valkey, providing synchronous, asynchronous, and reactive APIs for interacting with these data stores. The project distinguishes itself by providing a comprehensive suite of distributed coordination tools, including a locking framework for managing semaphores and countdown latches across multiple application nodes. It also serves as a distributed messaging
Uses bloom filters, hyperloglog, and cuckoo filters for memory-efficient membership and cardinality checks.
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Use probabilistic data structures to quickly determine if a key is absent from a storage segment to reduce unnecessary disk input-output during read operations.
This project is a feature-rich Go client library designed for interacting with Redis. It serves as a comprehensive interface for managing remote data stores, enabling developers to execute standard database commands, handle complex data structures, and perform asynchronous operations within Go applications. The library distinguishes itself through its support for advanced Redis capabilities, including connection pooling, pipelining, and transactional integrity. It provides specialized primitives for managing distributed clusters, including automated topology updates and request routing to sha
Provides memory-efficient data structures for approximate membership and frequency queries.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Filters records based on membership within specified lists or subqueries.
Instaloader is a Python library and command-line utility designed for the automated retrieval, archiving, and analysis of Instagram content. It provides a programmatic interface to fetch media, captions, and metadata from public or private profiles, hashtags, and stories, while maintaining persistent user sessions for authorized access. The tool distinguishes itself through robust archive management and traffic control mechanisms. It supports incremental synchronization, allowing users to resume interrupted downloads and update local collections without redundant requests. To ensure reliable
Downloads only the most recent post from each unique user by tracking processed creators.
Redis is a high-performance in-memory key-value store that functions as a distributed cache, message broker, and NoSQL database. It provides sub-millisecond read and write access to data stored in RAM and can operate as a vector database for indexing high-dimensional embeddings. The system supports a wide range of data storage and synchronization primitives, including the management of strings, hashes, lists, sets, and JSON documents. It enables real-time data operations through atomic transactions, hybrid persistence using snapshots and append-only logs, and high-availability configurations
Implements Bloom filters for memory-efficient membership tracking and behavioral profiling.
This project is a WeChat LLM bot framework and messaging gateway designed to connect WeChat accounts to language models for automated responses and group chat interactions. It functions as an orchestration layer that routes incoming messages to AI agents and returns generated responses to users. The system distinguishes itself through a provider-agnostic routing mechanism that distributes messages across various cloud-based and local language model services. It includes a command-line interface for managing login sessions, searching chat history, and sending messages, as well as a whitelist-b
Restricts AI response triggers to specific authorized users or group IDs via a predefined whitelist.
Boost is a collection of portable, high-performance source libraries that extend the C++ standard library. It provides a wide range of reusable components, data structures, and algorithms designed to add capabilities to the base language across different platforms. The project is distinguished by its extensive focus on compile-time template metaprogramming and generic programming. It implements advanced architectural patterns such as policy-based design, concept-based type validation, and the use of SFINAE for conditional template resolution to minimize runtime overhead. The library covers a
Implements Bloom filters and other probabilistic data structures for memory-efficient membership testing.
RedisInsight is a graphical user interface and management tool for browsing, analyzing, and administering Redis databases. It provides a visual environment for exploring key-value data structures, managing database instances, and performing data analysis across different operating systems and deployments. The tool distinguishes itself by providing dedicated visual managers for complex operations, including a vector database manager for configuring embeddings and similarity searches, a query workbench for executing raw commands and Lua scripts, and a performance monitoring dashboard for tracki
Provides visual inspection and management of memory-efficient probabilistic data structures like Bloom filters.
This project is a comprehensive collection of computer science implementations and an algorithm tutorial repository. It serves as a study guide and reference for competitive programming, providing executable code examples that demonstrate fundamental algorithmic problem solving and mathematical computation. The library covers a wide range of specialized domains, including cryptography and security primitives, lossless data compression techniques, and computational geometry for spatial analysis. It also features implementations of machine learning models, linear algebra operations, and formal
Implements a Bloom filter for space-efficient probabilistic membership testing.
Ristretto is a high-performance in-memory cache and concurrent key-value store for Go applications. It provides a thread-safe memory store that manages strict memory bounds and employs probabilistic set filters to reduce lookup overhead. The system is distinguished by an admission-policy cache that utilizes frequency sketches and cost-based eviction to maximize hit ratios. It minimizes contention and improves throughput through the use of striped ring buffers and concurrent map sharding. The project covers a broad range of data management capabilities, including time-based expiration, item f
Employs memory-efficient structures like bloom filters to check if an item exists in a set without storing full keys.
River est un framework Python pour le machine learning en ligne (online machine learning), conçu pour entraîner et évaluer des modèles sur des données en streaming. Il permet un apprentissage incrémental en mettant à jour les paramètres du modèle une observation à la fois, éliminant le besoin de stocker des jeux de données d'entraînement complets en mémoire. La bibliothèque se distingue par un système dédié de détection de dérive de concept (concept drift) qui surveille les changements dans les distributions de données pour déclencher l'adaptation du modèle. Elle fournit également un framework de validation progressive qui simule un déploiement en temps réel en testant les modèles sur des échantillons avant de les utiliser pour l'entraînement. Le système couvre un large éventail de capacités de streaming, incluant l'ingénierie de caractéristiques (feature engineering) en temps réel, la prévision de séries temporelles et la détection d'anomalies en ligne. Il prend en charge l'apprentissage non supervisé via le clustering incrémental et les arbres de décision, ainsi que l'agrégation ensembliste et les politiques de bandit pour la sélection de modèles. Le projet inclut des utilitaires pour l'ingestion de données en streaming à partir de sources telles que des fichiers CSV et des API, ainsi que des outils pour calculer des statistiques courantes et des esquisses de données (data sketches) économes en mémoire.
Implements memory-efficient probabilistic structures to track statistics of high-volume data streams.
This is a collection of classical algorithms and data structures implemented as a header-only C++ library. It provides a suite of tools for general algorithm implementation, including data structure management, graph theory analysis, and string processing. The library is distinguished by its specialized toolkits for cryptographic hashing and encoding, featuring implementations of MD5, SHA-1, and Base64. It also includes advanced capabilities for high-performance string processing via suffix trees and arrays, as well as computational number theory for primality testing and arbitrary-precision
Provides Bloom filters for memory-efficient, probabilistic set membership testing.
Kvrocks est une base de données NoSQL basée sur disque et un magasin clé-valeur distribué qui exploite le moteur de stockage RocksDB pour persister de grands jeux de données sur disque physique. Il est conçu pour être une base de données compatible avec Redis, utilisant le protocole de communication standard de Redis pour assurer l'interopérabilité avec les bibliothèques clientes et les outils existants. Le projet se distingue en combinant un modèle de stockage persistant sur disque avec des capacités de récupération avancées, notamment la recherche vectorielle pour les requêtes k-plus proches voisins, l'indexation de recherche plein texte et l'exécution de requêtes géospatiales. Il prend en charge le clustering distribué avec une distribution des données basée sur des slots et une gestion de la topologie pour permettre une mise à l'échelle horizontale et une haute disponibilité. Le système couvre un large éventail de types de stockage de données, y compris les documents JSON, les flux, les ensembles triés, les hash maps et les bitmaps. Il fournit des outils complets de gestion des données tels que des transactions atomiques, la réplication basée sur les logs et des structures de données probabilistes pour l'estimation de cardinalité et la vérification d'appartenance. De plus, il inclut des scripts côté serveur, la messagerie pub/sub et une surveillance détaillée de la santé du serveur et des performances du moteur de stockage.
Provides memory-efficient probabilistic filters to verify set membership with minimal false positives.
Kvrocks est un magasin clé-valeur distribué et une base de données NoSQL compatible avec Redis. Il utilise un moteur de stockage RocksDB pour fournir une persistance basée sur disque, permettant un stockage de données haute capacité avec des coûts de mémoire réduits par rapport aux systèmes en mémoire. Le système fonctionne comme une base de données vectorielle et un moteur de recherche plein texte, prenant en charge les recherches de plus proches voisins sur des embeddings vectoriels et des requêtes de documents complexes via la correspondance de texte. Il emploie une architecture de cluster sans proxy avec un routage basé sur des slots pour distribuer les données et mettre à l'échelle la capacité sur plusieurs nœuds. La plateforme couvre un large éventail de capacités de gestion des données, y compris la gestion de documents JSON, les données de séries temporelles et le traitement de flux en temps réel. Elle fournit une recherche et une indexation avancées grâce à des requêtes géospatiales, une indexation secondaire et l'analyse de plans de requête, tout en offrant un sketching de données probabiliste pour une estimation efficace de la cardinalité et de l'appartenance en mémoire. Les fonctionnalités opérationnelles supplémentaires incluent des transactions atomiques, la messagerie pub/sub et l'isolation des données par namespace pour les environnements multi-locataires.
Employs Bloom filters and HyperLogLog for memory-efficient cardinality estimation and membership testing.