23 Repos
Data structures that provide approximate answers to membership and frequency queries with high memory efficiency.
Distinguishing note: No existing candidates provided; this category captures memory-efficient approximate data structures like Bloom filters.
Explore 23 awesome GitHub repositories matching data & databases · Probabilistic Data Structures. Refine with filters or upvote what's useful.
Guava is a Java standard library extension and utility toolkit that provides optimized data structures, concurrency tools, and core extensions. It serves as a comprehensive set of helpers for Java development, focusing on reducing repetitive boilerplate logic. The project is distinguished by its specialized implementations of immutable collections, which ensure thread safety and data consistency by preventing accidental modification. It also includes a dedicated graph data structure library for modeling and traversing networks of interconnected nodes and edges, alongside advanced collection t
Implements Bloom filters for memory-efficient probabilistic membership checking.
Dragonfly is a high-performance, multi-model in-memory data store designed to serve as a drop-in replacement for existing database infrastructures. By utilizing a multi-threaded, shared-nothing architecture and a fiber-based concurrency model, it maximizes CPU utilization and minimizes latency for read and write operations. The system supports a wide range of data structures, including strings, hashes, lists, sets, sorted sets, and JSON documents, while maintaining full compatibility with standard industry wire protocols and client libraries. What distinguishes Dragonfly is its focus on effic
Dragonfly performs probabilistic membership testing using bloom filters to efficiently determine if an element is likely present in a large dataset without scanning all records.
Hutool is a comprehensive suite of Java extensions designed to serve as a standard library extension. Its primary purpose is to reduce development boilerplate for common programming tasks and data manipulation through a collection of utility classes. The project provides specialized toolkits for database management using active record patterns and connection pooling, as well as network communication via a simplified HTTP client and asynchronous socket management. It includes security and identity capabilities such as symmetric and asymmetric encryption, image captcha generation, and JWT token
Implements memory-efficient probabilistic structures like Bloom filters for fast membership verification.
This project is a comprehensive collection of common computer science algorithms and data structures implemented in Swift. It serves as an educational reference and library for studying computational complexity, algorithmic logic, and data structure engineering through practical code examples. The repository provides a wide suite of data structure implementations, including various types of linked lists, heaps, hash tables, and an extensive range of hierarchical trees such as Red-Black, B-Tree, and Splay trees. It also covers diverse sorting and searching techniques, from basic bubble sort to
A space-efficient method, such as a Bloom filter, for checking if an element is likely present in a set.
Redisson is a Java client library for Redis and Valkey that provides a distributed data structure library, a distributed lock manager, and a distributed MapReduce framework. It enables application instances in a cluster to share state through thread-safe collections and objects. The project implements a JCache compliant caching layer for standardized data storage and retrieval. It also functions as a probabilistic data store, providing memory-efficient structures such as Bloom filters and HyperLogLog for high-volume data membership testing. The library covers distributed state management usi
Implements memory-efficient probabilistic structures like Bloom filters and HyperLogLog for high-volume membership testing.
Redisson is a Java library and Redis client that functions as a distributed Java object mapper, caching provider, and locking framework. It maps Java collections and concurrency primitives to distributed implementations backed by Redis and Valkey, providing synchronous, asynchronous, and reactive APIs for interacting with these data stores. The project distinguishes itself by providing a comprehensive suite of distributed coordination tools, including a locking framework for managing semaphores and countdown latches across multiple application nodes. It also serves as a distributed messaging
Uses bloom filters, hyperloglog, and cuckoo filters for memory-efficient membership and cardinality checks.
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi
Use probabilistic data structures to quickly determine if a key is absent from a storage segment to reduce unnecessary disk input-output during read operations.
This project is a feature-rich Go client library designed for interacting with Redis. It serves as a comprehensive interface for managing remote data stores, enabling developers to execute standard database commands, handle complex data structures, and perform asynchronous operations within Go applications. The library distinguishes itself through its support for advanced Redis capabilities, including connection pooling, pipelining, and transactional integrity. It provides specialized primitives for managing distributed clusters, including automated topology updates and request routing to sha
Provides memory-efficient data structures for approximate membership and frequency queries.
Presto is a distributed SQL query engine designed for high-performance analytical processing across heterogeneous data sources. It functions as a data federation platform and massively parallel processing engine, allowing users to execute interactive queries against diverse storage systems without requiring data migration. By mapping remote metadata and structures to a unified relational namespace, it enables seamless cross-platform analysis through a standard SQL interface. The engine distinguishes itself through a pluggable connector architecture and a shared-nothing distributed processing
Filters records based on membership within specified lists or subqueries.
Instaloader is a Python library and command-line utility designed for the automated retrieval, archiving, and analysis of Instagram content. It provides a programmatic interface to fetch media, captions, and metadata from public or private profiles, hashtags, and stories, while maintaining persistent user sessions for authorized access. The tool distinguishes itself through robust archive management and traffic control mechanisms. It supports incremental synchronization, allowing users to resume interrupted downloads and update local collections without redundant requests. To ensure reliable
Downloads only the most recent post from each unique user by tracking processed creators.
Redis is a high-performance in-memory key-value store that functions as a distributed cache, message broker, and NoSQL database. It provides sub-millisecond read and write access to data stored in RAM and can operate as a vector database for indexing high-dimensional embeddings. The system supports a wide range of data storage and synchronization primitives, including the management of strings, hashes, lists, sets, and JSON documents. It enables real-time data operations through atomic transactions, hybrid persistence using snapshots and append-only logs, and high-availability configurations
Implements Bloom filters for memory-efficient membership tracking and behavioral profiling.
This project is a WeChat LLM bot framework and messaging gateway designed to connect WeChat accounts to language models for automated responses and group chat interactions. It functions as an orchestration layer that routes incoming messages to AI agents and returns generated responses to users. The system distinguishes itself through a provider-agnostic routing mechanism that distributes messages across various cloud-based and local language model services. It includes a command-line interface for managing login sessions, searching chat history, and sending messages, as well as a whitelist-b
Restricts AI response triggers to specific authorized users or group IDs via a predefined whitelist.
Boost is a collection of portable, high-performance source libraries that extend the C++ standard library. It provides a wide range of reusable components, data structures, and algorithms designed to add capabilities to the base language across different platforms. The project is distinguished by its extensive focus on compile-time template metaprogramming and generic programming. It implements advanced architectural patterns such as policy-based design, concept-based type validation, and the use of SFINAE for conditional template resolution to minimize runtime overhead. The library covers a
Implements Bloom filters and other probabilistic data structures for memory-efficient membership testing.
RedisInsight is a graphical user interface and management tool for browsing, analyzing, and administering Redis databases. It provides a visual environment for exploring key-value data structures, managing database instances, and performing data analysis across different operating systems and deployments. The tool distinguishes itself by providing dedicated visual managers for complex operations, including a vector database manager for configuring embeddings and similarity searches, a query workbench for executing raw commands and Lua scripts, and a performance monitoring dashboard for tracki
Provides visual inspection and management of memory-efficient probabilistic data structures like Bloom filters.
This project is a comprehensive collection of computer science implementations and an algorithm tutorial repository. It serves as a study guide and reference for competitive programming, providing executable code examples that demonstrate fundamental algorithmic problem solving and mathematical computation. The library covers a wide range of specialized domains, including cryptography and security primitives, lossless data compression techniques, and computational geometry for spatial analysis. It also features implementations of machine learning models, linear algebra operations, and formal
Implements a Bloom filter for space-efficient probabilistic membership testing.
Ristretto is a high-performance in-memory cache and concurrent key-value store for Go applications. It provides a thread-safe memory store that manages strict memory bounds and employs probabilistic set filters to reduce lookup overhead. The system is distinguished by an admission-policy cache that utilizes frequency sketches and cost-based eviction to maximize hit ratios. It minimizes contention and improves throughput through the use of striped ring buffers and concurrent map sharding. The project covers a broad range of data management capabilities, including time-based expiration, item f
Employs memory-efficient structures like bloom filters to check if an item exists in a set without storing full keys.
River ist ein Python-Framework für Online-Machine-Learning, das darauf ausgelegt ist, Modelle auf Streaming-Daten zu trainieren und zu evaluieren. Es ermöglicht inkrementelles Lernen durch die Aktualisierung von Modellparametern pro Beobachtung, wodurch das Speichern vollständiger Trainingsdatensätze im Arbeitsspeicher entfällt. Die Bibliothek zeichnet sich durch ein dediziertes System zur Erkennung von Concept Drift aus, das Änderungen in Datenverteilungen überwacht, um eine Modellanpassung auszulösen. Sie bietet zudem ein Framework für progressive Validierung, das den Echtzeit-Einsatz simuliert, indem Modelle an Stichproben getestet werden, bevor sie für das Training verwendet werden. Das System deckt ein breites Spektrum an Streaming-Funktionen ab, einschließlich Echtzeit-Feature-Engineering, Zeitreihenprognosen und Online-Anomalieerkennung. Es unterstützt unüberwachtes Lernen durch inkrementelles Clustering und Entscheidungsbäume sowie Ensemble-Aggregation und Bandit-Richtlinien für die Modellauswahl. Das Projekt enthält Dienstprogramme für das Streaming von Daten aus Quellen wie CSV-Dateien und APIs sowie Werkzeuge zur Berechnung laufender Statistiken und speichereffizienter Daten-Sketches.
Implements memory-efficient probabilistic structures to track statistics of high-volume data streams.
This is a collection of classical algorithms and data structures implemented as a header-only C++ library. It provides a suite of tools for general algorithm implementation, including data structure management, graph theory analysis, and string processing. The library is distinguished by its specialized toolkits for cryptographic hashing and encoding, featuring implementations of MD5, SHA-1, and Base64. It also includes advanced capabilities for high-performance string processing via suffix trees and arrays, as well as computational number theory for primality testing and arbitrary-precision
Provides Bloom filters for memory-efficient, probabilistic set membership testing.
Kvrocks is a disk-based NoSQL database and distributed key-value store that leverages the RocksDB storage engine to persist large datasets to physical disk. It is designed to be a Redis-compatible database, utilizing the standard Redis communication protocol to ensure interoperability with existing client libraries and tools. The project distinguishes itself by combining a disk-persistent storage model with advanced retrieval capabilities, including vector search for k-nearest neighbor queries, full-text search indexing, and geospatial query execution. It supports distributed clustering with
Provides memory-efficient probabilistic filters to verify set membership with minimal false positives.
Kvrocks is a distributed key-value store and Redis-compatible NoSQL database. It utilizes a RocksDB storage engine to provide disk-based persistence, allowing for high-capacity data storage with reduced memory costs compared to in-memory systems. The system functions as a vector database and full-text search engine, supporting nearest-neighbor searches on vector embeddings and complex document queries via text matching. It employs a proxyless cluster architecture with slot-based routing to distribute data and scale capacity across multiple nodes. The platform covers a wide range of data mana
Employs Bloom filters and HyperLogLog for memory-efficient cardinality estimation and membership testing.