27 repository-uri
Algorithms for identifying unusual patterns in data streams automatically.
Distinguishing note: Focuses on automated statistical detection rather than manual threshold monitoring.
Explore 27 awesome GitHub repositories matching data & databases · Anomaly Detection. Refine with filters or upvote what's useful.
PostHog is a comprehensive product analytics and feature management platform designed to capture, process, and visualize user behavior data. It provides a unified suite for tracking application events, managing feature rollouts, and monitoring system health through session recordings and error tracking. By leveraging a columnar-storage-optimized architecture, the platform enables high-performance aggregation and filtering across massive event datasets. What distinguishes PostHog is its integrated approach to data pipelines and application control. It features a robust event ingestion system t
Identifies unusual data patterns using statistical algorithms to trigger alerts without manual threshold configuration.
FinceptTerminal is a quantitative finance platform and financial engineering library designed for asset valuation, risk management, and fixed-income analytics. It provides a comprehensive suite for algorithmic trading and investment strategy automation, integrating specialized language model agents and node-based workflows to automate market research and alpha generation. The project distinguishes itself with a dedicated game theory analysis engine for calculating Nash equilibria and simulating strategic interactions in competitive markets. It also features a specialized credit risk modeling
Identifies outliers in feature matrices to perform fraud detection and ensure financial data quality.
NetworkX is a Python library designed for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. It provides a comprehensive framework for modeling relationships between entities as graphs, directed graphs, or multigraphs, allowing users to attach arbitrary metadata and properties to nodes and edges. The library distinguishes itself through a modular architecture that decouples graph analysis logic from data storage, utilizing nested dictionaries and adjacency lists to manage topology. It features a pluggable backend system that delegates computat
Detects clusters within a graph using spectral bipartitioning or greedy node-swapping algorithms to reveal underlying structural groupings.
This project is a comprehensive framework for engineering financial data pipelines, designed to automate the collection, cleaning, and synchronization of large-scale market datasets. It functions as a quantitative trading data engine, providing the infrastructure necessary to manage historical and real-time asset pricing information for research and machine learning workflows. The system distinguishes itself through a configuration-driven approach to orchestration, allowing users to manage complex data acquisition tasks across multiple financial providers. It features resilient middleware tha
Identifies statistical outliers and irregularities in time series data to maintain high-fidelity inputs.
Neo4j is a native graph database management system designed to store and query highly connected data using a property-graph model. It provides an ACID-compliant transaction engine that ensures data integrity, supported by a distributed cluster architecture that maintains causal consistency across nodes. Users interact with the system through a declarative query language, which allows for complex pattern matching and path traversal without requiring manual traversal logic. The platform distinguishes itself through its hybrid approach to data retrieval, combining traditional graph-based queries
Executes advanced graph algorithms for centrality, pathfinding, and community detection on connected datasets.
Ragas is an evaluation framework designed to measure the performance of retrieval-augmented generation pipelines and autonomous agent workflows. It provides a comprehensive suite of tools for benchmarking system outputs, utilizing language models as automated judges to score performance against defined rubrics and reference data. By standardizing inputs, retrieved contexts, and generated responses into a unified schema, the project enables consistent analysis across complex AI applications. The framework distinguishes itself through its ability to generate synthetic test datasets from existin
Discovers groups of related nodes within graphs using community detection algorithms or path exploration.
This project is an LLM knowledge base builder and personal knowledge management tool. It is a desktop application designed to transform diverse documents into a persistent, interlinked wiki through LLM analysis and incremental ingestion. The system distinguishes itself with a knowledge graph visualizer that uses community detection algorithms to map relationships between concepts and identify topical clusters. It features a hybrid retrieval system that combines keyword matching, vector embeddings, and graph relevance to locate information. The platform covers a wide range of capabilities inc
Uses modularity-based community detection algorithms to automatically discover and group related knowledge clusters.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
Monitors data patterns using artificial intelligence to automatically configure and adapt quality checks as underlying data structures change.
Cleanlab is a data-centric AI library and toolkit designed to improve machine learning model performance by detecting label errors and increasing overall dataset quality. It implements a confident learning framework that iteratively refines label noise estimates by comparing model predictions with estimated label probabilities to identify mislabeled examples. The project provides specialized utilities for active learning optimization, allowing for the selection of the most impactful examples for labeling or re-labeling. It also includes an outlier detection tool to identify atypical data poin
Finds atypical data points that fall outside the expected distribution to remove or investigate anomalies.
sktime is a machine learning framework designed for time series analysis. It provides a unified interface for performing time series forecasting, classification, and anomaly detection, integrating these capabilities into a standardized toolkit compatible with the scikit-learn API. The framework allows for the construction of complex analysis workflows through model pipelining and ensemble-based aggregation. It uses adapter-based integration to wrap external time series libraries, providing a single entry point for diverse algorithmic implementations. Its capabilities cover temporal data tran
Identifies unusual data points or significant shifts in the underlying properties of temporal sequences.
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
Implements graph neural networks to identify anomalous user and device behavior patterns for fraud detection.
This is a Python machine learning library featuring a collection of core algorithms implemented from scratch to demonstrate foundational AI concepts. It provides a comprehensive toolkit for supervised learning, unsupervised learning, and neural network development. The project is distinguished by its custom implementation of a neural network framework, which includes multi-layer perceptrons with backpropagation, gradient descent, and weight regularization. It also includes a specialized anomaly detection toolkit that identifies outliers and rare events using Gaussian probability distributions
Provides statistical anomaly detection for identifying outliers and rare events in datasets.
This repository is a collection of implementation references and solved notebooks covering supervised, unsupervised, and reinforcement learning techniques. It provides practical guides for building predictive models, clustering algorithms, and autonomous agents. The project includes specific implementations for neural network architectures, such as multi-layer perceptrons for digit recognition, and recommender systems using collaborative and content-based filtering. It also features reinforcement learning systems that utilize deep Q-learning to optimize decision-making policies. The codebase
Implements Gaussian anomaly detection to identify outliers by modeling normal data distributions.
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Identifies rare or suspicious data points that deviate significantly from the majority using anomaly detection algorithms.
Kats este un framework și o bibliotecă de analiză a seriilor temporale care oferă instrumente pentru caracterizarea statistică, detectarea anomaliilor și prognoza tendințelor. Funcționează ca un toolkit pentru prezicerea valorilor viitoare pe baza datelor istorice și identificarea modelelor neregulate sau a punctelor de schimbare structurală în secvențele temporale. Proiectul include un instrument de extracție a caracteristicilor temporale pentru a calcula statistici descriptive și caracteristici care rezumă comportamentul seriilor temporale. De asemenea, oferă un sistem pentru reglarea hiperparametrilor modelului folosind învățarea auto-supervizată pentru a îmbunătăți scara și generalizarea predicțiilor.
Identifies irregular patterns or significant shifts in data to flag outliers and structural breaks.
River este un framework Python pentru online machine learning, conceput pentru a antrena și evalua modele pe date de tip streaming. Permite învățarea incrementală prin actualizarea parametrilor modelului la fiecare observație, eliminând nevoia de a stoca seturi de date complete de antrenament în memorie. Biblioteca se distinge printr-un sistem dedicat de detectare a concept drift-ului, care monitorizează schimbările în distribuțiile datelor pentru a declanșa adaptarea modelului. De asemenea, oferă un framework de validare progresivă care simulează deployment-ul în timp real prin testarea modelelor pe eșantioane înainte de a le utiliza pentru antrenament. Sistemul acoperă o gamă largă de capabilități de streaming, inclusiv feature engineering în timp real, prognoza seriilor temporale și detectarea anomaliilor online. Suportă învățarea nesupervizată prin clustering incremental și arbori de decizie, precum și agregarea de tip ensemble și politici de tip bandit pentru selecția modelelor. Proiectul include utilitare pentru ingestia de date de streaming din surse precum fișiere CSV și API-uri, precum și instrumente pentru calcularea statisticilor în mișcare și a schițelor de date eficiente din punct de vedere al memoriei.
Identifies unusual observations in live data streams by scoring samples based on evolving distributions.
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Defines dataclasses holding input images, ground truth, masks, and predictions for each sample.
Arroyo is a high-performance stream processing platform built in Rust. It executes continuous SQL queries on streaming data with event-time semantics, enabling accurate windowed aggregations, joins, and stateful computations on unbounded event streams. The platform uses native Rust execution for high throughput and low latency, with periodic checkpointing for exactly-once fault tolerance and horizontal scaling across distributed workers. The system integrates deeply with Kafka for reading and writing topics with exactly-once delivery and supports change data capture (CDC) from MySQL and Postg
Groups streaming data by key and time window, counts events, and filters for thresholds to flag suspicious activity.
Tiny Universe is an educational monorepo that delivers multiple independent implementations of core AI subsystems as self-contained Jupyter notebooks. It provides from-scratch constructions of foundational architectures including a complete Transformer model built from the original paper specification, a denoising diffusion probabilistic model for image generation, and a ReAct-style autonomous agent framework that equips an LLM with tools for planning and multi-step task execution. The project distinguishes itself by covering the full lifecycle of modern AI systems through hands-on implementa
Partitions knowledge graphs into nested communities and generates LLM summaries for each level.
Memgraph is an in-memory, distributed graph database designed for high-performance labeled property graph management. It utilizes a Cypher query engine for declarative data retrieval and manipulation, providing a scalable knowledge graph backend that integrates vector search and graph traversals. The system distinguishes itself as a real-time graph analytics platform, employing native C++ and CUDA implementations to execute complex network analysis and dynamic community detection on streaming data. It provides specialized support for AI integration, including GraphRAG capabilities, the constr
Identifies clusters of related nodes in real-time using the LabelRankT algorithm.