30 open-source projects similar to stumpy-dev/stumpy, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Stumpy alternative.
Nixtla is a time series analysis platform centered on a transformer-based foundation model. It provides zero-shot inference for forecasting and anomaly detection, allowing the system to predict future values for new time series without requiring model retraining. The project is designed for large-scale analysis, using distributed inference scaling and forecast parallelization to process millions of data series. It supports fine-tuning adaptation to adjust pretrained weights for domain-specific datasets and offers deployment options ranging from local execution and private containers to integr
cuDF is a GPU-accelerated dataframe library and data processing engine designed for manipulating and analyzing large tabular datasets. It provides a high-level API for executing filtering, joining, and aggregating operations directly on GPU hardware. The project integrates the Apache Arrow memory format to enable zero-copy data transfers and includes a just-in-time compiler for executing custom user-defined functions on the GPU. The library features specialized acceleration for existing workflows by redirecting standard Pandas dataframe calls and Polars query plans to a GPU backend. It also p
cuml is a GPU-accelerated machine learning library and framework that uses CUDA to accelerate tabular data preprocessing and model execution. It provides a suite of tools for training and deploying classification, regression, and clustering models on NVIDIA GPUs and GPU clusters. The library is designed for scalability, offering a distributed GPU machine learning environment that can spread computation and data across multiple hardware accelerators and nodes to handle datasets exceeding single-device memory. It mirrors standard estimator interfaces to allow the replacement of CPU-based models
Darts is a Python time series library designed for forecasting, anomaly detection, and the preprocessing of univariate and multivariate temporal data. It serves as a comprehensive framework for training and evaluating a wide range of statistical, machine learning, and deep learning models to predict future numerical values. The toolkit is distinguished by its support for global time series modeling, allowing a single model to be trained across multiple different series to leverage shared patterns. It also features a hierarchical time series manager to ensure consistency between aggregate and
statsforecast is a high-performance statistical time series forecasting library designed to generate point forecasts and prediction intervals. It functions as a distributed time series framework that utilizes a C-based forecasting engine and an automated model selector to identify and fit the optimal statistical model for every unique series in a dataset. The system also includes a time series anomaly detector to identify unusual data points by comparing observed values against probabilistic forecast intervals. The project is distinguished by its ability to handle massive-scale parallel forec
sktime is a machine learning framework for time series analysis. It provides a unified toolkit for implementing time series classification, forecasting, and anomaly detection using standardized machine learning interfaces. The library serves as a collection of tools for assigning categorical labels to temporal sequences, predicting future values based on historical patterns, and identifying outliers or unusual patterns within temporal data. The framework includes capabilities for panel-data handling and pipeline-based transformations. It utilizes a unified API wrapper and plugin-based model
Merlion is a time series machine learning framework designed for anomaly detection and forecasting. It provides a unified interface for implementing and applying various statistical and machine learning models to temporal data streams. The project includes a benchmarking dashboard that allows for the visual testing and evaluation of models against historical ground truth datasets. This web interface enables the experimentation of different models on custom datasets without manual coding. The framework covers capabilities for identifying outliers, predicting future time series values, and mea
NuPIC is a machine learning framework that implements Hierarchical Temporal Memory (HTM) theory, a neuroscience-inspired approach to artificial intelligence. It models principles of the neocortex to build systems capable of learning patterns from streaming data, performing sequence prediction, and detecting anomalies in real-time data streams. The framework is built around a Cortical Learning Algorithm that combines spatial pooling and temporal memory to process streaming input. It uses Sparse Distributed Representations to encode input patterns, a Spatial Pooler to convert dense input into s
This PyTorch-based deep learning library provides a framework for analyzing and forecasting temporal data. It implements specialized architectures for time series forecasting, anomaly detection, data imputation, and classification. The project distinguishes itself through the inclusion of zero-shot inference capabilities, allowing large-scale temporal models to be evaluated on unseen datasets without requiring task-specific fine-tuning. The framework covers a broad range of analytical capabilities, including the recovery of missing values in incomplete datasets, the identification of irregul
TimesFM is a time series foundation model designed to generalize across diverse temporal datasets for forecasting and anomaly detection. It functions as a pretrained model for predicting future values in univariate time series data, eliminating the need for manual training from scratch. The project includes a framework for adapting pretrained weights to specific datasets using low-rank adaptation to improve accuracy. It also provides specialized capabilities for integrating time-series predictions as tools within autonomous AI agent architectures and automated workflows. The system supports
Kats is a time series analysis framework and library providing tools for statistical characterization, anomaly detection, and trend forecasting. It functions as a toolkit for predicting future values based on historical data and identifying irregular patterns or structural change points within temporal sequences. The project includes a temporal feature extraction tool to calculate descriptive statistics and characteristics that summarize time series behavior. It also provides a system for model hyperparameter tuning using self-supervised learning to improve the scale and generalization of pre
This repository provides a curated collection of self-contained Python code examples that demonstrate the core capabilities of the PyTorch deep learning framework. The examples cover automatic differentiation, dynamic computational graphs, GPU‑accelerated tensor operations, and training of neural network models using gradient‑based optimization. The code samples illustrate PyTorch’s dynamic graph construction, where models can change structure with native control flow, and its automatic gradient computation through reverse‑mode differentiation. Additional examples show how to work with tensor
Dask is a parallel computing framework and distributed task scheduler designed to scale Python data science workflows from single machines to large clusters. It functions as a cluster resource manager that orchestrates computational logic by representing tasks and their dependencies as directed acyclic graphs. This architecture allows the system to automate the distribution of workloads across available hardware while managing complex execution requirements. The project distinguishes itself through a lazy evaluation engine that defers data operations until they are explicitly requested, enabl
Ignite is a distributed in-memory data grid and compute platform. It functions as a distributed SQL database and storage engine designed to store and process large datasets in RAM to minimize latency and increase calculation speed. The system is distinguished by a multi-tier storage engine that manages data placement across memory and disk to balance high-speed access with large capacity. It features a distributed compute grid that executes custom logic directly on the nodes where data resides to reduce network traffic. The platform provides a broad set of capabilities including ACID transac
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
PaddleX is a PaddlePaddle-based framework for building, deploying, and fine-tuning AI model pipelines, with pre-built support for computer vision, OCR, document analysis, and time series tasks. It offers a toolkit of ready-to-use pipelines for image classification, object detection, segmentation, and pose estimation, alongside an end-to-end OCR document analysis pipeline that extracts text, tables, formulas, and layout information. The platform also includes a dedicated time series forecasting pipeline for analyzing historical data to detect anomalies, classify patterns, and predict future val
This project is an educational resource and technical manual for Apache Spark, focused on the architecture and practical application of large-scale data processing. It serves as a guide for big data engineering and distributed computing, covering the principles of parallel processing and fault-tolerant data distribution. The material provides instructional content on designing distributed ETL pipelines and implementing data analysis workflows. It includes tutorials for polyglot data processing, offering patterns and examples for using Python, Scala, and Java within a unified environment. The
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Thrust is a heterogeneous computing library and C++ template library that provides a collection of high-level templates for executing data-parallel operations. It functions as a parallel algorithms library designed to work across different hardware backends, including multicore CPUs and NVIDIA GPU hardware. The framework utilizes a header-only implementation and a generic-programming policy interface to abstract the differences between CPU and GPU memory and execution models. It employs an iterator-based data abstraction to provide a uniform interface for accessing elements across host RAM an
Hazelcast is a distributed data platform that combines an in-memory data grid with a stream processing engine to support real-time analytics and event-driven applications. It functions as a partitioned, distributed key-value store that replicates data across cluster nodes to provide low-latency access and high availability. The platform also serves as a distributed SQL query engine, allowing users to execute standard SQL statements against both in-memory datasets and external data sources. What distinguishes Hazelcast is its use of a distributed consensus subsystem to maintain strongly consis
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Storm is a distributed stream processing framework and fault-tolerant compute engine designed for executing real-time continuous computations across a cluster of machines. It functions as a stateful stream processor and cluster topology manager, enabling the deployment and monitoring of distributed data flow configurations. The system ensures exactly-once semantics by utilizing transactional state management to guarantee that every message in a data stream is processed exactly one time. It further operates as a distributed RPC system, allowing for the integration of non-native languages throu
Mmlspark is a distributed framework for executing machine learning models, data transformations, and AI service integrations across Apache Spark clusters. It functions as a distributed machine learning library and pipeline orchestrator, allowing users to integrate pre-trained cognitive services and custom models into large-scale batch and streaming workflows. The project is distinguished by its ability to incorporate external AI services and web APIs directly into big data pipelines for text and vision analysis. It provides a scalable model training framework that coordinates gradient boostin
SynapseML is an Apache Spark machine learning library designed for building and scaling machine learning workflows and data pipelines across distributed clusters. It serves as a distributed machine learning pipeline framework and a distributed inference engine for executing hardware-accelerated predictions and deep learning tasks on large-scale datasets. The project functions as a cloud AI integration layer, allowing users to apply pretrained artificial intelligence services for text, vision, and speech within distributed pipelines. It also includes a dedicated suite of tools for distributed
SparkInternals is a technical reference and architecture guide detailing the internal design and implementation of the Apache Spark distributed computing engine. It serves as a study of big data engine analysis, focusing on how the system manages cluster execution and the interaction between driver nodes, executors, and workers. The project provides a detailed breakdown of how logical plans are converted into physical execution stages. It specifically analyzes the mechanics of data shuffle operations, memory management, and the coordination of distributed job scheduling. The documentation co
Daft is a distributed dataframe library and multimodal data processor designed to handle large-scale structured and unstructured data. It functions as a vectorized execution engine that processes tables alongside images, audio, and video, utilizing a unified schema to manage diverse data types. The project distinguishes itself by combining distributed data engineering with large-scale AI inference. It provides an AI data pipeline for batch-optimizing model prompts and generating high-dimensional text embeddings, while utilizing zero-copy memory sharing to execute custom Python functions witho
This project is a curated directory of software, frameworks, and educational resources designed for building, scaling, and maintaining distributed data processing and storage architectures. It serves as a comprehensive index for the distributed computing ecosystem, helping users identify the appropriate tools for managing large-scale information systems. The repository functions as a central hub for data engineering, offering categorized access to technologies that support batch and stream processing, machine learning, and interactive querying. By organizing these resources, it assists in the
This project serves as a comprehensive technical reference for the architecture and design of data-intensive applications. It provides a structured analysis of the fundamental principles required to build reliable, scalable, and maintainable software systems, covering the core trade-offs inherent in modern data infrastructure. The repository explores the mechanics of distributed data management, including strategies for replication, partitioning, and achieving consensus across multiple nodes. It details the design of storage engines, indexing techniques, and transaction management models, whi