40 dépôts
Systems designed to execute complex data analysis and computation graphs across distributed clusters.
Distinct from Data Processing and Analysis: Focuses on the general domain of large-scale distributed computation rather than specific ML training or image processing.
Explore 40 awesome GitHub repositories matching data & databases · Large-Scale Data Computation. Refine with filters or upvote what's useful.
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Executes complex computation graphs across distributed clusters to process massive datasets.
Gensim is an unsupervised natural language processing toolkit designed for topic modeling, word embedding training, and the processing of large-scale text corpora. It provides a framework for discovering latent themes and semantic structures in text without the need for labeled data. The toolkit is distinguished by its ability to handle datasets that exceed system memory through iterator-based data streaming from disk. It also supports distributed model training, allowing complex modeling tasks to be executed across computer clusters. The library covers a broad range of analysis capabilities
Processes massive text datasets that exceed system memory through distributed computation and streaming.
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Implements a distributed framework for executing complex data analysis and computation across large clusters.
Apache Druid is a real-time analytics database and distributed columnar time-series store designed for sub-second analytical queries. It functions as a data platform featuring a distributed SQL query engine and a real-time data ingestion system for moving historical and streaming data from external sources. The system is distinguished by its ability to provide low-latency analytics under high concurrency to power operational dashboards. It implements a Kerberos-secured environment for user authentication and employs a shared-nothing cluster architecture to enable horizontal scaling. The plat
Executes complex data analysis and multi-stage SQL transformations across distributed clusters for massive datasets.
Azure Docs is the official technical documentation repository for Microsoft Azure, the cloud computing platform. It provides comprehensive guidance on the full spectrum of Azure services, covering everything from core infrastructure components like virtual machines, Kubernetes clusters, and serverless computing to platform services for AI, machine learning, data analytics, and storage. The documentation details how to provision, manage, and govern cloud resources at scale, including policy enforcement, identity management, and cost optimization. The documentation distinguishes Azure through i
Documents Azure Batch for scheduling and executing large-scale parallel workloads on managed clusters.
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Processes datasets that exceed system memory using distributed execution engines and out-of-core computation.
Boto3 is the AWS SDK for Python, providing a programmatic interface for managing and automating AWS cloud infrastructure and services. It serves as a cloud management API client and resource manager for provisioning, configuring, and scaling virtual servers, databases, and storage. The library enables the implementation of infrastructure-as-code through declarative templates and scripts, allowing for the deployment of identical resource stacks across multiple accounts and geographic regions. It also provides a framework for coordinating distributed workflows, serverless functions, and contain
Processes large-scale data analytics using Apache Spark code in a managed distributed environment.
tsfresh is an automated feature engineering tool and library designed to extract statistical characteristics from raw time series data. It transforms sequential data into tabular datasets, converting time series into a flat format where each row represents a unique entity and columns represent extracted features. The project distinguishes itself through a parallel data processing framework that distributes heavy computational workloads across multiple CPU cores. It also implements hypothesis-based feature selection to identify the most predictive characteristics and filter out irrelevant ones
Processes massive time series datasets by distributing heavy computational workloads across multiple CPU cores.
Spring AI is an application framework for Java that provides a portable, fluent API for integrating AI models, tools, and vector stores into applications. It wraps multiple AI providers behind a common interface, allowing developers to switch between chat, embedding, image, and speech models without changing application code. The framework includes a chainable chat client API similar to WebClient or RestClient, supports both synchronous and streaming interactions, and offers structured output conversion that transforms unstructured AI responses into strongly-typed Java objects. The framework
Splits large document collections into smaller batches to fit within embedding model token limits.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Allows adding new columns and transforming data at scale to extend tables vertically and horizontally.
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Provides mathematical transformations such as scaling, centering, and logarithmic changes to prepare model variables.
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Performs large-scale filtering, joining, and transformations on massive dataframes via lazy evaluation.
The 1BRC (One Billion Row Challenge) is a Java performance benchmarking exercise that processes one billion temperature records from a text file to compute the minimum, mean, and maximum temperature per weather station. At its core, it is a large-scale data aggregation challenge designed to test how efficiently a Java program can parse and aggregate structured data from a plain text file, serving as both a programming exercise and a benchmark for Java performance optimization. The project distinguishes itself through a collection of performance-oriented architectural patterns for high-through
A programming exercise that processes one billion temperature records from a text file to compute per-station statistics.
Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data. The system functions as a distributed feature synthesis engine, allowing the process of creating feature vectors to scale across multiple cores or clusters to handle large-scale datasets. The library supports the synthesis of multi-table datasets, time series feature generation, and the creation of custom machine learning primitives
Executes feature engineering and transformations across massive datasets using distributed processing.
Featuretools is a Python data science library and automated feature engineering framework designed to create predictive features from multiple related datasets. It automates the data preparation and transformation steps required for machine learning models through deep feature synthesis. The library enables the automatic generation of comprehensive feature tables by applying recursive transformations to relational data. It supports the transformation of unstructured text into structured numeric features and allows users to define custom primitives to extend the synthesis process with specific
Processes massive volumes of data by distributing feature computation across distributed clusters.
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
Processes massive datasets across a cluster using distributed key-value stores and map-reduce computation.
oneTBB est une bibliothèque et un framework de parallélisme C++ conçu pour ajouter le parallélisme multi-cœur aux applications. Il fournit un modèle de parallélisme basé sur les tâches qui mappe les tâches computationnelles logiques aux cœurs matériels disponibles pour éliminer le besoin de gestion manuelle des threads. La bibliothèque fonctionne comme un outil de mise à l'échelle multi-cœur, utilisant des templates génériques pour mettre à l'échelle les opérations de parallélisme de données sur les processeurs pour une performance portable. Elle emploie un framework basé sur les tâches pour assurer que les charges de travail computationnelles sont distribuées sur les ressources matérielles. Le projet couvre le parallélisme à mémoire partagée, la planification de tâches multi-cœur et la mise à l'échelle du parallélisme de données. Il utilise un planificateur de tâches avec vol de travail (work-stealing), le découpage récursif de plages et l'équilibrage de charge dynamique pour gérer la distribution du travail sur les cœurs à l'exécution.
Enables running operations across large datasets using templates to ensure portable performance across different multi-core processors.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Feast executes feature computation DAGs across a cluster, automatically scaling workers and managing resources for large-scale processing.
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Provides scalers, encoders, and imputers to transform raw data for statistical analysis and modeling.
River est un framework Python pour le machine learning en ligne (online machine learning), conçu pour entraîner et évaluer des modèles sur des données en streaming. Il permet un apprentissage incrémental en mettant à jour les paramètres du modèle une observation à la fois, éliminant le besoin de stocker des jeux de données d'entraînement complets en mémoire. La bibliothèque se distingue par un système dédié de détection de dérive de concept (concept drift) qui surveille les changements dans les distributions de données pour déclencher l'adaptation du modèle. Elle fournit également un framework de validation progressive qui simule un déploiement en temps réel en testant les modèles sur des échantillons avant de les utiliser pour l'entraînement. Le système couvre un large éventail de capacités de streaming, incluant l'ingénierie de caractéristiques (feature engineering) en temps réel, la prévision de séries temporelles et la détection d'anomalies en ligne. Il prend en charge l'apprentissage non supervisé via le clustering incrémental et les arbres de décision, ainsi que l'agrégation ensembliste et les politiques de bandit pour la sélection de modèles. Le projet inclut des utilitaires pour l'ingestion de données en streaming à partir de sources telles que des fichiers CSV et des API, ainsi que des outils pour calculer des statistiques courantes et des esquisses de données (data sketches) économes en mémoire.
Scales numeric values and encodes categories in real time to ensure data compatibility with algorithms.