41 Repos
Systems designed to execute complex data analysis and computation graphs across distributed clusters.
Distinct from Data Processing and Analysis: Focuses on the general domain of large-scale distributed computation rather than specific ML training or image processing.
Explore 41 awesome GitHub repositories matching data & databases · Large-Scale Data Computation. Refine with filters or upvote what's useful.
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Executes complex computation graphs across distributed clusters to process massive datasets.
Gensim is an unsupervised natural language processing toolkit designed for topic modeling, word embedding training, and the processing of large-scale text corpora. It provides a framework for discovering latent themes and semantic structures in text without the need for labeled data. The toolkit is distinguished by its ability to handle datasets that exceed system memory through iterator-based data streaming from disk. It also supports distributed model training, allowing complex modeling tasks to be executed across computer clusters. The library covers a broad range of analysis capabilities
Processes massive text datasets that exceed system memory through distributed computation and streaming.
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Implements a distributed framework for executing complex data analysis and computation across large clusters.
Apache Druid is a real-time analytics database and distributed columnar time-series store designed for sub-second analytical queries. It functions as a data platform featuring a distributed SQL query engine and a real-time data ingestion system for moving historical and streaming data from external sources. The system is distinguished by its ability to provide low-latency analytics under high concurrency to power operational dashboards. It implements a Kerberos-secured environment for user authentication and employs a shared-nothing cluster architecture to enable horizontal scaling. The plat
Executes complex data analysis and multi-stage SQL transformations across distributed clusters for massive datasets.
Azure Docs is the official technical documentation repository for Microsoft Azure, the cloud computing platform. It provides comprehensive guidance on the full spectrum of Azure services, covering everything from core infrastructure components like virtual machines, Kubernetes clusters, and serverless computing to platform services for AI, machine learning, data analytics, and storage. The documentation details how to provision, manage, and govern cloud resources at scale, including policy enforcement, identity management, and cost optimization. The documentation distinguishes Azure through i
Documents Azure Batch for scheduling and executing large-scale parallel workloads on managed clusters.
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Processes datasets that exceed system memory using distributed execution engines and out-of-core computation.
Boto3 is the AWS SDK for Python, providing a programmatic interface for managing and automating AWS cloud infrastructure and services. It serves as a cloud management API client and resource manager for provisioning, configuring, and scaling virtual servers, databases, and storage. The library enables the implementation of infrastructure-as-code through declarative templates and scripts, allowing for the deployment of identical resource stacks across multiple accounts and geographic regions. It also provides a framework for coordinating distributed workflows, serverless functions, and contain
Processes large-scale data analytics using Apache Spark code in a managed distributed environment.
tsfresh is an automated feature engineering tool and library designed to extract statistical characteristics from raw time series data. It transforms sequential data into tabular datasets, converting time series into a flat format where each row represents a unique entity and columns represent extracted features. The project distinguishes itself through a parallel data processing framework that distributes heavy computational workloads across multiple CPU cores. It also implements hypothesis-based feature selection to identify the most predictive characteristics and filter out irrelevant ones
Processes massive time series datasets by distributing heavy computational workloads across multiple CPU cores.
Spring AI is an application framework for Java that provides a portable, fluent API for integrating AI models, tools, and vector stores into applications. It wraps multiple AI providers behind a common interface, allowing developers to switch between chat, embedding, image, and speech models without changing application code. The framework includes a chainable chat client API similar to WebClient or RestClient, supports both synchronous and streaming interactions, and offers structured output conversion that transforms unstructured AI responses into strongly-typed Java objects. The framework
Splits large document collections into smaller batches to fit within embedding model token limits.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Allows adding new columns and transforming data at scale to extend tables vertically and horizontally.
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Provides mathematical transformations such as scaling, centering, and logarithmic changes to prepare model variables.
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Performs large-scale filtering, joining, and transformations on massive dataframes via lazy evaluation.
The 1BRC (One Billion Row Challenge) is a Java performance benchmarking exercise that processes one billion temperature records from a text file to compute the minimum, mean, and maximum temperature per weather station. At its core, it is a large-scale data aggregation challenge designed to test how efficiently a Java program can parse and aggregate structured data from a plain text file, serving as both a programming exercise and a benchmark for Java performance optimization. The project distinguishes itself through a collection of performance-oriented architectural patterns for high-through
A programming exercise that processes one billion temperature records from a text file to compute per-station statistics.
Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data. The system functions as a distributed feature synthesis engine, allowing the process of creating feature vectors to scale across multiple cores or clusters to handle large-scale datasets. The library supports the synthesis of multi-table datasets, time series feature generation, and the creation of custom machine learning primitives
Executes feature engineering and transformations across massive datasets using distributed processing.
Featuretools is a Python data science library and automated feature engineering framework designed to create predictive features from multiple related datasets. It automates the data preparation and transformation steps required for machine learning models through deep feature synthesis. The library enables the automatic generation of comprehensive feature tables by applying recursive transformations to relational data. It supports the transformation of unstructured text into structured numeric features and allows users to define custom primitives to extend the synthesis process with specific
Processes massive volumes of data by distributing feature computation across distributed clusters.
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
Processes massive datasets across a cluster using distributed key-value stores and map-reduce computation.
oneTBB ist eine C++-Parallelitätsbibliothek und ein Framework, das darauf ausgelegt ist, Anwendungen um Multi-Core-Parallelität zu erweitern. Es bietet ein auf Tasks basierendes Parallelitätsmodell, das logische Rechenaufgaben auf verfügbare Hardware-Kerne mappt, wodurch die manuelle Thread-Verwaltung entfällt. Die Bibliothek fungiert als Multi-Core-Skalierungstool und nutzt generische Templates, um datenparallele Operationen für portable Performance über Prozessoren hinweg zu skalieren. Sie verwendet ein Task-basiertes Framework, um sicherzustellen, dass Rechenlasten auf Hardware-Ressourcen verteilt werden. Das Projekt deckt Shared-Memory-Parallelität, Multi-Core-Task-Scheduling und die Skalierung von Datenparallelität ab. Es nutzt einen Work-Stealing-Task-Scheduler, rekursive Range-Splitting-Verfahren und dynamisches Load-Balancing, um die Arbeitsverteilung zur Laufzeit über Kerne hinweg zu verwalten.
Enables running operations across large datasets using templates to ensure portable performance across different multi-core processors.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Feast executes feature computation DAGs across a cluster, automatically scaling workers and managing resources for large-scale processing.
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Provides scalers, encoders, and imputers to transform raw data for statistical analysis and modeling.
River ist ein Python-Framework für Online-Machine-Learning, das darauf ausgelegt ist, Modelle auf Streaming-Daten zu trainieren und zu evaluieren. Es ermöglicht inkrementelles Lernen durch die Aktualisierung von Modellparametern pro Beobachtung, wodurch das Speichern vollständiger Trainingsdatensätze im Arbeitsspeicher entfällt. Die Bibliothek zeichnet sich durch ein dediziertes System zur Erkennung von Concept Drift aus, das Änderungen in Datenverteilungen überwacht, um eine Modellanpassung auszulösen. Sie bietet zudem ein Framework für progressive Validierung, das den Echtzeit-Einsatz simuliert, indem Modelle an Stichproben getestet werden, bevor sie für das Training verwendet werden. Das System deckt ein breites Spektrum an Streaming-Funktionen ab, einschließlich Echtzeit-Feature-Engineering, Zeitreihenprognosen und Online-Anomalieerkennung. Es unterstützt unüberwachtes Lernen durch inkrementelles Clustering und Entscheidungsbäume sowie Ensemble-Aggregation und Bandit-Richtlinien für die Modellauswahl. Das Projekt enthält Dienstprogramme für das Streaming von Daten aus Quellen wie CSV-Dateien und APIs sowie Werkzeuge zur Berechnung laufender Statistiken und speichereffizienter Daten-Sketches.
Scales numeric values and encodes categories in real time to ensure data compatibility with algorithms.