Why is apache/spark a recommended Large-Scale Data Computation GitHub Repositories repository?

Executes complex computation graphs across distributed clusters to process massive datasets.

Why is rare-technologies/gensim a recommended Large-Scale Data Computation GitHub Repositories repository?

Processes massive text datasets that exceed system memory through distributed computation and streaming.

Why is apache/hadoop a recommended Large-Scale Data Computation GitHub Repositories repository?

Implements a distributed framework for executing complex data analysis and computation across large clusters.

Why is apache/druid a recommended Large-Scale Data Computation GitHub Repositories repository?

Executes complex data analysis and multi-stage SQL transformations across distributed clusters for massive datasets.

Why is microsoftdocs/azure-docs a recommended Large-Scale Data Computation GitHub Repositories repository?

Documents Azure Batch for scheduling and executing large-scale parallel workloads on managed clusters.

Why is modin-project/modin a recommended Large-Scale Data Computation GitHub Repositories repository?

Processes datasets that exceed system memory using distributed execution engines and out-of-core computation.

Why is boto/boto3 a recommended Large-Scale Data Computation GitHub Repositories repository?

Processes large-scale data analytics using Apache Spark code in a managed distributed environment.

Why is blue-yonder/tsfresh a recommended Large-Scale Data Computation GitHub Repositories repository?

Processes massive time series datasets by distributing heavy computational workloads across multiple CPU cores.

Why is spring-projects/spring-ai a recommended Large-Scale Data Computation GitHub Repositories repository?

Splits large document collections into smaller batches to fit within embedding model token limits.

Why is lancedb/lancedb a recommended Large-Scale Data Computation GitHub Repositories repository?

Allows adding new columns and transforming data at scale to extend tables vertically and horizontally.

41 Repos

Awesome GitHub RepositoriesLarge-Scale Data Computation

Systems designed to execute complex data analysis and computation graphs across distributed clusters.

Distinct from Data Processing and Analysis: Focuses on the general domain of large-scale distributed computation rather than specific ML training or image processing.

Explore 41 awesome GitHub repositories matching data & databases · Large-Scale Data Computation. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

apache/spark
apache/spark
43,467Auf GitHub ansehen
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Executes complex computation graphs across distributed clusters to process massive datasets.
Scalabig-datajavajdbc
Auf GitHub ansehen43,467
rare-technologies/gensim
RaRe-Technologies/gensim
16,442Auf GitHub ansehen
Gensim is an unsupervised natural language processing toolkit designed for topic modeling, word embedding training, and the processing of large-scale text corpora. It provides a framework for discovering latent themes and semantic structures in text without the need for labeled data. The toolkit is distinguished by its ability to handle datasets that exceed system memory through iterator-based data streaming from disk. It also supports distributed model training, allowing complex modeling tasks to be executed across computer clusters. The library covers a broad range of analysis capabilities
Processes massive text datasets that exceed system memory through distributed computation and streaming.
Python
Auf GitHub ansehen16,442
apache/hadoop
apache/hadoop
15,567Auf GitHub ansehen
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Implements a distributed framework for executing complex data analysis and computation across large clusters.
Java
Auf GitHub ansehen15,567
apache/druid
apache/druid
14,020Auf GitHub ansehen
Apache Druid is a real-time analytics database and distributed columnar time-series store designed for sub-second analytical queries. It functions as a data platform featuring a distributed SQL query engine and a real-time data ingestion system for moving historical and streaming data from external sources. The system is distinguished by its ability to provide low-latency analytics under high concurrency to power operational dashboards. It implements a Kerberos-secured environment for user authentication and employs a shared-nothing cluster architecture to enable horizontal scaling. The plat
Executes complex data analysis and multi-stage SQL transformations across distributed clusters for massive datasets.
Javadruid
Auf GitHub ansehen14,020
microsoftdocs/azure-docs
MicrosoftDocs/azure-docs
10,894Auf GitHub ansehen
Azure Docs is the official technical documentation repository for Microsoft Azure, the cloud computing platform. It provides comprehensive guidance on the full spectrum of Azure services, covering everything from core infrastructure components like virtual machines, Kubernetes clusters, and serverless computing to platform services for AI, machine learning, data analytics, and storage. The documentation details how to provision, manage, and govern cloud resources at scale, including policy enforcement, identity management, and cost optimization. The documentation distinguishes Azure through i
Documents Azure Batch for scheduling and executing large-scale parallel workloads on managed clusters.
Markdownskilling
Auf GitHub ansehen10,894
modin-project/modin
modin-project/modin
10,389Auf GitHub ansehen
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Processes datasets that exceed system memory using distributed execution engines and out-of-core computation.
Pythonanalyticsdata-sciencedataframe
Auf GitHub ansehen10,389
boto/boto3
boto/boto3
9,834Auf GitHub ansehen
Boto3 is the AWS SDK for Python, providing a programmatic interface for managing and automating AWS cloud infrastructure and services. It serves as a cloud management API client and resource manager for provisioning, configuring, and scaling virtual servers, databases, and storage. The library enables the implementation of infrastructure-as-code through declarative templates and scripts, allowing for the deployment of identical resource stacks across multiple accounts and geographic regions. It also provides a framework for coordinating distributed workflows, serverless functions, and contain
Processes large-scale data analytics using Apache Spark code in a managed distributed environment.
Pythonawsaws-sdkcloud
Auf GitHub ansehen9,834
blue-yonder/tsfresh
blue-yonder/tsfresh
9,249Auf GitHub ansehen
tsfresh is an automated feature engineering tool and library designed to extract statistical characteristics from raw time series data. It transforms sequential data into tabular datasets, converting time series into a flat format where each row represents a unique entity and columns represent extracted features. The project distinguishes itself through a parallel data processing framework that distributes heavy computational workloads across multiple CPU cores. It also implements hypothesis-based feature selection to identify the most predictive characteristics and filter out irrelevant ones
Processes massive time series datasets by distributing heavy computational workloads across multiple CPU cores.
Jupyter Notebookdata-sciencefeature-extractiontime-series
Auf GitHub ansehen9,249
spring-projects/spring-ai
spring-projects/spring-ai
9,001Auf GitHub ansehen
Spring AI is an application framework for Java that provides a portable, fluent API for integrating AI models, tools, and vector stores into applications. It wraps multiple AI providers behind a common interface, allowing developers to switch between chat, embedding, image, and speech models without changing application code. The framework includes a chainable chat client API similar to WebClient or RestClient, supports both synchronous and streaming interactions, and offers structured output conversion that transforms unstructured AI responses into strongly-typed Java objects. The framework
Splits large document collections into smaller batches to fit within embedding model token limits.
Javaartificial-intelligencejavaspring-ai
Auf GitHub ansehen9,001
lancedb/lancedb
lancedb/lancedb
9,031Auf GitHub ansehen
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Allows adding new columns and transforming data at scale to extend tables vertically and horizontally.
HTMLapproximate-nearest-neighbor-searchimage-searchnearest-neighbor-search
Auf GitHub ansehen9,031
iamseancheney/python_for_data_analysis_2nd_chinese_version
iamseancheney/python_for_data_analysis_2nd_chinese_version
8,937Auf GitHub ansehen
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Provides mathematical transformations such as scaling, centering, and logarithmic changes to prepare model variables.
matplotlibnumpypandas
Auf GitHub ansehen8,937
vaexio/vaex
vaexio/vaex
8,506Auf GitHub ansehen
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Performs large-scale filtering, joining, and transformations on massive dataframes via lazy evaluation.
Python
Auf GitHub ansehen8,506
gunnarmorling/1brc
gunnarmorling/1brc
8,062Auf GitHub ansehen
The 1BRC (One Billion Row Challenge) is a Java performance benchmarking exercise that processes one billion temperature records from a text file to compute the minimum, mean, and maximum temperature per weather station. At its core, it is a large-scale data aggregation challenge designed to test how efficiently a Java program can parse and aggregate structured data from a plain text file, serving as both a programming exercise and a benchmark for Java performance optimization. The project distinguishes itself through a collection of performance-oriented architectural patterns for high-through
A programming exercise that processes one billion temperature records from a text file to compute per-station statistics.
Java1brcchallenges
Auf GitHub ansehen8,062
alteryx/featuretools
alteryx/featuretools
7,658Auf GitHub ansehen
Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data. The system functions as a distributed feature synthesis engine, allowing the process of creating feature vectors to scale across multiple cores or clusters to handle large-scale datasets. The library supports the synthesis of multi-table datasets, time series feature generation, and the creation of custom machine learning primitives
Executes feature engineering and transformations across massive datasets using distributed processing.
Python
Auf GitHub ansehen7,658
featuretools/featuretools
featuretools/featuretools
7,655Auf GitHub ansehen
Featuretools is a Python data science library and automated feature engineering framework designed to create predictive features from multiple related datasets. It automates the data preparation and transformation steps required for machine learning models through deep feature synthesis. The library enables the automatic generation of comprehensive feature tables by applying recursive transformations to relational data. It supports the transformation of unstructured text into structured numeric features and allows users to define custom primitives to extend the synthesis process with specific
Processes massive volumes of data by distributing feature computation across distributed clusters.
Python
Auf GitHub ansehen7,655
h2oai/h2o-3
h2oai/h2o-3
7,493Auf GitHub ansehen
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
Processes massive datasets across a cluster using distributed key-value stores and map-reduce computation.
Jupyter Notebookautomlbig-datadata-science
Auf GitHub ansehen7,493
oneapi-src/onetbb
oneapi-src/oneTBB
6,683Auf GitHub ansehen
oneTBB ist eine C++-Parallelitätsbibliothek und ein Framework, das darauf ausgelegt ist, Anwendungen um Multi-Core-Parallelität zu erweitern. Es bietet ein auf Tasks basierendes Parallelitätsmodell, das logische Rechenaufgaben auf verfügbare Hardware-Kerne mappt, wodurch die manuelle Thread-Verwaltung entfällt. Die Bibliothek fungiert als Multi-Core-Skalierungstool und nutzt generische Templates, um datenparallele Operationen für portable Performance über Prozessoren hinweg zu skalieren. Sie verwendet ein Task-basiertes Framework, um sicherzustellen, dass Rechenlasten auf Hardware-Ressourcen verteilt werden. Das Projekt deckt Shared-Memory-Parallelität, Multi-Core-Task-Scheduling und die Skalierung von Datenparallelität ab. Es nutzt einen Work-Stealing-Task-Scheduler, rekursive Range-Splitting-Verfahren und dynamisches Load-Balancing, um die Arbeitsverteilung zur Laufzeit über Kerne hinweg zu verwalten.
Enables running operations across large datasets using templates to ensure portable performance across different multi-core processors.
C++
Auf GitHub ansehen6,683
feast-dev/feast
feast-dev/feast
6,727Auf GitHub ansehen
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Feast executes feature computation DAGs across a cluster, automatically scaling workers and managing resources for large-scale processing.
Pythonbig-datadata-engineeringdata-quality
Auf GitHub ansehen6,727
haifengl/smile
haifengl/smile
6,387Auf GitHub ansehen
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Provides scalers, encoders, and imputers to transform raw data for statistical analysis and modeling.
Java
Auf GitHub ansehen6,387
online-ml/river
online-ml/river
5,853Auf GitHub ansehen
River ist ein Python-Framework für Online-Machine-Learning, das darauf ausgelegt ist, Modelle auf Streaming-Daten zu trainieren und zu evaluieren. Es ermöglicht inkrementelles Lernen durch die Aktualisierung von Modellparametern pro Beobachtung, wodurch das Speichern vollständiger Trainingsdatensätze im Arbeitsspeicher entfällt. Die Bibliothek zeichnet sich durch ein dediziertes System zur Erkennung von Concept Drift aus, das Änderungen in Datenverteilungen überwacht, um eine Modellanpassung auszulösen. Sie bietet zudem ein Framework für progressive Validierung, das den Echtzeit-Einsatz simuliert, indem Modelle an Stichproben getestet werden, bevor sie für das Training verwendet werden. Das System deckt ein breites Spektrum an Streaming-Funktionen ab, einschließlich Echtzeit-Feature-Engineering, Zeitreihenprognosen und Online-Anomalieerkennung. Es unterstützt unüberwachtes Lernen durch inkrementelles Clustering und Entscheidungsbäume sowie Ensemble-Aggregation und Bandit-Richtlinien für die Modellauswahl. Das Projekt enthält Dienstprogramme für das Streaming von Daten aus Quellen wie CSV-Dateien und APIs sowie Werkzeuge zur Berechnung laufender Statistiken und speichereffizienter Daten-Sketches.
Scales numeric values and encodes categories in real time to ensure data compatibility with algorithms.
Python
Auf GitHub ansehen5,853