Why is apache/spark a recommended Large-Scale Data Computation GitHub Repositories repository?

Executes complex computation graphs across distributed clusters to process massive datasets.

Why is rare-technologies/gensim a recommended Large-Scale Data Computation GitHub Repositories repository?

Processes massive text datasets that exceed system memory through distributed computation and streaming.

Why is apache/hadoop a recommended Large-Scale Data Computation GitHub Repositories repository?

Implements a distributed framework for executing complex data analysis and computation across large clusters.

Why is apache/druid a recommended Large-Scale Data Computation GitHub Repositories repository?

Executes complex data analysis and multi-stage SQL transformations across distributed clusters for massive datasets.

Why is microsoftdocs/azure-docs a recommended Large-Scale Data Computation GitHub Repositories repository?

Documents Azure Batch for scheduling and executing large-scale parallel workloads on managed clusters.

Why is modin-project/modin a recommended Large-Scale Data Computation GitHub Repositories repository?

Processes datasets that exceed system memory using distributed execution engines and out-of-core computation.

Why is boto/boto3 a recommended Large-Scale Data Computation GitHub Repositories repository?

Processes large-scale data analytics using Apache Spark code in a managed distributed environment.

Why is blue-yonder/tsfresh a recommended Large-Scale Data Computation GitHub Repositories repository?

Processes massive time series datasets by distributing heavy computational workloads across multiple CPU cores.

Why is spring-projects/spring-ai a recommended Large-Scale Data Computation GitHub Repositories repository?

Splits large document collections into smaller batches to fit within embedding model token limits.

Why is lancedb/lancedb a recommended Large-Scale Data Computation GitHub Repositories repository?

Allows adding new columns and transforming data at scale to extend tables vertically and horizontally.

40 dépôts

Awesome GitHub RepositoriesLarge-Scale Data Computation

Systems designed to execute complex data analysis and computation graphs across distributed clusters.

Distinct from Data Processing and Analysis: Focuses on the general domain of large-scale distributed computation rather than specific ML training or image processing.

Explore 40 awesome GitHub repositories matching data & databases · Large-Scale Data Computation. Refine with filters or upvote what's useful.

Trouvez les meilleurs dépôts grâce à l'IA.Nous recherchons les dépôts les plus pertinents grâce à l'IA.

apache/spark
apache/spark
43,467Voir sur GitHub
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
Executes complex computation graphs across distributed clusters to process massive datasets.
Scalabig-datajavajdbc
Voir sur GitHub43,467
rare-technologies/gensim
RaRe-Technologies/gensim
16,442Voir sur GitHub
Gensim is an unsupervised natural language processing toolkit designed for topic modeling, word embedding training, and the processing of large-scale text corpora. It provides a framework for discovering latent themes and semantic structures in text without the need for labeled data. The toolkit is distinguished by its ability to handle datasets that exceed system memory through iterator-based data streaming from disk. It also supports distributed model training, allowing complex modeling tasks to be executed across computer clusters. The library covers a broad range of analysis capabilities
Processes massive text datasets that exceed system memory through distributed computation and streaming.
Python
Voir sur GitHub16,442
apache/hadoop
apache/hadoop
15,567Voir sur GitHub
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
Implements a distributed framework for executing complex data analysis and computation across large clusters.
Java
Voir sur GitHub15,567
apache/druid
apache/druid
14,020Voir sur GitHub
Apache Druid is a real-time analytics database and distributed columnar time-series store designed for sub-second analytical queries. It functions as a data platform featuring a distributed SQL query engine and a real-time data ingestion system for moving historical and streaming data from external sources. The system is distinguished by its ability to provide low-latency analytics under high concurrency to power operational dashboards. It implements a Kerberos-secured environment for user authentication and employs a shared-nothing cluster architecture to enable horizontal scaling. The plat
Executes complex data analysis and multi-stage SQL transformations across distributed clusters for massive datasets.
Javadruid
Voir sur GitHub14,020
microsoftdocs/azure-docs
MicrosoftDocs/azure-docs
10,894Voir sur GitHub
Azure Docs is the official technical documentation repository for Microsoft Azure, the cloud computing platform. It provides comprehensive guidance on the full spectrum of Azure services, covering everything from core infrastructure components like virtual machines, Kubernetes clusters, and serverless computing to platform services for AI, machine learning, data analytics, and storage. The documentation details how to provision, manage, and govern cloud resources at scale, including policy enforcement, identity management, and cost optimization. The documentation distinguishes Azure through i
Documents Azure Batch for scheduling and executing large-scale parallel workloads on managed clusters.
Markdownskilling
Voir sur GitHub10,894
modin-project/modin
modin-project/modin
10,389Voir sur GitHub
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
Processes datasets that exceed system memory using distributed execution engines and out-of-core computation.
Pythonanalyticsdata-sciencedataframe
Voir sur GitHub10,389
boto/boto3
boto/boto3
9,834Voir sur GitHub
Boto3 is the AWS SDK for Python, providing a programmatic interface for managing and automating AWS cloud infrastructure and services. It serves as a cloud management API client and resource manager for provisioning, configuring, and scaling virtual servers, databases, and storage. The library enables the implementation of infrastructure-as-code through declarative templates and scripts, allowing for the deployment of identical resource stacks across multiple accounts and geographic regions. It also provides a framework for coordinating distributed workflows, serverless functions, and contain
Processes large-scale data analytics using Apache Spark code in a managed distributed environment.
Pythonawsaws-sdkcloud
Voir sur GitHub9,834
blue-yonder/tsfresh
blue-yonder/tsfresh
9,249Voir sur GitHub
tsfresh is an automated feature engineering tool and library designed to extract statistical characteristics from raw time series data. It transforms sequential data into tabular datasets, converting time series into a flat format where each row represents a unique entity and columns represent extracted features. The project distinguishes itself through a parallel data processing framework that distributes heavy computational workloads across multiple CPU cores. It also implements hypothesis-based feature selection to identify the most predictive characteristics and filter out irrelevant ones
Processes massive time series datasets by distributing heavy computational workloads across multiple CPU cores.
Jupyter Notebookdata-sciencefeature-extractiontime-series
Voir sur GitHub9,249
spring-projects/spring-ai
spring-projects/spring-ai
9,001Voir sur GitHub
Spring AI is an application framework for Java that provides a portable, fluent API for integrating AI models, tools, and vector stores into applications. It wraps multiple AI providers behind a common interface, allowing developers to switch between chat, embedding, image, and speech models without changing application code. The framework includes a chainable chat client API similar to WebClient or RestClient, supports both synchronous and streaming interactions, and offers structured output conversion that transforms unstructured AI responses into strongly-typed Java objects. The framework
Splits large document collections into smaller batches to fit within embedding model token limits.
Javaartificial-intelligencejavaspring-ai
Voir sur GitHub9,001
lancedb/lancedb
lancedb/lancedb
9,031Voir sur GitHub
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Allows adding new columns and transforming data at scale to extend tables vertically and horizontally.
HTMLapproximate-nearest-neighbor-searchimage-searchnearest-neighbor-search
Voir sur GitHub9,031
iamseancheney/python_for_data_analysis_2nd_chinese_version
iamseancheney/python_for_data_analysis_2nd_chinese_version
8,937Voir sur GitHub
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Provides mathematical transformations such as scaling, centering, and logarithmic changes to prepare model variables.
matplotlibnumpypandas
Voir sur GitHub8,937
vaexio/vaex
vaexio/vaex
8,506Voir sur GitHub
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Performs large-scale filtering, joining, and transformations on massive dataframes via lazy evaluation.
Python
Voir sur GitHub8,506
gunnarmorling/1brc
gunnarmorling/1brc
8,062Voir sur GitHub
The 1BRC (One Billion Row Challenge) is a Java performance benchmarking exercise that processes one billion temperature records from a text file to compute the minimum, mean, and maximum temperature per weather station. At its core, it is a large-scale data aggregation challenge designed to test how efficiently a Java program can parse and aggregate structured data from a plain text file, serving as both a programming exercise and a benchmark for Java performance optimization. The project distinguishes itself through a collection of performance-oriented architectural patterns for high-through
A programming exercise that processes one billion temperature records from a text file to compute per-station statistics.
Java1brcchallenges
Voir sur GitHub8,062
alteryx/featuretools
alteryx/featuretools
7,658Voir sur GitHub
Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data. The system functions as a distributed feature synthesis engine, allowing the process of creating feature vectors to scale across multiple cores or clusters to handle large-scale datasets. The library supports the synthesis of multi-table datasets, time series feature generation, and the creation of custom machine learning primitives
Executes feature engineering and transformations across massive datasets using distributed processing.
Python
Voir sur GitHub7,658
featuretools/featuretools
featuretools/featuretools
7,655Voir sur GitHub
Featuretools is a Python data science library and automated feature engineering framework designed to create predictive features from multiple related datasets. It automates the data preparation and transformation steps required for machine learning models through deep feature synthesis. The library enables the automatic generation of comprehensive feature tables by applying recursive transformations to relational data. It supports the transformation of unstructured text into structured numeric features and allows users to define custom primitives to extend the synthesis process with specific
Processes massive volumes of data by distributing feature computation across distributed clusters.
Python
Voir sur GitHub7,655
h2oai/h2o-3
h2oai/h2o-3
7,493Voir sur GitHub
h2o-3 is a distributed machine learning platform and automated machine learning framework designed for training and deploying predictive models using distributed in-memory computing. It functions as a deep learning framework and a distributed model scoring engine, capable of operating as a Kubernetes ML cluster to process large datasets in parallel. The platform distinguishes itself through automated machine learning capabilities that automatically select the best algorithms and hyperparameters to optimize model performance. It provides specialized deep learning toolkits for tasks including i
Processes massive datasets across a cluster using distributed key-value stores and map-reduce computation.
Jupyter Notebookautomlbig-datadata-science
Voir sur GitHub7,493
oneapi-src/onetbb
oneapi-src/oneTBB
6,683Voir sur GitHub
oneTBB est une bibliothèque et un framework de parallélisme C++ conçu pour ajouter le parallélisme multi-cœur aux applications. Il fournit un modèle de parallélisme basé sur les tâches qui mappe les tâches computationnelles logiques aux cœurs matériels disponibles pour éliminer le besoin de gestion manuelle des threads. La bibliothèque fonctionne comme un outil de mise à l'échelle multi-cœur, utilisant des templates génériques pour mettre à l'échelle les opérations de parallélisme de données sur les processeurs pour une performance portable. Elle emploie un framework basé sur les tâches pour assurer que les charges de travail computationnelles sont distribuées sur les ressources matérielles. Le projet couvre le parallélisme à mémoire partagée, la planification de tâches multi-cœur et la mise à l'échelle du parallélisme de données. Il utilise un planificateur de tâches avec vol de travail (work-stealing), le découpage récursif de plages et l'équilibrage de charge dynamique pour gérer la distribution du travail sur les cœurs à l'exécution.
Enables running operations across large datasets using templates to ensure portable performance across different multi-core processors.
C++
Voir sur GitHub6,683
feast-dev/feast
feast-dev/feast
6,727Voir sur GitHub
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Feast executes feature computation DAGs across a cluster, automatically scaling workers and managing resources for large-scale processing.
Pythonbig-datadata-engineeringdata-quality
Voir sur GitHub6,727
haifengl/smile
haifengl/smile
6,387Voir sur GitHub
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Provides scalers, encoders, and imputers to transform raw data for statistical analysis and modeling.
Java
Voir sur GitHub6,387
online-ml/river
online-ml/river
5,853Voir sur GitHub
River est un framework Python pour le machine learning en ligne (online machine learning), conçu pour entraîner et évaluer des modèles sur des données en streaming. Il permet un apprentissage incrémental en mettant à jour les paramètres du modèle une observation à la fois, éliminant le besoin de stocker des jeux de données d'entraînement complets en mémoire. La bibliothèque se distingue par un système dédié de détection de dérive de concept (concept drift) qui surveille les changements dans les distributions de données pour déclencher l'adaptation du modèle. Elle fournit également un framework de validation progressive qui simule un déploiement en temps réel en testant les modèles sur des échantillons avant de les utiliser pour l'entraînement. Le système couvre un large éventail de capacités de streaming, incluant l'ingénierie de caractéristiques (feature engineering) en temps réel, la prévision de séries temporelles et la détection d'anomalies en ligne. Il prend en charge l'apprentissage non supervisé via le clustering incrémental et les arbres de décision, ainsi que l'agrégation ensembliste et les politiques de bandit pour la sélection de modèles. Le projet inclut des utilitaires pour l'ingestion de données en streaming à partir de sources telles que des fichiers CSV et des API, ainsi que des outils pour calculer des statistiques courantes et des esquisses de données (data sketches) économes en mémoire.
Scales numeric values and encodes categories in real time to ensure data compatibility with algorithms.
Python
Voir sur GitHub5,853

Awesome Large-Scale Data Computation GitHub Repositories

apache/spark

RaRe-Technologies/gensim

apache/hadoop

apache/druid

MicrosoftDocs/azure-docs

modin-project/modin

boto/boto3

blue-yonder/tsfresh

spring-projects/spring-ai

lancedb/lancedb

iamseancheney/python_for_data_analysis_2nd_chinese_version

vaexio/vaex

gunnarmorling/1brc

alteryx/featuretools

featuretools/featuretools

h2oai/h2o-3

oneapi-src/oneTBB

feast-dev/feast

haifengl/smile

online-ml/river

Explorer les sous-tags