9 dépôts
Techniques for processing data in batches to improve computational efficiency.
Distinguishing note: Focuses on batch-oriented processing rather than row-level iteration.
Explore 9 awesome GitHub repositories matching data & databases · Vectorized Data Processing. Refine with filters or upvote what's useful.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Processes datasets in vectorized batches to achieve higher performance compared to row-by-row operations.
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
Groups processed features and labels into minibatches to facilitate efficient training and testing loops.
This project is a linear algebra tutorial and educational resource focused on the mathematical foundations of machine learning. It serves as a technical guide and instructional material for understanding how matrix calculations and linear operations power predictive algorithms. The resource emphasizes the transition from basic arithmetic to the implementation of predictive models. It focuses on linear algebra visualization to demonstrate how matrix operations translate into the geometric transformations used in data science. The material covers the implementation of machine learning logic th
Demonstrates processing multiple data samples simultaneously using vectorized matrix operations to increase throughput.
This repository provides a complete framework for training generative adversarial networks (GANs) that produce high-resolution photorealistic images, up to 1024 by 1024 pixels. The core technique is progressive layer growth, where both the generator and discriminator networks start training at low resolution and gradually add new layers to model finer details, enabling stable synthesis of large images. The framework includes a high-resolution image generator, an image quality metric evaluator, a latent space interpolation tool for creating smooth transition videos, and a multi-resolution datas
Implements minibatch standard deviation to help the discriminator detect mode collapse during training.
dplyr est une bibliothèque de manipulation de données pour R qui fournit une grammaire pour transformer les data frames tabulaires. Elle fonctionne comme un processeur de data frames en mémoire et un outil d'algèbre relationnelle, utilisant un ensemble cohérent de verbes pour filtrer, sélectionner et résumer les données. Le projet inclut un moteur de traduction SQL qui convertit des expressions de manipulation de données de haut niveau en requêtes optimisées. Cela permet aux utilisateurs d'effectuer des transformations directement sur des bases de données relationnelles distantes et du stockage cloud sans rapatrier les données localement. La bibliothèque couvre une large gamme d'opérations tabulaires, incluant la mutation de colonnes, le sous-ensemble de lignes et la jointure de données relationnelles. Elle offre également des capacités pour l'analyse de données groupées, permettant de partitionner les jeux de données pour des agrégations et des résumés indépendants.
Applies functions across entire columns simultaneously to maximize computational efficiency within the R memory model.
ArrayFire est un framework de calcul agnostique au matériel et un moteur de tenseurs compilé JIT conçu pour le calcul numérique haute performance. Il sert de bibliothèque de calcul numérique GPU et de toolkit de traitement du signal parallèle qui abstrait les backends matériels, permettant à la même base de code de s'exécuter sur diverses architectures GPU et CPU. Le projet se distingue par un moteur JIT qui utilise la compilation d'expressions pour fusionner les opérations et minimiser la surcharge mémoire. Il emploie un graphe d'exécution différée pour optimiser les chaînes de calcul et fournit des primitives d'interopérabilité pour partager des données et des contextes d'exécution avec des plateformes de calcul externes comme CUDA et OpenCL. La bibliothèque couvre un large éventail de capacités, incluant l'algèbre linéaire parallèle, le traitement du signal numérique et la vision par ordinateur accélérée. Elle fournit des outils pour l'implémentation de l'apprentissage automatique, la simulation de modélisation financière et la résolution d'équations aux dérivées partielles pour les simulations de systèmes physiques. Son système de gestion de tenseurs gère l'allocation de tableaux multidimensionnels, le découpage et les transferts de données hôte-périphérique.
Executes operations across N-dimensional arrays by tiling data and parallelizing loop iterations on hardware.
ZLinq is a zero-allocation LINQ library and memory-efficient collection toolkit for C#. It provides a high-performance replacement for standard query operations by using value-type enumerators and pooled memory to eliminate heap allocations and reduce garbage collection overhead. The library features a C# source generator that automatically routes standard query method calls to these zero-allocation implementations. It further accelerates data processing through a SIMD accelerated data library, using hardware vectorization for numeric aggregations and bulk operations on primitive arrays and s
Processes array and span elements using hardware vector widths via lambda expressions for high-performance iteration.
USearch is a high-performance vector similarity search engine and approximate nearest neighbor index designed for dense embeddings. It functions as a low-level vector database core and high-dimensional vector indexer, providing the primitives necessary to store and retrieve vectors across massive datasets. The engine distinguishes itself through hardware-level SIMD acceleration for distance kernels and a proximity-graph indexing system that enables fast retrieval across billions of vectors. It supports multi-precision vector quantization to balance memory usage and accuracy, and utilizes memo
Processes multiple query vectors simultaneously using flattened arrays to maximize throughput for bulk similarity searches.
This project is a multi-purpose toolkit comprising a static site generator, a predictive modeling tool, and a sports analytics dashboard. It functions as a content syndication engine that converts source files into static HTML and machine-readable XML streams for blogs and professional portfolios. The system features a data processing engine designed for sports performance analytics, using linear and logistic regression to estimate season win totals and calculate win probabilities. It includes a time-series visualization framework that renders these performance trends using high-contrast them
Processes large datasets using vectorization and row-by-row application to increase computation speed.