9 repository-uri
Techniques for processing data in batches to improve computational efficiency.
Distinguishing note: Focuses on batch-oriented processing rather than row-level iteration.
Explore 9 awesome GitHub repositories matching data & databases · Vectorized Data Processing. Refine with filters or upvote what's useful.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Processes datasets in vectorized batches to achieve higher performance compared to row-by-row operations.
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
Groups processed features and labels into minibatches to facilitate efficient training and testing loops.
This project is a linear algebra tutorial and educational resource focused on the mathematical foundations of machine learning. It serves as a technical guide and instructional material for understanding how matrix calculations and linear operations power predictive algorithms. The resource emphasizes the transition from basic arithmetic to the implementation of predictive models. It focuses on linear algebra visualization to demonstrate how matrix operations translate into the geometric transformations used in data science. The material covers the implementation of machine learning logic th
Demonstrates processing multiple data samples simultaneously using vectorized matrix operations to increase throughput.
This repository provides a complete framework for training generative adversarial networks (GANs) that produce high-resolution photorealistic images, up to 1024 by 1024 pixels. The core technique is progressive layer growth, where both the generator and discriminator networks start training at low resolution and gradually add new layers to model finer details, enabling stable synthesis of large images. The framework includes a high-resolution image generator, an image quality metric evaluator, a latent space interpolation tool for creating smooth transition videos, and a multi-resolution datas
Implements minibatch standard deviation to help the discriminator detect mode collapse during training.
dplyr este o bibliotecă R pentru manipularea datelor care oferă o gramatică pentru transformarea cadrelor de date (data frames) tabelare. Funcționează ca un procesor de data frames în memorie și un instrument de algebră relațională, folosind un set consistent de verbe pentru a filtra, selecta și sumariza datele. Proiectul include un motor de traducere SQL care convertește expresiile de manipulare a datelor de nivel înalt în interogări optimizate. Acest lucru permite utilizatorilor să efectueze transformări direct pe baze de date relaționale la distanță și în stocarea cloud, fără a descărca datele local. Biblioteca acoperă o gamă largă de operațiuni tabelare, inclusiv mutarea coloanelor, subsetarea rândurilor și join-uri de date relaționale. De asemenea, oferă capabilități pentru analiza datelor grupate, permițând partiționarea seturilor de date pentru agregări și rezumate independente.
Applies functions across entire columns simultaneously to maximize computational efficiency within the R memory model.
ArrayFire este un framework de calcul hardware-agnostic și un motor de tensori compilat JIT, conceput pentru calcul numeric de înaltă performanță. Acesta servește ca bibliotecă de calcul numeric GPU și toolkit de procesare paralelă a semnalelor care abstractizează backend-urile hardware, permițând aceluiași cod să ruleze pe diverse arhitecturi GPU și CPU. Proiectul se distinge printr-un motor JIT care utilizează compilarea expresiilor pentru a fuziona operațiunile și a minimiza consumul de memorie. Acesta folosește un graf de execuție amânată pentru a optimiza lanțurile de calcul și oferă primitive de interoperabilitate pentru a partaja date și contexte de execuție cu platforme de calcul externe precum CUDA și OpenCL. Biblioteca acoperă o gamă largă de capabilități, inclusiv algebră liniară paralelă, procesarea digitală a semnalelor și viziune computerizată accelerată. Oferă instrumente pentru implementarea învățării automate, simularea modelării financiare și rezolvarea ecuațiilor diferențiale parțiale pentru simulări de sisteme fizice. Sistemul său de gestionare a tensorilor se ocupă de alocarea array-urilor multidimensionale, felierea (slicing) și transferurile de date gazdă-dispozitiv.
Executes operations across N-dimensional arrays by tiling data and parallelizing loop iterations on hardware.
ZLinq is a zero-allocation LINQ library and memory-efficient collection toolkit for C#. It provides a high-performance replacement for standard query operations by using value-type enumerators and pooled memory to eliminate heap allocations and reduce garbage collection overhead. The library features a C# source generator that automatically routes standard query method calls to these zero-allocation implementations. It further accelerates data processing through a SIMD accelerated data library, using hardware vectorization for numeric aggregations and bulk operations on primitive arrays and s
Processes array and span elements using hardware vector widths via lambda expressions for high-performance iteration.
USearch is a high-performance vector similarity search engine and approximate nearest neighbor index designed for dense embeddings. It functions as a low-level vector database core and high-dimensional vector indexer, providing the primitives necessary to store and retrieve vectors across massive datasets. The engine distinguishes itself through hardware-level SIMD acceleration for distance kernels and a proximity-graph indexing system that enables fast retrieval across billions of vectors. It supports multi-precision vector quantization to balance memory usage and accuracy, and utilizes memo
Processes multiple query vectors simultaneously using flattened arrays to maximize throughput for bulk similarity searches.
This project is a multi-purpose toolkit comprising a static site generator, a predictive modeling tool, and a sports analytics dashboard. It functions as a content syndication engine that converts source files into static HTML and machine-readable XML streams for blogs and professional portfolios. The system features a data processing engine designed for sports performance analytics, using linear and logistic regression to estimate season win totals and calculate win probabilities. It includes a time-series visualization framework that renders these performance trends using high-contrast them
Processes large datasets using vectorization and row-by-row application to increase computation speed.