8 dépôts
Functions for filtering, mapping, and manipulating distributed data.
Distinguishing note: Focuses on row-level and batch-level data manipulation.
Explore 8 awesome GitHub repositories matching data & databases · Dataset Transformations. Refine with filters or upvote what's useful.
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Applies functions to rows or batches to filter, map, or manipulate data for downstream processing tasks.
This project is a comprehensive research platform designed for the end-to-end lifecycle of robotic learning. It provides a modular framework for training neural network policies—specifically through imitation and reinforcement learning—and deploying them onto physical robotic hardware. By offering a unified interface for hardware abstraction, the platform decouples high-level control logic from the specific sensors and actuators of diverse robotic systems. The framework distinguishes itself through a standardized approach to data and policy management. It utilizes a consistent schema for reco
Applies coordinate transformations to historical data to ensure compatibility with updated hardware.
Vega is a reactive visualization engine that translates structured specifications into interactive, browser-based graphical representations. It functions as a declarative grammar for data visualization, allowing users to define complex charts and maps through a JSON-based configuration format rather than imperative code. The system operates on a dataflow-based reactive graph that automatically propagates updates through the visualization whenever input data or user interactions change. By integrating a modular transformation pipeline, the engine handles data filtering, sorting, and aggregatio
Filters, sorts, and aggregates datasets directly within the visualization specification before rendering.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Applies operations like LLM inference and repartitioning across entire datasets using distributed engines.
Flashlight est une bibliothèque de machine learning et de tenseurs autonome en C++ utilisée pour construire et entraîner des réseaux de neurones. Elle fonctionne comme un framework complet de réseaux de neurones et un moteur de différenciation automatique, fournissant les outils pour construire des graphes de calcul et calculer les gradients via la rétropropagation. Le projet sert de framework d'entraînement distribué, utilisant des opérations all-reduce pour synchroniser les gradients et les paramètres sur plusieurs nœuds de calcul et appareils. Il se distingue par une intégration profonde de la manipulation de tenseurs haute performance, l'interopérabilité native de la mémoire des appareils et un système pour synchroniser les poids entre les workers distribués afin d'accélérer l'entraînement de modèles à grande échelle. Le framework couvre un large éventail de capacités de deep learning, incluant la composition modulaire de couches pour concevoir des architectures complexes comme des blocs résiduels et des cellules récurrentes. Il fournit des utilitaires étendus de gestion de données pour l'ingestion et le préchargement, ainsi que des systèmes de sérialisation pour persister les états de modèle. De plus, il inclut une suite d'outils de surveillance et d'observabilité pour suivre les métriques d'entraînement et mesurer les erreurs de séquence. La bibliothèque est implémentée en C++.
Provides functions for mapping and manipulating dataset values while preserving the original data size.
SparkInternals est une référence technique et un guide d'architecture détaillant la conception interne et l'implémentation du moteur de calcul distribué Apache Spark. Il sert d'étude sur l'analyse des moteurs de big data, en se concentrant sur la gestion de l'exécution en cluster et l'interaction entre les nœuds drivers, les exécuteurs et les workers. Le projet fournit une décomposition détaillée de la manière dont les plans logiques sont convertis en étapes d'exécution physiques. Il analyse spécifiquement la mécanique des opérations de shuffle, la gestion de la mémoire et la coordination de la planification des jobs distribués. La documentation couvre un large éventail de capacités de calcul distribué, incluant la planification de l'exécution des requêtes, la gestion des dépendances de données et les stratégies de mise en cache en mémoire. Elle examine également la distribution des tâches, l'exécution parallèle et les processus utilisés pour la reprise sur erreur et la persistance des données.
Provides distributed functions for mapping, filtering, and manipulating records to produce new datasets.
This is an interactive notebook-based course that teaches machine learning from Python fundamentals through deep learning and natural language processing. It uses real datasets and multiple frameworks within a structured, hands-on curriculum that combines concise explanations with executable code cells, built-in datasets, and embedded exercise checkpoints. Learning progresses through data preparation and exploration, classical machine learning workflows, computer vision with convolutional neural networks, and natural language processing with deep learning, all delivered as a cohesive progressi
Provides functions for mapping and manipulating data using custom functions and lambdas across columns.
This is a structured deep learning curriculum for programmers, delivered as a collection of Jupyter notebooks. It teaches the fundamentals of training neural networks for computer vision, natural language processing, tabular data analysis, and collaborative filtering using PyTorch and the fastai library. The course is designed to be hands-on, guiding learners from building a training loop from scratch to fine-tuning pretrained models for a variety of practical tasks. The curriculum distinguishes itself by covering the full lifecycle of a deep learning project, from data preparation and augmen
Wraps custom dataset logic into Transform objects so they integrate with the data pipeline system.