38 dépôts
Accessing data using explicit index labels.
Distinguishing note: Focuses on label-based access patterns.
Explore 38 awesome GitHub repositories matching data & databases · Label-Based Data Selection. Refine with filters or upvote what's useful.
Developer Roadmap est une plateforme pilotée par la communauté qui fournit des parcours d'apprentissage structurés basés sur des graphes pour le génie logiciel. Elle sert de dépôt de connaissances complet où les domaines techniques sont organisés en séquences visuelles pour guider l'acquisition de compétences professionnelles et la croissance de carrière. Le projet se distingue par un écosystème collaboratif qui permet aux utilisateurs de contribuer à des roadmaps, d'organiser les meilleures pratiques de l'industrie et de maintenir des profils professionnels. Il intègre des cadres d'évaluation diagnostique pour évaluer la compétence technique, aidant les développeurs à identifier les lacunes en matière de connaissances et à se préparer aux entretiens professionnels grâce à des séquences d'apprentissage ciblées. Au-delà de ses capacités de cartographie de base, la plateforme propose des idées de projets pratiques et du tutorat interactif pour renforcer les concepts d'ingénierie. Elle offre un espace centralisé pour que la communauté puisse partager des ressources, suivre le développement progressif des compétences et naviguer dans des paysages techniques complexes.
Returns named data structures for improved code readability.
Pandas is a high-performance data analysis library that provides a comprehensive framework for manipulating, cleaning, and transforming structured datasets. It centers on labeled one-dimensional and two-dimensional data structures, allowing users to construct, filter, and reshape tabular information while performing complex arithmetic and logical operations. The library distinguishes itself through a sophisticated indexing engine that enables automatic data alignment during calculations and relational merges. By utilizing a block-based memory layout, it optimizes cache locality for vectorized
Provides intuitive access to data rows and columns via index labels.
This project is a comprehensive Chinese translation of a technical deep learning textbook, providing an educational resource on the theory and implementation of neural networks. It functions as a collaborative technical translation project designed to make complex academic AI literature accessible to non-English speakers. The project utilizes a community-driven translation model that integrates external suggestions and pull requests to refine linguistic accuracy and reduce bias. It employs standardized terminology mapping to ensure a uniform vocabulary throughout the translated content. To i
Provides guidance on using label smoothing to prevent neural networks from becoming overconfident in their predictions.
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
Implements label shift correction to adjust training data weighting when label distributions change.
Fastai is a high-level deep learning library built on PyTorch that provides a unified interface for managing the entire machine learning lifecycle. It functions as a comprehensive training toolkit, abstracting hardware management and automating complex training loops to simplify the construction and execution of neural network models. The framework is distinguished by its notebook-centric development environment and a type-dispatching data pipeline that automatically applies transformations based on input data formats. It emphasizes transfer learning through discriminative layer-wise optimiza
Adjusts target labels during training to prevent model overconfidence and improve generalization.
Label Studio est un outil d'étiquetage de données multi-types et un espace de travail d'annotation de données conçu pour préparer des jeux de données pour l'entraînement en apprentissage automatique. Il fonctionne comme un pipeline de données intégré au cloud qui importe des données brutes depuis le stockage, gère le processus d'annotation et exporte les étiquettes dans des formats standardisés. La plateforme dispose d'un framework d'intégration de modèles d'apprentissage automatique qui se connecte à des serveurs de modèles externes. Cela permet l'annotation assistée par modèle et l'apprentissage actif, permettant au système d'effectuer un pré-étiquetage et d'affiner les prédictions basées sur les commentaires humains. Le logiciel fournit des outils de gestion de projet pour organiser les jeux de données et assigner des tâches aux utilisateurs via un accès basé sur les rôles. Il prend en charge divers types de données et utilise des adaptateurs de stockage agnostiques au backend pour se connecter à des systèmes de fichiers locaux ou à des fournisseurs de stockage cloud. L'application peut être installée via une configuration manuelle ou des déploiements en un clic sur une infrastructure cloud.
Integrates machine learning models to automatically generate initial annotations and refine training data.
Label Studio is a multi-modal data annotation platform designed to create and manage high-quality training datasets for machine learning. It functions as a self-hosted, containerized environment that supports secure, private deployments, including air-gapped configurations. The platform provides a centralized workspace for labeling diverse media types, such as images, text, audio, and time-series data, to support supervised and reinforcement learning workflows. The platform distinguishes itself through deep integration with machine learning backends, enabling active learning loops, automated
| Integrating machine learning models to provide automated predictions and active learning loops that accelerate the manual data annotation process.
labelImg is a computer vision labeling tool and image bounding box annotator used to create training datasets for machine learning models. It functions as a desktop utility for drawing rectangular labels on images and saving object coordinates and class names in common machine learning formats. The tool is specifically designed to generate and edit PascalVOC formatted XML files and create image labels in the text-based format required by YOLO object detection pipelines. The software covers object detection annotation and training data preparation, including the ability to manage label catego
Transforms image labels between XML, text, and CSV formats for use in cloud training platforms.
Grounded-Segment-Anything is a suite of specialized tools for multimodal visual analysis, text-based segmentation, and generative image editing. It integrates text-to-bounding-box detection and high-precision image segmentation masks to function as a text-based image segmenter and an automated visual labeling tool. The project enables text-driven image editing by identifying objects through natural language to perform inpainting and element replacement. It further extends visual analysis into three dimensions, allowing for 3D human reconstruction and the generation of 3D bounding boxes from t
Automatically creates image pseudo-labels, bounding boxes, and masks using recognition and captioning models.
CVAT est un outil d'annotation de vision par ordinateur open-source et une plateforme de gestion de jeux de données visuels. Il fournit une interface auto-hébergée pour étiqueter des images, des vidéos et des données 3D afin de créer des jeux de données pour des modèles d'IA de vision. La plateforme dispose d'un étiquetage de données assisté par IA pour automatiser la création de masques et de boîtes englobantes, utilisant un système de plug-in pour connecter des modèles d'apprentissage automatique externes. Il inclut un système d'assurance qualité basé sur le consensus qui vérifie la précision des étiquettes en comparant des annotations indépendantes. Le système couvre la gestion d'équipe collaborative, l'organisation de projets par décomposition de tâches et l'intégration de stockage cloud distant. Il fournit également une API REST pour le contrôle programmatique du flux de travail et l'importation/exportation de données dans des formats standard de l'industrie.
Utilizes machine learning models to automatically generate initial bounding boxes and masks for visual data.
CVAT is an open-source, web-based platform designed for annotating images, videos, and 3D point clouds to create high-quality training datasets for machine learning. It functions as a containerized server that orchestrates the entire lifecycle of computer vision data, from initial task creation and manual labeling to quality assurance and final dataset export. The platform distinguishes itself through deep integration with machine learning models, allowing users to deploy custom AI models as serverless functions for automated object detection, tracking, and skeleton annotation. It supports co
Applies pre-trained machine learning models to generate initial annotations or suggest labels, reducing manual effort.
Dask est un framework de calcul parallèle et un planificateur de tâches distribué conçu pour mettre à l'échelle les flux de travail de science des données Python, des machines uniques aux grands clusters. Il fonctionne comme un gestionnaire de ressources de cluster qui orchestre la logique computationnelle en représentant les tâches et leurs dépendances sous forme de graphes acycliques dirigés. Cette architecture permet au système d'automatiser la distribution des charges de travail sur le matériel disponible tout en gérant des exigences d'exécution complexes. Le projet se distingue par un moteur d'évaluation paresseuse qui diffère les opérations sur les données jusqu'à ce qu'elles soient explicitement demandées, permettant une optimisation globale du graphe et une allocation efficace des ressources. Il intègre le déversement de données conscient de la mémoire pour éviter les plantages du système lors du traitement de jeux de données dépassant la mémoire disponible, et il utilise la fusion de graphes de tâches pour combiner des séquences d'opérations en étapes d'exécution uniques, minimisant la surcharge de planification et la communication entre nœuds. La plateforme fournit une surface de capacités complète pour l'analyse de données à grande échelle, incluant le support pour l'apprentissage automatique distribué, l'intégration du calcul haute performance et le traitement de données parallèle. Elle offre des outils étendus pour la gestion du cycle de vie des clusters, le profilage des performances et la surveillance en temps réel de l'exécution des tâches. Les utilisateurs peuvent déployer ces environnements sur diverses infrastructures, incluant le matériel local, les fournisseurs cloud, les systèmes conteneurisés et les clusters de calcul haute performance.
Retrieves specific rows or columns using index labels, boolean masks, or partial-string matching to filter large datasets.
h2oGPT is a self-hosted platform designed for running large language models and executing retrieval-augmented generation workflows locally. It provides a comprehensive web interface that allows users to index private document collections into searchable databases, enabling context-aware question answering and summarization without exposing sensitive data to external services. The platform distinguishes itself by offering a modular architecture that supports both local model execution and connections to external inference servers. It facilitates the development of autonomous agents capable of
Generate labels for documents and provide tools to validate, correct, and manage annotation workflows for training machine learning models.
This project is a PyTorch-based generative framework and implementation template for building Generative Adversarial Networks. It provides a collection of foundational toolkits and architectural patterns designed to synthesize high-quality artificial data while focusing on the stability of adversarial neural networks. The framework distinguishes itself through a specialized toolkit for conditional image generation, which integrates discrete labels and auxiliary classification into the training process. It utilizes specific mechanisms to guide the generative process toward target classes by co
Provides utilities to adjust target labels with random noise to prevent discriminator overconfidence.
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Increases model accuracy by iteratively predicting and filtering confident samples from unlabeled data to expand the training set.
This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora. The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-
Links specific questions to corresponding answers using category labels for building knowledge-based systems.
This project is a Transformer machine translation model and attention-based neural network implemented using the PyTorch deep learning framework. It functions as a text-to-text translation tool designed to convert source sequences into target language text. The implementation focuses on neural machine translation, covering the development of sequence-to-sequence architectures. It includes the full pipeline for translation, from text sequence preprocessing and vocabulary creation to model training and text generation inference. The system incorporates standard transformer components such as a
Includes utilities for label smoothing to distribute probability mass and prevent overconfidence.
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Explains how to use explicit axis labels to match and align data points across different tabular objects.
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
Runs deep learning models to automatically label datasets with GPU-accelerated pre- and post-processing.
X-AnyLabeling is an AI-assisted annotation platform and computer vision labeling tool. It provides an interface for annotating images and videos using polygons and rectangles to create training sets for machine learning models. The project distinguishes itself through the integration of external AI models via a plugin-based inference backend, allowing for automated generation of candidate labels and the execution of specialized tasks like pose estimation and object detection. It also functions as an optical character recognition tool for extracting text and layout information from document im
Translates annotations between different industry-standard data formats to ensure cross-tool compatibility.