What are the best Awesome Label-Based Data Selection GitHub Repositories?

Accessing data using explicit index labels. **Distinguishing note:** Focuses on label-based access patterns. Explore 38 awesome GitHub repositories matching data & databases · Label-Based Data Selection. Refine with filters or upvote what's useful. Top picks: kamranahmedse/developer-roadmap, pandas-dev/pandas, exacity/deeplearningbook-chinese, d2l-ai/d2l-en, fastai/fastai, heartexlabs/label-studio, humansignal/label-studio, humansignal/labelimg, idea-research/grounded-segment-anything, opencv…

Why is kamranahmedse/developer-roadmap a recommended Label-Based Data Selection GitHub Repositories repository?

Returns named data structures for improved code readability.

Why is pandas-dev/pandas a recommended Label-Based Data Selection GitHub Repositories repository?

Provides intuitive access to data rows and columns via index labels.

Why is exacity/deeplearningbook-chinese a recommended Label-Based Data Selection GitHub Repositories repository?

Provides guidance on using label smoothing to prevent neural networks from becoming overconfident in their predictions.

Why is d2l-ai/d2l-en a recommended Label-Based Data Selection GitHub Repositories repository?

Implements label shift correction to adjust training data weighting when label distributions change.

Why is fastai/fastai a recommended Label-Based Data Selection GitHub Repositories repository?

Adjusts target labels during training to prevent model overconfidence and improve generalization.

Why is heartexlabs/label-studio a recommended Label-Based Data Selection GitHub Repositories repository?

Integrates machine learning models to automatically generate initial annotations and refine training data.

Why is humansignal/label-studio a recommended Label-Based Data Selection GitHub Repositories repository?

| Integrating machine learning models to provide automated predictions and active learning loops that accelerate the manual data annotation process.

Why is humansignal/labelimg a recommended Label-Based Data Selection GitHub Repositories repository?

Transforms image labels between XML, text, and CSV formats for use in cloud training platforms.

Why is idea-research/grounded-segment-anything a recommended Label-Based Data Selection GitHub Repositories repository?

Automatically creates image pseudo-labels, bounding boxes, and masks using recognition and captioning models.

Why is opencv/cvat a recommended Label-Based Data Selection GitHub Repositories repository?

Utilizes machine learning models to automatically generate initial bounding boxes and masks for visual data.

38 Repos

Awesome GitHub RepositoriesLabel-Based Data Selection

Accessing data using explicit index labels.

Distinguishing note: Focuses on label-based access patterns.

Explore 38 awesome GitHub repositories matching data & databases · Label-Based Data Selection. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

kamranahmedse/developer-roadmap
kamranahmedse/developer-roadmap
357,434Auf GitHub ansehen
Developer Roadmap ist eine Community-gesteuerte Plattform, die strukturierte, graphbasierte Lernpfade für das Software-Engineering bietet. Sie dient als umfassendes Wissens-Repository, in dem technische Bereiche in visuellen Sequenzen organisiert sind, um den Erwerb beruflicher Fähigkeiten und das Karrierewachstum zu steuern. Das Projekt zeichnet sich durch ein kollaboratives Ökosystem aus, das es Nutzern ermöglicht, Roadmaps beizusteuern, bewährte Branchenpraktiken zu kuratieren und berufliche Profile zu pflegen. Es integriert diagnostische Bewertungs-Frameworks, um die technische Kompetenz zu evaluieren, und hilft Entwicklern dabei, Wissenslücken zu identifizieren und sich durch gezielte Lernsequenzen auf professionelle Vorstellungsgespräche vorzubereiten. Über seine Kern-Mapping-Funktionen hinaus bietet die Plattform praktische Projektideen und interaktives Tutoring, um Engineering-Konzepte zu festigen. Sie bietet einen zentralen Raum für die Community, um Ressourcen zu teilen, den fortschreitenden Kompetenzaufbau zu verfolgen und durch komplexe technische Landschaften zu navigieren.
Returns named data structures for improved code readability.
TypeScriptangular-roadmapbackend-roadmapblockchain-roadmap
Auf GitHub ansehen357,434
pandas-dev/pandas
pandas-dev/pandas
49,039Auf GitHub ansehen
Pandas is a high-performance data analysis library that provides a comprehensive framework for manipulating, cleaning, and transforming structured datasets. It centers on labeled one-dimensional and two-dimensional data structures, allowing users to construct, filter, and reshape tabular information while performing complex arithmetic and logical operations. The library distinguishes itself through a sophisticated indexing engine that enables automatic data alignment during calculations and relational merges. By utilizing a block-based memory layout, it optimizes cache locality for vectorized
Provides intuitive access to data rows and columns via index labels.
Pythonalignmentdata-analysisdata-science
Auf GitHub ansehen49,039
exacity/deeplearningbook-chinese
exacity/deeplearningbook-chinese
37,285Auf GitHub ansehen
This project is a comprehensive Chinese translation of a technical deep learning textbook, providing an educational resource on the theory and implementation of neural networks. It functions as a collaborative technical translation project designed to make complex academic AI literature accessible to non-English speakers. The project utilizes a community-driven translation model that integrates external suggestions and pull requests to refine linguistic accuracy and reduce bias. It employs standardized terminology mapping to ensure a uniform vocabulary throughout the translated content. To i
Provides guidance on using label smoothing to prevent neural networks from becoming overconfident in their predictions.
TeX
Auf GitHub ansehen37,285
d2l-ai/d2l-en
d2l-ai/d2l-en
29,001Auf GitHub ansehen
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
Implements label shift correction to adjust training data weighting when label distributions change.
Pythonbookcomputer-visiondata-science
Auf GitHub ansehen29,001
fastai/fastai
fastai/fastai
27,862Auf GitHub ansehen
Fastai is a high-level deep learning library built on PyTorch that provides a unified interface for managing the entire machine learning lifecycle. It functions as a comprehensive training toolkit, abstracting hardware management and automating complex training loops to simplify the construction and execution of neural network models. The framework is distinguished by its notebook-centric development environment and a type-dispatching data pipeline that automatically applies transformations based on input data formats. It emphasizes transfer learning through discriminative layer-wise optimiza
Adjusts target labels during training to prevent model overconfidence and improve generalization.
Jupyter Notebookcolabdeep-learningfastai
Auf GitHub ansehen27,862
heartexlabs/label-studio
heartexlabs/label-studio
27,626Auf GitHub ansehen
Label Studio ist ein Tool für die Annotation verschiedener Datentypen und ein Arbeitsbereich für Datenannotation, der entwickelt wurde, um Datensätze für das Training von maschinellem Lernen vorzubereiten. Es fungiert als cloud-integrierte Daten-Pipeline, die Rohdaten aus Speichern importiert, den Annotationsprozess verwaltet und Labels in standardisierte Formate exportiert. Die Plattform verfügt über ein Framework zur Integration von Modellen für maschinelles Lernen, das eine Verbindung zu externen Modellservern herstellt. Dies ermöglicht modellgestützte Annotation und aktives Lernen, wodurch das System Vor-Labeling durchführen und Vorhersagen basierend auf menschlichem Feedback verfeinern kann. Die Software bietet Projektmanagement-Tools zur Organisation von Datensätzen und zur Zuweisung von Aufgaben an Benutzer über rollenbasierte Zugriffe. Sie unterstützt verschiedene Datentypen und nutzt speicherunabhängige Speicheradapter, um eine Verbindung zu lokalen Dateisystemen oder Cloud-Speicheranbietern herzustellen. Die Anwendung kann durch manuelle Einrichtung oder One-Click-Deployments auf Cloud-Infrastruktur installiert werden.
Integrates machine learning models to automatically generate initial annotations and refine training data.
TypeScript
Auf GitHub ansehen27,626
humansignal/label-studio
HumanSignal/label-studio
27,619Auf GitHub ansehen
Label Studio is a multi-modal data annotation platform designed to create and manage high-quality training datasets for machine learning. It functions as a self-hosted, containerized environment that supports secure, private deployments, including air-gapped configurations. The platform provides a centralized workspace for labeling diverse media types, such as images, text, audio, and time-series data, to support supervised and reinforcement learning workflows. The platform distinguishes itself through deep integration with machine learning backends, enabling active learning loops, automated
| Integrating machine learning models to provide automated predictions and active learning loops that accelerate the manual data annotation process.
TypeScriptannotationannotation-toolannotations
Auf GitHub ansehen27,619
humansignal/labelimg
HumanSignal/labelImg
25,015Auf GitHub ansehen
labelImg is a computer vision labeling tool and image bounding box annotator used to create training datasets for machine learning models. It functions as a desktop utility for drawing rectangular labels on images and saving object coordinates and class names in common machine learning formats. The tool is specifically designed to generate and edit PascalVOC formatted XML files and create image labels in the text-based format required by YOLO object detection pipelines. The software covers object detection annotation and training data preparation, including the ability to manage label catego
Transforms image labels between XML, text, and CSV formats for use in cloud training platforms.
Pythonannotationsdeep-learningdetection
Auf GitHub ansehen25,015
idea-research/grounded-segment-anything
IDEA-Research/Grounded-Segment-Anything
17,633Auf GitHub ansehen
Grounded-Segment-Anything is a suite of specialized tools for multimodal visual analysis, text-based segmentation, and generative image editing. It integrates text-to-bounding-box detection and high-precision image segmentation masks to function as a text-based image segmenter and an automated visual labeling tool. The project enables text-driven image editing by identifying objects through natural language to perform inpainting and element replacement. It further extends visual analysis into three dimensions, allowing for 3D human reconstruction and the generation of 3D bounding boxes from t
Automatically creates image pseudo-labels, bounding boxes, and masks using recognition and captioning models.
Jupyter Notebook3d-whole-body-pose-estimationautomatic-labeling-systemcaption
Auf GitHub ansehen17,633
opencv/cvat
opencv/cvat
16,086Auf GitHub ansehen
CVAT ist ein Open-Source-Annotationstool für Computer Vision und eine Plattform zur Verwaltung visueller Datensätze. Es bietet eine selbst gehostete Schnittstelle zum Labeln von Bildern, Videos und 3D-Daten, um Datensätze für Vision-KI-Modelle zu erstellen. Die Plattform bietet KI-gestützte Daten-Labeling-Funktionen zur Automatisierung der Erstellung von Masken und Bounding Boxes und nutzt ein Plug-in-System zur Anbindung externer Modelle für maschinelles Lernen. Sie enthält ein konsensbasiertes Qualitätssicherungssystem, das die Genauigkeit von Labels durch den Vergleich unabhängiger Annotationen überprüft. Das System deckt kollaboratives Teammanagement, Projektorganisation durch Aufgabenzerlegung und die Integration von Remote-Cloud-Speichern ab. Es bietet zudem eine REST-API für die programmatische Workflow-Steuerung sowie den Import und Export von Daten in branchenüblichen Formaten.
Utilizes machine learning models to automatically generate initial bounding boxes and masks for visual data.
Python
Auf GitHub ansehen16,086
cvat-ai/cvat
cvat-ai/cvat
15,317Auf GitHub ansehen
CVAT is an open-source, web-based platform designed for annotating images, videos, and 3D point clouds to create high-quality training datasets for machine learning. It functions as a containerized server that orchestrates the entire lifecycle of computer vision data, from initial task creation and manual labeling to quality assurance and final dataset export. The platform distinguishes itself through deep integration with machine learning models, allowing users to deploy custom AI models as serverless functions for automated object detection, tracking, and skeleton annotation. It supports co
Applies pre-trained machine learning models to generate initial annotations or suggest labels, reducing manual effort.
Pythonannotationannotation-toolannotations
Auf GitHub ansehen15,317
dask/dask
dask/dask
13,746Auf GitHub ansehen
Dask ist ein Framework für paralleles Rechnen und ein verteilter Task-Scheduler, der darauf ausgelegt ist, Python-Data-Science-Workflows von einzelnen Maschinen auf große Cluster zu skalieren. Es fungiert als Cluster-Ressourcenmanager, der die Berechnungslogik orchestriert, indem Aufgaben und deren Abhängigkeiten als gerichtete azyklische Graphen dargestellt werden. Diese Architektur ermöglicht es dem System, die Verteilung von Workloads auf verfügbare Hardware zu automatisieren und gleichzeitig komplexe Ausführungsanforderungen zu verwalten. Das Projekt zeichnet sich durch eine Lazy-Evaluation-Engine aus, die Datenoperationen verzögert, bis sie explizit angefordert werden, was eine globale Graphoptimierung und effiziente Ressourcenzuweisung ermöglicht. Es integriert speicherbewusstes Data-Spilling, um Systemabstürze bei der Verarbeitung von Datensätzen zu verhindern, die den verfügbaren Speicher überschreiten, und nutzt Task-Graph-Fusion, um Sequenzen von Operationen in einzelne Ausführungsschritte zu kombinieren, wodurch Scheduling-Overhead und Inter-Node-Kommunikation minimiert werden. Die Plattform bietet eine umfassende Oberfläche für die Datenanalyse im großen Maßstab, einschließlich Unterstützung für verteiltes maschinelles Lernen, Integration in das Hochleistungsrechnen und parallele Datenverarbeitung. Sie bietet umfangreiche Werkzeuge für das Cluster-Lebenszyklusmanagement, Performance-Profiling und die Echtzeitüberwachung der Aufgabenausführung. Benutzer können diese Umgebungen über verschiedene Infrastrukturen hinweg bereitstellen, einschließlich lokaler Hardware, Cloud-Anbietern, containerisierten Systemen und Hochleistungsrechner-Clustern.
Retrieves specific rows or columns using index labels, boolean masks, or partial-string matching to filter large datasets.
Pythondasknumpypandas
Auf GitHub ansehen13,746
h2oai/h2ogpt
h2oai/h2ogpt
12,016Auf GitHub ansehen
h2oGPT is a self-hosted platform designed for running large language models and executing retrieval-augmented generation workflows locally. It provides a comprehensive web interface that allows users to index private document collections into searchable databases, enabling context-aware question answering and summarization without exposing sensitive data to external services. The platform distinguishes itself by offering a modular architecture that supports both local model execution and connections to external inference servers. It facilitates the development of autonomous agents capable of
Generate labels for documents and provide tools to validate, correct, and manage annotation workflows for training machine learning models.
Pythonaichatgptembeddings
Auf GitHub ansehen12,016
soumith/ganhacks
soumith/ganhacks
11,619Auf GitHub ansehen
This project is a PyTorch-based generative framework and implementation template for building Generative Adversarial Networks. It provides a collection of foundational toolkits and architectural patterns designed to synthesize high-quality artificial data while focusing on the stability of adversarial neural networks. The framework distinguishes itself through a specialized toolkit for conditional image generation, which integrates discrete labels and auxiliary classification into the training process. It utilizes specific mechanisms to guide the generative process toward target classes by co
Provides utilities to adjust target labels with random noise to prevent discriminator overconfidence.
Auf GitHub ansehen11,619
autogluon/autogluon
autogluon/autogluon
9,997Auf GitHub ansehen
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Increases model accuracy by iteratively predicting and filtering confident samples from unlabeled data to expand the training set.
Pythonautogluonautomated-machine-learningautoml
Auf GitHub ansehen9,997
brightmart/nlp_chinese_corpus
brightmart/nlp_chinese_corpus
9,903Auf GitHub ansehen
This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora. The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-
Links specific questions to corresponding answers using category labels for building knowledge-based systems.
bertchinesechinese-corpus
Auf GitHub ansehen9,903
jadore801120/attention-is-all-you-need-pytorch
jadore801120/attention-is-all-you-need-pytorch
9,742Auf GitHub ansehen
This project is a Transformer machine translation model and attention-based neural network implemented using the PyTorch deep learning framework. It functions as a text-to-text translation tool designed to convert source sequences into target language text. The implementation focuses on neural machine translation, covering the development of sequence-to-sequence architectures. It includes the full pipeline for translation, from text sequence preprocessing and vocabulary creation to model training and text generation inference. The system incorporates standard transformer components such as a
Includes utilities for label smoothing to distribute probability mass and prevent overconfidence.
Pythonattentionattention-is-all-you-needdeep-learning
Auf GitHub ansehen9,742
iamseancheney/python_for_data_analysis_2nd_chinese_version
iamseancheney/python_for_data_analysis_2nd_chinese_version
8,937Auf GitHub ansehen
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Explains how to use explicit axis labels to match and align data points across different tabular objects.
matplotlibnumpypandas
Auf GitHub ansehen8,937
dusty-nv/jetson-inference
dusty-nv/jetson-inference
8,734Auf GitHub ansehen
jetson-inference is a set of libraries and tools for executing optimized deep learning models on embedded GPU hardware. Its primary purpose is to enable real-time computer vision and AI inference at the edge with low latency and high throughput. The project distinguishes itself through high-performance streaming analytics and the ability to execute concurrent AI pipelines on auto-grade silicon. It provides specialized support for multi-sensor stream processing, utilizing zero-copy data transport to load camera frames directly into GPU memory. The codebase covers a broad surface of capabiliti
Runs deep learning models to automatically label datasets with GPU-accelerated pre- and post-processing.
C++caffecomputer-visiondeep-learning
Auf GitHub ansehen8,734
cvhub520/x-anylabeling
CVHub520/X-AnyLabeling
8,193Auf GitHub ansehen
X-AnyLabeling is an AI-assisted annotation platform and computer vision labeling tool. It provides an interface for annotating images and videos using polygons and rectangles to create training sets for machine learning models. The project distinguishes itself through the integration of external AI models via a plugin-based inference backend, allowing for automated generation of candidate labels and the execution of specialized tasks like pose estimation and object detection. It also functions as an optical character recognition tool for extracting text and layout information from document im
Translates annotations between different industry-standard data formats to ensure cross-tool compatibility.
Pythonartificial-intelligenceclipcomputer-vision
Auf GitHub ansehen8,193