12 repository-uri
Tools for cleaning, transforming, and encoding data for model consumption.
Distinguishing note: Focuses on categorical encoding.
Explore 12 awesome GitHub repositories matching artificial intelligence & ml · Data Preprocessing. Refine with filters or upvote what's useful.
This project is an interactive data science environment that combines code execution, rich media visualization, and narrative documentation into a persistent, browser-based platform. It serves as a comprehensive educational resource for scientific computing, providing a framework for iterative data analysis and machine learning prototyping. The environment is distinguished by its focus on high-performance numerical computing, utilizing vectorized array operations and memory-mapped data structures to handle large-scale computations efficiently. It features a unified estimator interface that st
Converts categorical data into numerical formats for model input.
DeepSeek-Coder is a large language model and foundational neural network architecture designed specifically for software development tasks. It functions as an artificial intelligence assistant capable of interpreting complex programming instructions to generate, transpile, and structure source code. The system distinguishes itself through its ability to perform project-level code generation, analyzing broader context and patterns across entire software projects rather than isolated files. It supports multimodal input processing, allowing for the integration of text and visual data to inform i
Formats raw data through truncation, padding, and token insertion to meet model architecture requirements.
This project is a cross-platform machine learning inference engine designed to execute pre-trained models across diverse operating systems and hardware environments. It functions as a standardized execution framework that manages the entire lifecycle of model inference, from loading and graph optimization to hardware-accelerated execution and generative sequence management. The runtime distinguishes itself through a highly modular architecture that decouples model logic from hardware-specific kernels. By utilizing an execution provider abstraction, it enables developers to offload computation
Transforms raw inputs like text or images into tensor formats required by models using integrated operators.
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Stores transformed data to skip the preprocessing stage during repeated prediction calls.
CatBoost is a gradient boosting machine learning library used to train decision tree ensembles for regression, classification, and ranking tasks. It functions as a high-performance framework that provides a categorical data processor for transforming non-numeric features, a distributed trainer for large-scale datasets, and GPU acceleration to speed up model construction. The library distinguishes itself through native handling of categorical data and text features, removing the need for manual encoding. It includes a specialized model interpretability tool that leverages SHAP values and featu
Uses specialized categorical data types during input preparation to speed up the preprocessing of categorical features.
This project is a manifold learning and non-linear dimensionality reduction library used to project high-dimensional data into lower-dimensional spaces while preserving topological structure. It functions as a parametric embedding framework and a topological data visualization library for identifying clusters and patterns within complex datasets. The library distinguishes itself through parametric neural mapping, which uses neural networks to learn functional mappings that allow for out-of-sample projections and the reconstruction of original data. It supports supervised and semi-supervised d
Reduces high-dimensional data to a lower-dimensional manifold to improve density-based clustering performance.
This project is a machine learning educational resource and implementation guide for Python. It provides a collection of executable code and notebooks that demonstrate predictive modeling, data analysis workflows, and the implementation of various machine learning algorithms. The repository features practical examples of classification, regression, and clustering tasks using Scikit-Learn, alongside tutorials for building and training deep learning architectures with TensorFlow. These include implementations of convolutional and recurrent networks. The content covers a broad range of capabili
Provides workflows for cleaning, scaling, and encoding raw datasets to prepare them for machine learning.
This is a comprehensive educational curriculum designed to teach machine learning fundamentals using the Python programming language. It provides a structured course covering the implementation and theory of supervised learning, unsupervised learning, and deep learning. The curriculum is delivered through interactive notebooks that combine executable code with technical tutorials. It includes dedicated guides for building neural network architectures, implementing classification and regression models, and utilizing clustering techniques for pattern discovery in unlabeled data. The materials
Provides a comprehensive workflow for cleaning, transforming, and encoding data to prepare it for machine learning models.
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Anomalib applies transformations to raw images before passing them to the anomaly detection model.
Orange3 is a visual data mining platform that provides an interactive canvas for building data analysis workflows without writing code. At its core, it offers a widget-based visual programming environment where users connect configurable components to perform data preprocessing, machine learning model training, statistical evaluation, and interactive visualization. The platform is built on NumPy-backed data tables with domain descriptors that define variable names, types, and roles, and includes a lazy SQL query proxy for working with database tables without loading all data into memory. The
Applies transformations such as normalization, imputation, or feature selection to prepare data for modeling.
Acest proiect este o resursă educațională cuprinzătoare și un curs pentru construirea de rețele neuronale folosind PyTorch. Acoperă elementele fundamentale ale deep learning-ului, inclusiv manipularea tensorilor, diferențierea automată și construcția componentelor modulare de rețele neuronale. Repository-ul servește drept ghid tehnic pentru mai multe domenii specializate. Oferă detalii de implementare pentru sarcini de computer vision, cum ar fi clasificarea imaginilor, detecția obiectelor și segmentarea semantică, precum și fluxuri de lucru de procesare a limbajului natural (NLP) care implică transformatoare, rețele recurente și modele generative. În plus, include o referință pentru AI generativ, concentrându-se în mod specific pe sinteza de imagini prin modele de difuzie și rețele adversariale. Materialul se extinde către optimizarea modelelor și pipeline-uri de deployment. Acoperă tehnici pentru reducerea dimensiunii modelelor și creșterea vitezei de inferență prin cuantizare și exportul modelelor în formate precum ONNX și TensorRT. Alte domenii de capabilitate includ ingineria datelor pentru încărcarea paralelă, evaluarea modelelor folosind metrici personalizate și deployment-ul modelelor de limbaj mari (LLM) open-source. Proiectul este livrat în principal sub formă de serie de Jupyter Notebooks.
Provides tools for cleaning, transforming, and encoding raw data to prepare it for model consumption.
Acest proiect este un framework de meta-learning TensorFlow și un set de instrumente de cercetare conceput pentru a implementa și antrena optimizatori învățați. Oferă o bibliotecă de instrumente pentru dezvoltarea rețelelor neuronale care învață cum să optimizeze alte modele, înlocuind algoritmii tradiționali de optimizare bazați pe gradient. Framework-ul include un manager de ansamblu de probleme care permite combinarea mai multor sarcini de optimizare distincte într-o singură funcție de pierdere ponderată pentru antrenare simultană. Utilizează un model factory pentru instanțierea rețelei și suportă definirea funcțiilor obiectiv personalizate și a grafurilor de pierdere ca ținte pentru algoritmii de învățare. Setul de instrumente acoperă o gamă largă de capabilități, inclusiv meta-optimizarea bazată pe gradient, benchmarking-ul modelelor și execuția buclelor de antrenare cu lungimi de derulare configurabile. De asemenea, oferă utilitare pentru preprocesarea gradientului, persistența stării serializate și raportarea statisticilor experimentelor, cum ar fi eroarea finală medie și durata epocii.
Transforms input gradients using logarithmic scaling and sign extraction to prepare them for model consumption.