6 مستودعات
Tools for transforming and normalizing raw information into structured formats optimized for machine learning models.
Explore 6 awesome GitHub repositories matching data & databases · Feature Engineering Tools. Refine with filters or upvote what's useful.
Scikit-learn is a machine learning library for predictive data analysis that provides a collection of algorithms for supervised and unsupervised learning. It functions as a comprehensive toolkit for data preprocessing, dimensionality reduction, and model selection, allowing users to classify data objects, predict continuous values, and cluster similar items based on historical patterns. The project is defined by a unified interface design where objects either learn from data, transform data, or chain these operations into sequential workflows. To ensure performance on large or high-dimensiona
Transforms raw information into structured formats optimized for analysis and machine learning model performance.
This project is a Python-based framework that functions as a generative AI agent for programmatic data analysis. It enables users to interact with structured data sources through natural language prompts, translating these requests into executable code to perform analysis, data cleaning, and visualization. By maintaining conversational context across multi-turn interactions, the system allows for iterative exploration and the building of complex data narratives. The framework distinguishes itself through a robust semantic layer and secure execution model. It maps raw datasets to descriptive m
Provides tools for transforming and normalizing raw information into structured formats optimized for machine learning models.
h2oGPT is a self-hosted platform designed for running large language models and executing retrieval-augmented generation workflows locally. It provides a comprehensive web interface that allows users to index private document collections into searchable databases, enabling context-aware question answering and summarization without exposing sensitive data to external services. The platform distinguishes itself by offering a modular architecture that supports both local model execution and connections to external inference servers. It facilitates the development of autonomous agents capable of
Automates the identification and calculation of data features to enhance predictive model performance.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Provides a specialized SDK for transforming raw data into formats optimized for machine learning and vector storage.
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Provides high-speed feature transformation and incremental training tools to prepare massive datasets for machine learning.
Aerosolve هو إطار عمل للتعلم الآلي مصمم لتدريب ونشر نماذج قابلة للتفسير. يعمل كأداة لهندسة الميزات (feature engineering) ومدرب للنماذج يستخدم نمذجة الميزات المتفرقة (sparse feature modeling) لتبسيط تصحيح الأوزان وتسريع تكرار البيانات. يتضمن النظام لغة تحويل متخصصة (DSL) لتحويل البيانات الخام إلى تمثيلات جاهزة للنماذج. كما يوفر قدرات لتحليل المحتوى المرئي عبر تعيين الصور في مساحات متجهة كثيفة وعالية الأبعاد لتصنيف وتنظيم البيانات حسب النمط أو المحتوى. يسمح إطار العمل بالتدريب المتمحور حول الإنسان من خلال حقن المعتقدات المسبقة والأوزان المحددة في عملية تعلم النموذج. وللنشر، يستخدم وقت تشغيل استنتاجي (inference runtime) بسيط لتنفيذ تنبؤات خفيفة وآلية تسجيل ذات سياق مشترك لمعالجة عناصر متعددة في عملية واحدة.
Provides tools to transform and normalize raw information into structured formats optimized for machine learning models.