30 open-source projects similar to featuretools/featuretools, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Featuretools alternative.
Featuretools is an automated feature engineering library and data transformation framework written in Python. It automatically generates machine learning feature vectors from multi-table datasets by applying synthesis patterns to relational and timestamped data. The system functions as a distributed feature synthesis engine, allowing the process of creating feature vectors to scale across multiple cores or clusters to handle large-scale datasets. The library supports the synthesis of multi-table datasets, time series feature generation, and the creation of custom machine learning primitives
Modin is a distributed dataframe library and parallel data processing engine designed to handle large datasets that exceed system memory. It functions as a distributed computing framework that parallelizes data manipulation tasks across multiple CPU cores or clusters to increase throughput and avoid memory errors. The project mirrors the Pandas API, allowing for the distribution of data workflows without changing core code logic. It utilizes a pluggable backend interface, which enables users to switch between different distributed execution engines to optimize performance based on available h
This is an interactive notebook-based course that teaches machine learning from Python fundamentals through deep learning and natural language processing. It uses real datasets and multiple frameworks within a structured, hands-on curriculum that combines concise explanations with executable code cells, built-in datasets, and embedded exercise checkpoints. Learning progresses through data preparation and exploration, classical machine learning workflows, computer vision with convolutional neural networks, and natural language processing with deep learning, all delivered as a cohesive progressi
This project is a structured learning curriculum and technical reference for mastering deep learning with TensorFlow. It provides a comprehensive guide for building, training, and deploying neural networks, combining theoretical fundamentals with practical implementation examples. The repository distinguishes itself by covering the end-to-end machine learning workflow, from low-level tensor mathematics and linear algebra to the creation of complex model architectures. It includes specific guidance on developing data pipelines for diverse data types, such as images, text, and time-series seque
Gensim is an unsupervised natural language processing toolkit designed for topic modeling, word embedding training, and the processing of large-scale text corpora. It provides a framework for discovering latent themes and semantic structures in text without the need for labeled data. The toolkit is distinguished by its ability to handle datasets that exceed system memory through iterator-based data streaming from disk. It also supports distributed model training, allowing complex modeling tasks to be executed across computer clusters. The library covers a broad range of analysis capabilities
tsfresh is an automated feature engineering tool and library designed to extract statistical characteristics from raw time series data. It transforms sequential data into tabular datasets, converting time series into a flat format where each row represents a unique entity and columns represent extracted features. The project distinguishes itself through a parallel data processing framework that distributes heavy computational workloads across multiple CPU cores. It also implements hypothesis-based feature selection to identify the most predictive characteristics and filter out irrelevant ones
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
This project is a comprehensive pandas data analysis tutorial and instructional guide designed for learning data manipulation and analysis. It serves as a tabular data processing guide and a manual for time series analysis, providing a structured approach to cleaning, merging, and transforming datasets. The repository functions as a data feature engineering course, providing tutorials on constructing and selecting dataset features to improve machine learning model performance. It also includes a vectorized data operations guide for performing element-wise mathematical computations and matrix
Vaex is a high-performance Apache Arrow DataFrame library and out-of-core data processing engine designed to handle billion-row tabular datasets in Python. It functions as a lazy evaluation framework that defers computations and transformations until results are required, enabling the processing of datasets that exceed available system RAM by mapping files directly from disk. The project distinguishes itself as a tool for big data visualization and exploration, specifically integrated for use within interactive notebooks. It provides specialized capabilities for machine learning feature engin
Apache Spark is a unified distributed data processing engine designed for large-scale data analysis and computation graphs. It functions as a distributed machine learning framework, a graph processing system, a real-time stream processor, and a SQL analytics engine. The system enables the execution of distributed SQL querying, large-scale graph analysis, and real-time stream analytics across clusters of machines. It also provides a scalable environment for implementing machine learning algorithms and predictive model development on massive datasets. The engine incorporates relational query e
This project is a comprehensive knowledge base and study resource designed for mastering technical interviews. It provides structured guides, roadmaps, and curricula focused on data structures, algorithms, system design, and frontend engineering to help candidates prepare for software engineering screenings. The repository distinguishes itself by offering a holistic approach to professional advancement. Beyond technical drills, it includes a career development handbook covering resume optimization, salary benchmarking, and strategic negotiation coaching. It also provides detailed methodologie
This project is a comprehensive machine learning interview guide and technical study resource designed for individuals preparing for machine learning and AI engineering roles. It provides a collection of materials and practice problems covering core algorithms, theoretical fundamentals, and the implementation of neural network architectures. The resource serves as a technical reference for generative AI development, focusing on the design and optimization of large language models and diffusion systems. It includes frameworks for system design, covering the architecture of production machine l
Hadoop is a big data infrastructure suite and distributed data processing framework designed to store and process massive datasets across clusters of computers. It consists of a distributed storage system for managing large files across multiple nodes and a parallel computing engine for processing data across a distributed cluster. The framework implements a distributed file system to ensure fault tolerance and high throughput, paired with a programming model that processes large datasets in parallel. It manages the underlying hardware and software environment required for distributed big dat
This project is a comprehensive instructional resource and course for building neural networks using PyTorch. It covers the fundamental building blocks of deep learning, including tensor manipulation, automatic differentiation, and the construction of modular neural network components. The repository serves as a technical guide for several specialized domains. It provides implementation details for computer vision tasks such as image classification, object detection, and semantic segmentation, as well as natural language processing workflows involving transformers, recurrent networks, and gen
The TensorFlow Cookbook is a collection of code examples and recipes for building, training, and deploying machine learning models using TensorFlow. It covers the full model lifecycle, from constructing neural networks and training them with configurable parameters to packaging trained models for production deployment with unit tests and multi-device support. The project also integrates TensorBoard for logging and visualizing computational graphs, scalar summaries, and histograms during training. The cookbook demonstrates a wide range of machine learning techniques, including convolutional ne
This project is a machine learning implementation library featuring a collection of code examples that implement supervised, unsupervised, and reinforcement learning algorithms from scratch. It provides a comprehensive set of toolkits for core machine learning components, including a natural language processing toolkit, a reinforcement learning framework, and suites for data dimensionality reduction and pattern mining. The library includes specialized implementations for reinforcement learning, such as Q-Learning, Deep Q-Networks, and Actor-Critic agents. The natural language processing capab
This repository is a comprehensive collection of instructional guides and practical examples for Python development, focusing on machine learning, data science, and web scraping. It provides implementations for neural networks, reinforcement learning algorithms, and deep learning architectures using PyTorch, alongside detailed manuals for scientific computing and data visualization. The project distinguishes itself by offering specialized tutorials on concurrent programming to optimize CPU performance and guides for setting up Linux development environments. It covers the implementation of ad
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
LightGBM is a high-performance machine learning framework designed for constructing gradient-boosted decision tree ensembles. It provides a platform for training classification, regression, and ranking models, with a focus on memory efficiency and large-scale distributed computing. The framework distinguishes itself through specialized algorithmic strategies, including leaf-wise tree growth and histogram-based decision learning, which prioritize convergence speed. It optimizes memory usage by bundling mutually exclusive features and employs gradient-based sampling to reduce training complexit
This project is a structured educational curriculum designed to teach the fundamentals of building and training deep learning models. It provides a comprehensive guide for implementing neural networks using high-level machine learning frameworks and the Python programming language, focusing on practical, hands-on exercises for beginners. The tutorial distinguishes itself by covering the full lifecycle of model development, from initial construction to production-ready optimization. It includes specific modules on refining model performance through weight quantization and addressing data bias
Fastai is a high-level deep learning library built on PyTorch that provides a unified interface for managing the entire machine learning lifecycle. It functions as a comprehensive training toolkit, abstracting hardware management and automating complex training loops to simplify the construction and execution of neural network models. The framework is distinguished by its notebook-centric development environment and a type-dispatching data pipeline that automatically applies transformations based on input data formats. It emphasizes transfer learning through discriminative layer-wise optimiza
cuml is a GPU-accelerated machine learning library and framework that uses CUDA to accelerate tabular data preprocessing and model execution. It provides a suite of tools for training and deploying classification, regression, and clustering models on NVIDIA GPUs and GPU clusters. The library is designed for scalability, offering a distributed GPU machine learning environment that can spread computation and data across multiple hardware accelerators and nodes to handle datasets exceeding single-device memory. It mirrors standard estimator interfaces to allow the replacement of CPU-based models
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API
open-source feature selection repository in python
scikit-image is a Python image processing library and scientific image analysis toolkit. It provides a framework for digital image processing and computer vision, utilizing numerical arrays for pixel-level manipulations. The library enables the quantification of image properties and the detection of visual features, such as edges and blobs. It includes tools for image segmentation and the extraction of textures and patterns to characterize objects within visual data. Capabilities cover image manipulation through color space conversion, geometric transformations, and digital restoration. It a
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.
Data search & enrichment library for Machine Learning → Easily find and add relevant features to your ML & AI pipeline from hundreds of public and premium external data sources, including open & commercial LLMs
This project is an educational platform and research toolkit designed to teach deep learning through a combination of mathematical theory, visual diagrams, and executable code. It provides a comprehensive environment for building, training, and evaluating neural networks, grounding complex concepts in interactive computational notebooks that allow for hands-on experimentation. The framework distinguishes itself by interleaving theoretical foundations—including linear algebra, calculus, and probability—with practical implementations across multiple industry-standard libraries. It supports flex
This is a Python automated machine learning framework designed to automate the design and optimization of machine learning pipelines. It functions as a genetic programming pipeline optimizer and an automated feature selection tool, using evolutionary search to discover the most effective sequences of data processing and model steps. The project focuses on multi-objective optimization to balance competing performance metrics simultaneously. It employs a genetic selection process to identify impactful variables and remove noise from raw datasets, ensuring the resulting machine learning solution
This project is a structured educational program and machine learning engineering course. It provides a comprehensive curriculum and learning path focused on data science, the development of predictive models, and the operational aspects of MLOps. The instructional material covers the full machine learning lifecycle, moving from basic data engineering to production deployment. This includes guides on wrapping models in APIs, utilizing container-based packaging, and implementing serverless architectures to host models in cloud environments. The program encompasses technical training in predic