Machine Learning

This project is a collection of supervised and unsupervised machine learning algorithms implemented from scratch using Python. It serves as an educational resource for studying model training, parameter optimization, and the implementation of core predictive models.

The library provides a variety of supervised learning tools, including linear and logistic regression, decision trees, and support vector machines. It also features unsupervised learning capabilities for discovering patterns in unlabeled datasets through clustering algorithms.

Broad capability areas include ensemble learning through bagging and boosting, a text classification workflow with support for Chinese text segmentation, and comprehensive model performance evaluation through error analysis and the visualization of decision boundaries. The project also covers data preprocessing tasks such as feature normalization, vectorization, and the parsing of tabular data.

Features

Machine Learning Implementations - Provides from-scratch Python implementations of core supervised learning algorithms like linear regression and SVMs.

Categorical Classifiers - Provides multiple algorithms like Naive Bayes and SVMs to classify categorical data.

Classification Models - Implements multi-class classification capabilities using support vector models to categorize data into multiple distinct classes.

Clustering Algorithms - Implements clustering algorithms to discover patterns and group similar unlabeled data points.

Decision Trees - Implements recursive-partitioning decision trees using information gain and squared error minimization.

Regression Trees - Implements decision trees for predicting continuous numerical outcomes by minimizing squared error.

Ensemble Learning - Implements ensemble methods including bagging and boosting to improve predictive accuracy.

Bagging Ensembles - Implements bagging ensembles to reduce model variance and overfitting through bootstrap sampling.

K-Means Clustering - Implements k-means clustering to group unlabeled data points by minimizing distance between samples and centroids.

Clustering Algorithms - Implements unsupervised learning algorithms like K-Means to discover patterns in unlabeled data.

K-Nearest Neighbor Classifiers - Implements a k-nearest neighbors classifier to predict data categories based on majority vote of the closest samples.

Linear and Logistic Regression - Implements logistic regression algorithms for predicting binary outcomes using a sigmoid function.

Linear Regression - Implements linear regression to calculate optimal coefficients by minimizing squared errors.

Linear Regression Models - Implements linear regression models to predict continuous numerical outcomes.

Logistic Regression Models - Implements logistic regression models with support for regularization and class weighting.

Recursive Partitioning Trees - Implements regression trees that split data into recursive partitions to predict continuous values.

Boosting Algorithms - Implements boosting algorithms like AdaBoost that iteratively train models by re-weighting difficult samples.

Binary Classification Training - Implements binary classification using a logistic function to map input features to a binary outcome probability.

Naive Bayes Classifiers - Predicts categories using probabilistic models based on the conditional independence of features.

SMO Algorithms - Implements the Sequential Minimal Optimization algorithm to train support vector machines.

Supervised Learning - Provides a variety of predictive models for labeled datasets, including regression and classification.

Support Vector Machines - Implements support vector machines to separate data classes by finding an optimal maximizing hyperplane.

Text Classifiers - Implements text classifiers to predict document categories by calculating class probabilities from trained models.

Unsupervised Learning - Implements unsupervised learning capabilities for discovering hidden patterns in unlabeled datasets.

Machine Learning Educational Resources - Serves as an educational resource with practical implementations of core machine learning algorithms.

Ordinary Least Squares - Implements ordinary least squares to find the best-fitting line between input features and target values.

Regression Trees - Implements regression trees to estimate numerical outcomes by partitioning data into recursive segments.

Model Evaluation - Provides tools for measuring model accuracy through error analysis and decision boundary visualization.

Automated Feature Selection Tools - Implements feature selection by calculating information gain and Shannon entropy to identify discriminative attributes.

Classifier Accuracy Metrics - Provides tools to measure classifier accuracy by calculating the percentage of incorrect predictions.

Dataset Distribution Analysis - Provides visual analysis of dataset distributions to evaluate data separability.

Decision Boundary Visualizations - Plots the separating boundaries between classes to evaluate how a classifier partitions feature space.

Decision Stumps - Implements single-level decision stumps to serve as weak learners for ensemble algorithms.

Pruning Techniques - Implements decision tree pruning to reduce model complexity and prevent overfitting.

Tree Visualizers - Generates visual representations of the paths and splitting logic used by decision tree models.

Feature Scale Normalization - Scales numeric features by mean and variance to ensure stable model convergence.

Iterative Parameter Optimizations - Implements iterative parameter optimization using gradient ascent to maximize loss functions.

Kernel-Based Feature Mapping - Provides kernel-based feature mapping to project data into high-dimensional spaces for non-linear classification.

Regularization Techniques - Implements Ridge and stepwise regression to reduce model complexity and prevent overfitting.

Text Classification - Implements a text classification workflow utilizing Naive Bayes and feature vectorization.

Model Evaluation Metrics - Implements model evaluation metrics including confusion matrices, precision, recall, and ROC curves.

Model Ensembling - Implements model ensembling techniques to improve predictive accuracy and robustness.

AdaBoost Implementations - Implements AdaBoost to improve classifier performance by iteratively training weak learners.

Text Tokenization - Segments raw text strings into lowercase word lists by removing non-alphanumeric characters.

Gradient Ascent Algorithms - Implements gradient ascent to optimize model parameters by maximizing the defined loss function.

Weighted Regression - Performs locally weighted regression using Gaussian kernels to reduce underfitting.

Regression Scoring Evaluation - Evaluates regression performance by computing the total squared difference between real and predicted values.

Stochastic Gradient Ascent - Implements stochastic gradient ascent to reduce computational complexity during model optimization.

Text Feature Extraction - Transforms unstructured text into structured numerical features based on word frequency ranking.

Vocabulary Generators - Implements vocabulary building to create master feature lists for text vectorization.

Feature Vectorizations - Provides symmetric-matrix feature vectorization for transforming raw text and images into numerical formats.

Tabular Data Preprocessing - Converts raw text files into feature matrices and label vectors for use in classifiers.

Chinese Language Segmenters - Processes Chinese sentences into distinct words to handle the absence of whitespace during tokenization.

Entropy Calculators - Measures dataset uncertainty by calculating the empirical entropy of class label distributions.

Kernel-Based Feature Mapping - Uses kernel functions to project non-linear data into higher-dimensional spaces for linear separation.

Ridge Regression - Implements ridge regression with L2 norm penalties to reduce collinearity and prevent model overfitting.

Feature Interaction Visualizations - Provides tools to visualize feature distributions and relationships using scatter plots.

Jack-CherishMachine-Learning

Features

Star history