Data Science From Scratch

This project is a collection of foundational machine learning algorithms and data science tools implemented in Python. It focuses on building the logic of these tools using basic programming primitives rather than relying on specialized libraries.

The implementation covers several core domains, including a linear algebra library for matrix and vector operations, a statistical analysis toolkit for probability and hypothesis testing, and a framework for map-reduce distributed processing. It also includes implementations for natural language processing, graph theory for network analysis, and various machine learning models.

The capabilities extend to building specific models such as feed-forward neural networks, decision trees, and recommender systems. It provides tools for mathematical optimization via gradient descent, the calculation of model performance metrics, and data processing utilities for parsing structured data and extracting content from HTML.

Features

Data Science Algorithms - Implements fundamental machine learning and data science algorithms using basic programming primitives.

Machine Learning Implementations - Provides code-based implementations of core supervised and unsupervised machine learning algorithms using basic programming primitives.

Python Machine Learning Libraries - Implements foundational machine learning algorithms and data science tools from scratch using Python.

Decision Trees - Constructs tree-based models using splits and leaves to classify data attributes.

Feed-Forward Neural Networks - Implements the fundamental architecture of neural networks where data flows in a single direction from input to output.

Linear Regression - Implements statistical methods for modeling relationships between variables using linear equations and gradient descent.

Distributed Data Processing Engines - Implements distributed data processing systems using map-reduce techniques to handle large datasets.

MapReduce Processing Engines - Provides a framework for executing batch computations by mapping and reducing records into aggregated results.

Linear Algebra Libraries - Provides foundational matrix and vector operation primitives for mathematical modeling and scientific computing.

Linear Algebra Routines - Implements fundamental linear algebra operations, including matrix and vector calculations required for scientific computing.

Matrix Manipulations - Extracts specific rows and columns or generates identity matrices from custom grids.

Linear Algebra - Executes fundamental vector and matrix calculations, such as dot products, for mathematical modeling.

Statistical Analysis Libraries - Provides a toolkit for applying probability models and statistical inference to derive insights from datasets.

Statistical Metric Calculators - Computes probability and hypothesis tests to analyze data distributions and trends.

Statistical Analysis Libraries - Provides a toolkit for computing probability distributions, hypothesis tests, and central tendency metrics.

Hypothesis Testing - Provides statistical procedures and tools for executing hypothesis tests to determine the significance of observations.

Gradient Computation - Provides tools for calculating function gradients to support model training and optimization via backpropagation.

Numerical Gradient Approximations - Implements numerical gradient approximations using finite difference methods to estimate function slopes.

Iterative Parameter Optimizations - Provides logic for repeatedly updating model weights using loss functions and gradient descent to fit data.

K-Nearest Neighbor Classifiers - Implements a supervised learning model that assigns classes based on the majority vote of the closest training samples.

Linear Model Performance Metrics - Measures linear fit quality using squared errors and the R-squared coefficient.

Logistic Regression Models - Implements algorithms for predicting binary outcomes using the sigmoid function and weight optimization.

Naive Bayes Classifiers - Implements probabilistic classification models based on statistical feature distributions to categorize text.

Natural Language Processing - Provides foundational techniques for analyzing and processing human language data.

Natural Language Processing Implementations - Provides reference implementations for text analysis, n-gram generation, and text classification.

N-Gram Generators - Implements tools for generating contiguous word sequences based on n-gram sampling from source text.

Gradient Descent Algorithms - Implements iterative optimization algorithms that update model parameters by moving in the direction of the negative gradient.

Perceptrons - Implements basic linear classifiers that learn a weight vector to separate two classes of data.

Performance Metrics - Provides tools for calculating essential statistical performance indicators like accuracy, precision, and recall.

Popularity-Based Recommendations - Suggests items based on global frequency while filtering out items already owned.

Recommender Systems - Implements algorithms designed to predict user preferences and suggest relevant items based on historical patterns.

User-to-User Similarity - Identifies similar user profiles using cosine similarity to suggest items based on peer preferences.

Data Visualization - Provides tools to render numeric datasets into visual formats like charts and graphs.

Distribution Histograms - Generates frequency histograms to visualize the distribution of numeric data.

Graph Libraries - Implements graph algorithms and data structures for network analysis and centrality calculations.

Standard Deviation Calculators - Computes the standard deviation of a dataset to measure value dispersion.

Central Tendency Measures - Computes mean, median, and mode to identify the center of a distribution.

Correlation Matrices - Generates correlation matrices to identify linear relationships between multiple data vectors.

Discrete Event Simulators - Generates outcomes for Bernoulli and Binomial trials to model discrete random events.

Inverse Cumulative Distribution Functions - Finds the value for a given probability in a normal distribution using binary search.

Network Graph Analysis - Provides capabilities to study connections between entities in a network to identify clusters and influence.

Normal Distribution Probability Estimation - Predicts the likelihood that a value falls within a specific range of a Gaussian distribution.

Normal Distribution Bounds - Calculates boundaries that contain a specified probability for a normal distribution.

Shortest Path Algorithms - Implements algorithms to calculate the most efficient path between nodes in a graph.

Distribution Function Calculators - Implements functions for computing cumulative distribution functions and their inverses for uniform and normal spreads.

P-Value Calculations - Determines the probability of observing extreme values in either direction under a normal distribution.

Parameter Estimation - Infers population parameters, such as mean and standard deviation, from sample data using binomial distributions.

Quantile and Percentile Calculators - Provides mathematical tools for dividing sorted data into equal groups via quartiles and percentiles.

Dot Product Computation - Computes dot products and Euclidean distances to analyze geometric relationships between vectors.

Vector Operations - Calculates element-wise addition, subtraction, and means for numeric vectors.

Betweenness Centrality - Implements betweenness centrality to measure node influence via shortest path counts.

Closeness Centrality - Implements closeness centrality to determine node importance based on path lengths.

joelgrusdata-science-from-scratch

Data Science From Scratch

Features

Star history