# jakevdp/PythonDataScienceHandbook

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/jakevdp-pythondatasciencehandbook).**

46,802 stars · 18,727 forks · Jupyter Notebook · mit

## Links

- GitHub: https://github.com/jakevdp/PythonDataScienceHandbook
- Homepage: http://jakevdp.github.io/PythonDataScienceHandbook
- awesome-repositories: https://awesome-repositories.com/repository/jakevdp-pythondatasciencehandbook.md

## Topics

`jupyter-notebook` `matplotlib` `numpy` `pandas` `python` `scikit-learn`

## Description

This project is an interactive data science environment that combines code execution, rich media visualization, and narrative documentation into a persistent, browser-based platform. It serves as a comprehensive educational resource for scientific computing, providing a framework for iterative data analysis and machine learning prototyping.

The environment is distinguished by its focus on high-performance numerical computing, utilizing vectorized array operations and memory-mapped data structures to handle large-scale computations efficiently. It features a unified estimator interface that standardizes machine learning workflows, allowing users to build, train, and evaluate predictive models through consistent pipelines. Additionally, the project includes a configuration-driven visualization engine that separates aesthetic style definitions from data rendering, enabling the creation of publication-quality graphical outputs.

Beyond its core modeling capabilities, the project provides an extensive exploratory programming toolkit. This includes dynamic namespace introspection, performance profiling, and interactive debugging tools that allow users to inspect object metadata and navigate code in real-time. The repository is structured as a collection of executable notebooks and technical documentation, designed to facilitate hands-on learning of data science techniques and programming workflows.

## Tags

### Development Tools & Productivity

- [Interactive Data Science Environments](https://awesome-repositories.com/f/development-tools-productivity/interactive-data-science-environments.md) — Combines code execution, rich media visualization, and narrative documentation for iterative data analysis.
- [Interactive Notebooks](https://awesome-repositories.com/f/development-tools-productivity/interactive-notebooks.md) — Combines code execution, rich media visualization, and narrative documentation for iterative analysis.
- [Interactive Shells](https://awesome-repositories.com/f/development-tools-productivity/interactive-shells.md) — Provides an interactive environment for executing code and building programming proficiency. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/01.00-ipython-beyond-normal-python.html))
- [Interactive Data Exploration Tools](https://awesome-repositories.com/f/development-tools-productivity/interactive-data-exploration-tools.md) — Enables real-time code execution for analyzing datasets, visualizing results, and documenting analytical workflows.
- [Exploratory Programming Toolkits](https://awesome-repositories.com/f/development-tools-productivity/exploratory-programming-toolkits.md) — Provides introspection and debugging tools to inspect object metadata, profile performance, and navigate code interactively.
- [Shell Command Runners](https://awesome-repositories.com/f/development-tools-productivity/shell-command-runners.md) — Supports running system-level shell commands directly from the programming environment. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html))

### Scientific & Mathematical Computing

- [Vectorized Numerical Computing Frameworks](https://awesome-repositories.com/f/scientific-mathematical-computing/vectorized-numerical-computing-frameworks.md) — Uses compiled routines to perform efficient mathematical operations on large, contiguous memory-mapped data structures.
- [Vectorized Computation Libraries](https://awesome-repositories.com/f/scientific-mathematical-computing/vectorized-computation-libraries.md) — Uses compiled routines to perform operations on entire data arrays, bypassing interpreted loops.
- [High Performance Computing Frameworks](https://awesome-repositories.com/f/scientific-mathematical-computing/high-performance-computing-frameworks.md) — Optimizes data processing tasks through vectorized operations and memory-efficient structures for large-scale numerical computation.
- [Statistical Analysis Libraries](https://awesome-repositories.com/f/scientific-mathematical-computing/statistical-analysis-libraries.md) — Applies probability models and clustering techniques to uncover hidden structures within complex datasets.

### Artificial Intelligence & ML

- [Machine Learning Interfaces](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-interfaces.md) — Standardizes machine learning workflows by enforcing a consistent API across different algorithms.
- [Machine Learning Workflow Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-workflow-libraries.md) — Provides a standardized interface for building, training, and evaluating predictive models through consistent pipelines.
- [Dimensionality Reduction Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/dimensionality-reduction-techniques.md) — Implements unsupervised techniques to simplify high-dimensional data for efficient analysis and visualization. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html))
- [Machine Learning Pipelines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-pipelines.md) — Chains multiple data transformation and modeling steps into a single workflow to ensure consistent execution. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html))
- [Machine Learning Prototyping Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning-prototyping-frameworks.md) — Provides standardized interfaces for building, training, and evaluating predictive models during the prototyping phase.
- [Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training.md) — Implements support vector machine training for binary classification. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.14-image-features.html))
- [Classification Algorithms](https://awesome-repositories.com/f/artificial-intelligence-ml/classification-algorithms.md) — Builds models to predict discrete labels from input data. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html))
- [Ensemble Methods](https://awesome-repositories.com/f/artificial-intelligence-ml/ensemble-methods.md) — Aggregates multiple models to reduce overfitting and improve the overall accuracy of predictive results. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html))
- [Centroid Clustering](https://awesome-repositories.com/f/artificial-intelligence-ml/centroid-clustering.md) — Groups similar data points together using centroid-based algorithms to discover natural segments in the data. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html))
- [Clustering](https://awesome-repositories.com/f/artificial-intelligence-ml/clustering.md) — Uses unsupervised clustering to infer labels on unlabeled datasets. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html))
- [Dimensionality Reduction](https://awesome-repositories.com/f/artificial-intelligence-ml/dimensionality-reduction.md) — Simplifies complex datasets by extracting essential structures while discarding redundant or noisy information. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html))
- [Object Detection](https://awesome-repositories.com/f/artificial-intelligence-ml/object-detection.md) — Applies trained models to identify specific patterns in new images. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.14-image-features.html))
- [Clustering Algorithms](https://awesome-repositories.com/f/artificial-intelligence-ml/clustering-algorithms.md) — Identifies clusters in data by fitting a mixture of multiple probability distributions to the input set. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html))
- [Data Imputation](https://awesome-repositories.com/f/artificial-intelligence-ml/data-imputation.md) — Fills in or handles gaps in datasets to ensure complete information for training and analysis. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html))
- [Data Preprocessing](https://awesome-repositories.com/f/artificial-intelligence-ml/data-preprocessing.md) — Converts categorical data into numerical formats for model input. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html))
- [Data Representation](https://awesome-repositories.com/f/artificial-intelligence-ml/data-representation.md) — Formats data into structured arrays for model compatibility. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html))
- [Dataset Preparation Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-preparation-tools.md) — Acquire a collection of background data examples that do not contain the target features for model training. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.14-image-features.html))
- [Feature Engineering](https://awesome-repositories.com/f/artificial-intelligence-ml/feature-engineering.md) — Computes descriptive features from combined sample sets. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.14-image-features.html))
- [Image Classification](https://awesome-repositories.com/f/artificial-intelligence-ml/image-classification.md) — Trains models to recognize handwritten digits from image datasets. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html))
- [Kernel Methods](https://awesome-repositories.com/f/artificial-intelligence-ml/kernel-methods.md) — Uses kernel methods to enable linear models to solve nonlinear problems. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html))
- [Model APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/model-apis.md) — Demonstrates the use of a consistent estimator interface for model workflows. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html))
- [Model Diagnostics](https://awesome-repositories.com/f/artificial-intelligence-ml/model-diagnostics.md) — Visualizes model performance relative to training data size to diagnose bias and variance issues. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html))
- [Support Vector Machines](https://awesome-repositories.com/f/artificial-intelligence-ml/support-vector-machines.md) — Trains a support vector classifier on data to establish decision boundaries between different classes. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html))

### Testing & Quality Assurance

- [Model Validation](https://awesome-repositories.com/f/testing-quality-assurance/model-validation.md) — Reserves a portion of data for testing to obtain an unbiased estimate of how models perform. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html))
- [Cross Validation](https://awesome-repositories.com/f/testing-quality-assurance/cross-validation.md) — Splits data into multiple subsets to rigorously evaluate model performance and maximize data usage. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html))
- [Performance Profilers](https://awesome-repositories.com/f/testing-quality-assurance/performance-profilers.md) — Provides timing utilities to measure and evaluate code execution performance. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/01.03-magic-commands.html))
- [Interactive Debuggers](https://awesome-repositories.com/f/testing-quality-assurance/interactive-debuggers.md) — Steps through code execution line by line to identify and resolve complex logic errors. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/01.06-errors-and-debugging.html))
- [Memory Profilers](https://awesome-repositories.com/f/testing-quality-assurance/memory-profilers.md) — Provides tools to monitor memory usage and detect resource-heavy operations. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/01.07-timing-and-profiling.html))

### Education & Learning Resources

- [Supervised Learning Examples](https://awesome-repositories.com/f/education-learning-resources/supervised-learning-examples.md) — Trains and tests a model on flower measurements to demonstrate the supervised learning workflow. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html))
- [Linear Regression Tutorials](https://awesome-repositories.com/f/education-learning-resources/linear-regression-tutorials.md) — Models the relationship between variables by finding the best-fitting straight line through data points. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html))
- [Principal Component Analysis Tutorials](https://awesome-repositories.com/f/education-learning-resources/principal-component-analysis-tutorials.md) — Projects data onto a lower-dimensional space by keeping only the most significant variance components. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html))
- [Regression Examples](https://awesome-repositories.com/f/education-learning-resources/regression-examples.md) — Executes a basic linear regression to predict continuous outcomes from a single input variable. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html))
- [Data Visualization Tutorials](https://awesome-repositories.com/f/education-learning-resources/data-visualization-tutorials.md) — Projects complex, high-dimensional data into two dimensions to make patterns and clusters easier to see. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html))
- [Decision Tree Tutorials](https://awesome-repositories.com/f/education-learning-resources/decision-tree-tutorials.md) — Constructs hierarchical models that make predictions by splitting data based on feature values. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html))
- [Model Regularization Tutorials](https://awesome-repositories.com/f/education-learning-resources/model-regularization-tutorials.md) — Applies constraints to model parameters to prevent overfitting and improve generalization on new data. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.06-linear-regression.html))
- [Model Selection Guides](https://awesome-repositories.com/f/education-learning-resources/model-selection-guides.md) — Compares different model configurations and hyperparameters to identify the best performer for a specific task. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html))
- [Regression Analysis Tutorials](https://awesome-repositories.com/f/education-learning-resources/regression-analysis-tutorials.md) — Uses ensembles of decision trees to predict continuous numerical values instead of discrete categories. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html))
- [Regression Modeling Guides](https://awesome-repositories.com/f/education-learning-resources/regression-modeling-guides.md) — Builds models that estimate numerical values based on input features for regression analysis tasks. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.01-what-is-machine-learning.html))
- [Support Vector Machine Guides](https://awesome-repositories.com/f/education-learning-resources/support-vector-machine-guides.md) — Explains the theoretical advantages of using support vector machines for high-dimensional classification tasks. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html))
- [Clustering Tutorials](https://awesome-repositories.com/f/education-learning-resources/clustering-tutorials.md) — Explains the limitations of k-means clustering to motivate more advanced modeling techniques. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html))
- [Density Estimation Guides](https://awesome-repositories.com/f/education-learning-resources/density-estimation-guides.md) — Evaluates different methods for modeling probability distributions to choose the best approach. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html))
- [Dimensionality Reduction Tutorials](https://awesome-repositories.com/f/education-learning-resources/dimensionality-reduction-tutorials.md) — Projects high-dimensional data into lower dimensions while preserving the relative distances between points. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.10-manifold-learning.html))
- [Hyperparameter Tuning Guides](https://awesome-repositories.com/f/education-learning-resources/hyperparameter-tuning-guides.md) — Adjusts model parameters to allow for some classification errors when dealing with overlapping data. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html))
- [Interactive Learning Platforms](https://awesome-repositories.com/f/education-learning-resources/interactive-learning-platforms.md) — Offers educational content through executable notebooks and interactive code environments. ([source](https://cdn.jsdelivr.net/gh/jakevdp/PythonDataScienceHandbook@master/README.md))
- [Manifold Learning Guides](https://awesome-repositories.com/f/education-learning-resources/manifold-learning-guides.md) — Preserves local relationships between data points to uncover complex structures that linear methods miss. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.10-manifold-learning.html))
- [Noise Filtering Guides](https://awesome-repositories.com/f/education-learning-resources/noise-filtering-guides.md) — Removes noise from datasets by reconstructing them using only the most significant principal components. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html))
- [Probabilistic Classifier Guides](https://awesome-repositories.com/f/education-learning-resources/probabilistic-classifier-guides.md) — Uses probabilistic classifiers suitable for discrete feature counts, such as word frequencies in text. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html))
- [Text Processing Tutorials](https://awesome-repositories.com/f/education-learning-resources/text-processing-tutorials.md) — Transforms raw text into numerical representations that machine learning models can process and analyze. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html))

### User Interface & Experience

- [Visualization Frameworks](https://awesome-repositories.com/f/user-interface-experience/visualization-frameworks.md) — Separates style definitions from data rendering logic to manage aesthetic properties.
- [Declarative Visualization Engines](https://awesome-repositories.com/f/user-interface-experience/declarative-visualization-engines.md) — Separates aesthetic style definitions from data rendering to produce complex, publication-quality graphical outputs.
- [Visualization Configurations](https://awesome-repositories.com/f/user-interface-experience/visualization-configurations.md) — Provides a centralized system for managing visualization aesthetics and style definitions.

### Web Development

- [Notebook Interfaces](https://awesome-repositories.com/f/web-development/notebook-interfaces.md) — Supports browser-based notebook interfaces for executing code and visualizing data. ([source](https://jakevdp.github.io/PythonDataScienceHandbook/01.00-ipython-beyond-normal-python.html))

### Programming Languages & Runtimes

- [Execution Kernels](https://awesome-repositories.com/f/programming-languages-runtimes/execution-kernels.md) — Decouples the interactive user interface from the language interpreter for persistent sessions.

### Data & Databases

- [Memory-Mapped Data Structures](https://awesome-repositories.com/f/data-databases/memory-mapped-data-structures.md) — Stores data in contiguous memory buffers to optimize cache locality and reduce overhead.
- [Data Manipulation Patterns](https://awesome-repositories.com/f/data-databases/data-manipulation-patterns.md) — Organizes data manipulation by partitioning datasets into groups, applying functions, and merging results.