# stas00/ml-engineering

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/stas00-ml-engineering).**

16,914 stars · 1,058 forks · Python · cc-by-sa-4.0

## Links

- GitHub: https://github.com/stas00/ml-engineering
- Homepage: https://stasosphere.com/machine-learning/
- awesome-repositories: https://awesome-repositories.com/repository/stas00-ml-engineering.md

## Topics

`ai` `debugging` `gpus` `inference` `large-language-models` `llm` `machine-learning` `machine-learning-engineering` `mlops` `network` `pytorch` `scalability` `slurm` `storage` `training` `transformers`

## Description

This project is a comprehensive engineering framework and technical reference for managing, scaling, and optimizing distributed machine learning infrastructure. It provides a suite of methodologies and diagnostic tools designed to support large-scale model training and inference on high-performance computing clusters.

The project distinguishes itself through a specialized diagnostic toolkit and infrastructure optimization suite that addresses the complexities of multi-node environments. It enables precise control over cluster resources, including hardware maintenance, network topology configuration, and the orchestration of containerized workloads. By integrating performance benchmarking, numerical stability validation, and automated fault detection, it allows engineers to identify and resolve bottlenecks or hardware failures within distributed systems.

Beyond core orchestration, the project covers a broad range of operational capabilities including distributed file system management, automated checkpointing, and storage lifecycle optimization. It provides utilities for training performance tuning, inference scaling, and the enforcement of structured outputs, ensuring that both training and deployment pipelines remain efficient and reliable.

The repository serves as a technical guide for distributed machine learning engineering, offering automation scripts and diagnostic procedures for GPU and TPU clusters.

## Tags

### Artificial Intelligence & ML

- [Distributed Training Orchestration](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-orchestration.md) — Manages containerized workloads and job scheduling across compute clusters to execute large-scale machine learning model training tasks efficiently.
- [Machine Learning Optimization](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/training-algorithms/machine-learning-optimization.md) — Provides technical references and automation scripts for configuring high-speed network interconnects, parallel storage, and containerized AI deployment pipelines.
- [Distributed Training Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-optimizers.md) — Implements communication-computation overlapping and collective operation acceleration to maximize throughput during multi-node model training.
- [Inference Scaling](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-scaling.md) — Optimizes request throughput and memory utilization through continuous batching and parallel execution strategies for high-concurrency model deployment.
- [Training Checkpointing](https://awesome-repositories.com/f/artificial-intelligence-ml/training-checkpointing.md) — Implements automated checkpointing and recovery routines to resume training sessions after hardware failures. ([source](https://github.com/stas00/ml-engineering/tree/master/training))
- [Batch Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/batch-inference-engines.md) — Processes large volumes of prompts in a single application run to maximize throughput and reduce compute costs for benchmarking. ([source](https://github.com/stas00/ml-engineering/tree/master/inference))
- [Model Training Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training-optimizers.md) — Provides configuration tools for data types and hyper-parameters to balance speed, memory, and convergence. ([source](https://github.com/stas00/ml-engineering/tree/master/training))
- [Weight Distribution](https://awesome-repositories.com/f/artificial-intelligence-ml/model-weight-management/weight-distribution.md) — Distributes model weights and computations across multiple hardware devices to accommodate large-scale models. ([source](https://github.com/stas00/ml-engineering/tree/master/network))
- [Numerical Stability Techniques](https://awesome-repositories.com/f/artificial-intelligence-ml/numerical-stability-techniques.md) — Monitors training processes for underflow or overflow to ensure mathematical precision and prevent model divergence. ([source](https://github.com/stas00/ml-engineering/blob/master/debug))
- [Structured Output Parsers](https://awesome-repositories.com/f/artificial-intelligence-ml/structured-output-parsers.md) — Constrains model generation to specific formats like JSON by selecting tokens that adhere to a predefined schema. ([source](https://github.com/stas00/ml-engineering/blob/master/inference))
- [Inference Context Injection](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/generative-ai/grounded-answer-generation/inference-context-injection.md) — Provides additional context or external data to the model during inference to improve the relevance and accuracy of generated outputs. ([source](https://github.com/stas00/ml-engineering/blob/master/inference))
- [Performance Benchmarks](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-evaluation-and-validation/performance-benchmarks.md) — Measures inter-node and intra-node throughput and storage latency to identify bottlenecks in distributed training. ([source](https://github.com/stas00/ml-engineering#machine-learning-engineering-open-book))
- [Inference Optimizations](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/serving-and-runtime/inference-optimizations.md) — Groups queries into batches and manages requests to maximize inference throughput and minimize idle compute time. ([source](https://github.com/stas00/ml-engineering/blob/master/inference))
- [Infrastructure Debugging](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/training-algorithms/machine-learning-optimization/infrastructure-debugging.md) — Diagnoses distributed training failures, numerical instabilities, and hardware performance bottlenecks using low-level system tracing and diagnostic reporting tools.
- [Dataset Sampling Utilities](https://awesome-repositories.com/f/artificial-intelligence-ml/dataset-preparation-tools/dataset-sampling-utilities.md) — Produces reduced-size versions of large datasets by sampling records or synthesizing representative data to accelerate training and iteration cycles. ([source](https://github.com/stas00/ml-engineering/blob/master/debug/make-tiny-models-tokenizers-datasets.md))
- [Distributed Training Runtimes](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-training-runtimes.md) — Simulates multi-node distributed training setups on single-node hardware to validate scaling logic. ([source](https://github.com/stas00/ml-engineering/tree/master/training))
- [Speculative Decoding Strategies](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/inference-optimization/inference-acceleration-techniques/speculative-decoding-strategies.md) — Uses smaller draft models to predict tokens and verify them against the main model to reduce latency. ([source](https://github.com/stas00/ml-engineering/blob/master/inference))
- [Training Log Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/training-log-analysis.md) — Provides tools to review historical parameter logs to inform hyper-parameter selection and identify training instabilities. ([source](https://github.com/stas00/ml-engineering/tree/master/resources))

### Data & Databases

- [Parallel](https://awesome-repositories.com/f/data-databases/distributed-file-systems/parallel.md) — Implements parallel distributed file systems to handle high-throughput data loading, burst checkpoint writing, and shared codebase access. ([source](https://github.com/stas00/ml-engineering/blob/master/insights/ai-battlefield.md))
- [Collective Communication Operations](https://awesome-repositories.com/f/data-databases/collective-communication-operations.md) — Offloads data reduction and aggregation tasks to network hardware to improve throughput for distributed training. ([source](https://github.com/stas00/ml-engineering/tree/master/network))

### DevOps & Infrastructure

- [Cluster Management](https://awesome-repositories.com/f/devops-infrastructure/cluster-management.md) — Orchestrates distributed training workloads, manages job scheduling, and optimizes hardware utilization across high-performance computing environments.
- [Node Health Diagnostics](https://awesome-repositories.com/f/devops-infrastructure/node-orchestrators/node-health-diagnostics.md) — Identifies and isolates broken hardware by running diagnostic checks to prevent future jobs from being scheduled on those specific nodes. ([source](https://github.com/stas00/ml-engineering/blob/master/orchestration/slurm/users.md))
- [Cluster Job Schedulers](https://awesome-repositories.com/f/devops-infrastructure/cluster-job-schedulers.md) — Maintains job scheduling policies and hardware access permissions to optimize workload distribution across shared computing infrastructure. ([source](https://github.com/stas00/ml-engineering/blob/master/README.md))
- [Job Scheduling](https://awesome-repositories.com/f/devops-infrastructure/job-scheduling.md) — Queues jobs to start at specific future times or after relative delays to manage resource availability. ([source](https://github.com/stas00/ml-engineering/blob/master/orchestration/slurm/users.md))
- [Task & Job Management](https://awesome-repositories.com/f/devops-infrastructure/automation-orchestration/task-execution-frameworks/task-job-management.md) — Executes a sequence of identical tasks as a single unit to allow for controlled concurrency and batch processing. ([source](https://github.com/stas00/ml-engineering/blob/master/orchestration/slurm/users.md))
- [Containerized Application Deployment](https://awesome-repositories.com/f/devops-infrastructure/containerized-application-deployment.md) — Automates the deployment, scaling, and lifecycle management of containerized workloads across diverse cloud and on-premises infrastructure. ([source](https://github.com/stas00/ml-engineering/tree/master/orchestration))
- [System Resource Provisioning](https://awesome-repositories.com/f/devops-infrastructure/system-resource-provisioning.md) — Calculates and allocates compute, storage, and networking resources to support large-scale machine learning training and inference. ([source](https://github.com/stas00/ml-engineering/blob/master/insights/ai-battlefield.md))

### Development Tools & Productivity

- [Debugging and Diagnostics](https://awesome-repositories.com/f/development-tools-productivity/debugging-profiling-testing/debugging-diagnostics.md) — Provides diagnostic procedures, stack trace inspection, and troubleshooting tools to resolve failures, hangs, and performance issues in deep learning applications. ([source](https://github.com/stas00/ml-engineering/blob/master/debug/pytorch.md))
- [Diagnostic Toolkits](https://awesome-repositories.com/f/development-tools-productivity/diagnostic-toolkits.md) — Provides a suite of utilities for debugging multi-node communication, monitoring accelerator performance, and resolving numerical instabilities.
- [CLI Process Controls](https://awesome-repositories.com/f/development-tools-productivity/cli-process-controls.md) — Signals running processes to save state and exit cleanly before a job reaches its time limit or is preempted. ([source](https://github.com/stas00/ml-engineering/blob/master/orchestration/slurm/users.md))

### Hardware & IoT

- [Faulty Node Management](https://awesome-repositories.com/f/hardware-iot/integration-performance/hardware-interfacing-integration/hardware-interfacing/faulty-node-management.md) — Provides interfaces to identify and swap faulty hardware nodes without manual intervention to maintain cluster performance and availability. ([source](https://github.com/stas00/ml-engineering/blob/master/insights/how-to-choose-cloud-provider.md))

### Scientific & Mathematical Computing

- [High-Performance Computing](https://awesome-repositories.com/f/scientific-mathematical-computing/high-performance-execution-environments/high-performance-and-parallel-computing/high-performance-computing.md) — Configures and maintains specialized hardware, network interconnects, and parallel storage systems to support intensive scientific and machine learning workloads.

### System Administration & Monitoring

- [Cluster Monitoring](https://awesome-repositories.com/f/system-administration-monitoring/cluster-monitoring.md) — Retrieves real-time metrics, resource usage, and historical accounting data for active or completed jobs and hardware components. ([source](https://github.com/stas00/ml-engineering/blob/master/compute/accelerator))

### Education & Learning Resources

- [Machine Learning Guides](https://awesome-repositories.com/f/education-learning-resources/machine-learning-guides.md) — Provides a comprehensive collection of best practices, methodologies, and diagnostic tools for scaling, training, and deploying large-scale models.

### Networking & Communication

- [Connectivity Verifiers](https://awesome-repositories.com/f/networking-communication/network-clients/connectivity-verifiers.md) — Verifies inter-node communication and network health to ensure hardware is correctly configured for multi-GPU training. ([source](https://github.com/stas00/ml-engineering/blob/master/README.md))

### Operating Systems & Systems Programming

- [Resource Paging](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/allocation-strategies/dynamic-memory-allocation/gpu-memory-allocators/resource-paging.md) — Allocates accelerator memory using paging techniques to prevent fragmentation and improve utilization during inference. ([source](https://github.com/stas00/ml-engineering/tree/master/inference))

### Security & Cryptography

- [Homomorphic Encryption](https://awesome-repositories.com/f/security-cryptography/homomorphic-encryption.md) — Performs computations on encrypted data using homomorphic encryption to protect privacy and intellectual property. ([source](https://github.com/stas00/ml-engineering/tree/master/inference))

### Testing & Quality Assurance

- [Test Execution Controls](https://awesome-repositories.com/f/testing-quality-assurance/general-testing-utilities/test-utilities-assertions/test-lifecycle-execution-control/test-execution-controls.md) — Applies conditional logic to skip or expect failures for specific tests based on hardware availability, platform requirements, or known bugs. ([source](https://github.com/stas00/ml-engineering/tree/master/testing))
- [Test Suite Filters](https://awesome-repositories.com/f/testing-quality-assurance/testing-infrastructure-management/test-execution-management/test-suite-filters.md) — Runs specific test suites, classes, or individual functions using keyword filtering and logical operators to isolate relevant code paths. ([source](https://github.com/stas00/ml-engineering/blob/master/testing))