# NVIDIA/TensorRT-LLM

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/nvidia-tensorrt-llm).**

12,913 stars · 2,115 forks · Python · other

## Links

- GitHub: https://github.com/NVIDIA/TensorRT-LLM
- Homepage: https://nvidia.github.io/TensorRT-LLM
- awesome-repositories: https://awesome-repositories.com/repository/nvidia-tensorrt-llm.md

## Topics

`blackwell` `cuda` `llm-serving` `moe` `pytorch`

## Description

TensorRT-LLM is a platform and toolkit designed for compiling, optimizing, and serving transformer-based models on accelerated hardware. It functions as a framework that transforms machine learning models into efficient execution graphs, providing an engine to refine these models for specific hardware to maximize throughput and minimize latency during text generation.

The project distinguishes itself through advanced execution strategies that manage the entire inference pipeline. It utilizes kernel-level fusion and static graph execution to optimize mathematical operations and computational flow, while implementing paged attention memory management to handle long sequence lengths without memory fragmentation. These capabilities are integrated with in-flight request batching and custom decoding logic, which allow for the direct implementation of sampling strategies within the execution pipeline to reduce data transfer overhead.

The toolkit supports both online model serving for scalable, concurrent request handling and offline batch inference for high-volume, non-interactive processing. It provides comprehensive controls for managing attention memory and configuring decoding parameters, ensuring that hardware utilization remains efficient across diverse deployment environments.

## Tags

### Artificial Intelligence & ML

- [GPU-Accelerated](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/engines-runtimes-servers/gpu-accelerated.md) — Provides a high-performance deployment platform for serving optimized language models with advanced batching.
- [Model Compilation](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/serving-and-runtime/inference-optimization-utilities/model-compilation.md) — Transforms machine learning models into highly efficient execution graphs for accelerated text generation.
- [Large Language Model Optimization](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-optimization-and-inference/serving-and-runtime/large-language-model-optimization.md) — Compiles and refines machine learning models for specific hardware to maximize throughput and reduce latency.
- [Model Optimization](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization.md) — Compiles and refines machine learning models for specific hardware to maximize processing throughput and reduce latency. ([source](https://nvidia.github.io/TensorRT-LLM/))
- [Model Compilers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/frameworks/training-systems/model-performance-optimizations/model-compilers.md) — Transforms high-level neural network definitions into hardware-specific execution kernels to maximize throughput.
- [High-Throughput Model Serving](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/high-throughput-model-serving.md) — Deploys optimized models as scalable services to handle concurrent user requests while maintaining low latency.
- [Online Model Servers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/online-model-servers.md) — Deploys optimized models as scalable services, handling concurrent user requests with advanced batching and scheduling. ([source](https://nvidia.github.io/TensorRT-LLM/))
- [Attention Backends](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/attention-backends.md) — Allocates and retains memory for attention mechanisms to support long sequence processing and data reuse. ([source](https://nvidia.github.io/TensorRT-LLM/))
- [Offline Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/model-integration-pipelines/model-inference/offline-inference-engines.md) — Provides offline batch inference to process large volumes of data through optimized models in non-interactive environments. ([source](https://nvidia.github.io/TensorRT-LLM/))
- [Batch Inference Engines](https://awesome-repositories.com/f/artificial-intelligence-ml/batch-inference-engines.md) — Processes large volumes of data through optimized models in non-interactive environments to maximize hardware utilization.
- [Custom Decoding Strategies](https://awesome-repositories.com/f/artificial-intelligence-ml/decoder-architectures/custom-decoding-strategies.md) — Implements custom sampling strategies directly within the execution pipeline to minimize data transfer overhead.
- [Generation Controls](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-ai-resources/decoding-generation-controls/generation-controls.md) — Adjusts sampling strategies and decoding logic to manage generated text quality and inference speed.
- [Inference Configuration Parameters](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/model-integration-pipelines/model-inference/inference-configuration-parameters.md) — Provides configuration settings to adjust sampling strategies and logic for controlling generated text quality. ([source](https://nvidia.github.io/TensorRT-LLM/))

### Operating Systems & Systems Programming

- [PagedAttention Memory Management](https://awesome-repositories.com/f/operating-systems-systems-programming/kernel-core-internals/process-and-memory-management/memory-management/buffer-and-cache-management/pagedattention-memory-management.md) — Allocates non-contiguous memory blocks for key-value caches to eliminate fragmentation and support long sequence lengths.

### Data & Databases

- [Long-Context Sequence Processors](https://awesome-repositories.com/f/data-databases/text-processing-pipelines/long-context-sequence-processors.md) — Allocates and retains memory for attention mechanisms to support processing long sequences and data reuse.

### System Administration & Monitoring

- [Inference Batching Schedulers](https://awesome-repositories.com/f/system-administration-monitoring/concurrency-management-systems/inference-batching-schedulers.md) — Groups multiple concurrent inference requests into a single execution pass to optimize hardware utilization.

### Programming Languages & Runtimes

- [Static Graph Execution](https://awesome-repositories.com/f/programming-languages-runtimes/runtime-execution-environments/runtime-environments/execution-engines/static-graph-execution.md) — Pre-calculates computational flow and memory requirements before runtime to ensure predictable performance.
- [Kernel Fusion Operations](https://awesome-repositories.com/f/programming-languages-runtimes/runtime-execution-environments/runtime-environments/runtimes/graph-symbolic-execution-engines/operation-kernels/kernel-fusion-operations.md) — Combines multiple sequential mathematical operations into single optimized GPU instructions to reduce memory bandwidth overhead.
