# thu-pacman/chitu

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/thu-pacman-chitu).**

3,265 stars · 530 forks · Python · apache-2.0

## Links

- GitHub: https://github.com/thu-pacman/chitu
- awesome-repositories: https://awesome-repositories.com/repository/thu-pacman-chitu.md

## Topics

`deepseek` `gpu` `llm` `llm-serving` `model-serving` `pytorch`

## Description

Chitu is a distributed serving platform and orchestrator for large language model inference. It functions as a compute manager designed to deploy and scale model workloads across diverse hardware architectures, including GPUs, CPUs, and heterogeneous hardware clusters.

The platform enables model deployment across a wide range of targets, including NVIDIA GPUs, regional chipsets, and legacy hardware. It manages the execution of models across these varying environments to increase available computing capacity and optimize resource utilization.

The system includes capabilities for distributed inference orchestration and heterogeneous hardware scaling, allowing models to run on configurations ranging from single devices to large production clusters. It also incorporates concurrent traffic management and request queueing to maintain stability during high-demand workloads.

## Tags

### Artificial Intelligence & ML

- [LLM Serving Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/llm-serving-architectures.md) — Provides a high-performance engineering architecture for deploying and serving large language models at scale across clusters.
- [Cross-Hardware Model Inference](https://awesome-repositories.com/f/artificial-intelligence-ml/cross-hardware-model-inference.md) — Enables the execution of large language models across diverse hardware, including NVIDIA GPUs, regional chipsets, and legacy systems. ([source](https://github.com/thu-pacman/chitu/tree/public-main/docs/en))
- [Distributed Model Execution](https://awesome-repositories.com/f/artificial-intelligence-ml/distributed-model-execution.md) — Distributes model inference workloads across multiple compute devices to increase processing speed and resource utilization.
- [Heterogeneous Hardware Scaling](https://awesome-repositories.com/f/artificial-intelligence-ml/heterogeneous-hardware-scaling.md) — Increases computing capacity by running inference across mixed hardware clusters and non-NVIDIA chipsets.
- [Inference Scaling](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-scaling.md) — Scales model inference capacity across configurations ranging from single devices to large-scale production clusters. ([source](https://cdn.jsdelivr.net/gh/thu-pacman/chitu@public-main/README.md))
- [Inference Deployment Orchestrators](https://awesome-repositories.com/f/artificial-intelligence-ml/llm-orchestrators/inference-deployment-orchestrators.md) — Orchestrates the deployment and scaling of large language models across heterogeneous hardware clusters.
- [Heterogeneous Orchestrators](https://awesome-repositories.com/f/artificial-intelligence-ml/local-model-orchestrators/heterogeneous-orchestrators.md) — Coordinates the distribution of LLM workloads across varying CPU and GPU architectures to optimize resource utilization.
- [Hardware-Agnostic Deployment](https://awesome-repositories.com/f/artificial-intelligence-ml/model-optimization/inference-deployment/model-deployment-toolkits/hardware-agnostic-deployment.md) — Provides a runtime that abstracts chip-specific instructions to allow models to run across GPUs, CPUs, and regional accelerators.
- [High-Throughput Model Serving](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-deployment-and-serving/inference-servers-and-runtimes/high-throughput-model-serving.md) — Employs architectures designed to handle large volumes of concurrent LLM inference requests with stable performance.

### Scientific & Mathematical Computing

- [Distributed Inference Orchestrators](https://awesome-repositories.com/f/scientific-mathematical-computing/high-performance-execution-environments/high-performance-and-parallel-computing/parallel-processing/distributed-inference-orchestrators.md) — Implements a system for distributing model weights and computation tasks across multiple devices and network nodes to scale processing.

### DevOps & Infrastructure

- [Hardware Kernel Switching](https://awesome-repositories.com/f/devops-infrastructure/apple-silicon-deployment/multi-backend-execution/hardware-kernel-switching.md) — Switches between hardware-specific kernels based on the target compute device to maintain optimal inference performance.
- [Model Inference Deployment](https://awesome-repositories.com/f/devops-infrastructure/deployment-management/model-inference-deployment.md) — Deploys large language models into production environments across diverse hardware including GPUs and CPUs.
- [Traffic Management](https://awesome-repositories.com/f/devops-infrastructure/traffic-management.md) — Controls request throughput and connection concurrency to ensure stable operation in high-demand production environments. ([source](https://github.com/thu-pacman/chitu/tree/public-main/docs/en))

### Operating Systems & Systems Programming

- [Model Tensor Mapping](https://awesome-repositories.com/f/operating-systems-systems-programming/virtual-memory-management/cross-platform-memory-abstraction-layers/model-tensor-mapping.md) — Provides specialized memory mapping to allocate model tensors across diverse hardware architectures for optimized inference performance.

### Software Engineering & Architecture

- [Cluster Load Balancing](https://awesome-repositories.com/f/software-engineering-architecture/cluster-load-balancing.md) — Balances computational loads across available cluster nodes to ensure maximum hardware utilization during model inference.
- [Concurrent Request Limits](https://awesome-repositories.com/f/software-engineering-architecture/traffic-management/concurrent-request-limits.md) — Manages simultaneous inference requests through a structured buffer to maintain system stability during traffic spikes.
