# microsoft/minference

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/microsoft-minference).**

1,221 stars · 78 forks · Python · MIT

## Links

- GitHub: https://github.com/microsoft/MInference
- Homepage: https://aka.ms/MInference
- awesome-repositories: https://awesome-repositories.com/repository/microsoft-minference.md

## Description

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

## Tags

### Part of an Awesome List

- [Attention Optimization](https://awesome-repositories.com/f/awesome-lists/ai/attention-optimization.md) — Dynamic sparse attention for accelerating long-context pre-filling.
- [Inference and Serving](https://awesome-repositories.com/f/awesome-lists/ai/inference-and-serving.md) — Sparse attention calculation to reduce latency in long-context models.