Gpt Fast

Gpt Fast - run low-latency LLM inference | Awesome Repos

Open-source alternatives to Gpt Fast

Similar open-source projects, ranked by how many features they share with Gpt Fast.

meta-pytorch/gpt-fast
meta-pytorch/gpt-fast
6,223View on GitHub
gpt-fast is a PyTorch transformer inference engine designed for text generation using a native tensor library implementation. It provides a runtime for executing large language models without the need for external C++ extensions. The project implements speculative decoding to accelerate generation by using a small draft model for token prediction and a larger model for verification. It further optimizes performance through a compiled prefill stage and a multi-GPU tensor parallelism library that shards linear layers across multiple graphics processing units. Memory efficiency is managed throu
Python
View on GitHub6,223
mistralai/mistral-src
mistralai/mistral-src
10,821View on GitHub
This project is a large language model inference library and framework designed to run models for text generation, problem solving, and coding assistance. It includes a multimodal framework for processing combined image and text inputs and a tool-use implementation that enables the execution of external functions based on model reasoning. The system features a distributed GPU inference engine that spreads large model workloads across multiple graphics processors to increase processing speed and meet memory requirements. It also provides containerized model deployment through pre-packaged imag
Jupyter Notebook
View on GitHub10,821
intel/ipex-llm
intel/ipex-llm
8,836View on GitHub
Intel XPU LLM Acceleration Library is a toolkit designed to accelerate large language model inference and finetuning on Intel CPUs, GPUs, and NPUs. It provides a distributed inference engine for scaling models across multiple accelerators, a multimodal model runtime for vision and speech tasks, and a low-bit model quantization tool for converting weights into INT4, FP8, and GGUF formats. The project features a parameter-efficient finetuning framework that enables model adaptation using QLoRA and DPO on Intel hardware. It distinguishes itself by providing specialized optimizations for Intel XP
Python
View on GitHub8,836
opennmt/ctranslate2
OpenNMT/CTranslate2
4,319View on GitHub
CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models. The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model di
C++avxavx2cpp
View on GitHub4,319

See all 30 alternatives to Gpt Fast

pytorch-labsgpt-fast

Features

Open-source alternatives to Gpt Fast

meta-pytorch/gpt-fast

mistralai/mistral-src

intel/ipex-llm

OpenNMT/CTranslate2

Star history

Open-source alternatives to Gpt Fast

meta-pytorch/gpt-fast

mistralai/mistral-src

intel/ipex-llm

OpenNMT/CTranslate2