CTranslate2

CTranslate2 is a C++ inference engine and runtime for Transformer models, designed to execute models on both CPU and GPU with optimizations for speed and memory efficiency. It functions as a model format converter, quantization tool, and REST API server, enabling deployment of neural machine translation, automatic speech recognition, and text generation models.

The engine distinguishes itself through a suite of runtime optimizations including layer fusion, weight-matrix quantization, batch-by-length grouping, and a caching allocator that reuses GPU memory. It supports tensor-parallel model distribution across multiple GPUs, static prompt state caching to avoid re-encoding repeated inputs, and CPU instruction set dispatch that selects the optimal code path for the hardware. An asynchronous inference queue allows overlapping computation with other work, while the OpenAI-compatible REST API enables drop-in integration with existing applications.

CTranslate2 provides model conversion tools for frameworks including Fairseq, Hugging Face Transformers, Marian, OpenNMT-py, OpenNMT-tf, and OPUS-MT, transforming trained models into an optimized binary format. It supports a range of quantization types such as INT8, FP16, and BF16, with automatic compute type selection based on the available hardware. The engine handles text translation, text generation with configurable decoding strategies like beam search and sampling, sequence scoring, text encoding, and speech transcription, all with streaming input and output capabilities.

Features

Model Inference Runtimes - Executes Transformer models with reduced memory and faster speed by applying optimizations like layer fusion and quantization at runtime.

Transformer Inference Engines - C++ runtime for executing Transformer models on CPU and GPU with optimized performance.

OpenAI-Compatible APIs - Exposes an OpenAI-compatible REST API for drop-in integration of optimized models.

Device and Precision Selectors - Selects the compute device and data type to balance speed and memory usage, such as float32, float16, int8, or bfloat16.

Inference Accelerators - Applies optimizations such as layer fusion and batch reordering to run models faster and with fewer resources.

GPU-Accelerated Inference - Runs Transformer models on NVIDIA GPUs with Compute Capability 3.5 or higher, leveraging CUDA for accelerated execution.

CPU Inference Runtimes - Runs Transformer models on x86-64 and ARM64 processors with automatic backend and instruction set selection.

Top-K Token Sampling - CTranslate2 draws tokens from the model's output distribution, optionally restricting to the top-K candidates, for diverse outputs.

Decoding Strategies - CTranslate2 controls text generation through beam search, sampling, and repetition penalties to balance quality and diversity.

Beam Search Runtimes - CTranslate2 keeps multiple candidate hypotheses at each step to find a better final translation at the cost of speed and memory.

Decoding Strategy Implementations - CTranslate2 applies beam search, sampling, and other decoding strategies during text generation or translation.

Inference Execution Models - Executes Transformer models for tasks like translation, summarization, and generation using a custom runtime optimized for speed and memory.

Accelerated Speech Recognizers - CTranslate2 transcribes audio to text using Transformer-based speech recognition models with accelerated inference.

Speech Recognition Engines - CTranslate2 transcribes audio into text using an optimized speech recognition model.

Neural Machine Translation - Translating text between languages using optimized transformer models with beam search, streaming, and batch processing capabilities.

Model Format Converters - Transforms models from Fairseq, Marian, and OpenNMT-tf into an optimized binary format for efficient execution.

Translation Model Runners - CTranslate2 runs a converted Fairseq translation model on tokenized input to produce translated output sequences.

Model Optimization Frameworks - Transforms trained models from Fairseq and other supported frameworks into the engine's optimized format.

Tensor-Parallel Inference Distributions - CTranslate2 splits a large model across several GPUs using tensor parallelism to handle models that exceed a single device's memory.

Framework-Specific Model Converters - Converting trained models from frameworks like Fairseq and Hugging Face into an optimized binary format with weight quantization for efficient deployment.

Model Quantization - Reduces model weights to lower-precision formats to shrink memory footprint and accelerate inference.

Model Serving APIs - Serves Transformer models through an OpenAI-compatible HTTP endpoint for application integration.

Asynchronous Inference - CTranslate2 starts a generation task and retrieves the result later, allowing other work to proceed while the model computes.

Model Format Converters - Transforms trained models from supported frameworks into the optimized CTranslate2 format for faster inference.

Prefix Bias Control - CTranslate2 encourages the model to follow a given prefix but allows it to diverge when the model is confident in a different token.

Quantized Inference Accelerators - CTranslate2 runs neural network computations in reduced precision like INT8 or FP16 to speed up execution on both CPU and GPU.

Weight Quantization - Transforms model weights into lower-precision formats during conversion and loading to shrink memory footprint and accelerate operations.

Decoder-Only Inference - CTranslate2 runs text generation using decoder-only Transformer models converted from OpenNMT-py format.

Greedy Decoding Strategies - CTranslate2 selects the highest-probability token at each step for the fastest possible decoding with no branching.

Autoregressive Text Generation - Generates output tokens one step at a time using beam search or sampling for summarization or dialogue.

Speech Recognition - CTranslate2 transcribes audio into text using a speech recognition model with language detection and task prompting.

Target Prefix Forcing - CTranslate2 forces the start of the generated sequence to match a given prefix, completing the rest freely.

Text Generation - CTranslate2 generates text from a batch of prompts or start tokens using a generative language model.

Text Translation Inference - CTranslate2 translates sequences of source tokens into target tokens using a pre-trained Transformer model, outputting the most likely hypothesis.

Text Translation Services - CTranslate2 translates text between languages using an optimized Transformer model on CPU or GPU.

Weight Quantization Tools - Reduces model weights to lower-precision formats to shrink memory footprint and accelerate inference.

Length-Based Batch Groupers - CTranslate2 groups input sequences by length and processes them in fixed-size chunks to maximize hardware utilization and throughput.

Compute Type Selectors - Automatically selects the fastest supported computation type for the current hardware when loading a model.

Text-to-Speech Translation - CTranslate2 runs a speech-to-text model to convert audio input into written text output in real time.

Compute Type Auto-Selectors - Automatically selects the fastest supported quantization type for the current hardware and backend when loading a model.

Inference Performance Optimizers - Applies environment variables and runtime settings to maximize inference speed on the available hardware.

Alternative Sequence Generation - CTranslate2 returns multiple most-likely tokens immediately after a forced prefix, completing each alternative independently.

Neural Network Layer Fusions - Combines adjacent neural network layers into single fused operations to reduce memory bandwidth and kernel launch overhead.

Beam Search Overhead Reducers - CTranslate2 disables score tracking and skips the final softmax layer when beam size is 1 and scores are not needed.

Decoder Prompt Forwarders - Passes input prompts directly into the decoder at once instead of re-encoding them, reducing redundant computation.

Dynamic Model Loaders - Switches between Transformer models at runtime by loading or releasing them from memory as needed.

Automatic Speech Recognition - Transcribing audio into text in real time using optimized speech recognition models on CPU or GPU.

Sequence Encoders - Runs encoder-only models like BERT to transform input text into dense vector representations for downstream tasks.

Fairseq Converters - Converts PyTorch models trained with Fairseq into an optimized format for faster inference and reduced memory usage.

Hugging Face Converters - Converts Hugging Face Transformer models into an optimized format for faster inference and reduced memory usage.

Marian Converters - Converts Marian-trained Transformer models into an optimized format for accelerated inference.

OpenNMT-py Converters - Converts PyTorch Transformer models trained with OpenNMT-py into an optimized format for inference.

OpenNMT-tf Converters - Converts Transformer models trained with OpenNMT-tf into the CTranslate2 format using a YAML configuration file.

OPUS-MT Converters - Converts pretrained OPUS-MT Transformer models into an optimized format for faster inference using a dedicated converter tool.

Multi-GPU Inference Runtimes - Distributing model execution across multiple GPUs using tensor parallelism to handle large models that exceed single-device memory.

Multi-GPU Distribution - Splits a large model across multiple GPUs using tensor parallelism to handle models that exceed single-device memory.

Disk Size Reducers - Converts model weights to lower-precision types during conversion to shrink file size while preserving accuracy.

OpenAI-Compatible Model Servers - Exposing optimized transformer models through an OpenAI-compatible REST API for integration into external applications.

Multilingual Content Translation - CTranslate2 translates text between multiple languages using a converted Fairseq multilingual model by prefixing language tokens.

Parallel Inference Orchestrators - Distributes model execution across multiple CPU threads or GPU streams to increase throughput.

Static Prompt Caching - Pre-computes and stores the transformer state for a fixed system prompt so subsequent calls skip re-encoding.

On-Load Quantizers - Selects or changes the computation precision at load time, overriding the quantization used during conversion.

Conversion-Time Quantizers - Applies a chosen quantization type during model conversion to shrink the on-disk model file size.

Length-Based Feature Grouping - Sorts and groups input sequences by length before processing to maximize hardware utilization and minimize padding waste.

Tensor Parallelism - Splits model weights across multiple GPU devices so inference can proceed on models larger than a single device's memory.

Output Length Modifiers - CTranslate2 limits the minimum and maximum number of tokens the decoder generates, excluding the end-of-sequence token.

Translation Pair Scoring - Computes the likelihood of a given translation pair for evaluating model confidence or quality.

Sequence To Sequence Models - Computes log-probabilities for token sequences to evaluate how well a model fits the input.

Interactive Sequence Generation - CTranslate2 autocompletes partial sequences or returns alternative tokens at a specific position during generation.

Transformer Text Encoders - Computes dense vector representations of input text using a Transformer encoder for downstream tasks.

Dynamic Inference Batching - Processes multiple requests in parallel across CPU cores or GPUs, with dynamic memory allocation per batch size.

Token Streaming - CTranslate2 returns tokens as they are generated by the model, enabling interactive output display.

Model Memory Managers - Controls the allocation and release of model weights and intermediate buffers to fit large models into limited device memory.

GPU Memory Allocators - Manages GPU memory through a custom allocator that caches and reuses allocations to avoid expensive cudaMalloc calls.

Asynchronous Task Queues - Launches generation tasks that return results later, allowing the caller to overlap computation with other work.

CPU Instruction Set Detection - Selects the optimal CPU code path at runtime based on detected hardware capabilities.

Sequence Likelihood Scores - Computes log-probability scores for token sequences to evaluate model confidence or quality.

OpenAI-Compatible API Servers - Exposes an OpenAI-compatible API for integrating optimized models into external applications.

AI & Machine Learning - Fast inference engine for Transformer models

Model Serving Engines - Fast C++ inference engine for transformer-based models.

Inference Frameworks - C++ based inference engine for CPU and GPU acceleration.

OpenNMTCTranslate2

Features

Star history