Mistral Inference is a library for running Mistral large language models on a GPU, generating text from prompts with token streaming. It loads pretrained model weights from local disk or a remote registry into GPU memory, then produces output tokens one by one for real-time display in interactive applications. The library supports multimodal prompts that accept image URLs alongside text, enabling visual description and reasoning. It includes content safety guardrails that scan generated text against predefined policies to block or flag policy violations. For structured interactions, it provid
Long Llama is a transformer-based language model and fine-tuning framework designed to process and maintain logical coherence across input sequences that significantly exceed standard length limits. By utilizing a focused transformer architecture, the project enables models to handle massive documents or entire books by training attention layers to track distant tokens. The framework distinguishes itself through specialized attention mechanisms that allow for the processing of hundreds of thousands of tokens. It incorporates memory-efficient inference techniques, such as key-value caching and
This project is a collection of reference implementations and recipes for deploying, fine-tuning, and running inference with Llama large language models. It serves as a toolkit and implementation guide for adapting pre-trained models to specific tasks and domain-specific datasets. The repository provides frameworks for developing retrieval augmented generation pipelines to ground model responses in external data. It includes guides for executing quantized inference to reduce memory usage and increase processing speed. The toolkit covers a broad range of capabilities including parameter-effic
GLM-4 is a large language model and fine-tuning framework designed for human-like text production, complex reasoning, and multilingual conversation. It functions as a multimodal system capable of processing high-resolution visual content and as a long-context model designed to analyze documents with a context window of up to one million tokens. The project differentiates itself through a function calling interface that enables AI agent development by connecting the model to external APIs and real-time web browsing. It includes specialized capabilities for generating functional programming cod