← All repositories

huggingfacetransformers

Transformers

Features

  • API FrameworksA comprehensive training API for models that supports distributed training, mixed precision, and integration with various hardware accelerators.
  • Hybrid Parallelism StrategiesA training approach combining data, pipeline, and tensor parallelism to scale large language models across multi-node, multi-GPU clusters.
  • Byte Pair EncodingsA subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs to build a vocabulary.
  • Vision TransformersA computer vision model that processes images by splitting them into fixed-sized patches, treating them as sequences of tokens.
  • Chat Template FormattersA method for formatting chat history into the specific token sequences and control tokens required by a model's chat structure.
  • Large Model OptimizationsOptimizations for large models including automatic device mapping, half-precision weight support, and quantization to reduce memory footprint and accelerate inference.
  • Qwen2 Language ModelsA family of pretrained and instruction-tuned large language models featuring group query attention, rotary positional embeddings, and support for long context lengths.
  • Checkpoint ResumptionA capability to resume training from a specific checkpoint path, restoring optimizer, scheduler, and random number generator states.
  • Batched Inference MechanismsA batch-processing mechanism that accepts lists of conversation sequences to enable efficient inference across multiple chat sessions in a single forward pass.
  • Attention MechanismsA registry-based interface for managing and extending attention functions, allowing models to register custom implementations or locally overwrite existing mechanisms.
  • Model QuantizationA collection of quantization methods to reduce model memory requirements by storing weights in lower precision while balancing accuracy and compression.
  • Tokenizer Base InterfacesA base class providing a unified interface for tokenization, encoding, decoding, and vocabulary management across different tokenizer backends.
  • Tool Calling PatternsA pattern for tool invocation that appends assistant-generated function requests and subsequent tool-role results to the conversation message list.
  • Transformers Integration LayersA model loader that integrates with standard transformer libraries to handle device mapping, quantization, and attention backends, while extending the training loop with custom mixins.
  • Multimodal Input HandlersA capability for multimodal models to process mixed-modality inputs, such as images, video, or audio, by specifying input types within the content structure.
  • Data ParallelismA training strategy that evenly distributes data across multiple GPUs, where each GPU holds a model copy and synchronizes results to reduce training time.
  • Sequence-to-Sequence Translation TasksA sequence-to-sequence framework for converting text between languages, supporting model fine-tuning, dataset preprocessing, evaluation, and inference.
  • Tool Calling SupportsNative support for structured function calling, allowing models to generate function requests that can be executed by the host application.
  • Chunked Prefill MechanismsA technique that splits long prompt processing across multiple forward passes to prevent blocking other requests during generation.
  • Text Classification TasksA machine learning task that assigns labels to text sequences, commonly used for sentiment analysis or categorization.
  • Mixture of ExpertsA workflow for mixture-of-experts models that captures expert routing indices during inference and replays them during training passes to maintain consistent expert paths.
  • Document Question Answering PipelinesA high-level pipeline interface for performing document question answering inference by passing image and question inputs to a model.
  • Distributed Training IntegrationsAn integration layer for loading models directly into a distributed training framework, leveraging native components while utilizing parallelization and optimization techniques.
  • Generation Continuation ModesA configuration option that allows the model to continue generating from the last message in the chat history rather than initiating a new assistant turn.
  • Asynchronous Batching ExecutionAn execution strategy that overlaps CPU request preparation with GPU computation using multiple streams and graph-based execution to improve performance.
  • Prompt Lookup DecodingAn optimization technique that proposes candidate tokens by identifying and copying repeating n-grams from the input prompt, avoiding the need for an external assistant model.
  • Edge Model Inference RuntimesA lightweight runtime for edge device model inference that exports models into a portable format with ahead-of-time memory planning and hardware-specific operation dispatch.
  • Parallel LoadingIntegration with tensor parallelism that shards tensors during materialization, allowing each rank to load only the necessary portion of the weight data.
  • Byte Level EncodingsA variant of subword tokenization that uses byte values as the base vocabulary, ensuring every word can be tokenized without requiring an unknown token.
  • Memory Efficient EvaluationA technique for memory-efficient evaluation by offloading accumulated predictions to the CPU and preprocessing logits at the batch level.
  • Paged KV Cache ManagementA memory management system using fixed-size blocks to store key-value cache states, enabling efficient memory sharing and preventing fragmentation.
  • Configuration ManagementA configuration class that centralizes hyperparameters, optimization settings, logging preferences, and infrastructure choices.
  • Training Flow ManagersA built-in callback that manages logging, evaluation, and checkpointing schedules based on training arguments, with support for customization.