TransformerLens is a library for mechanistic interpretability research designed to reverse engineer the learned algorithms within large language models. It provides a standardized framework for wrapping diverse transformer architectures, allowing researchers to extract, manipulate, and analyze internal activations and weights through a consistent interface.
The project distinguishes itself through a comprehensive system of activation hooks that can capture, patch, and ablate internal tensors during the forward pass. It includes specialized utilities for decomposing fused projections, materializing attention matrices from state space models, and mapping internal activations of multimodal vision and audio encoders.
The framework covers a broad range of analysis capabilities, including causal interventions, attention circuit analysis, and weight conversion from various pretrained formats. It also provides tools for token salience analysis, gradient computation, and the generation of interpretability benchmarks.
The library supports a wide array of model families through a system of architecture adapters, enabling compatible analysis of models including Llama, Mistral, Gemma, and various Mixture of Experts architectures.