TransformerLens | Awesome Repository

TransformerLens is a library for mechanistic interpretability research designed to reverse engineer the learned algorithms within large language models. It provides a standardized framework for wrapping diverse transformer architectures, allowing researchers to extract, manipulate, and analyze internal activations and weights through a consistent interface.

The project distinguishes itself through a comprehensive system of activation hooks that can capture, patch, and ablate internal tensors during the forward pass. It includes specialized utilities for decomposing fused projections, materializing attention matrices from state space models, and mapping internal activations of multimodal vision and audio encoders.

The framework covers a broad range of analysis capabilities, including causal interventions, attention circuit analysis, and weight conversion from various pretrained formats. It also provides tools for token salience analysis, gradient computation, and the generation of interpretability benchmarks.

The library supports a wide array of model families through a system of architecture adapters, enabling compatible analysis of models including Llama, Mistral, Gemma, and various Mixture of Experts architectures.

Features

Internal Activation Hooks - Provides a comprehensive system of internal activation hooks to capture and modify tensors during the model forward pass.
Activation and Gradient Hooking - Creates forward and backward hooks to intercept and store model activations and gradients for analysis.
Interpretable ML Libraries - Provides a comprehensive suite of tools for reverse engineering learned algorithms in LLMs through internal activation analysis.

Features

Internal Activation Hooks - Provides a comprehensive system of internal activation hooks to capture and modify tensors during the model forward pass.
Activation and Gradient Hooking - Creates forward and backward hooks to intercept and store model activations and gradients for analysis.
Interpretable ML Libraries - Provides a comprehensive suite of tools for reverse engineering learned algorithms in LLMs through internal activation analysis.