# harvardnlp/annotated-transformer

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/harvardnlp-annotated-transformer).**

7,325 stars · 1,549 forks · Jupyter Notebook · MIT

## Links

- GitHub: https://github.com/harvardnlp/annotated-transformer
- Homepage: http://nlp.seas.harvard.edu/annotated-transformer
- awesome-repositories: https://awesome-repositories.com/repository/harvardnlp-annotated-transformer.md

## Topics

`annotated` `notebook` `python`

## Description

The Annotated Transformer is an educational resource that provides annotated code implementations of the Transformer architecture for sequence-to-sequence tasks, built with PyTorch. It serves as a learning tool for understanding attention mechanisms, multi-head parallel attention, and scaled dot-product attention through executable examples that walk through each component of the model.

The project covers the full Transformer pipeline, including stacked encoder-decoder layers with residual connections and layer normalization, sinusoidal positional encoding for order-aware representation, and masked self-attention decoding for auto-regressive generation. It also demonstrates label smoothing regularization to reduce overconfidence during training, and provides a framework for neural machine translation that encodes input sequences and decodes output sequences using attention mechanisms.

The documentation includes annotated code that explains how multi-head self-attention works, how positional encoding injects sequence order information, and how the model processes sequences through stacked self-attention and feed-forward layers. The resource is designed to help learners implement and train Transformer-based sequence models for translation tasks, with explanations of each architectural component from the original paper.

## Tags

### Artificial Intelligence & ML

- [Encoder-Decoder Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/encoder-decoder-architectures.md) — Processes input through identical encoder layers and generates output via decoder layers with cross-attention.
- [Sequence Encoders](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/speech-processing/sequence-to-sequence-tasks/sequence-encoders.md) — Processes an input sequence through stacked self-attention and feed-forward layers to produce continuous representations. ([source](http://nlp.seas.harvard.edu/annotated-transformer/))
- [Causal Masking](https://awesome-repositories.com/f/artificial-intelligence-ml/masked-language-modeling/causal-masking.md) — Implements causal masking to prevent future token leakage during auto-regressive sequence generation.
- [Multi-Head Attention Mechanisms](https://awesome-repositories.com/f/artificial-intelligence-ml/multi-head-attention-mechanisms.md) — Splits queries, keys, and values across multiple parallel heads to learn distinct representation subspaces.
- [Sinusoidal Encodings](https://awesome-repositories.com/f/artificial-intelligence-ml/positional-encoding-techniques/sinusoidal-encodings.md) — Injects fixed sinusoidal signals into token embeddings to encode absolute and relative position information.
- [Residual Connection Implementations](https://awesome-repositories.com/f/artificial-intelligence-ml/residual-networks/residual-connection-implementations.md) — Adds skip connections around each sub-layer followed by layer normalization to stabilize deep training.
- [Sequence Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-decoding-models/sequence-decoders.md) — Generates an output sequence token by token using masked self-attention and encoder-decoder attention. ([source](http://nlp.seas.harvard.edu/annotated-transformer/))
- [Sequence Learning Models](https://awesome-repositories.com/f/artificial-intelligence-ml/sequence-learning-models.md) — Builds models that transform input sequences into output sequences using attention mechanisms.
- [Transformer Architecture Implementation](https://awesome-repositories.com/f/artificial-intelligence-ml/transformer-architecture-implementation.md) — Implements the full Transformer architecture from the original paper for sequence-to-sequence tasks.
- [Positional Encodings](https://awesome-repositories.com/f/artificial-intelligence-ml/transformer-architecture-implementation/positional-encodings.md) — Injects sinusoidal signals into token embeddings to provide the model with information about token order. ([source](http://nlp.seas.harvard.edu/annotated-transformer/))
- [Educational Tutorials](https://awesome-repositories.com/f/artificial-intelligence-ml/multi-head-attention-mechanisms/educational-tutorials.md) — Provides annotated code examples that teach how multi-head self-attention and scaled dot-product attention work.
- [Neural Machine Translation](https://awesome-repositories.com/f/artificial-intelligence-ml/neural-machine-translation.md) — Trains sequence models with label smoothing and positional encoding for translation tasks.
- [Neural Machine Translation Frameworks](https://awesome-repositories.com/f/artificial-intelligence-ml/neural-machine-translation-frameworks.md) — Provides a framework for encoding input sequences and decoding output sequences using transformer layers.

### Education & Learning Resources

- [Paper Implementations](https://awesome-repositories.com/f/education-learning-resources/paper-implementations.md) — Provides an annotated code implementation of the Transformer architecture with explanations of each component.
- [PyTorch Deep Learning Examples](https://awesome-repositories.com/f/education-learning-resources/deep-learning-education/deep-learning-platforms/pytorch-deep-learning-examples.md) — Ships a deep learning model built with PyTorch for processing sequential data using attention mechanisms.

### Scientific & Mathematical Computing

- [Scaled Attention Computations](https://awesome-repositories.com/f/scientific-mathematical-computing/vector-dot-product-kernels/dot-product-computation/scaled-attention-computations.md) — Computes attention by scaling query-key dot products before softmax to prevent gradient vanishing in high-dimensional spaces. ([source](http://nlp.seas.harvard.edu/annotated-transformer/))

### Data & Databases

- [Label Smoothing Techniques](https://awesome-repositories.com/f/data-databases/label-based-data-selection/metadata-labelers/label-smoothing-utilities/label-smoothing-techniques.md) — Demonstrates label smoothing regularization to reduce overconfidence during transformer training.