# karpathy/minbpe

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/karpathy-minbpe).**

10,582 stars · 1,071 forks · Python · MIT

## Links

- GitHub: https://github.com/karpathy/minbpe
- awesome-repositories: https://awesome-repositories.com/repository/karpathy-minbpe.md

## Description

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

## Tags

### Artificial Intelligence & ML

- [Toolkits](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/tokenization-algorithms/byte-pair-encodings/vocabulary-training/toolkits.md) — Provides a complete toolkit for learning BPE merge rules and building a vocabulary from text corpora.
- [Tokenizer Persisters](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/data-and-checkpointing/model-loading/model-persistence/tokenizer-persisters.md) — Ships built-in serialization for saving and loading trained tokenizer models to disk. ([source](https://cdn.jsdelivr.net/gh/karpathy/minbpe@master/README.md))
- [Byte Level Encodings](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/tokenization-algorithms/byte-level-encodings.md) — Encodes UTF-8 text into byte-level token sequences using BPE merges for subword tokenization.
- [Byte Pair Encodings](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/tokenization-algorithms/byte-pair-encodings.md) — Provides a clean implementation of the BPE algorithm used for subword tokenization in large language models.
- [Greedy Merge Encoding](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/tokenization-algorithms/byte-pair-encodings/greedy-merge-encoding.md) — Encodes text by greedily applying learned byte pair merges until no more merges apply.
- [Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/tokenization-algorithms/byte-pair-encodings/training.md) — Trains a BPE tokenizer on text data to learn merge rules and build a vocabulary for subword tokenization.
- [Byte-Level Vocabulary Trainers](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/tokenization-algorithms/byte-pair-encodings/vocabulary-training/byte-level-vocabulary-trainers.md) — Trains byte-level BPE vocabularies and merge rules on text corpora for subword tokenization. ([source](https://cdn.jsdelivr.net/gh/karpathy/minbpe@master/README.md))
- [Text Tokenization](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/text-tokenization.md) — Converts raw text into integer token IDs using a trained BPE tokenizer for language model pipelines.
- [Byte-Level Tokenizers](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/tokenizers/byte-level-tokenizers.md) — Implements a byte-level tokenizer that maps integer IDs to raw byte sequences for full UTF-8 coverage.
- [Training](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/tokenizers/byte-level-tokenizers/training.md) — Provides a clean implementation of the BPE training algorithm to learn merge rules from text corpora.
- [Subword Tokenization](https://awesome-repositories.com/f/artificial-intelligence-ml/subword-tokenization.md) — Provides a framework for splitting text into subword units with support for special tokens and regex pre-tokenization.
- [Byte-Level Encoders](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers/byte-level-encoders.md) — Encodes raw text into integer token IDs using byte-level BPE merges for language model pipelines. ([source](https://cdn.jsdelivr.net/gh/karpathy/minbpe@master/README.md))
- [Token Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers/token-decoders.md) — Provides a decoder that reconstructs human-readable text from integer token IDs using the trained vocabulary. ([source](https://cdn.jsdelivr.net/gh/karpathy/minbpe@master/README.md))
- [Greedy Merge Application](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/architectures/instruction-tuned-language-models/weight-space-merging-techniques/weight-merging-utilities/merge-pipelines/greedy-merge-application.md) — Implements greedy merge application that repeatedly applies the highest-priority learned byte pair merge.
- [Merge Tables](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/tokenization-algorithms/byte-pair-encodings/merge-tables.md) — Ships a priority-ordered merge table that stores learned byte pair rules for sequential application.
- [Category-Based Splitters](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/text-tokenization/recursive-text-splitting/category-based-splitters.md) — Splits text by character category before tokenization to prevent cross-category BPE merges.
- [Persistence](https://awesome-repositories.com/f/artificial-intelligence-ml/vocabulary-management/tokenizer-vocabulary-merging/persistence.md) — Saves and loads a trained BPE tokenizer's vocabulary and merge rules to disk for reuse across sessions.
- [Serialization](https://awesome-repositories.com/f/artificial-intelligence-ml/vocabulary-management/tokenizer-vocabulary-merging/serialization.md) — Provides a module for saving and loading trained tokenizer models with vocabulary and merge tables to disk.

### Data & Databases

- [Category-Based Splitters](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-transformation/text-nlp-preprocessing/text-preprocessing/category-based-splitters.md) — Implements regex-based text splitting by category to prevent cross-category BPE merges during tokenization. ([source](https://cdn.jsdelivr.net/gh/karpathy/minbpe@master/README.md))
- [Pre-Tokenization Splitters](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction/text-segmentation/pattern-based-text-segmenters/regex-based-ocr-cleaners/pre-tokenization-splitters.md) — Ships a regex-based pre-splitter that segments text into categories before BPE merges.
- [Regex Column Splitting](https://awesome-repositories.com/f/data-databases/wide-column-stores/column-oriented-disk-storage/regex-column-splitting.md) — Splits input text into categories like letters, numbers, and punctuation using a configurable regex pattern.

### Programming Languages & Runtimes

- [Special Token Registries](https://awesome-repositories.com/f/programming-languages-runtimes/regular-expression-engines/tokenizers/special-token-registries.md) — Provides a mechanism to register custom tokens with reserved IDs that are never split or merged. ([source](https://cdn.jsdelivr.net/gh/karpathy/minbpe@master/README.md))

### Software Engineering & Architecture

- [Tokenizer State Persistence](https://awesome-repositories.com/f/software-engineering-architecture/architectural-design-patterns/state-management/persistence-and-serialization/tokenizer-state-persistence.md) — Saves and restores the full tokenizer state including vocabulary, merge table, and special tokens to disk.
