# google/sentencepiece

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/google-sentencepiece).**

11,657 stars · 1,325 forks · C++ · apache-2.0

## Links

- GitHub: https://github.com/google/sentencepiece
- awesome-repositories: https://awesome-repositories.com/repository/google-sentencepiece.md

## Topics

`natural-language-processing` `neural-machine-translation` `word-segmentation`

## Description

SentencePiece is a text segmentation engine and tokenization library designed for machine learning workflows. It provides a comprehensive toolkit for transforming raw text into subword units or numerical identifiers, enabling consistent data representation for neural network training and inference. The library supports the training of segmentation models from raw text, allowing for the creation of custom vocabularies tailored to specific domain requirements.

The project distinguishes itself through its byte-level encoding and fallback mechanisms, which ensure that every input can be represented without relying on unknown tokens. It employs probabilistic subword modeling and stochastic sampling to improve model robustness during training. To handle large-scale datasets, the engine utilizes memory-mapped model loading and thread-safe, parallelized processing, which distributes encoding and decoding tasks across multiple CPU cores.

Beyond core segmentation, the library includes a deterministic normalization pipeline that manages Unicode transformations and whitespace formatting to ensure consistent text representation. It also provides granular control over vocabulary composition, including the reservation of special control symbols, the enforcement of atomic token definitions, and the ability to map tokens back to their original character positions for precise alignment.

## Tags

### Artificial Intelligence & ML

- [Subword Tokenization](https://awesome-repositories.com/f/artificial-intelligence-ml/subword-tokenization.md) — Trains and applies subword segmentation models using Unigram or BPE algorithms for natural language processing.
- [Text Tokenizers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers.md) — Transforms raw text into subword units or numerical identifiers using trained segmentation models. ([source](https://github.com/google/sentencepiece/blob/master/python/README.md))
- [Natural Language Processing](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing.md) — Converts raw text into subword units or numerical identifiers to prepare data for large language models.
- [Machine Learning Training](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/machine-learning-training.md) — Transforms raw text into numerical identifiers and manages vocabulary constraints for neural network training.
- [Natural Language Processing Libraries](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing-libraries.md) — Provides a collection of tools for normalizing, encoding, and decoding text into subword units.
- [Byte-Level Tokenizers](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/tokenizers/byte-level-tokenizers.md) — Decomposes unknown characters into UTF-8 byte sequences to ensure full vocabulary coverage without unknown tokens. ([source](https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md))
- [Data Preprocessing](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-training-and-tuning/data-and-checkpointing/data-preprocessing.md) — Cleans and normalizes text inputs while managing vocabulary constraints for neural network models.
- [Byte Level Encodings](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/tokenization-algorithms/byte-level-encodings.md) — Ensures full vocabulary coverage by treating raw input as a sequence of bytes rather than Unicode characters.
- [Subword Regularization Methods](https://awesome-repositories.com/f/artificial-intelligence-ml/subword-tokenization/subword-regularization-methods.md) — Improves model robustness during training by sampling multiple possible token sequences for a single input. ([source](https://github.com/google/sentencepiece/blob/master/python/README.md))
- [Token Alignment Trackers](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers/token-alignment-trackers.md) — Maps individual tokens back to their original character or byte positions for precise text alignment and extraction. ([source](https://github.com/google/sentencepiece/blob/master/python/README.md))
- [Segmentation Model Training](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/image-segmentation/segmentation-model-training.md) — Creates new tokenization models from raw text data using flexible input sources. ([source](https://github.com/google/sentencepiece/blob/master/python/README.md))
- [Vocabulary Management](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/language-tools/dictionary-management-utilities/vocabulary-management.md) — Provides tools for defining specialized control symbols and atomic tokens to handle domain-specific requirements.
- [Token Decoders](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers/token-decoders.md) — Reconstructs original raw text from sequences of subword pieces or numerical identifiers. ([source](https://github.com/google/sentencepiece/blob/master/README.md))
- [Vocabulary Management](https://awesome-repositories.com/f/artificial-intelligence-ml/vocabulary-management.md) — Applies rules on subword length and character boundaries to control the structure and composition of the generated vocabulary. ([source](https://github.com/google/sentencepiece/blob/master/doc/options.md))
- [Segmentation Boundary Enforcers](https://awesome-repositories.com/f/artificial-intelligence-ml/computer-vision-systems/image-segmentation/segmentation-model-training/segmentation-boundary-enforcers.md) — Enforces hard segmentation boundaries using custom delimiters to influence tokenization logic. ([source](https://github.com/google/sentencepiece/blob/master/doc/piece_constraints.md))
- [Memory-Mapped Weight Loaders](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-inference-serving/inference-optimization/memory-mapped-weight-loaders.md) — Maps model weight files directly into process memory to reduce RAM usage and improve load times.
- [Vocabulary Usage Restrictions](https://awesome-repositories.com/f/artificial-intelligence-ml/vocabulary-management/vocabulary-usage-restrictions.md) — Limits tokenization output to a specific subset of allowed symbols to control the active vocabulary during inference. ([source](https://github.com/google/sentencepiece/blob/master/README.md))
- [Control Symbol Reservoirs](https://awesome-repositories.com/f/artificial-intelligence-ml/control-symbol-reservoirs.md) — Reserves specific vocabulary identifiers for model control flow that do not participate in text segmentation. ([source](https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md))
- [Atomic Token Definitions](https://awesome-repositories.com/f/artificial-intelligence-ml/custom-model-definitions/tokenization-definitions/atomic-token-definitions.md) — Preserves the integrity of specific character sequences by treating them as indivisible tokens during encoding. ([source](https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md))
- [Language Model Architectures](https://awesome-repositories.com/f/artificial-intelligence-ml/language-model-architectures.md) — Uses probabilistic models to determine the most likely subword segmentation by evaluating token combinations.
- [Model Training Optimizers](https://awesome-repositories.com/f/artificial-intelligence-ml/model-training-optimizers.md) — Controls corpus loading, shuffling, and character coverage thresholds to optimize model training. ([source](https://github.com/google/sentencepiece/blob/master/doc/options.md))
- [Segmentation Restriction Utilities](https://awesome-repositories.com/f/artificial-intelligence-ml/subword-tokenization/segmentation-restriction-utilities.md) — Enforces constraints on token formation to ensure valid vocabulary generation. ([source](https://github.com/google/sentencepiece/blob/master/doc/piece_constraints.md))
- [Whitespace Formatting Utilities](https://awesome-repositories.com/f/artificial-intelligence-ml/text-tokenizers/whitespace-formatting-utilities.md) — Configures whitespace handling rules to ensure consistent text representation during the tokenization process. ([source](https://github.com/google/sentencepiece/blob/master/doc/normalization.md))

### Data & Databases

- [Text Segmentation](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction/text-segmentation.md) — Splits text into subword pieces with support for byte-level fallback and stochastic sampling.
- [Text Normalization](https://awesome-repositories.com/f/data-databases/text-normalization.md) — Applies Unicode normalization rules and custom whitespace handling to ensure consistent text representation. ([source](https://github.com/google/sentencepiece#readme))
- [Unicode Normalization Pipelines](https://awesome-repositories.com/f/data-databases/text-normalization/unicode-normalization-pipelines.md) — Applies Unicode transformations and whitespace rules to ensure consistent text representation before segmentation.

### Programming Languages & Runtimes

- [Special Symbol Managers](https://awesome-repositories.com/f/programming-languages-runtimes/symbolic-identifiers/special-symbol-managers.md) — The tokenization library customizes the surface strings and integer identifiers for reserved tokens like unknown, beginning-of-sequence, end-of-sequence, and padding markers. ([source](https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md))

### Scientific & Mathematical Computing

- [High-Performance and Parallel Computing](https://awesome-repositories.com/f/scientific-mathematical-computing/high-performance-execution-environments/high-performance-and-parallel-computing.md) — Distributes tokenization and detokenization tasks across multiple CPU threads to rapidly handle massive datasets.
- [Parallel Processing](https://awesome-repositories.com/f/scientific-mathematical-computing/high-performance-execution-environments/high-performance-and-parallel-computing/parallel-processing.md) — Distributes encoding and decoding workloads across multiple CPU threads to increase processing speed. ([source](https://github.com/google/sentencepiece/blob/master/python/README.md))
