Tokenizers

This project is a high-performance library for converting raw text into tokens and IDs for machine learning models. It functions as a fast text encoder and a text preprocessing pipeline designed to transform strings into numerical representations with high throughput for research and production.

The library includes a subword tokenizer trainer used to analyze text datasets and create custom vocabularies using algorithms such as byte-pair encoding and wordpiece. It provides capabilities for subword vocabulary training and text alignment, allowing character offsets to be tracked during normalization so processed tokens can be mapped back to their original positions in the raw text.

The system covers a broad range of natural language processing preprocessing tasks, including text normalization, the insertion of special tokens, and the application of padding and truncation to meet model input requirements. It supports the construction of custom tokenization pipelines and the ability to download pre-trained tokenizer assets.

The core logic is implemented in Rust for memory safety and performance, with bindings provided for Python.

Features

Text Tokenizers - Converts raw text into a sequence of tokens and IDs using high-performance segmentation algorithms.

Sequence Padding Utilities - Provides utilities for standardizing input sequence lengths through truncation and padding with special tokens.

Vocabulary Builders - Generates token-to-index mappings from datasets to prepare text for machine learning models.

Byte Pair Encodings - Builds token sets using byte-pair encoding by iteratively merging the most frequent character pairs.

Vocabulary Learning - Creates a mapping of characters or subwords from raw text corpora using byte-pair encoding or wordpiece algorithms.

Vocabulary Training - Analyzes text datasets to create custom subword vocabularies using byte-pair encoding and wordpiece algorithms.

Pipeline Construction - Enables the construction of custom tokenization pipelines by combining normalization and model components.

Subword Tokenization - Uses byte-pair encoding and wordpiece algorithms to handle out-of-vocabulary words via subword units.

Vocabulary Trainers - Includes a trainer to analyze text datasets and create custom subword vocabularies using byte-pair encoding and wordpiece.

Text Preprocessing Pipelines - Provides automated workflows for cleaning and normalizing raw text data to prepare it for machine learning models.

Token Alignment Trackers - Tracks alignments during normalization to identify the exact original text characters corresponding to a specific token.

Text Processing Pipelines - Utilizes a modular pipeline to sequentially apply normalization, pre-tokenization, and splitting for structured text processing.

Machine Learning Pipelines - Orchestrates a preprocessing pipeline that includes normalization, padding, and truncation to prepare text datasets for models.

Rust-Implemented Tools - Implements core tokenization logic in Rust to ensure memory safety and high-performance execution.

Custom Text Normalizers - Cleans input text while tracking character offsets to enable mapping tokens back to the original sentence.

Alignment-Aware Normalization - Cleans input text while tracking character offsets to maintain traceability between original text and processed tokens.

Offset Tracking - Provides high-precision tracking of original character offsets through normalization to allow tokens to be mapped back to raw text.

Large Language Model Input Tools - Prepares text for model consumption by applying final padding, truncation, and special token insertion.

Input Buffer Formatting - Provides utilities to trim, pad, and inject special tokens into tokenized output to meet specific model input requirements.

Token-to-Text Alignments - Tracks character offsets during normalization so processed tokens can be mapped back to original positions in the raw text.

Python Bindings - Exposes a high-performance Rust implementation to Python via low-overhead foreign function interface bindings.

Parallel Text Processing - Implements multi-threaded parallel processing of text to increase encoding throughput across multiple CPU cores.

Multi-Pattern Matching Algorithms - Uses trie-based structures to efficiently find and replace the longest matching subwords in text streams.

Artificial Intelligence - Modern NLP tokenization pipelines with high-performance implementations.

Model Serving Engines - High-performance tokenization library for research and production.

Natural Language Processing - High-performance tokenization library optimized for research and production.

Lexical Analysis Tools - High-performance tokenization library for NLP models.

Python NLP Libraries - High-performance tokenization for research and production.

huggingfacetokenizers

Features

Star history