This project is a high-performance library for converting raw text into tokens and IDs for machine learning models. It functions as a fast text encoder and a text preprocessing pipeline designed to transform strings into numerical representations with high throughput for research and production.
The library includes a subword tokenizer trainer used to analyze text datasets and create custom vocabularies using algorithms such as byte-pair encoding and wordpiece. It provides capabilities for subword vocabulary training and text alignment, allowing character offsets to be tracked during normalization so processed tokens can be mapped back to their original positions in the raw text.
The system covers a broad range of natural language processing preprocessing tasks, including text normalization, the insertion of special tokens, and the application of padding and truncation to meet model input requirements. It supports the construction of custom tokenization pipelines and the ability to download pre-trained tokenizer assets.
The core logic is implemented in Rust for memory safety and performance, with bindings provided for Python.