Class BpeTokenizer
- Namespace
- AiDotNet.Tokenization.Algorithms
- Assembly
- AiDotNet.dll
Byte-Pair Encoding (BPE) tokenizer implementation for subword tokenization.
public class BpeTokenizer : TokenizerBase, ITokenizer
- Inheritance
-
BpeTokenizer
- Implements
- Inherited Members
Remarks
BPE is a data compression algorithm adapted for NLP that learns to merge frequent character sequences into subword units. It's used by GPT, GPT-2, GPT-3, and many other modern language models.
For Beginners: BPE is like learning common letter combinations. Imagine you're creating shorthand notes:
- Start with individual letters: "t", "h", "e", " ", "c", "a", "t"
- Notice "th" appears often, so create a symbol for it: "th", "e", " ", ...
- Notice "the" appears often, merge again: "the", " ", "cat"
- Keep merging until you have a good vocabulary size
This way, common words like "the" become single tokens, while rare words like "cryptocurrency" might be split into "crypt" + "ocurrency" or similar subwords.
Benefits:
- No out-of-vocabulary words (any text can be tokenized)
- Common words are single tokens (efficient)
- Rare words are split into meaningful subwords (handles new words)
Example tokenization of "unhappiness":
- Full word not in vocabulary, so split into subwords
- Possible result: ["un", "happiness"] or ["un", "happy", "ness"]
Constructors
BpeTokenizer(IVocabulary, Dictionary<(string, string), int>, SpecialTokens?, string?)
Creates a new BPE tokenizer with the specified vocabulary and merge rules.
public BpeTokenizer(IVocabulary vocabulary, Dictionary<(string, string), int> merges, SpecialTokens? specialTokens = null, string? pattern = null)
Parameters
vocabularyIVocabularyThe vocabulary containing all valid tokens.
mergesDictionary<(string, string), int>The BPE merges (pairs of tokens to merge and their priority order).
specialTokensSpecialTokensThe special tokens configuration. Defaults to GPT-style tokens.
patternstringThe regex pattern for pre-tokenization. Defaults to GPT-2 pattern.
Remarks
For Beginners: Most users should use the Train method or load a pretrained tokenizer instead of calling this constructor directly. The merges dictionary contains rules like ("t", "h") -> 0 meaning "merge t and h first" (lower number = higher priority).
Methods
CleanupTokens(List<string>)
Cleans up tokens and converts them back to text.
protected override string CleanupTokens(List<string> tokens)
Parameters
Returns
Tokenize(string)
Tokenizes text into BPE tokens.
public override List<string> Tokenize(string text)
Parameters
textstring
Returns
Train(IEnumerable<string>, int, SpecialTokens?, string?)
Trains a BPE tokenizer from a text corpus by learning merge rules.
public static BpeTokenizer Train(IEnumerable<string> corpus, int vocabSize, SpecialTokens? specialTokens = null, string? pattern = null)
Parameters
corpusIEnumerable<string>The training corpus - a collection of text strings.
vocabSizeintThe desired vocabulary size (number of unique tokens).
specialTokensSpecialTokensThe special tokens configuration. Defaults to GPT-style tokens.
patternstringThe regex pattern for pre-tokenization. Defaults to GPT-2 pattern.
Returns
- BpeTokenizer
A trained BPE tokenizer ready to tokenize text.
Remarks
For Beginners: Training learns which letter combinations appear most frequently in your text. For example, if training on English text:
- The algorithm starts with all individual characters as tokens
- It counts all adjacent character pairs in the corpus
- The most frequent pair (e.g., "t" + "h") becomes a new token "th"
- This repeats until reaching the desired vocabulary size
Larger vocabulary = longer sequences become single tokens = faster inference but more memory. Typical sizes: 30,000-50,000 tokens.