Class BpeTokenizer

Namespace: AiDotNet.Tokenization.Algorithms

Assembly: AiDotNet.dll

Byte-Pair Encoding (BPE) tokenizer implementation for subword tokenization.

public class BpeTokenizer : TokenizerBase, ITokenizer

Inheritance: object

TokenizerBase

BpeTokenizer

Implements: ITokenizer

Inherited Members: TokenizerBase.Vocabulary

TokenizerBase.SpecialTokens

TokenizerBase.VocabularySize

TokenizerBase.Encode(string, EncodingOptions)

TokenizerBase.EncodeBatch(List<string>, EncodingOptions)

TokenizerBase.Decode(List<int>, bool)

TokenizerBase.DecodeBatch(List<List<int>>, bool)

TokenizerBase.ConvertTokensToIds(List<string>)

TokenizerBase.ConvertIdsToTokens(List<int>)

TokenizerBase.AddSpecialTokensToSequence(List<string>)

TokenizerBase.TruncateSequence(List<string>, int, string)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

BPE is a data compression algorithm adapted for NLP that learns to merge frequent character sequences into subword units. It's used by GPT, GPT-2, GPT-3, and many other modern language models.

For Beginners: BPE is like learning common letter combinations. Imagine you're creating shorthand notes:

Start with individual letters: "t", "h", "e", " ", "c", "a", "t"
Notice "th" appears often, so create a symbol for it: "th", "e", " ", ...
Notice "the" appears often, merge again: "the", " ", "cat"
Keep merging until you have a good vocabulary size

This way, common words like "the" become single tokens, while rare words like "cryptocurrency" might be split into "crypt" + "ocurrency" or similar subwords.

Benefits:

No out-of-vocabulary words (any text can be tokenized)
Common words are single tokens (efficient)
Rare words are split into meaningful subwords (handles new words)

Example tokenization of "unhappiness":

Full word not in vocabulary, so split into subwords
Possible result: ["un", "happiness"] or ["un", "happy", "ness"]

Constructors

BpeTokenizer(IVocabulary, Dictionary<(string, string), int>, SpecialTokens?, string?)

Creates a new BPE tokenizer with the specified vocabulary and merge rules.

public BpeTokenizer(IVocabulary vocabulary, Dictionary<(string, string), int> merges, SpecialTokens? specialTokens = null, string? pattern = null)

Parameters

vocabulary IVocabulary: The vocabulary containing all valid tokens.
merges Dictionary<(string, string), int>: The BPE merges (pairs of tokens to merge and their priority order).
specialTokens SpecialTokens: The special tokens configuration. Defaults to GPT-style tokens.
pattern string: The regex pattern for pre-tokenization. Defaults to GPT-2 pattern.

Remarks

For Beginners: Most users should use the Train method or load a pretrained tokenizer instead of calling this constructor directly. The merges dictionary contains rules like ("t", "h") -> 0 meaning "merge t and h first" (lower number = higher priority).

Methods

CleanupTokens(List<string>)

Cleans up tokens and converts them back to text.

protected override string CleanupTokens(List<string> tokens)

Parameters

tokens List<string>

Returns

string

Tokenize(string)

Tokenizes text into BPE tokens.

public override List<string> Tokenize(string text)

Parameters

text string

Returns

List<string>

Train(IEnumerable<string>, int, SpecialTokens?, string?)

Trains a BPE tokenizer from a text corpus by learning merge rules.

public static BpeTokenizer Train(IEnumerable<string> corpus, int vocabSize, SpecialTokens? specialTokens = null, string? pattern = null)

Parameters

corpus IEnumerable<string>: The training corpus - a collection of text strings.
vocabSize int: The desired vocabulary size (number of unique tokens).
specialTokens SpecialTokens: The special tokens configuration. Defaults to GPT-style tokens.
pattern string: The regex pattern for pre-tokenization. Defaults to GPT-2 pattern.

Returns

BpeTokenizer: A trained BPE tokenizer ready to tokenize text.

Remarks

For Beginners: Training learns which letter combinations appear most frequently in your text. For example, if training on English text:

The algorithm starts with all individual characters as tokens
It counts all adjacent character pairs in the corpus
The most frequent pair (e.g., "t" + "h") becomes a new token "th"
This repeats until reaching the desired vocabulary size

Larger vocabulary = longer sequences become single tokens = faster inference but more memory. Typical sizes: 30,000-50,000 tokens.

Table of Contents

Class BpeTokenizer

Remarks

Constructors

BpeTokenizer(IVocabulary, Dictionary<(string, string), int>, SpecialTokens?, string?)

Parameters

Remarks

Methods

CleanupTokens(List<string>)

Parameters

Returns

Tokenize(string)

Parameters

Returns

Train(IEnumerable<string>, int, SpecialTokens?, string?)

Parameters

Returns

Remarks