Table of Contents

Class BpeTokenizer

Namespace
AiDotNet.Tokenization.Algorithms
Assembly
AiDotNet.dll

Byte-Pair Encoding (BPE) tokenizer implementation for subword tokenization.

public class BpeTokenizer : TokenizerBase, ITokenizer
Inheritance
BpeTokenizer
Implements
Inherited Members

Remarks

BPE is a data compression algorithm adapted for NLP that learns to merge frequent character sequences into subword units. It's used by GPT, GPT-2, GPT-3, and many other modern language models.

For Beginners: BPE is like learning common letter combinations. Imagine you're creating shorthand notes:

  1. Start with individual letters: "t", "h", "e", " ", "c", "a", "t"
  2. Notice "th" appears often, so create a symbol for it: "th", "e", " ", ...
  3. Notice "the" appears often, merge again: "the", " ", "cat"
  4. Keep merging until you have a good vocabulary size

This way, common words like "the" become single tokens, while rare words like "cryptocurrency" might be split into "crypt" + "ocurrency" or similar subwords.

Benefits:

  • No out-of-vocabulary words (any text can be tokenized)
  • Common words are single tokens (efficient)
  • Rare words are split into meaningful subwords (handles new words)

Example tokenization of "unhappiness":

  • Full word not in vocabulary, so split into subwords
  • Possible result: ["un", "happiness"] or ["un", "happy", "ness"]

Constructors

BpeTokenizer(IVocabulary, Dictionary<(string, string), int>, SpecialTokens?, string?)

Creates a new BPE tokenizer with the specified vocabulary and merge rules.

public BpeTokenizer(IVocabulary vocabulary, Dictionary<(string, string), int> merges, SpecialTokens? specialTokens = null, string? pattern = null)

Parameters

vocabulary IVocabulary

The vocabulary containing all valid tokens.

merges Dictionary<(string, string), int>

The BPE merges (pairs of tokens to merge and their priority order).

specialTokens SpecialTokens

The special tokens configuration. Defaults to GPT-style tokens.

pattern string

The regex pattern for pre-tokenization. Defaults to GPT-2 pattern.

Remarks

For Beginners: Most users should use the Train method or load a pretrained tokenizer instead of calling this constructor directly. The merges dictionary contains rules like ("t", "h") -> 0 meaning "merge t and h first" (lower number = higher priority).

Methods

CleanupTokens(List<string>)

Cleans up tokens and converts them back to text.

protected override string CleanupTokens(List<string> tokens)

Parameters

tokens List<string>

Returns

string

Tokenize(string)

Tokenizes text into BPE tokens.

public override List<string> Tokenize(string text)

Parameters

text string

Returns

List<string>

Train(IEnumerable<string>, int, SpecialTokens?, string?)

Trains a BPE tokenizer from a text corpus by learning merge rules.

public static BpeTokenizer Train(IEnumerable<string> corpus, int vocabSize, SpecialTokens? specialTokens = null, string? pattern = null)

Parameters

corpus IEnumerable<string>

The training corpus - a collection of text strings.

vocabSize int

The desired vocabulary size (number of unique tokens).

specialTokens SpecialTokens

The special tokens configuration. Defaults to GPT-style tokens.

pattern string

The regex pattern for pre-tokenization. Defaults to GPT-2 pattern.

Returns

BpeTokenizer

A trained BPE tokenizer ready to tokenize text.

Remarks

For Beginners: Training learns which letter combinations appear most frequently in your text. For example, if training on English text:

  1. The algorithm starts with all individual characters as tokens
  2. It counts all adjacent character pairs in the corpus
  3. The most frequent pair (e.g., "t" + "h") becomes a new token "th"
  4. This repeats until reaching the desired vocabulary size

Larger vocabulary = longer sequences become single tokens = faster inference but more memory. Typical sizes: 30,000-50,000 tokens.