Table of Contents

Namespace AiDotNet.Tokenization.Algorithms

Classes

BpeTokenizer

Byte-Pair Encoding (BPE) tokenizer implementation for subword tokenization.

CharacterTokenizer

Character-level tokenizer that splits text into individual characters. Useful for character-based language models and some RNN architectures.

SentencePieceTokenizer

SentencePiece tokenizer implementation using Unigram language model. Used for multilingual models and language-agnostic tokenization.

UnigramTokenizer

Unigram Language Model tokenizer using probabilistic segmentation.

WordPieceTokenizer

WordPiece tokenizer implementation. Used by BERT and similar models.