Table of Contents

Class SentencePieceTokenizer

Namespace
AiDotNet.Tokenization.Algorithms
Assembly
AiDotNet.dll

SentencePiece tokenizer implementation using Unigram language model. Used for multilingual models and language-agnostic tokenization.

public class SentencePieceTokenizer : TokenizerBase, ITokenizer
Inheritance
SentencePieceTokenizer
Implements
Inherited Members

Constructors

SentencePieceTokenizer(IVocabulary, Dictionary<string, double>, SpecialTokens?, bool)

Creates a new SentencePiece tokenizer.

public SentencePieceTokenizer(IVocabulary vocabulary, Dictionary<string, double> pieceScores, SpecialTokens? specialTokens = null, bool treatWhitespaceAsSpecialToken = true)

Parameters

vocabulary IVocabulary

The vocabulary.

pieceScores Dictionary<string, double>

The scores for each piece (used for unigram segmentation).

specialTokens SpecialTokens

The special tokens.

treatWhitespaceAsSpecialToken bool

Whether to treat whitespace as a special token.

Methods

CleanupTokens(List<string>)

Cleans up tokens and converts them back to text.

protected override string CleanupTokens(List<string> tokens)

Parameters

tokens List<string>

Returns

string

Tokenize(string)

Tokenizes text into SentencePiece tokens.

public override List<string> Tokenize(string text)

Parameters

text string

Returns

List<string>

Train(IEnumerable<string>, int, SpecialTokens?, double)

Trains a SentencePiece tokenizer using Unigram language model.

public static SentencePieceTokenizer Train(IEnumerable<string> corpus, int vocabSize, SpecialTokens? specialTokens = null, double characterCoverage = 0.9995)

Parameters

corpus IEnumerable<string>

The training corpus.

vocabSize int

The desired vocabulary size.

specialTokens SpecialTokens

The special tokens.

characterCoverage double

Character coverage (default: 0.9995).

Returns

SentencePieceTokenizer

A trained SentencePiece tokenizer.