Class SentencePieceTokenizer
- Namespace
- AiDotNet.Tokenization.Algorithms
- Assembly
- AiDotNet.dll
SentencePiece tokenizer implementation using Unigram language model. Used for multilingual models and language-agnostic tokenization.
public class SentencePieceTokenizer : TokenizerBase, ITokenizer
- Inheritance
-
SentencePieceTokenizer
- Implements
- Inherited Members
Constructors
SentencePieceTokenizer(IVocabulary, Dictionary<string, double>, SpecialTokens?, bool)
Creates a new SentencePiece tokenizer.
public SentencePieceTokenizer(IVocabulary vocabulary, Dictionary<string, double> pieceScores, SpecialTokens? specialTokens = null, bool treatWhitespaceAsSpecialToken = true)
Parameters
vocabularyIVocabularyThe vocabulary.
pieceScoresDictionary<string, double>The scores for each piece (used for unigram segmentation).
specialTokensSpecialTokensThe special tokens.
treatWhitespaceAsSpecialTokenboolWhether to treat whitespace as a special token.
Methods
CleanupTokens(List<string>)
Cleans up tokens and converts them back to text.
protected override string CleanupTokens(List<string> tokens)
Parameters
Returns
Tokenize(string)
Tokenizes text into SentencePiece tokens.
public override List<string> Tokenize(string text)
Parameters
textstring
Returns
Train(IEnumerable<string>, int, SpecialTokens?, double)
Trains a SentencePiece tokenizer using Unigram language model.
public static SentencePieceTokenizer Train(IEnumerable<string> corpus, int vocabSize, SpecialTokens? specialTokens = null, double characterCoverage = 0.9995)
Parameters
corpusIEnumerable<string>The training corpus.
vocabSizeintThe desired vocabulary size.
specialTokensSpecialTokensThe special tokens.
characterCoveragedoubleCharacter coverage (default: 0.9995).
Returns
- SentencePieceTokenizer
A trained SentencePiece tokenizer.