Class UnigramTokenizer
- Namespace
- AiDotNet.Tokenization.Algorithms
- Assembly
- AiDotNet.dll
Unigram Language Model tokenizer using probabilistic segmentation.
public class UnigramTokenizer : TokenizerBase, ITokenizer
- Inheritance
-
UnigramTokenizer
- Implements
- Inherited Members
Constructors
UnigramTokenizer(IVocabulary, Dictionary<string, double>, SpecialTokens, int)
Creates a new unigram tokenizer.
public UnigramTokenizer(IVocabulary vocabulary, Dictionary<string, double> tokenScores, SpecialTokens specialTokens, int maxTokenLength = 16)
Parameters
vocabularyIVocabularytokenScoresDictionary<string, double>specialTokensSpecialTokensmaxTokenLengthint
Methods
CleanupTokens(List<string>)
Cleans up tokens and converts them back to text (must be implemented by derived classes).
protected override string CleanupTokens(List<string> tokens)
Parameters
Returns
Tokenize(string)
Tokenizes text using Viterbi algorithm for optimal segmentation.
public override List<string> Tokenize(string text)
Parameters
textstring
Returns
Train(IEnumerable<string>, int, SpecialTokens?)
Trains a unigram tokenizer from a corpus.
public static UnigramTokenizer Train(IEnumerable<string> corpus, int vocabSize = 8000, SpecialTokens? specialTokens = null)
Parameters
corpusIEnumerable<string>vocabSizeintspecialTokensSpecialTokens