Table of Contents

Class UnigramTokenizer

Namespace
AiDotNet.Tokenization.Algorithms
Assembly
AiDotNet.dll

Unigram Language Model tokenizer using probabilistic segmentation.

public class UnigramTokenizer : TokenizerBase, ITokenizer
Inheritance
UnigramTokenizer
Implements
Inherited Members

Constructors

UnigramTokenizer(IVocabulary, Dictionary<string, double>, SpecialTokens, int)

Creates a new unigram tokenizer.

public UnigramTokenizer(IVocabulary vocabulary, Dictionary<string, double> tokenScores, SpecialTokens specialTokens, int maxTokenLength = 16)

Parameters

vocabulary IVocabulary
tokenScores Dictionary<string, double>
specialTokens SpecialTokens
maxTokenLength int

Methods

CleanupTokens(List<string>)

Cleans up tokens and converts them back to text (must be implemented by derived classes).

protected override string CleanupTokens(List<string> tokens)

Parameters

tokens List<string>

Returns

string

Tokenize(string)

Tokenizes text using Viterbi algorithm for optimal segmentation.

public override List<string> Tokenize(string text)

Parameters

text string

Returns

List<string>

Train(IEnumerable<string>, int, SpecialTokens?)

Trains a unigram tokenizer from a corpus.

public static UnigramTokenizer Train(IEnumerable<string> corpus, int vocabSize = 8000, SpecialTokens? specialTokens = null)

Parameters

corpus IEnumerable<string>
vocabSize int
specialTokens SpecialTokens

Returns

UnigramTokenizer