Table of Contents

Class CharacterTokenizer

Namespace
AiDotNet.Tokenization.Algorithms
Assembly
AiDotNet.dll

Character-level tokenizer that splits text into individual characters. Useful for character-based language models and some RNN architectures.

public class CharacterTokenizer : TokenizerBase, ITokenizer
Inheritance
CharacterTokenizer
Implements
Inherited Members

Constructors

CharacterTokenizer(IVocabulary, SpecialTokens, bool, bool)

Creates a new character tokenizer.

public CharacterTokenizer(IVocabulary vocabulary, SpecialTokens specialTokens, bool lowercase = false, bool includeWhitespace = true)

Parameters

vocabulary IVocabulary
specialTokens SpecialTokens
lowercase bool
includeWhitespace bool

Methods

CleanupTokens(List<string>)

Cleans up tokens and converts them back to text.

protected override string CleanupTokens(List<string> tokens)

Parameters

tokens List<string>

Returns

string

CreateAscii(SpecialTokens?, bool)

Creates a character tokenizer with ASCII printable characters.

public static CharacterTokenizer CreateAscii(SpecialTokens? specialTokens = null, bool lowercase = false)

Parameters

specialTokens SpecialTokens
lowercase bool

Returns

CharacterTokenizer

Tokenize(string)

Tokenizes text into individual characters.

public override List<string> Tokenize(string text)

Parameters

text string

Returns

List<string>

Train(IEnumerable<string>, SpecialTokens?, bool, int)

Trains a character tokenizer from a corpus.

public static CharacterTokenizer Train(IEnumerable<string> corpus, SpecialTokens? specialTokens = null, bool lowercase = false, int minFrequency = 1)

Parameters

corpus IEnumerable<string>
specialTokens SpecialTokens
lowercase bool
minFrequency int

Returns

CharacterTokenizer