Class CharacterTokenizer
- Namespace
- AiDotNet.Tokenization.Algorithms
- Assembly
- AiDotNet.dll
Character-level tokenizer that splits text into individual characters. Useful for character-based language models and some RNN architectures.
public class CharacterTokenizer : TokenizerBase, ITokenizer
- Inheritance
-
CharacterTokenizer
- Implements
- Inherited Members
Constructors
CharacterTokenizer(IVocabulary, SpecialTokens, bool, bool)
Creates a new character tokenizer.
public CharacterTokenizer(IVocabulary vocabulary, SpecialTokens specialTokens, bool lowercase = false, bool includeWhitespace = true)
Parameters
vocabularyIVocabularyspecialTokensSpecialTokenslowercaseboolincludeWhitespacebool
Methods
CleanupTokens(List<string>)
Cleans up tokens and converts them back to text.
protected override string CleanupTokens(List<string> tokens)
Parameters
Returns
CreateAscii(SpecialTokens?, bool)
Creates a character tokenizer with ASCII printable characters.
public static CharacterTokenizer CreateAscii(SpecialTokens? specialTokens = null, bool lowercase = false)
Parameters
specialTokensSpecialTokenslowercasebool
Returns
Tokenize(string)
Tokenizes text into individual characters.
public override List<string> Tokenize(string text)
Parameters
textstring
Returns
Train(IEnumerable<string>, SpecialTokens?, bool, int)
Trains a character tokenizer from a corpus.
public static CharacterTokenizer Train(IEnumerable<string> corpus, SpecialTokens? specialTokens = null, bool lowercase = false, int minFrequency = 1)
Parameters
corpusIEnumerable<string>specialTokensSpecialTokenslowercaseboolminFrequencyint