Class WordPieceTokenizer
- Namespace
- AiDotNet.Tokenization.Algorithms
- Assembly
- AiDotNet.dll
WordPiece tokenizer implementation. Used by BERT and similar models.
public class WordPieceTokenizer : TokenizerBase, ITokenizer
- Inheritance
-
WordPieceTokenizer
- Implements
- Inherited Members
Constructors
WordPieceTokenizer(IVocabulary, SpecialTokens?, string, int)
Creates a new WordPiece tokenizer.
public WordPieceTokenizer(IVocabulary vocabulary, SpecialTokens? specialTokens = null, string continuingSubwordPrefix = "##", int maxInputCharsPerWord = 100)
Parameters
vocabularyIVocabularyThe vocabulary.
specialTokensSpecialTokensThe special tokens.
continuingSubwordPrefixstringThe prefix for continuing subwords (default: "##").
maxInputCharsPerWordintMaximum characters per word (default: 100).
Methods
CleanupTokens(List<string>)
Cleans up tokens and converts them back to text.
protected override string CleanupTokens(List<string> tokens)
Parameters
Returns
Tokenize(string)
Tokenizes text into WordPiece tokens.
public override List<string> Tokenize(string text)
Parameters
textstring
Returns
Train(IEnumerable<string>, int, SpecialTokens?, string)
Trains a WordPiece tokenizer from a corpus.
public static WordPieceTokenizer Train(IEnumerable<string> corpus, int vocabSize, SpecialTokens? specialTokens = null, string continuingSubwordPrefix = "##")
Parameters
corpusIEnumerable<string>The training corpus.
vocabSizeintThe desired vocabulary size.
specialTokensSpecialTokensThe special tokens.
continuingSubwordPrefixstringThe prefix for continuing subwords.
Returns
- WordPieceTokenizer
A trained WordPiece tokenizer.