Table of Contents

Class WordPieceTokenizer

Namespace
AiDotNet.Tokenization.Algorithms
Assembly
AiDotNet.dll

WordPiece tokenizer implementation. Used by BERT and similar models.

public class WordPieceTokenizer : TokenizerBase, ITokenizer
Inheritance
WordPieceTokenizer
Implements
Inherited Members

Constructors

WordPieceTokenizer(IVocabulary, SpecialTokens?, string, int)

Creates a new WordPiece tokenizer.

public WordPieceTokenizer(IVocabulary vocabulary, SpecialTokens? specialTokens = null, string continuingSubwordPrefix = "##", int maxInputCharsPerWord = 100)

Parameters

vocabulary IVocabulary

The vocabulary.

specialTokens SpecialTokens

The special tokens.

continuingSubwordPrefix string

The prefix for continuing subwords (default: "##").

maxInputCharsPerWord int

Maximum characters per word (default: 100).

Methods

CleanupTokens(List<string>)

Cleans up tokens and converts them back to text.

protected override string CleanupTokens(List<string> tokens)

Parameters

tokens List<string>

Returns

string

Tokenize(string)

Tokenizes text into WordPiece tokens.

public override List<string> Tokenize(string text)

Parameters

text string

Returns

List<string>

Train(IEnumerable<string>, int, SpecialTokens?, string)

Trains a WordPiece tokenizer from a corpus.

public static WordPieceTokenizer Train(IEnumerable<string> corpus, int vocabSize, SpecialTokens? specialTokens = null, string continuingSubwordPrefix = "##")

Parameters

corpus IEnumerable<string>

The training corpus.

vocabSize int

The desired vocabulary size.

specialTokens SpecialTokens

The special tokens.

continuingSubwordPrefix string

The prefix for continuing subwords.

Returns

WordPieceTokenizer

A trained WordPiece tokenizer.