Class WhisperTokenizer
Tokenizer for Whisper speech recognition model.
public class WhisperTokenizer
- Inheritance
-
WhisperTokenizer
- Inherited Members
Remarks
Whisper uses a special tokenizer with BPE (Byte Pair Encoding) and special tokens for controlling transcription behavior (language, task, timestamps).
For Beginners: A tokenizer converts text to numbers (tokens) and back. Whisper's tokenizer has special tokens for:
- Language codes (to specify which language to transcribe)
- Task tokens (transcribe vs translate)
- Timestamp tokens (for word-level timing)
Properties
EndOfText
Gets the end of text token ID.
public int EndOfText { get; }
Property Value
NoSpeechToken
Gets the no speech token ID.
public int NoSpeechToken { get; }
Property Value
NoTimestampsToken
Gets the no timestamps token ID.
public int NoTimestampsToken { get; }
Property Value
StartOfTranscript
Gets the start of transcript token ID.
public int StartOfTranscript { get; }
Property Value
SupportedLanguages
Gets all supported language codes.
public static IReadOnlyList<string> SupportedLanguages { get; }
Property Value
TranscribeToken
Gets the transcribe task token ID.
public int TranscribeToken { get; }
Property Value
TranslateToken
Gets the translate task token ID.
public int TranslateToken { get; }
Property Value
Methods
Decode(IEnumerable<long>)
Decodes a sequence of token IDs to text.
public string Decode(IEnumerable<long> tokenIds)
Parameters
tokenIdsIEnumerable<long>The token IDs to decode.
Returns
- string
The decoded text.
Remarks
This is a simplified decoder. A full implementation would use the actual BPE vocabulary from the Whisper model.
Encode(string)
Encodes text to token IDs.
public List<long> Encode(string text)
Parameters
textstringThe text to encode.
Returns
Remarks
This is a placeholder. A full implementation would use BPE encoding.
GetLanguageToken(string)
Gets the token ID for a language code.
public int GetLanguageToken(string languageCode)
Parameters
languageCodestringTwo-letter language code (e.g., "en", "es").
Returns
- int
The token ID for the language.
GetTimeFromToken(int)
Converts a timestamp token ID to time in seconds.
public double GetTimeFromToken(int tokenId)
Parameters
tokenIdintThe timestamp token ID.
Returns
- double
Time in seconds.
GetTimestampToken(double)
Gets the timestamp token ID for a given time in seconds.
public int GetTimestampToken(double timeSeconds)
Parameters
timeSecondsdoubleTime in seconds (must be a multiple of 0.02).
Returns
- int
The timestamp token ID.
IsSpecialToken(int)
Checks if a token ID is a special token.
public bool IsSpecialToken(int tokenId)
Parameters
tokenIdintThe token ID to check.
Returns
- bool
True if the token is a special token.
IsTimestampToken(int)
Checks if a token ID is a timestamp token.
public bool IsTimestampToken(int tokenId)
Parameters
tokenIdintThe token ID to check.
Returns
- bool
True if the token is a timestamp token.