Enum PretrainedTokenizerModel
- Namespace
- AiDotNet.Tokenization.Configuration
- Assembly
- AiDotNet.dll
Specifies pretrained tokenizer models available from HuggingFace Hub.
public enum PretrainedTokenizerModel
- Extension Methods
Fields
AlbertBaseV2 = 14ALBERT Base v2 - A Lite BERT with parameter sharing. Much smaller model size with competitive performance.
BertBaseCased = 1BERT Base Cased - Preserves case information. Best when capitalization matters (e.g., named entity recognition).
BertBaseUncased = 0BERT Base Uncased - The default choice for most NLP tasks. Vocabulary: 30,522 tokens. Case-insensitive.
BertLargeCased = 3BERT Large Cased - Large model preserving case. Best accuracy for case-sensitive tasks.
BertLargeUncased = 2BERT Large Uncased - Larger model with better accuracy. Vocabulary: 30,522 tokens. More compute intensive.
CodeBertBase = 18CodeBERT Base - BERT for programming languages. Best for code understanding tasks.
DistilBertBaseCased = 10DistilBERT Base Cased - Distilled BERT preserving case. Fast and case-sensitive.
DistilBertBaseUncased = 9DistilBERT Base Uncased - Distilled BERT (40% smaller, 60% faster). Good balance of speed and accuracy.
ElectraBase = 17Electra Base - Efficient pretraining approach (base size). Good accuracy with efficient training.
ElectraSmall = 16Electra Small - Efficient pretraining approach. Very efficient for its size.
Gpt2 = 4GPT-2 - OpenAI's text generation model. Vocabulary: 50,257 tokens. Best for text generation.
Gpt2Large = 6GPT-2 Large - Even larger GPT-2 variant. High quality generation for demanding applications.
Gpt2Medium = 5GPT-2 Medium - Larger GPT-2 variant. Better quality generation, more compute required.
GraphCodeBert = 20GraphCodeBERT - Code model with data flow. Enhanced code understanding with graph structure.
MicrosoftCodeBert = 19Microsoft CodeBERT - Multi-language code model. Supports multiple programming languages.
RobertaBase = 7RoBERTa Base - Robustly optimized BERT. Often outperforms BERT on benchmarks.
RobertaLarge = 8RoBERTa Large - Large RoBERTa model. State-of-the-art performance on many tasks.
T5Base = 12T5 Base - Text-to-Text Transfer Transformer (base). Good balance of performance and efficiency.
T5Large = 13T5 Large - Text-to-Text Transfer Transformer (large). High performance for complex tasks.
T5Small = 11T5 Small - Text-to-Text Transfer Transformer (small). Versatile for many NLP tasks.
XlnetBaseCased = 15XLNet Base Cased - Autoregressive pretraining. Strong performance on long-context tasks.
Remarks
For Beginners: These are industry-standard tokenizers that have been trained on large text corpora. Each is designed for different use cases:
- BERT models: Best for understanding text (classification, Q&A, NER)
- GPT models: Best for text generation
- RoBERTa: Improved BERT with better training
- T5: Versatile text-to-text model
- DistilBERT: Faster, smaller BERT