Table of Contents

Enum PretrainedTokenizerModel

Namespace
AiDotNet.Tokenization.Configuration
Assembly
AiDotNet.dll

Specifies pretrained tokenizer models available from HuggingFace Hub.

public enum PretrainedTokenizerModel
Extension Methods

Fields

AlbertBaseV2 = 14

ALBERT Base v2 - A Lite BERT with parameter sharing. Much smaller model size with competitive performance.

BertBaseCased = 1

BERT Base Cased - Preserves case information. Best when capitalization matters (e.g., named entity recognition).

BertBaseUncased = 0

BERT Base Uncased - The default choice for most NLP tasks. Vocabulary: 30,522 tokens. Case-insensitive.

BertLargeCased = 3

BERT Large Cased - Large model preserving case. Best accuracy for case-sensitive tasks.

BertLargeUncased = 2

BERT Large Uncased - Larger model with better accuracy. Vocabulary: 30,522 tokens. More compute intensive.

CodeBertBase = 18

CodeBERT Base - BERT for programming languages. Best for code understanding tasks.

DistilBertBaseCased = 10

DistilBERT Base Cased - Distilled BERT preserving case. Fast and case-sensitive.

DistilBertBaseUncased = 9

DistilBERT Base Uncased - Distilled BERT (40% smaller, 60% faster). Good balance of speed and accuracy.

ElectraBase = 17

Electra Base - Efficient pretraining approach (base size). Good accuracy with efficient training.

ElectraSmall = 16

Electra Small - Efficient pretraining approach. Very efficient for its size.

Gpt2 = 4

GPT-2 - OpenAI's text generation model. Vocabulary: 50,257 tokens. Best for text generation.

Gpt2Large = 6

GPT-2 Large - Even larger GPT-2 variant. High quality generation for demanding applications.

Gpt2Medium = 5

GPT-2 Medium - Larger GPT-2 variant. Better quality generation, more compute required.

GraphCodeBert = 20

GraphCodeBERT - Code model with data flow. Enhanced code understanding with graph structure.

MicrosoftCodeBert = 19

Microsoft CodeBERT - Multi-language code model. Supports multiple programming languages.

RobertaBase = 7

RoBERTa Base - Robustly optimized BERT. Often outperforms BERT on benchmarks.

RobertaLarge = 8

RoBERTa Large - Large RoBERTa model. State-of-the-art performance on many tasks.

T5Base = 12

T5 Base - Text-to-Text Transfer Transformer (base). Good balance of performance and efficiency.

T5Large = 13

T5 Large - Text-to-Text Transfer Transformer (large). High performance for complex tasks.

T5Small = 11

T5 Small - Text-to-Text Transfer Transformer (small). Versatile for many NLP tasks.

XlnetBaseCased = 15

XLNet Base Cased - Autoregressive pretraining. Strong performance on long-context tasks.

Remarks

For Beginners: These are industry-standard tokenizers that have been trained on large text corpora. Each is designed for different use cases:

  • BERT models: Best for understanding text (classification, Q&A, NER)
  • GPT models: Best for text generation
  • RoBERTa: Improved BERT with better training
  • T5: Versatile text-to-text model
  • DistilBERT: Faster, smaller BERT