Class AutoTokenizer
- Namespace
- AiDotNet.Tokenization.HuggingFace
- Assembly
- AiDotNet.dll
AutoTokenizer provides HuggingFace-style automatic tokenizer loading. This class automatically detects and loads the appropriate tokenizer type based on the model configuration.
public static class AutoTokenizer
- Inheritance
-
AutoTokenizer
- Inherited Members
Remarks
Usage mirrors the HuggingFace transformers library:
// Load from HuggingFace Hub
var tokenizer = AutoTokenizer.FromPretrained("bert-base-uncased");
// Load from local directory
var tokenizer = AutoTokenizer.FromPretrained("./my-model");
Methods
ClearCache(string?, string?)
Clears the cache for a specific model or all models.
public static void ClearCache(string? modelName = null, string? cacheDir = null)
Parameters
modelNamestringOptional model name to clear. If null, clears all cached tokenizers.
cacheDirstringOptional cache directory. Uses default if not specified.
FromPretrained(string, string?)
Loads a tokenizer from a pretrained model name or path.
public static ITokenizer FromPretrained(string modelNameOrPath, string? cacheDir = null)
Parameters
modelNameOrPathstringEither a HuggingFace model name (e.g., "bert-base-uncased", "gpt2") or a local directory path containing tokenizer files.
cacheDirstringOptional cache directory for downloaded files. Defaults to ~/.cache/huggingface/tokenizers
Returns
- ITokenizer
The loaded tokenizer.
Exceptions
- ArgumentException
Thrown when modelNameOrPath is empty.
- InvalidOperationException
Thrown when tokenizer cannot be loaded.
FromPretrainedAsync(string, string?)
Asynchronously loads a tokenizer from a pretrained model name or path.
public static Task<ITokenizer> FromPretrainedAsync(string modelNameOrPath, string? cacheDir = null)
Parameters
modelNameOrPathstringEither a HuggingFace model name (e.g., "bert-base-uncased", "gpt2") or a local directory path containing tokenizer files.
cacheDirstringOptional cache directory for downloaded files. Defaults to ~/.cache/huggingface/tokenizers
Returns
- Task<ITokenizer>
The loaded tokenizer.
GetDefaultCacheDir()
Gets the default cache directory for tokenizer files.
public static string GetDefaultCacheDir()
Returns
- string
The default cache directory path.
IsCached(string, string?)
Checks if a tokenizer is cached locally.
public static bool IsCached(string modelName, string? cacheDir = null)
Parameters
modelNamestringThe model name to check.
cacheDirstringOptional cache directory. Uses default if not specified.
Returns
- bool
True if the tokenizer is cached, false otherwise.
ListCachedModels(string?)
Lists all cached tokenizer models.
public static string[] ListCachedModels(string? cacheDir = null)
Parameters
cacheDirstringOptional cache directory. Uses default if not specified.
Returns
- string[]
Array of cached model names.