Class HuggingFaceTokenizerLoader
- Namespace
- AiDotNet.Tokenization.HuggingFace
- Assembly
- AiDotNet.dll
Loads HuggingFace pretrained tokenizers.
public static class HuggingFaceTokenizerLoader
- Inheritance
-
HuggingFaceTokenizerLoader
- Inherited Members
Methods
LoadFromDirectory(string)
Loads a HuggingFace tokenizer from a directory.
public static ITokenizer LoadFromDirectory(string modelPath)
Parameters
modelPathstringThe path to the tokenizer directory.
Returns
- ITokenizer
The loaded tokenizer.
LoadFromHub(string, string?)
Loads a tokenizer from HuggingFace Hub by model name.
public static ITokenizer LoadFromHub(string modelName, string? cacheDir = null)
Parameters
modelNamestringThe model name (e.g., "bert-base-uncased", "gpt2").
cacheDirstringOptional cache directory.
Returns
- ITokenizer
The loaded tokenizer.
Remarks
Warning: This method uses sync-over-async internally and may cause deadlocks in UI applications or ASP.NET contexts with synchronization contexts. Prefer using LoadFromHubAsync(string, string?) when possible.
Files are cached locally, so subsequent calls will not make network requests.
LoadFromHubAsync(string, string?)
Asynchronously loads a tokenizer from HuggingFace Hub.
public static Task<ITokenizer> LoadFromHubAsync(string modelName, string? cacheDir = null)
Parameters
Returns
LoadFromTokenizerJson(string)
Loads a tokenizer from a tokenizer.json file.
public static ITokenizer LoadFromTokenizerJson(string tokenizerJsonPath)
Parameters
tokenizerJsonPathstring
Returns
SaveToDirectory(ITokenizer, string)
Saves a tokenizer to HuggingFace format.
public static void SaveToDirectory(ITokenizer tokenizer, string outputPath)
Parameters
tokenizerITokenizerThe tokenizer to save.
outputPathstringThe output directory path.
Remarks
Limitation: This method saves vocabulary and configuration but does not save BPE merge rules. BPE tokenizers saved with this method will not fully round-trip - they will need to be retrained or loaded from a different source to recover merge information.
For full BPE tokenizer serialization, consider using the original HuggingFace tokenizer files.