Table of Contents

Class HuggingFaceTokenizerLoader

Namespace
AiDotNet.Tokenization.HuggingFace
Assembly
AiDotNet.dll

Loads HuggingFace pretrained tokenizers.

public static class HuggingFaceTokenizerLoader
Inheritance
HuggingFaceTokenizerLoader
Inherited Members

Methods

LoadFromDirectory(string)

Loads a HuggingFace tokenizer from a directory.

public static ITokenizer LoadFromDirectory(string modelPath)

Parameters

modelPath string

The path to the tokenizer directory.

Returns

ITokenizer

The loaded tokenizer.

LoadFromHub(string, string?)

Loads a tokenizer from HuggingFace Hub by model name.

public static ITokenizer LoadFromHub(string modelName, string? cacheDir = null)

Parameters

modelName string

The model name (e.g., "bert-base-uncased", "gpt2").

cacheDir string

Optional cache directory.

Returns

ITokenizer

The loaded tokenizer.

Remarks

Warning: This method uses sync-over-async internally and may cause deadlocks in UI applications or ASP.NET contexts with synchronization contexts. Prefer using LoadFromHubAsync(string, string?) when possible.

Files are cached locally, so subsequent calls will not make network requests.

LoadFromHubAsync(string, string?)

Asynchronously loads a tokenizer from HuggingFace Hub.

public static Task<ITokenizer> LoadFromHubAsync(string modelName, string? cacheDir = null)

Parameters

modelName string
cacheDir string

Returns

Task<ITokenizer>

LoadFromTokenizerJson(string)

Loads a tokenizer from a tokenizer.json file.

public static ITokenizer LoadFromTokenizerJson(string tokenizerJsonPath)

Parameters

tokenizerJsonPath string

Returns

ITokenizer

SaveToDirectory(ITokenizer, string)

Saves a tokenizer to HuggingFace format.

public static void SaveToDirectory(ITokenizer tokenizer, string outputPath)

Parameters

tokenizer ITokenizer

The tokenizer to save.

outputPath string

The output directory path.

Remarks

Limitation: This method saves vocabulary and configuration but does not save BPE merge rules. BPE tokenizers saved with this method will not fully round-trip - they will need to be retrained or loaded from a different source to recover merge information.

For full BPE tokenizer serialization, consider using the original HuggingFace tokenizer files.