Class TextAugmenterBase<T>
- Namespace
- AiDotNet.Augmentation.Text
- Assembly
- AiDotNet.dll
Base class for text data augmentations.
public abstract class TextAugmenterBase<T> : AugmentationBase<T, string[]>, IAugmentation<T, string[]>
Type Parameters
TThe numeric type for calculations.
- Inheritance
-
AugmentationBase<T, string[]>TextAugmenterBase<T>
- Implements
-
IAugmentation<T, string[]>
- Derived
- Inherited Members
Remarks
For Beginners: Text augmentation creates variations of text to improve model robustness to different phrasings and writing styles. Common techniques include:
- Synonym replacement (replacing words with similar meanings)
- Random deletion (removing random words)
- Random swap (swapping word positions)
- Random insertion (adding synonyms of random words)
- Back-translation (translate to another language and back)
Text data is represented as an array of strings (sentences/documents).
Constructors
TextAugmenterBase(double, string)
Initializes a new text augmentation.
protected TextAugmenterBase(double probability = 1, string languageCode = "en")
Parameters
probabilitydoubleThe probability of applying this augmentation (0.0 to 1.0).
languageCodestringThe language code for language-specific operations.
Properties
LanguageCode
Gets or sets the language code for language-specific operations.
public string LanguageCode { get; set; }
Property Value
Remarks
Default: "en" (English)
Used for synonym lookup, tokenization, etc.
PreserveCase
Gets or sets whether to preserve case when modifying text.
public bool PreserveCase { get; set; }
Property Value
Remarks
Default: true
When true, replaced words will match the case of the original word.
Methods
Detokenize(string[])
Joins tokens back into text.
protected virtual string Detokenize(string[] tokens)
Parameters
tokensstring[]The tokens to join.
Returns
- string
The joined text.
GetParameters()
Gets the parameters of this augmentation.
public override IDictionary<string, object> GetParameters()
Returns
- IDictionary<string, object>
A dictionary of parameter names to values.
IsStopword(string)
Checks if a word is a stopword (common word to skip during augmentation).
protected virtual bool IsStopword(string word)
Parameters
wordstringThe word to check.
Returns
- bool
True if the word is a stopword.
Tokenize(string)
Tokenizes text into words.
protected virtual string[] Tokenize(string text)
Parameters
textstringThe text to tokenize.
Returns
- string[]
An array of word tokens.