Table of Contents

Class TextAugmenterBase<T>

Namespace
AiDotNet.Augmentation.Text
Assembly
AiDotNet.dll

Base class for text data augmentations.

public abstract class TextAugmenterBase<T> : AugmentationBase<T, string[]>, IAugmentation<T, string[]>

Type Parameters

T

The numeric type for calculations.

Inheritance
TextAugmenterBase<T>
Implements
Derived
Inherited Members

Remarks

For Beginners: Text augmentation creates variations of text to improve model robustness to different phrasings and writing styles. Common techniques include:

  • Synonym replacement (replacing words with similar meanings)
  • Random deletion (removing random words)
  • Random swap (swapping word positions)
  • Random insertion (adding synonyms of random words)
  • Back-translation (translate to another language and back)

Text data is represented as an array of strings (sentences/documents).

Constructors

TextAugmenterBase(double, string)

Initializes a new text augmentation.

protected TextAugmenterBase(double probability = 1, string languageCode = "en")

Parameters

probability double

The probability of applying this augmentation (0.0 to 1.0).

languageCode string

The language code for language-specific operations.

Properties

LanguageCode

Gets or sets the language code for language-specific operations.

public string LanguageCode { get; set; }

Property Value

string

Remarks

Default: "en" (English)

Used for synonym lookup, tokenization, etc.

PreserveCase

Gets or sets whether to preserve case when modifying text.

public bool PreserveCase { get; set; }

Property Value

bool

Remarks

Default: true

When true, replaced words will match the case of the original word.

Methods

Detokenize(string[])

Joins tokens back into text.

protected virtual string Detokenize(string[] tokens)

Parameters

tokens string[]

The tokens to join.

Returns

string

The joined text.

GetParameters()

Gets the parameters of this augmentation.

public override IDictionary<string, object> GetParameters()

Returns

IDictionary<string, object>

A dictionary of parameter names to values.

IsStopword(string)

Checks if a word is a stopword (common word to skip during augmentation).

protected virtual bool IsStopword(string word)

Parameters

word string

The word to check.

Returns

bool

True if the word is a stopword.

Tokenize(string)

Tokenizes text into words.

protected virtual string[] Tokenize(string text)

Parameters

text string

The text to tokenize.

Returns

string[]

An array of word tokens.