Interface IDataPreprocessor<T, TInput, TOutput>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

public interface IDataPreprocessor<T, TInput, TOutput>

Type Parameters

T
TInput
TOutput

Methods

PreprocessData(TInput, TOutput)

Preprocesses the input data by applying normalization and other transformations.

(TInput X, TOutput y, NormalizationInfo<T, TInput, TOutput> normInfo) PreprocessData(TInput X, TOutput y)

Parameters

X TInput: The input features where each row represents a sample and each column represents a feature.
y TOutput: The target values corresponding to each sample in the input data.

Returns

(TInput X, TOutput y, NormalizationInfo<T, TInput, TOutput> normInfo)

A tuple containing:

The preprocessed feature data
The preprocessed target data
Normalization information that can be used to transform new data consistently

Remarks

For Beginners: This method cleans and transforms your raw data to make it suitable for machine learning.

Parameters explained:

X: Your input data organized as a matrix (think of it as a table or spreadsheet)
- Each row is one example or data point
- Each column is one feature or characteristic
y: The target values you want to predict (like prices, categories, etc.)

The method returns three things:

Your transformed input data (X)
Your transformed target values (y)
Information about how the transformation was done (normInfo)

The third item (normInfo) is important because when you get new data later, you need to transform it in exactly the same way as your training data.

For example, if you're predicting house prices:

If during training you divided all prices by $1,000,000 to normalize them
Then for new predictions, you need to apply the same division
The normInfo stores these details so you can apply consistent transformations

SplitData(TInput, TOutput)

Splits the dataset into training, validation, and test sets.

(TInput XTrain, TOutput yTrain, TInput XValidation, TOutput yValidation, TInput XTest, TOutput yTest) SplitData(TInput X, TOutput y)

Parameters

X TInput: The input features.
y TOutput: The target values.

Returns

(TInput XTrain, TOutput yTrain, TInput XValidation, TOutput yValidation, TInput XTest, TOutput yTest)

A tuple containing six elements:

XTrain: Feature data for training
yTrain: Target data for training
XValidation: Feature data for validation
yValidation: Target data for validation
XTest: Feature data for testing
yTest: Target data for testing

Remarks

For Beginners: This method divides your data into three separate sets, each with a specific purpose.

Imagine you're learning to cook a new recipe:

Training set: This is where you practice and learn the recipe (70-80% of your data)
Validation set: This is where you taste and adjust your cooking (10-15% of your data)
Test set: This is the final taste test with fresh ingredients (10-15% of your data)

Why split the data?

Training: The model learns patterns from this data
Validation: You use this to tune your model settings without overfitting
Testing: You use this to get an honest estimate of how well your model will perform on new data

"Overfitting" is like memorizing test answers instead of understanding the subject. The model performs well on data it has seen but fails on new data.

Each set contains both features (X) and targets (y), keeping the relationship between input data and expected outputs intact for each portion of the data.

Table of Contents

Interface IDataPreprocessor<T, TInput, TOutput>

Type Parameters

Methods

PreprocessData(TInput, TOutput)

Parameters

Returns

Remarks

SplitData(TInput, TOutput)

Parameters

Returns

Remarks