Table of Contents

Interface IStreamingDataLoader<T, TInput, TOutput>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for streaming data loaders that process data on-demand without loading all data into memory.

public interface IStreamingDataLoader<T, TInput, TOutput> : IDataLoader<T>, IResettable, ICountable

Type Parameters

T

The numeric type used for calculations, typically float or double.

TInput

The input data type for each sample.

TOutput

The output/label data type for each sample.

Inherited Members
Extension Methods

Remarks

IStreamingDataLoader is designed for datasets that are too large to fit in memory. Unlike IInputOutputDataLoader which provides Features and Labels properties for all data, streaming loaders read data on-demand and yield batches through iteration.

For Beginners: When your dataset is too large to fit in RAM (like millions of images or text documents), you can't load it all at once. Streaming data loaders solve this by reading data piece by piece as needed during training.

Example usage:

var loader = DataLoaders.FromCsv<float>("huge_dataset.csv", parseRow);
loader.BatchSize = 32; // Set batch size before iteration

await foreach (var (inputs, labels) in loader.GetBatchesAsync())
{
    await model.TrainOnBatchAsync(inputs, labels);
}

Properties

BatchSize

Gets or sets the batch size for iteration.

int BatchSize { get; set; }

Property Value

int

NumWorkers

Gets the number of parallel workers for sample loading.

int NumWorkers { get; }

Property Value

int

PrefetchCount

Gets the number of batches to prefetch for improved throughput.

int PrefetchCount { get; }

Property Value

int

SampleCount

Gets the total number of samples in the dataset.

int SampleCount { get; }

Property Value

int

Remarks

This may be known upfront (e.g., from file metadata) or estimated. For truly streaming sources where the count is unknown, this may return -1.

Methods

GetBatches(bool, bool, int?)

Iterates through the dataset in batches synchronously.

IEnumerable<(TInput[] Inputs, TOutput[] Outputs)> GetBatches(bool shuffle = true, bool dropLast = false, int? seed = null)

Parameters

shuffle bool

Whether to shuffle the data before batching.

dropLast bool

Whether to drop the last incomplete batch.

seed int?

Optional random seed for reproducibility.

Returns

IEnumerable<(TInput[] Inputs, TOutput[] Outputs)>

An enumerable of batches, each containing arrays of inputs and outputs.

Remarks

For Beginners: Use this when you want simple, synchronous iteration. Each iteration gives you a batch of inputs and their corresponding outputs.

GetBatchesAsync(bool, bool, int?, CancellationToken)

Iterates through the dataset in batches asynchronously with prefetching.

IAsyncEnumerable<(TInput[] Inputs, TOutput[] Outputs)> GetBatchesAsync(bool shuffle = true, bool dropLast = false, int? seed = null, CancellationToken cancellationToken = default)

Parameters

shuffle bool

Whether to shuffle the data before batching.

dropLast bool

Whether to drop the last incomplete batch.

seed int?

Optional random seed for reproducibility.

cancellationToken CancellationToken

Token to cancel the iteration.

Returns

IAsyncEnumerable<(TInput[] Inputs, TOutput[] Outputs)>

An async enumerable of batches, each containing arrays of inputs and outputs.

Remarks

For Beginners: This is the recommended method for training. It uses async/await to overlap data loading with model training, keeping your GPU busy while the next batch is being prepared.

Example:

await foreach (var batch in loader.GetBatchesAsync(shuffle: true))
{
    await model.TrainOnBatchAsync(batch.Inputs, batch.Outputs);
}