Interface IStreamingDataLoader<T, TInput, TOutput>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for streaming data loaders that process data on-demand without loading all data into memory.
public interface IStreamingDataLoader<T, TInput, TOutput> : IDataLoader<T>, IResettable, ICountable
Type Parameters
TThe numeric type used for calculations, typically float or double.
TInputThe input data type for each sample.
TOutputThe output/label data type for each sample.
- Inherited Members
- Extension Methods
Remarks
IStreamingDataLoader is designed for datasets that are too large to fit in memory. Unlike IInputOutputDataLoader which provides Features and Labels properties for all data, streaming loaders read data on-demand and yield batches through iteration.
For Beginners: When your dataset is too large to fit in RAM (like millions of images or text documents), you can't load it all at once. Streaming data loaders solve this by reading data piece by piece as needed during training.
Example usage:
var loader = DataLoaders.FromCsv<float>("huge_dataset.csv", parseRow);
loader.BatchSize = 32; // Set batch size before iteration
await foreach (var (inputs, labels) in loader.GetBatchesAsync())
{
await model.TrainOnBatchAsync(inputs, labels);
}
Properties
BatchSize
Gets or sets the batch size for iteration.
int BatchSize { get; set; }
Property Value
NumWorkers
Gets the number of parallel workers for sample loading.
int NumWorkers { get; }
Property Value
PrefetchCount
Gets the number of batches to prefetch for improved throughput.
int PrefetchCount { get; }
Property Value
SampleCount
Gets the total number of samples in the dataset.
int SampleCount { get; }
Property Value
Remarks
This may be known upfront (e.g., from file metadata) or estimated. For truly streaming sources where the count is unknown, this may return -1.
Methods
GetBatches(bool, bool, int?)
Iterates through the dataset in batches synchronously.
IEnumerable<(TInput[] Inputs, TOutput[] Outputs)> GetBatches(bool shuffle = true, bool dropLast = false, int? seed = null)
Parameters
shuffleboolWhether to shuffle the data before batching.
dropLastboolWhether to drop the last incomplete batch.
seedint?Optional random seed for reproducibility.
Returns
- IEnumerable<(TInput[] Inputs, TOutput[] Outputs)>
An enumerable of batches, each containing arrays of inputs and outputs.
Remarks
For Beginners: Use this when you want simple, synchronous iteration. Each iteration gives you a batch of inputs and their corresponding outputs.
GetBatchesAsync(bool, bool, int?, CancellationToken)
Iterates through the dataset in batches asynchronously with prefetching.
IAsyncEnumerable<(TInput[] Inputs, TOutput[] Outputs)> GetBatchesAsync(bool shuffle = true, bool dropLast = false, int? seed = null, CancellationToken cancellationToken = default)
Parameters
shuffleboolWhether to shuffle the data before batching.
dropLastboolWhether to drop the last incomplete batch.
seedint?Optional random seed for reproducibility.
cancellationTokenCancellationTokenToken to cancel the iteration.
Returns
- IAsyncEnumerable<(TInput[] Inputs, TOutput[] Outputs)>
An async enumerable of batches, each containing arrays of inputs and outputs.
Remarks
For Beginners: This is the recommended method for training. It uses async/await to overlap data loading with model training, keeping your GPU busy while the next batch is being prepared.
Example:
await foreach (var batch in loader.GetBatchesAsync(shuffle: true))
{
await model.TrainOnBatchAsync(batch.Inputs, batch.Outputs);
}