Class StreamingDataLoader<T, TInput, TOutput>
A data loader that streams data from disk or other sources without loading all data into memory.
public class StreamingDataLoader<T, TInput, TOutput> : StreamingDataLoaderBase<T, TInput, TOutput>, IStreamingDataLoader<T, TInput, TOutput>, IDataLoader<T>, IResettable, ICountable
Type Parameters
TThe numeric type used for calculations, typically float or double.
TInputThe type of input data.
TOutputThe type of output/label data.
- Inheritance
-
StreamingDataLoaderBase<T, TInput, TOutput>StreamingDataLoader<T, TInput, TOutput>
- Implements
-
IStreamingDataLoader<T, TInput, TOutput>IDataLoader<T>
- Inherited Members
- Extension Methods
Remarks
StreamingDataLoader is designed for datasets that don't fit in memory. Instead of loading all data upfront, it reads data on-demand from a source, processes it, and yields batches.
For Beginners: When your dataset is too large to fit in RAM (e.g., millions of images or text documents), you can't load it all at once. StreamingDataLoader solves this by reading data piece by piece as needed.
Example:
// Define how to read individual samples
var loader = new StreamingDataLoader<float, Tensor<float>, int>(
sampleCount: 1000000, // 1 million samples
sampleReader: async (index, ct) =>
{
var image = await LoadImageFromDisk(index, ct);
var label = await LoadLabelFromDisk(index, ct);
return (image, label);
},
batchSize: 32
);
await foreach (var (inputs, labels) in loader.GetBatchesAsync())
{
await model.TrainOnBatchAsync(inputs, labels);
}
Constructors
StreamingDataLoader(int, Func<int, CancellationToken, Task<(TInput, TOutput)>>, int, string?, int, int)
Initializes a new instance of the StreamingDataLoader class.
public StreamingDataLoader(int sampleCount, Func<int, CancellationToken, Task<(TInput, TOutput)>> sampleReader, int batchSize, string? name = null, int prefetchCount = 2, int numWorkers = 4)
Parameters
sampleCountintTotal number of samples in the dataset.
sampleReaderFunc<int, CancellationToken, Task<(TInput, TOutput)>>Async function that reads a single sample by index.
batchSizeintNumber of samples per batch.
namestringOptional name for the data loader.
prefetchCountintNumber of batches to prefetch. Default is 2.
numWorkersintNumber of parallel workers for sample loading. Default is 4.
Properties
Name
Gets the human-readable name of this data loader.
public override string Name { get; }
Property Value
Remarks
Examples: "MNIST", "Cora Citation Network", "IMDB Reviews"
SampleCount
Gets the total number of samples in the dataset.
public override int SampleCount { get; }
Property Value
Remarks
This may be known upfront (e.g., from file metadata) or estimated. For truly streaming sources where the count is unknown, this may return -1.
Methods
ReadSampleAsync(int, CancellationToken)
Reads a single sample by index.
protected override Task<(TInput Input, TOutput Output)> ReadSampleAsync(int index, CancellationToken cancellationToken = default)
Parameters
indexintThe index of the sample to read.
cancellationTokenCancellationTokenCancellation token.
Returns
Remarks
Derived classes must implement this to read a single sample from the data source. This method is called by the batching infrastructure to build batches.