Class CsvStreamingDataLoader<T, TInput, TOutput>
A streaming data loader that reads from a CSV file line by line.
public class CsvStreamingDataLoader<T, TInput, TOutput> : StreamingDataLoaderBase<T, TInput, TOutput>, IStreamingDataLoader<T, TInput, TOutput>, IDataLoader<T>, IResettable, ICountable
Type Parameters
TThe numeric type used for calculations.
TInputThe type of input data.
TOutputThe type of output/label data.
- Inheritance
-
StreamingDataLoaderBase<T, TInput, TOutput>CsvStreamingDataLoader<T, TInput, TOutput>
- Implements
-
IStreamingDataLoader<T, TInput, TOutput>IDataLoader<T>
- Inherited Members
- Extension Methods
Remarks
CsvStreamingDataLoader reads a CSV file line by line without loading the entire file into memory. This is ideal for large tabular datasets.
For Beginners: If you have a large CSV file (gigabytes of data), this loader will read it row by row as needed during training.
Example:
var loader = new CsvStreamingDataLoader<float, float[], float>(
filePath: "large_dataset.csv",
lineParser: (line, lineNumber) =>
{
var parts = line.Split(',');
var features = parts.Take(10).Select(float.Parse).ToArray();
var label = float.Parse(parts[10]);
return (features, label);
},
batchSize: 256,
hasHeader: true
);
Constructors
CsvStreamingDataLoader(string, Func<string, int, (TInput, TOutput)>, int, bool, int, int)
Initializes a new instance of the CsvStreamingDataLoader class.
public CsvStreamingDataLoader(string filePath, Func<string, int, (TInput, TOutput)> lineParser, int batchSize, bool hasHeader = true, int prefetchCount = 2, int numWorkers = 4)
Parameters
filePathstringPath to the CSV file.
lineParserFunc<string, int, (TInput, TOutput)>Function that parses a line into (input, output).
batchSizeintNumber of samples per batch.
hasHeaderboolWhether the CSV has a header row to skip.
prefetchCountintNumber of batches to prefetch.
numWorkersintNumber of parallel workers.
Properties
Name
Gets the human-readable name of this data loader.
public override string Name { get; }
Property Value
Remarks
Examples: "MNIST", "Cora Citation Network", "IMDB Reviews"
SampleCount
Gets the total number of samples in the dataset.
public override int SampleCount { get; }
Property Value
Remarks
This may be known upfront (e.g., from file metadata) or estimated. For truly streaming sources where the count is unknown, this may return -1.
Methods
GetSequentialBatches(int?, bool)
Iterates through the CSV file sequentially without loading all lines into memory.
public IEnumerable<(TInput[] Inputs, TOutput[] Outputs)> GetSequentialBatches(int? batchSize = null, bool dropLast = false)
Parameters
Returns
- IEnumerable<(TInput[] Inputs, TOutput[] Outputs)>
An enumerable of batches.
Remarks
This method provides true streaming iteration without caching all lines. Use this when memory is constrained and you don't need shuffling.
ReadSampleAsync(int, CancellationToken)
Reads a single sample by index.
protected override Task<(TInput Input, TOutput Output)> ReadSampleAsync(int index, CancellationToken cancellationToken = default)
Parameters
indexintThe index of the sample to read.
cancellationTokenCancellationTokenCancellation token.
Returns
Remarks
Derived classes must implement this to read a single sample from the data source. This method is called by the batching infrastructure to build batches.
UnloadDataCore()
Core data unloading implementation to be provided by derived classes.
protected override void UnloadDataCore()
Remarks
Derived classes should implement this to release resources: - Clear internal data structures - Release file handles or connections - Allow garbage collection of loaded data