Class MemoryMappedStreamingDataLoader<T, TInput, TOutput>
A streaming data loader that uses memory-mapped files for efficient random access to large binary datasets.
public class MemoryMappedStreamingDataLoader<T, TInput, TOutput> : StreamingDataLoaderBase<T, TInput, TOutput>, IStreamingDataLoader<T, TInput, TOutput>, IDataLoader<T>, IResettable, ICountable, IDisposable
Type Parameters
TThe numeric type used for calculations.
TInputThe type of input data.
TOutputThe type of output/label data.
- Inheritance
-
StreamingDataLoaderBase<T, TInput, TOutput>MemoryMappedStreamingDataLoader<T, TInput, TOutput>
- Implements
-
IStreamingDataLoader<T, TInput, TOutput>IDataLoader<T>
- Inherited Members
- Extension Methods
Remarks
MemoryMappedStreamingDataLoader uses MemoryMappedFile for efficient random access to large datasets stored in binary format. The operating system handles paging data in and out of physical memory as needed, enabling efficient access to datasets larger than available RAM.
File Format Requirements:
- Binary file with fixed-size samples
- Each sample is
inputSizeBytes + outputSizeBytesbytes - Samples are stored contiguously with optional header
For Beginners: Memory-mapped files let the operating system manage which parts of a large file are in memory. When you access a sample, the OS automatically loads that portion of the file into RAM. This is very efficient for random access patterns like shuffled batch iteration on datasets too large to fit in memory.
Example:
// Create a memory-mapped loader for binary image data
var loader = new MemoryMappedStreamingDataLoader<float, float[], int>(
filePath: "images.bin",
sampleCount: 60000,
inputSizeBytes: 784 * sizeof(float), // 28x28 image
outputSizeBytes: sizeof(int), // Label
inputDeserializer: (bytes) => {
var floats = new float[784];
for (int i = 0; i < 784; i++)
floats[i] = BitConverter.ToSingle(bytes, i * 4);
return floats;
},
outputDeserializer: (bytes) => BitConverter.ToInt32(bytes, 0),
batchSize: 32
);
await foreach (var batch in loader.GetBatchesAsync())
{
await model.TrainOnBatchAsync(batch.Inputs, batch.Outputs);
}
Constructors
MemoryMappedStreamingDataLoader(string, int, int, int, Func<byte[], TInput>, Func<byte[], TOutput>, int, long, int, int)
Initializes a new instance of the MemoryMappedStreamingDataLoader class.
public MemoryMappedStreamingDataLoader(string filePath, int sampleCount, int inputSizeBytes, int outputSizeBytes, Func<byte[], TInput> inputDeserializer, Func<byte[], TOutput> outputDeserializer, int batchSize, long headerSizeBytes = 0, int prefetchCount = 2, int numWorkers = 4)
Parameters
filePathstringPath to the binary data file.
sampleCountintTotal number of samples in the dataset.
inputSizeBytesintSize of input data per sample in bytes.
outputSizeBytesintSize of output/label data per sample in bytes.
inputDeserializerFunc<byte[], TInput>Function to deserialize input bytes to TInput.
outputDeserializerFunc<byte[], TOutput>Function to deserialize output bytes to TOutput.
batchSizeintNumber of samples per batch.
headerSizeByteslongSize of file header to skip in bytes. Default is 0.
prefetchCountintNumber of batches to prefetch. Default is 2.
numWorkersintNumber of parallel workers. Default is 4.
Exceptions
- ArgumentNullException
Thrown when filePath or deserializers are null.
- ArgumentOutOfRangeException
Thrown when sizes are invalid.
- FileNotFoundException
Thrown when the file does not exist.
Properties
HeaderSizeBytes
Gets the size of the file header in bytes.
public long HeaderSizeBytes { get; }
Property Value
Name
Gets the human-readable name of this data loader.
public override string Name { get; }
Property Value
Remarks
Examples: "MNIST", "Cora Citation Network", "IMDB Reviews"
SampleCount
Gets the total number of samples in the dataset.
public override int SampleCount { get; }
Property Value
Remarks
This may be known upfront (e.g., from file metadata) or estimated. For truly streaming sources where the count is unknown, this may return -1.
SampleSizeBytes
Gets the size of each sample in bytes (input + output).
public int SampleSizeBytes { get; }
Property Value
Methods
Dispose()
Releases all resources used by the memory-mapped data loader.
public void Dispose()
Dispose(bool)
Releases the unmanaged resources and optionally releases the managed resources.
protected void Dispose(bool disposing)
Parameters
disposingboolTrue to release both managed and unmanaged resources.
~MemoryMappedStreamingDataLoader()
Finalizer to ensure resources are released.
protected ~MemoryMappedStreamingDataLoader()
ReadSampleAsync(int, CancellationToken)
Reads a single sample by index.
protected override Task<(TInput Input, TOutput Output)> ReadSampleAsync(int index, CancellationToken cancellationToken = default)
Parameters
indexintThe index of the sample to read.
cancellationTokenCancellationTokenCancellation token.
Returns
Remarks
Derived classes must implement this to read a single sample from the data source. This method is called by the batching infrastructure to build batches.
UnloadDataCore()
Core data unloading implementation to be provided by derived classes.
protected override void UnloadDataCore()
Remarks
Derived classes should implement this to release resources: - Clear internal data structures - Release file handles or connections - Allow garbage collection of loaded data