Class BlipNeuralNetwork<T>

Namespace: AiDotNet.NeuralNetworks

Assembly: AiDotNet.dll

BLIP (Bootstrapped Language-Image Pre-training) neural network for vision-language tasks.

public class BlipNeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IBlipModel<T>, IMultimodalEmbedding<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

NeuralNetworkBase<T>

BlipNeuralNetwork<T>

Implements: INeuralNetworkModel<T>

INeuralNetwork<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

IInterpretableModel<T>

IInputGradientComputable<T>

IDisposable

IBlipModel<T>

IMultimodalEmbedding<T>

Inherited Members: NeuralNetworkBase<T>.Layers

NeuralNetworkBase<T>.LayerCount

NeuralNetworkBase<T>.Architecture

NeuralNetworkBase<T>.NumOps

NeuralNetworkBase<T>.Engine

NeuralNetworkBase<T>._layerInputs

NeuralNetworkBase<T>._layerOutputs

NeuralNetworkBase<T>.Random

NeuralNetworkBase<T>.LossFunction

NeuralNetworkBase<T>.LastLoss

NeuralNetworkBase<T>.IsTrainingMode

NeuralNetworkBase<T>.SupportsGpuTraining

NeuralNetworkBase<T>.CanTrainOnGpu

NeuralNetworkBase<T>.GpuEngine

NeuralNetworkBase<T>.MaxGradNorm

NeuralNetworkBase<T>._mixedPrecisionContext

NeuralNetworkBase<T>._memoryManager

NeuralNetworkBase<T>.IsMemoryManagementEnabled

NeuralNetworkBase<T>.IsGradientCheckpointingEnabled

NeuralNetworkBase<T>.IsMixedPrecisionEnabled

NeuralNetworkBase<T>.ClipGradients(List<Tensor<T>>)

NeuralNetworkBase<T>.ClipGradient(Tensor<T>)

NeuralNetworkBase<T>.ClipGradient(Vector<T>)

NeuralNetworkBase<T>.Backpropagate(Tensor<T>)

NeuralNetworkBase<T>.BackpropagateWithRecompute(Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpu(IGpuTensor<T>)

NeuralNetworkBase<T>.BackpropagateGpuDeferred(IGpuTensor<T>, GpuExecutionOptions)

NeuralNetworkBase<T>.UpdateParametersGpu(T, T, T)

NeuralNetworkBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

NeuralNetworkBase<T>.UpdateParametersGpuDeferred(IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferred(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions)

NeuralNetworkBase<T>.TrainBatchGpuDeferredAsync(IGpuTensor<T>, IGpuTensor<T>, IGpuOptimizerConfig, GpuExecutionOptions, CancellationToken)

NeuralNetworkBase<T>.UploadWeightsToGpu()

NeuralNetworkBase<T>.DownloadWeightsFromGpu()

NeuralNetworkBase<T>.ZeroGradientsGpu()

NeuralNetworkBase<T>.ExtractSingleExample(Tensor<T>, int)

NeuralNetworkBase<T>.ForwardWithMemory(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithCheckpointing(Tensor<T>)

NeuralNetworkBase<T>.CanUseGpuResidentPath()

NeuralNetworkBase<T>.TryForwardGpuOptimized(Tensor<T>, out Tensor<T>)

NeuralNetworkBase<T>.ForwardGpu(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferred(Tensor<T>)

NeuralNetworkBase<T>.ForwardDeferredAsync(Tensor<T>, CancellationToken)

NeuralNetworkBase<T>.BeginGpuExecution(GpuExecutionOptions)

NeuralNetworkBase<T>.ForwardWithGpuContext(Tensor<T>)

NeuralNetworkBase<T>.ForwardWithGpuContext(IGpuTensor<T>)

NeuralNetworkBase<T>.GetGpuMemoryStats()

NeuralNetworkBase<T>.ForwardWithFeatures(Tensor<T>, int[])

NeuralNetworkBase<T>.GetParameterCount()

NeuralNetworkBase<T>.InvalidateParameterCountCache()

NeuralNetworkBase<T>.AddLayerToCollection(ILayer<T>)

NeuralNetworkBase<T>.RemoveLayerFromCollection(ILayer<T>)

NeuralNetworkBase<T>.ClearLayers()

NeuralNetworkBase<T>.ValidateCustomLayers(List<ILayer<T>>)

NeuralNetworkBase<T>.ValidateCustomLayersInternal(List<ILayer<T>>)

NeuralNetworkBase<T>.IsValidInputLayer(ILayer<T>)

NeuralNetworkBase<T>.IsValidOutputLayer(ILayer<T>)

NeuralNetworkBase<T>.AreLayersCompatible(ILayer<T>, ILayer<T>)

NeuralNetworkBase<T>.GetParameterGradients()

NeuralNetworkBase<T>.EnsureArchitectureInitialized()

NeuralNetworkBase<T>.SetTrainingMode(bool)

NeuralNetworkBase<T>.EnableMemoryManagement(TrainingMemoryConfig)

NeuralNetworkBase<T>.DisableMemoryManagement()

NeuralNetworkBase<T>.GetMemoryEstimate(int, int)

NeuralNetworkBase<T>.GetLastLoss()

NeuralNetworkBase<T>.ResetState()

NeuralNetworkBase<T>.BackwardWithInputGradient(Tensor<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Vector<T>, Vector<T>)

NeuralNetworkBase<T>.ComputeInputGradient(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.SaveModel(string)

NeuralNetworkBase<T>.LoadModel(string)

NeuralNetworkBase<T>.Serialize()

NeuralNetworkBase<T>.Deserialize(byte[])

NeuralNetworkBase<T>.WithParameters(Vector<T>)

NeuralNetworkBase<T>.GetActiveFeatureIndices()

NeuralNetworkBase<T>.IsFeatureUsed(int)

NeuralNetworkBase<T>.DeepCopy()

NeuralNetworkBase<T>.Clone()

NeuralNetworkBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

NeuralNetworkBase<T>._enabledMethods

NeuralNetworkBase<T>._sensitiveFeatures

NeuralNetworkBase<T>._fairnessMetrics

NeuralNetworkBase<T>._baseModel

NeuralNetworkBase<T>.GetGlobalFeatureImportanceAsync()

NeuralNetworkBase<T>.GetLocalFeatureImportanceAsync(Tensor<T>)

NeuralNetworkBase<T>.GetShapValuesAsync(Tensor<T>)

NeuralNetworkBase<T>.GetLimeExplanationAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetPartialDependenceAsync(Vector<int>, int)

NeuralNetworkBase<T>.GetCounterfactualAsync(Tensor<T>, Tensor<T>, int)

NeuralNetworkBase<T>.GetModelSpecificInterpretabilityAsync()

NeuralNetworkBase<T>.GenerateTextExplanationAsync(Tensor<T>, Tensor<T>)

NeuralNetworkBase<T>.GetFeatureInteractionAsync(int, int)

NeuralNetworkBase<T>.ValidateFairnessAsync(Tensor<T>, int)

NeuralNetworkBase<T>.GetAnchorExplanationAsync(Tensor<T>, T)

NeuralNetworkBase<T>.SetBaseModel<TInput, TOutput>(IFullModel<T, TInput, TOutput>)

NeuralNetworkBase<T>.EnableMethod(params InterpretationMethod[])

NeuralNetworkBase<T>.ConfigureFairness(Vector<int>, params FairnessMetric[])

NeuralNetworkBase<T>.GetNamedLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.GetArchitecture()

NeuralNetworkBase<T>.GetFeatureImportance()

NeuralNetworkBase<T>.SetParameters(Vector<T>)

NeuralNetworkBase<T>.AddLayer(LayerType, int, ActivationFunction)

NeuralNetworkBase<T>.AddConvolutionalLayer(int, int, int, ActivationFunction)

NeuralNetworkBase<T>.AddLSTMLayer(int, bool)

NeuralNetworkBase<T>.AddDropoutLayer(double)

NeuralNetworkBase<T>.AddBatchNormalizationLayer(int, double, double)

NeuralNetworkBase<T>.AddPoolingLayer(int[], PoolingType, int, int?)

NeuralNetworkBase<T>.GetGradients()

NeuralNetworkBase<T>.GetInputShape()

NeuralNetworkBase<T>.GetLayerActivations(Tensor<T>)

NeuralNetworkBase<T>.DefaultLossFunction

NeuralNetworkBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

NeuralNetworkBase<T>.ApplyGradients(Vector<T>, T)

NeuralNetworkBase<T>.SaveState(Stream)

NeuralNetworkBase<T>.LoadState(Stream)

NeuralNetworkBase<T>.Dispose()

NeuralNetworkBase<T>.SupportsJitCompilation

NeuralNetworkBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

NeuralNetworkBase<T>.ConvertLayerToGraph(ILayer<T>, ComputationNode<T>)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

BLIP extends CLIP's capabilities with image captioning, image-text matching, and visual question answering. It uses a unified framework with both understanding and generation tasks. This implementation supports both ONNX pretrained models and native library layers.

For Beginners: BLIP is a more powerful version of CLIP!

CLIP can:

Match images with text descriptions
Zero-shot classification

BLIP adds:

Generate captions ("a dog playing in the park")
Answer questions ("What color is the car?" -> "Red")
More accurate image-text matching

Training innovation:

BLIP was trained on noisy web data
It learned to filter out bad captions automatically
Then it generated better captions to train on!
This "bootstrapping" creates a cleaner dataset

Use cases:

Accessibility (auto-generate alt-text for images)
Content moderation (answer "is there violence in this image?")
Visual search (find images matching a description)
Image organization (auto-tag photos)

Constructors

BlipNeuralNetwork(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int, int, int, int, int, ITokenizer?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a BLIP network using native library layers.

public BlipNeuralNetwork(NeuralNetworkArchitecture<T> architecture, int imageSize = 384, int channels = 3, int patchSize = 16, int vocabularySize = 30522, int maxSequenceLength = 35, int embeddingDimension = 256, int hiddenDim = 768, int numEncoderLayers = 12, int numDecoderLayers = 12, int numHeads = 12, int mlpDim = 3072, ITokenizer? tokenizer = null, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
imageSize int: Expected image size (default 384 for BLIP).
channels int: Number of image channels (default 3 for RGB).
patchSize int: Patch size for vision transformer.
vocabularySize int: Text vocabulary size (BERT: 30522).
maxSequenceLength int: Maximum text sequence length.
embeddingDimension int: Dimension of shared embedding space.
hiddenDim int: Hidden dimension for transformers.
numEncoderLayers int: Number of encoder transformer layers.
numDecoderLayers int: Number of decoder transformer layers.
numHeads int: Number of attention heads.
mlpDim int: MLP hidden dimension.
tokenizer ITokenizer: Optional tokenizer for text processing.
optimizer IOptimizer<T, Tensor<T>, Tensor<T>>: Optional optimizer for training.
lossFunction ILossFunction<T>: Optional loss function.

Remarks

This constructor creates a fully trainable BLIP network using the library's native layers. All operations use the Engine for CPU/GPU acceleration.

BlipNeuralNetwork(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a BLIP network using pretrained ONNX models.

public BlipNeuralNetwork(NeuralNetworkArchitecture<T> architecture, string visionEncoderPath, string textEncoderPath, string textDecoderPath, ITokenizer tokenizer, int embeddingDimension = 256, int maxSequenceLength = 35, int imageSize = 384, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>: The neural network architecture configuration.
visionEncoderPath string: Path to the vision encoder ONNX model.
textEncoderPath string: Path to the text encoder ONNX model.
textDecoderPath string: Path to the text decoder ONNX model.
tokenizer ITokenizer: The tokenizer for text processing.
embeddingDimension int: Dimension of the shared embedding space.
maxSequenceLength int: Maximum text sequence length.
imageSize int: Expected image size.
optimizer IOptimizer<T, Tensor<T>, Tensor<T>>: Optional optimizer for fine-tuning.
lossFunction ILossFunction<T>: Optional loss function.

Properties

EmbeddingDimension

Gets the dimensionality of the embedding space.

public int EmbeddingDimension { get; }

Property Value

int

ImageSize

Gets the expected image size (square images: ImageSize x ImageSize pixels).

public int ImageSize { get; }

Property Value

int

MaxSequenceLength

Gets the maximum sequence length for text input.

public int MaxSequenceLength { get; }

Property Value

int

ParameterCount

Gets the total number of trainable parameters.

public override int ParameterCount { get; }

Property Value

int

SupportsTraining

Indicates whether this network supports training (learning from data).

public override bool SupportsTraining { get; }

Property Value

bool

Remarks

For Beginners: Not all neural networks can learn. Some are designed only for making predictions with pre-set parameters. This property tells you if the network can learn from data.

Methods

AnswerQuestion(Tensor<T>, string, int)

Answers a question about an image's content.

public string AnswerQuestion(Tensor<T> image, string question, int maxLength = 20)

Parameters

image Tensor<T>: The preprocessed image tensor.
question string: The question to answer (e.g., "What color is the car?").
maxLength int: Maximum length of the answer.

Returns

string: The generated answer.

Remarks

Visual Question Answering (VQA) generates natural language answers to questions about image content. The model uses cross-attention to focus on relevant image regions when generating the answer.

For Beginners: Ask questions about images and get answers!

Examples:

Image: Photo of a kitchen
"What appliances are visible?" → "refrigerator, microwave, and stove"
"What color are the cabinets?" → "white"
"Is there a window?" → "yes, above the sink"

This is useful for:

Accessibility (describe images for visually impaired users)
Content moderation (is there alcohol in this photo?)
Data extraction (what brand is this product?)

ComputeImageTextMatch(Tensor<T>, string)

Determines whether a given text accurately describes an image.

public T ComputeImageTextMatch(Tensor<T> image, string text)

Parameters

image Tensor<T>: The preprocessed image tensor.
text string: The text description to evaluate.

Returns

T: A probability score between 0 and 1 indicating match quality.

Remarks

Uses the Image-Text Matching (ITM) head with cross-attention between image patches and text tokens for fine-grained matching. This is more accurate than simple embedding similarity for detailed matching.

For Beginners: This checks if a caption accurately describes an image.

Unlike simple similarity (dot product), this uses "cross-attention" which:

Looks at specific parts of the image
Compares them to specific words in the text
Gives a more accurate yes/no answer

Example:

Image: A red car parked on a street
"A red vehicle on pavement" → 0.92 (accurate!)
"A blue car in a garage" → 0.15 (wrong color and location)

Use this when you need precise matching, not just "related content."

ComputeSimilarity(Vector<T>, Vector<T>)

Computes similarity between two embeddings.

public T ComputeSimilarity(Vector<T> textEmbedding, Vector<T> imageEmbedding)

Parameters

textEmbedding Vector<T>
imageEmbedding Vector<T>

Returns

T: Similarity score (cosine similarity for normalized embeddings).

CreateNewInstance()

Creates a new instance of the same type as this neural network.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>: A new instance of the same neural network type.

Remarks

For Beginners: This creates a blank version of the same type of neural network.

It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data that was not covered by the general deserialization process.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader: The BinaryReader to read the data from.

Remarks

This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.

For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.

Dispose(bool)

Protected Dispose pattern implementation.

protected override void Dispose(bool disposing)

Parameters

disposing bool: True if called from Dispose(), false if called from finalizer.

EmbedAsync(string)

public Task<Vector<T>> EmbedAsync(string text)

Parameters

text string

Returns

Task<Vector<T>>

EmbedBatchAsync(IEnumerable<string>)

public Task<Matrix<T>> EmbedBatchAsync(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

Returns

Task<Matrix<T>>

EncodeImage(double[])

Encodes an image into an embedding vector.

public Vector<T> EncodeImage(double[] imageData)

Parameters

imageData double[]: The preprocessed image data as a flattened array in CHW format.

Returns

Vector<T>: A normalized embedding vector.

EncodeImageBatch(IEnumerable<double[]>)

Encodes multiple images into embedding vectors in a batch.

public Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)

Parameters

imageDataBatch IEnumerable<double[]>: The preprocessed images as flattened arrays.

Returns

Matrix<T>: A matrix where each row is an embedding for the corresponding image.

EncodeText(string)

Encodes text into an embedding vector.

public Vector<T> EncodeText(string text)

Parameters

text string: The text to encode.

Returns

Vector<T>: A normalized embedding vector.

EncodeTextBatch(IEnumerable<string>)

Encodes multiple texts into embedding vectors in a batch.

public Matrix<T> EncodeTextBatch(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>: The texts to encode.

Returns

Matrix<T>: A matrix where each row is an embedding for the corresponding text.

GenerateCaption(Tensor<T>, int, int)

Generates a caption describing the content of an image.

public string GenerateCaption(Tensor<T> image, int maxLength = 30, int numBeams = 3)

Parameters

image Tensor<T>: The preprocessed image tensor with shape [channels, height, width].
maxLength int: Maximum number of tokens to generate. Default is 30.
numBeams int: Number of beams for beam search. Default is 3 for quality/speed balance.

Returns

string: A generated caption describing the image.

Remarks

Uses the image-grounded text decoder to generate descriptive captions. The generation uses beam search by default for higher quality outputs.

For Beginners: This automatically describes what's in an image!

Example:

Input: Photo of a dog playing fetch in a park
Output: "a brown dog catching a frisbee on a grassy field"

Parameters:

maxLength: How long the caption can be (30 = roughly 25 words)
numBeams: More beams = better captions but slower (3 is a good balance)

Uses "beam search" - it explores multiple possible captions and picks the best one.

GenerateCaptions(Tensor<T>, int, int)

Generates multiple candidate captions for an image.

public IEnumerable<string> GenerateCaptions(Tensor<T> image, int numCaptions = 5, int maxLength = 30)

Parameters

image Tensor<T>: The preprocessed image tensor.
numCaptions int: Number of captions to generate.
maxLength int: Maximum length per caption.

Returns

IEnumerable<string>: A collection of candidate captions.

Remarks

Uses nucleus (top-p) sampling to generate diverse captions. Useful for getting multiple perspectives on an image's content.

GetImageEmbedding(Tensor<T>)

public Vector<T> GetImageEmbedding(Tensor<T> image)

Parameters

image Tensor<T>

Returns

Vector<T>

GetImageEmbeddings(IEnumerable<Tensor<T>>)

public IEnumerable<Vector<T>> GetImageEmbeddings(IEnumerable<Tensor<T>> images)

Parameters

images IEnumerable<Tensor<T>>

Returns

IEnumerable<Vector<T>>

GetModelMetadata()

Retrieves metadata about the BLIP neural network model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>: A ModelMetaData object containing information about the network.

GetParameters()

Gets all trainable parameters of the network as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>: A vector containing all parameters of the network.

Remarks

For Beginners: Neural networks learn by adjusting their "parameters" (also called weights and biases). This method collects all those adjustable values into a single list so they can be updated during training.

GetTextEmbedding(string)

public Vector<T> GetTextEmbedding(string text)

Parameters

text string

Returns

Vector<T>

GetTextEmbeddings(IEnumerable<string>)

public IEnumerable<Vector<T>> GetTextEmbeddings(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

Returns

IEnumerable<Vector<T>>

InitializeLayers()

Initializes the layers of the neural network based on the architecture.

protected override void InitializeLayers()

Remarks

For Beginners: This method sets up all the layers in your neural network according to the architecture you've defined. It's like assembling the parts of your network before you can use it.

Predict(Tensor<T>)

Makes a prediction using the neural network.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>: The input data to process.

Returns

Tensor<T>: The network's prediction.

Remarks

For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).

RankCaptions(Tensor<T>, IEnumerable<string>)

Ranks a set of candidate captions by how well they match an image.

public IEnumerable<(string Caption, T Score)> RankCaptions(Tensor<T> image, IEnumerable<string> candidates)

Parameters

image Tensor<T>: The preprocessed image tensor.
candidates IEnumerable<string>: The candidate captions to rank.

Returns

IEnumerable<(string Caption, T Score)>: Captions ranked by match score, from best to worst.

Remarks

Uses the ITM head to score each candidate, then returns them in descending order. Useful for caption reranking in retrieval applications.

RetrieveImages(string, IEnumerable<Vector<T>>, int)

Retrieves the most relevant images for a text query from a collection.

public IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Vector<T>> imageEmbeddings, int topK = 10)

Parameters

query string: The text query describing desired images.
imageEmbeddings IEnumerable<Vector<T>>: Pre-computed image embeddings.
topK int: Number of results to return.

Returns

IEnumerable<(int Index, T Score)>: Indices of the top-K matching images with their scores.

Remarks

Performs efficient text-to-image retrieval using embedding similarity. For large collections, pre-compute and cache image embeddings.

RetrieveTexts(Tensor<T>, IEnumerable<Vector<T>>, int)

Retrieves the most relevant texts for an image from a collection.

public IEnumerable<(int Index, T Score)> RetrieveTexts(Tensor<T> image, IEnumerable<Vector<T>> textEmbeddings, int topK = 10)

Parameters

image Tensor<T>: The preprocessed image tensor.
textEmbeddings IEnumerable<Vector<T>>: Pre-computed text embeddings.
topK int: Number of results to return.

Returns

IEnumerable<(int Index, T Score)>: Indices of the top-K matching texts with their scores.

Remarks

Performs efficient image-to-text retrieval using embedding similarity. Useful for finding relevant captions or descriptions for images.

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data that is not covered by the general serialization process.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter: The BinaryWriter to write the data to.

Remarks

This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.

For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.

Train(Tensor<T>, Tensor<T>)

Trains the neural network on a single input-output pair.

public override void Train(Tensor<T> input, Tensor<T> target)

Parameters

input Tensor<T>: The input data.
target Tensor<T>

Remarks

This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.

For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)

The network then:

Makes a prediction based on the input
Compares its prediction to the expected output
Calculates how wrong it was (the loss)
Adjusts its internal values to do better next time

After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.

UpdateParameters(Vector<T>)

Updates the network's parameters with new values.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>: The new parameter values to set.

Remarks

For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.

This is typically used by optimization algorithms that calculate better parameter values based on training data.

ZeroShotClassify(Tensor<T>, IEnumerable<string>)

public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels)

Parameters

image Tensor<T>
classLabels IEnumerable<string>

Returns

Dictionary<string, T>

ZeroShotClassify(double[], IEnumerable<string>)

Performs zero-shot classification of an image against text labels.

public Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)

Parameters

imageData double[]: The preprocessed image data.
labels IEnumerable<string>: The candidate class labels.

Returns

Dictionary<string, T>: A dictionary mapping each label to its probability score.

Table of Contents

Class BlipNeuralNetwork<T>

Type Parameters

Remarks

Constructors

BlipNeuralNetwork(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int, int, int, int, int, ITokenizer?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Parameters

Remarks

BlipNeuralNetwork(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Parameters

Properties

EmbeddingDimension

Property Value

ImageSize

Property Value

MaxSequenceLength

Property Value

ParameterCount

Property Value

SupportsTraining

Property Value

Remarks

Methods

AnswerQuestion(Tensor<T>, string, int)

Parameters

Returns

Remarks

ComputeImageTextMatch(Tensor<T>, string)

Parameters

Returns

Remarks

ComputeSimilarity(Vector<T>, Vector<T>)

Parameters

Returns

CreateNewInstance()

Returns

Remarks

DeserializeNetworkSpecificData(BinaryReader)

Parameters

Remarks

Dispose(bool)

Parameters

EmbedAsync(string)

Parameters

Returns

EmbedBatchAsync(IEnumerable<string>)

Parameters

Returns

EncodeImage(double[])

Parameters

Returns

EncodeImageBatch(IEnumerable<double[]>)

Parameters

Returns

EncodeText(string)

Parameters

Returns

EncodeTextBatch(IEnumerable<string>)

Parameters

Returns

GenerateCaption(Tensor<T>, int, int)

Parameters

Returns

Remarks

GenerateCaptions(Tensor<T>, int, int)

Parameters

Returns

Remarks

GetImageEmbedding(Tensor<T>)

Parameters

Returns

GetImageEmbeddings(IEnumerable<Tensor<T>>)

Parameters

Returns

GetModelMetadata()

Returns

GetParameters()

Returns

Remarks

GetTextEmbedding(string)