Table of Contents

Class BlipNeuralNetwork<T>

Namespace
AiDotNet.NeuralNetworks
Assembly
AiDotNet.dll

BLIP (Bootstrapped Language-Image Pre-training) neural network for vision-language tasks.

public class BlipNeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IBlipModel<T>, IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
BlipNeuralNetwork<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

BLIP extends CLIP's capabilities with image captioning, image-text matching, and visual question answering. It uses a unified framework with both understanding and generation tasks. This implementation supports both ONNX pretrained models and native library layers.

For Beginners: BLIP is a more powerful version of CLIP!

CLIP can:

  • Match images with text descriptions
  • Zero-shot classification

BLIP adds:

  • Generate captions ("a dog playing in the park")
  • Answer questions ("What color is the car?" -> "Red")
  • More accurate image-text matching

Training innovation:

  • BLIP was trained on noisy web data
  • It learned to filter out bad captions automatically
  • Then it generated better captions to train on!
  • This "bootstrapping" creates a cleaner dataset

Use cases:

  • Accessibility (auto-generate alt-text for images)
  • Content moderation (answer "is there violence in this image?")
  • Visual search (find images matching a description)
  • Image organization (auto-tag photos)

Constructors

BlipNeuralNetwork(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int, int, int, int, int, ITokenizer?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a BLIP network using native library layers.

public BlipNeuralNetwork(NeuralNetworkArchitecture<T> architecture, int imageSize = 384, int channels = 3, int patchSize = 16, int vocabularySize = 30522, int maxSequenceLength = 35, int embeddingDimension = 256, int hiddenDim = 768, int numEncoderLayers = 12, int numDecoderLayers = 12, int numHeads = 12, int mlpDim = 3072, ITokenizer? tokenizer = null, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

imageSize int

Expected image size (default 384 for BLIP).

channels int

Number of image channels (default 3 for RGB).

patchSize int

Patch size for vision transformer.

vocabularySize int

Text vocabulary size (BERT: 30522).

maxSequenceLength int

Maximum text sequence length.

embeddingDimension int

Dimension of shared embedding space.

hiddenDim int

Hidden dimension for transformers.

numEncoderLayers int

Number of encoder transformer layers.

numDecoderLayers int

Number of decoder transformer layers.

numHeads int

Number of attention heads.

mlpDim int

MLP hidden dimension.

tokenizer ITokenizer

Optional tokenizer for text processing.

optimizer IOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer for training.

lossFunction ILossFunction<T>

Optional loss function.

Remarks

This constructor creates a fully trainable BLIP network using the library's native layers. All operations use the Engine for CPU/GPU acceleration.

BlipNeuralNetwork(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a BLIP network using pretrained ONNX models.

public BlipNeuralNetwork(NeuralNetworkArchitecture<T> architecture, string visionEncoderPath, string textEncoderPath, string textDecoderPath, ITokenizer tokenizer, int embeddingDimension = 256, int maxSequenceLength = 35, int imageSize = 384, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

visionEncoderPath string

Path to the vision encoder ONNX model.

textEncoderPath string

Path to the text encoder ONNX model.

textDecoderPath string

Path to the text decoder ONNX model.

tokenizer ITokenizer

The tokenizer for text processing.

embeddingDimension int

Dimension of the shared embedding space.

maxSequenceLength int

Maximum text sequence length.

imageSize int

Expected image size.

optimizer IOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer for fine-tuning.

lossFunction ILossFunction<T>

Optional loss function.

Properties

EmbeddingDimension

Gets the dimensionality of the embedding space.

public int EmbeddingDimension { get; }

Property Value

int

ImageSize

Gets the expected image size (square images: ImageSize x ImageSize pixels).

public int ImageSize { get; }

Property Value

int

MaxSequenceLength

Gets the maximum sequence length for text input.

public int MaxSequenceLength { get; }

Property Value

int

ParameterCount

Gets the total number of trainable parameters.

public override int ParameterCount { get; }

Property Value

int

SupportsTraining

Indicates whether this network supports training (learning from data).

public override bool SupportsTraining { get; }

Property Value

bool

Remarks

For Beginners: Not all neural networks can learn. Some are designed only for making predictions with pre-set parameters. This property tells you if the network can learn from data.

Methods

AnswerQuestion(Tensor<T>, string, int)

Answers a question about an image's content.

public string AnswerQuestion(Tensor<T> image, string question, int maxLength = 20)

Parameters

image Tensor<T>

The preprocessed image tensor.

question string

The question to answer (e.g., "What color is the car?").

maxLength int

Maximum length of the answer.

Returns

string

The generated answer.

Remarks

Visual Question Answering (VQA) generates natural language answers to questions about image content. The model uses cross-attention to focus on relevant image regions when generating the answer.

For Beginners: Ask questions about images and get answers!

Examples:

  • Image: Photo of a kitchen
  • "What appliances are visible?" → "refrigerator, microwave, and stove"
  • "What color are the cabinets?" → "white"
  • "Is there a window?" → "yes, above the sink"

This is useful for:

  • Accessibility (describe images for visually impaired users)
  • Content moderation (is there alcohol in this photo?)
  • Data extraction (what brand is this product?)

ComputeImageTextMatch(Tensor<T>, string)

Determines whether a given text accurately describes an image.

public T ComputeImageTextMatch(Tensor<T> image, string text)

Parameters

image Tensor<T>

The preprocessed image tensor.

text string

The text description to evaluate.

Returns

T

A probability score between 0 and 1 indicating match quality.

Remarks

Uses the Image-Text Matching (ITM) head with cross-attention between image patches and text tokens for fine-grained matching. This is more accurate than simple embedding similarity for detailed matching.

For Beginners: This checks if a caption accurately describes an image.

Unlike simple similarity (dot product), this uses "cross-attention" which:

  • Looks at specific parts of the image
  • Compares them to specific words in the text
  • Gives a more accurate yes/no answer

Example:

  • Image: A red car parked on a street
  • "A red vehicle on pavement" → 0.92 (accurate!)
  • "A blue car in a garage" → 0.15 (wrong color and location)

Use this when you need precise matching, not just "related content."

ComputeSimilarity(Vector<T>, Vector<T>)

Computes similarity between two embeddings.

public T ComputeSimilarity(Vector<T> textEmbedding, Vector<T> imageEmbedding)

Parameters

textEmbedding Vector<T>
imageEmbedding Vector<T>

Returns

T

Similarity score (cosine similarity for normalized embeddings).

CreateNewInstance()

Creates a new instance of the same type as this neural network.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

A new instance of the same neural network type.

Remarks

For Beginners: This creates a blank version of the same type of neural network.

It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data that was not covered by the general deserialization process.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

The BinaryReader to read the data from.

Remarks

This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.

For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.

Dispose(bool)

Protected Dispose pattern implementation.

protected override void Dispose(bool disposing)

Parameters

disposing bool

True if called from Dispose(), false if called from finalizer.

EmbedAsync(string)

public Task<Vector<T>> EmbedAsync(string text)

Parameters

text string

Returns

Task<Vector<T>>

EmbedBatchAsync(IEnumerable<string>)

public Task<Matrix<T>> EmbedBatchAsync(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

Returns

Task<Matrix<T>>

EncodeImage(double[])

Encodes an image into an embedding vector.

public Vector<T> EncodeImage(double[] imageData)

Parameters

imageData double[]

The preprocessed image data as a flattened array in CHW format.

Returns

Vector<T>

A normalized embedding vector.

EncodeImageBatch(IEnumerable<double[]>)

Encodes multiple images into embedding vectors in a batch.

public Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)

Parameters

imageDataBatch IEnumerable<double[]>

The preprocessed images as flattened arrays.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding image.

EncodeText(string)

Encodes text into an embedding vector.

public Vector<T> EncodeText(string text)

Parameters

text string

The text to encode.

Returns

Vector<T>

A normalized embedding vector.

EncodeTextBatch(IEnumerable<string>)

Encodes multiple texts into embedding vectors in a batch.

public Matrix<T> EncodeTextBatch(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

The texts to encode.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding text.

GenerateCaption(Tensor<T>, int, int)

Generates a caption describing the content of an image.

public string GenerateCaption(Tensor<T> image, int maxLength = 30, int numBeams = 3)

Parameters

image Tensor<T>

The preprocessed image tensor with shape [channels, height, width].

maxLength int

Maximum number of tokens to generate. Default is 30.

numBeams int

Number of beams for beam search. Default is 3 for quality/speed balance.

Returns

string

A generated caption describing the image.

Remarks

Uses the image-grounded text decoder to generate descriptive captions. The generation uses beam search by default for higher quality outputs.

For Beginners: This automatically describes what's in an image!

Example:

  • Input: Photo of a dog playing fetch in a park
  • Output: "a brown dog catching a frisbee on a grassy field"

Parameters:

  • maxLength: How long the caption can be (30 = roughly 25 words)
  • numBeams: More beams = better captions but slower (3 is a good balance)

Uses "beam search" - it explores multiple possible captions and picks the best one.

GenerateCaptions(Tensor<T>, int, int)

Generates multiple candidate captions for an image.

public IEnumerable<string> GenerateCaptions(Tensor<T> image, int numCaptions = 5, int maxLength = 30)

Parameters

image Tensor<T>

The preprocessed image tensor.

numCaptions int

Number of captions to generate.

maxLength int

Maximum length per caption.

Returns

IEnumerable<string>

A collection of candidate captions.

Remarks

Uses nucleus (top-p) sampling to generate diverse captions. Useful for getting multiple perspectives on an image's content.

GetImageEmbedding(Tensor<T>)

public Vector<T> GetImageEmbedding(Tensor<T> image)

Parameters

image Tensor<T>

Returns

Vector<T>

GetImageEmbeddings(IEnumerable<Tensor<T>>)

public IEnumerable<Vector<T>> GetImageEmbeddings(IEnumerable<Tensor<T>> images)

Parameters

images IEnumerable<Tensor<T>>

Returns

IEnumerable<Vector<T>>

GetModelMetadata()

Retrieves metadata about the BLIP neural network model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

A ModelMetaData object containing information about the network.

GetParameters()

Gets all trainable parameters of the network as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all parameters of the network.

Remarks

For Beginners: Neural networks learn by adjusting their "parameters" (also called weights and biases). This method collects all those adjustable values into a single list so they can be updated during training.

GetTextEmbedding(string)

public Vector<T> GetTextEmbedding(string text)

Parameters

text string

Returns

Vector<T>

GetTextEmbeddings(IEnumerable<string>)

public IEnumerable<Vector<T>> GetTextEmbeddings(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

Returns

IEnumerable<Vector<T>>

InitializeLayers()

Initializes the layers of the neural network based on the architecture.

protected override void InitializeLayers()

Remarks

For Beginners: This method sets up all the layers in your neural network according to the architecture you've defined. It's like assembling the parts of your network before you can use it.

Predict(Tensor<T>)

Makes a prediction using the neural network.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

The input data to process.

Returns

Tensor<T>

The network's prediction.

Remarks

For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).

RankCaptions(Tensor<T>, IEnumerable<string>)

Ranks a set of candidate captions by how well they match an image.

public IEnumerable<(string Caption, T Score)> RankCaptions(Tensor<T> image, IEnumerable<string> candidates)

Parameters

image Tensor<T>

The preprocessed image tensor.

candidates IEnumerable<string>

The candidate captions to rank.

Returns

IEnumerable<(string Caption, T Score)>

Captions ranked by match score, from best to worst.

Remarks

Uses the ITM head to score each candidate, then returns them in descending order. Useful for caption reranking in retrieval applications.

RetrieveImages(string, IEnumerable<Vector<T>>, int)

Retrieves the most relevant images for a text query from a collection.

public IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Vector<T>> imageEmbeddings, int topK = 10)

Parameters

query string

The text query describing desired images.

imageEmbeddings IEnumerable<Vector<T>>

Pre-computed image embeddings.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices of the top-K matching images with their scores.

Remarks

Performs efficient text-to-image retrieval using embedding similarity. For large collections, pre-compute and cache image embeddings.

RetrieveTexts(Tensor<T>, IEnumerable<Vector<T>>, int)

Retrieves the most relevant texts for an image from a collection.

public IEnumerable<(int Index, T Score)> RetrieveTexts(Tensor<T> image, IEnumerable<Vector<T>> textEmbeddings, int topK = 10)

Parameters

image Tensor<T>

The preprocessed image tensor.

textEmbeddings IEnumerable<Vector<T>>

Pre-computed text embeddings.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices of the top-K matching texts with their scores.

Remarks

Performs efficient image-to-text retrieval using embedding similarity. Useful for finding relevant captions or descriptions for images.

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data that is not covered by the general serialization process.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

The BinaryWriter to write the data to.

Remarks

This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.

For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.

Train(Tensor<T>, Tensor<T>)

Trains the neural network on a single input-output pair.

public override void Train(Tensor<T> input, Tensor<T> target)

Parameters

input Tensor<T>

The input data.

target Tensor<T>

Remarks

This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.

For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)

The network then:

  1. Makes a prediction based on the input
  2. Compares its prediction to the expected output
  3. Calculates how wrong it was (the loss)
  4. Adjusts its internal values to do better next time

After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.

UpdateParameters(Vector<T>)

Updates the network's parameters with new values.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The new parameter values to set.

Remarks

For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.

This is typically used by optimization algorithms that calculate better parameter values based on training data.

ZeroShotClassify(Tensor<T>, IEnumerable<string>)

public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels)

Parameters

image Tensor<T>
classLabels IEnumerable<string>

Returns

Dictionary<string, T>

ZeroShotClassify(double[], IEnumerable<string>)

Performs zero-shot classification of an image against text labels.

public Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)

Parameters

imageData double[]

The preprocessed image data.

labels IEnumerable<string>

The candidate class labels.

Returns

Dictionary<string, T>

A dictionary mapping each label to its probability score.