Table of Contents

Class Blip2NeuralNetwork<T>

Namespace
AiDotNet.NeuralNetworks
Assembly
AiDotNet.dll

BLIP-2 (Bootstrapped Language-Image Pre-training 2) neural network for vision-language tasks.

public class Blip2NeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IBlip2Model<T>, IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
Blip2NeuralNetwork<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

BLIP-2 uses a Q-Former (Querying Transformer) to efficiently bridge frozen image encoders with frozen large language models. The Q-Former uses learnable query tokens that interact with frozen image features through cross-attention layers.

For Beginners: BLIP-2 is the next evolution of vision-language models!

Architecture overview:

  1. Frozen Image Encoder (ViT-G): Extracts image patch features
  2. Q-Former: Small trainable transformer that bridges vision and language
    • Uses 32 learnable "query" tokens
    • Queries attend to image features via cross-attention
    • Output goes to the language model
  3. Frozen LLM (OPT/Flan-T5): Generates text from visual features

Why this architecture is brilliant:

  • Only trains the small Q-Former (~188M parameters)
  • Image encoder stays frozen (no GPU memory for gradients)
  • LLM stays frozen (can use huge 66B+ models)
  • Much cheaper to train than end-to-end models

Training stages:

  1. Vision-Language Representation Learning (Q-Former + ViT)
  2. Vision-to-Language Generative Learning (Q-Former + LLM)

Constructors

Blip2NeuralNetwork(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int, int, int, int, int, int, int, LanguageModelBackbone, ITokenizer?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a BLIP-2 network using native library layers.

public Blip2NeuralNetwork(NeuralNetworkArchitecture<T> architecture, int imageSize = 224, int channels = 3, int patchSize = 14, int vocabularySize = 30522, int maxSequenceLength = 32, int embeddingDimension = 256, int qformerHiddenDim = 768, int visionHiddenDim = 1408, int lmHiddenDim = 2560, int numQformerLayers = 12, int numQueryTokens = 32, int numHeads = 12, int numLmDecoderLayers = 6, LanguageModelBackbone languageModelBackbone = LanguageModelBackbone.OPT, ITokenizer? tokenizer = null, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

imageSize int

Expected image size (default 224 for BLIP-2).

channels int

Number of image channels (default 3 for RGB).

patchSize int

Patch size for vision transformer.

vocabularySize int

Text vocabulary size (BERT: 30522).

maxSequenceLength int

Maximum text sequence length.

embeddingDimension int

Dimension of shared embedding space.

qformerHiddenDim int

Q-Former hidden dimension.

visionHiddenDim int

Vision encoder hidden dimension.

lmHiddenDim int

Language model hidden dimension.

numQformerLayers int

Number of Q-Former layers.

numQueryTokens int

Number of learnable query tokens.

numHeads int

Number of attention heads.

numLmDecoderLayers int

Number of language model decoder layers for text generation.

languageModelBackbone LanguageModelBackbone

Type of LLM backbone (default: OPT).

tokenizer ITokenizer

Optional tokenizer for text processing. If null, creates a default based on backbone.

optimizer IOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer for training.

lossFunction ILossFunction<T>

Optional loss function.

Remarks

For training from scratch, the tokenizer defaults to a basic implementation matching the language model backbone's special tokens. For production use, load a pretrained tokenizer using AutoTokenizer.

Blip2NeuralNetwork(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, LanguageModelBackbone, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a BLIP-2 network using pretrained ONNX models.

public Blip2NeuralNetwork(NeuralNetworkArchitecture<T> architecture, string visionEncoderPath, string qformerPath, string languageModelPath, ITokenizer tokenizer, LanguageModelBackbone languageModelBackbone = LanguageModelBackbone.OPT, int embeddingDimension = 256, int maxSequenceLength = 32, int imageSize = 224, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>

The neural network architecture configuration.

visionEncoderPath string

Path to the vision encoder ONNX model.

qformerPath string

Path to the Q-Former ONNX model.

languageModelPath string

Path to the language model ONNX model.

tokenizer ITokenizer

Tokenizer for text processing. REQUIRED - must match the language model backbone.

languageModelBackbone LanguageModelBackbone

Type of LLM backbone (default: OPT).

embeddingDimension int

Dimension of the shared embedding space.

maxSequenceLength int

Maximum text sequence length.

imageSize int

Expected image size.

optimizer IOptimizer<T, Tensor<T>, Tensor<T>>

Optional optimizer for fine-tuning.

lossFunction ILossFunction<T>

Optional loss function.

Remarks

When loading pretrained ONNX models, you MUST provide a tokenizer that matches the language model backbone. Use AutoTokenizer to load the correct tokenizer, or use GetHuggingFaceModelName(LanguageModelBackbone) to get the model name for your backbone.

Properties

EmbeddingDimension

Gets the dimensionality of the embedding space.

public int EmbeddingDimension { get; }

Property Value

int

ImageSize

Gets the expected image size (square images: ImageSize x ImageSize pixels).

public int ImageSize { get; }

Property Value

int

LanguageModelBackbone

Gets the type of language model backbone used for generation.

public LanguageModelBackbone LanguageModelBackbone { get; }

Property Value

LanguageModelBackbone

Remarks

BLIP-2 can use different LLM backbones: - OPT - decoder-only, good for general generation - FlanT5 - encoder-decoder, better for instruction-following The choice affects generation capabilities and quality.

MaxSequenceLength

Gets the maximum sequence length for text input.

public int MaxSequenceLength { get; }

Property Value

int

NumQueryTokens

Gets the number of learnable query tokens used by the Q-Former.

public int NumQueryTokens { get; }

Property Value

int

Remarks

The query tokens are learnable embeddings that interact with the frozen image encoder through cross-attention to extract visual features. Typically 32 queries are used.

ParameterCount

Gets the total number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

For Beginners: This tells you how many adjustable values (weights and biases) your neural network has. More complex networks typically have more parameters and can learn more complex patterns, but also require more data to train effectively. This is part of the IFullModel interface for consistency with other model types.

Performance: This property uses caching to avoid recomputing the sum on every access. The cache is invalidated when layers are modified.

Methods

AnswerQuestion(Tensor<T>, string, int)

Answers a question about an image using the LLM backend.

public string AnswerQuestion(Tensor<T> image, string question, int maxLength = 30)

Parameters

image Tensor<T>

The preprocessed image tensor.

question string

The question to answer about the image.

maxLength int

Maximum answer length.

Returns

string

The generated answer.

Remarks

Formats the question appropriately for the LLM backend and generates an answer conditioned on both the visual features and the question. BLIP-2's LLM backend typically provides more detailed and accurate answers than BLIP's decoder.

For Beginners: Ask any question about an image!

BLIP-2 is better at VQA because:

  • Uses a powerful LLM (OPT/Flan-T5) for generation
  • LLM has more world knowledge
  • Can give more detailed, reasoned answers

Examples:

  • "What is the person doing?" -> "The person is riding a bicycle down a street"
  • "What color is the car?" -> "The car is red"
  • "Is it raining?" -> "No, it appears to be a sunny day"

Backward(Tensor<T>)

Backward pass through Q-Former (vision encoder is frozen).

public Tensor<T> Backward(Tensor<T> gradient)

Parameters

gradient Tensor<T>

Returns

Tensor<T>

ComputeContrastiveSimilarity(Tensor<T>, string)

Computes image-text contrastive similarity using Q-Former features.

public T ComputeContrastiveSimilarity(Tensor<T> image, string text)

Parameters

image Tensor<T>

The preprocessed image tensor.

text string

The text to compare.

Returns

T

Contrastive similarity score.

Remarks

Uses the Q-Former's image-text contrastive (ITC) learning objective. Computes similarity between the CLS token of query outputs and text features. Faster than ITM but less accurate for fine-grained matching.

For Beginners: Quick similarity check between image and text!

Difference from ITM (Image-Text Matching):

  • ITC: Fast, uses embedding similarity (like CLIP)
  • ITM: Slower, uses cross-attention for deeper analysis

Use ITC for:

  • Large-scale retrieval (searching millions of images)
  • Quick filtering before detailed matching

Use ITM for:

  • Final ranking of candidates
  • When accuracy matters more than speed

ComputeImageTextMatch(Tensor<T>, string)

Computes image-text matching score using the Q-Former's ITM head.

public T ComputeImageTextMatch(Tensor<T> image, string text)

Parameters

image Tensor<T>

The preprocessed image tensor.

text string

The text to match against the image.

Returns

T

Matching probability between 0 and 1.

Remarks

Uses the Q-Former's image-text matching head which applies cross-attention between query features and text features to determine if they match. This is trained with hard negative mining for better discrimination.

ComputeSimilarity(Vector<T>, Vector<T>)

Computes similarity between two embeddings.

public T ComputeSimilarity(Vector<T> textEmbedding, Vector<T> imageEmbedding)

Parameters

textEmbedding Vector<T>
imageEmbedding Vector<T>

Returns

T

Similarity score (cosine similarity for normalized embeddings).

CreateNewInstance()

Creates a new instance of the same type as this neural network.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

A new instance of the same neural network type.

Remarks

For Beginners: This creates a blank version of the same type of neural network.

It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data that was not covered by the general deserialization process.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

The BinaryReader to read the data from.

Remarks

This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.

For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.

Dispose(bool)

Protected Dispose pattern implementation.

protected override void Dispose(bool disposing)

Parameters

disposing bool

True if called from Dispose(), false if called from finalizer.

EmbedAsync(string)

public Task<Vector<T>> EmbedAsync(string text)

Parameters

text string

Returns

Task<Vector<T>>

EmbedBatchAsync(IEnumerable<string>)

public Task<Matrix<T>> EmbedBatchAsync(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

Returns

Task<Matrix<T>>

EncodeImage(double[])

Encodes an image into an embedding vector.

public Vector<T> EncodeImage(double[] imageData)

Parameters

imageData double[]

The preprocessed image data as a flattened array in CHW format.

Returns

Vector<T>

A normalized embedding vector.

EncodeImageBatch(IEnumerable<double[]>)

Encodes multiple images into embedding vectors in a batch.

public Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)

Parameters

imageDataBatch IEnumerable<double[]>

The preprocessed images as flattened arrays.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding image.

EncodeText(string)

Encodes text into an embedding vector.

public Vector<T> EncodeText(string text)

Parameters

text string

The text to encode.

Returns

Vector<T>

A normalized embedding vector.

EncodeTextBatch(IEnumerable<string>)

Encodes multiple texts into embedding vectors in a batch.

public Matrix<T> EncodeTextBatch(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

The texts to encode.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding text.

ExtractQFormerFeatures(Tensor<T>)

Extracts visual features using the Q-Former's learnable queries.

public Tensor<T> ExtractQFormerFeatures(Tensor<T> image)

Parameters

image Tensor<T>

The preprocessed image tensor with shape [channels, height, width].

Returns

Tensor<T>

Query output features with shape [numQueries, queryDim].

Remarks

The Q-Former uses cross-attention between learnable query tokens and the frozen image encoder output to extract query_num visual features. These features are then projected to match the LLM's input dimension.

For Beginners: Think of this as asking 32 questions about the image!

Process:

  1. Image goes through frozen ViT encoder -> patch features
  2. Query tokens attend to patch features via cross-attention
  3. Each query learns to focus on different aspects
  4. Output: 32 feature vectors summarizing the image

These 32 features are what gets sent to the language model.

Forward(Tensor<T>)

Forward pass through Q-Former and vision encoder.

public Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

Returns

Tensor<T>

GenerateCaption(Tensor<T>, string?, int, int, double)

Generates a caption for an image using the LLM backend.

public string GenerateCaption(Tensor<T> image, string? prompt = null, int maxLength = 30, int numBeams = 5, double temperature = 1)

Parameters

image Tensor<T>

The preprocessed image tensor.

prompt string

Optional prompt to guide generation (e.g., "a photo of").

maxLength int

Maximum number of tokens to generate.

numBeams int

Number of beams for beam search.

temperature double

Sampling temperature (lower = more deterministic).

Returns

string

The generated caption.

Remarks

Uses the Q-Former to extract visual features, projects them to the LLM space, and then uses the LLM to generate text conditioned on these visual tokens.

For Beginners: This generates descriptions using a powerful language model!

The prompt helps guide the style:

  • "a photo of" -> descriptive captions
  • "Question: What is this? Answer:" -> Q&A style
  • No prompt -> model's default behavior

Temperature controls randomness:

  • 0.0-0.3: Very focused, deterministic
  • 0.7-1.0: More creative, varied

GenerateCaptions(Tensor<T>, int, string?, int, double, double)

Generates multiple diverse captions for an image.

public IEnumerable<(string Caption, T Score)> GenerateCaptions(Tensor<T> image, int numCaptions = 5, string? prompt = null, int maxLength = 30, double temperature = 0.9, double topP = 0.95)

Parameters

image Tensor<T>

The preprocessed image tensor.

numCaptions int

Number of captions to generate.

prompt string

Optional prompt to guide generation.

maxLength int

Maximum length per caption.

temperature double

Sampling temperature for diversity.

topP double

Nucleus sampling probability threshold.

Returns

IEnumerable<(string Caption, T Score)>

Collection of generated captions with their log probabilities.

Remarks

Uses nucleus (top-p) sampling with temperature to generate diverse captions. Returns captions with their generation scores for ranking.

GenerateWithInstruction(Tensor<T>, string, int)

Generates text conditioned on both image and text context (instructed generation).

public string GenerateWithInstruction(Tensor<T> image, string instruction, int maxLength = 100)

Parameters

image Tensor<T>

The preprocessed image tensor.

instruction string

The instruction or context for generation.

maxLength int

Maximum generation length.

Returns

string

The generated response.

Remarks

Enables instruction-following behavior where the model generates text based on both visual input and textual instructions. This is particularly powerful with instruction-tuned LLM backends like Flan-T5.

For Beginners: Give instructions about what to do with the image!

Examples:

  • "Describe this image in detail" -> Detailed description
  • "List all the objects in this image" -> Bulleted list
  • "Write a story based on this image" -> Creative narrative
  • "Explain what is happening" -> Scene analysis

This is more flexible than simple captioning because you can customize the output format and content through instructions.

GetImageEmbedding(Tensor<T>)

public Vector<T> GetImageEmbedding(Tensor<T> image)

Parameters

image Tensor<T>

Returns

Vector<T>

GetImageEmbeddings(IEnumerable<Tensor<T>>)

public IEnumerable<Vector<T>> GetImageEmbeddings(IEnumerable<Tensor<T>> images)

Parameters

images IEnumerable<Tensor<T>>

Returns

IEnumerable<Vector<T>>

GetModelMetadata()

Gets the metadata for this neural network model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

A ModelMetaData object containing information about the model.

GetParameters()

Gets all trainable parameters of the network as a single vector.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all parameters of the network.

Remarks

For Beginners: Neural networks learn by adjusting their "parameters" (also called weights and biases). This method collects all those adjustable values into a single list so they can be updated during training.

GetTextEmbedding(string)

public Vector<T> GetTextEmbedding(string text)

Parameters

text string

Returns

Vector<T>

GetTextEmbeddings(IEnumerable<string>)

public IEnumerable<Vector<T>> GetTextEmbeddings(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

Returns

IEnumerable<Vector<T>>

GroundText(Tensor<T>, string)

Performs visual grounding to locate objects described in text.

public Vector<T> GroundText(Tensor<T> image, string description)

Parameters

image Tensor<T>

The preprocessed image tensor.

description string

Text description of the object to locate.

Returns

Vector<T>

Bounding box coordinates [x1, y1, x2, y2] normalized to [0, 1].

Remarks

Uses the Q-Former's attention patterns to identify which image regions correspond to the text description. Returns a bounding box for the most likely region.

For Beginners: Find where something is in an image!

Given text like "the red car on the left", this finds and returns the bounding box coordinates for that object.

The output is normalized coordinates:

  • [0, 0, 1, 1] would be the entire image
  • [0.5, 0.5, 1, 1] would be the bottom-right quarter

Use cases:

  • Object detection from natural language
  • Referring expression comprehension
  • Interactive image editing ("remove the person on the right")

InitializeLayers()

Initializes layers for both modes.

protected override void InitializeLayers()

Predict(Tensor<T>)

Makes a prediction using the neural network.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

The input data to process.

Returns

Tensor<T>

The network's prediction.

Remarks

For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).

RetrieveImages(string, IEnumerable<Tensor<T>>, int, bool, int)

Retrieves the most relevant images for a text query.

public IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Tensor<T>> imageFeatures, int topK = 10, bool useItmReranking = true, int rerankTopN = 100)

Parameters

query string

The text query.

imageFeatures IEnumerable<Tensor<T>>

Pre-computed Q-Former features for images.

topK int

Number of results to return.

useItmReranking bool

Whether to rerank top results using ITM.

rerankTopN int

Number of candidates to rerank with ITM.

Returns

IEnumerable<(int Index, T Score)>

Indices of top-K matching images with scores.

Remarks

Two-stage retrieval: 1. Fast ITC-based retrieval to get candidates 2. Optional ITM reranking for higher precision

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data that is not covered by the general serialization process.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

The BinaryWriter to write the data to.

Remarks

This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.

For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.

SetParameters(Vector<T>)

Sets the parameters of the neural network.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The parameters to set.

Remarks

This method distributes the parameters to all layers in the network. The parameters should be in the same format as returned by GetParameters.

Train(Tensor<T>, Tensor<T>)

Trains the neural network on a single input-output pair.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>

The input data.

expectedOutput Tensor<T>

The expected output for the given input.

Remarks

This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.

For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)

The network then:

  1. Makes a prediction based on the input
  2. Compares its prediction to the expected output
  3. Calculates how wrong it was (the loss)
  4. Adjusts its internal values to do better next time

After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.

UpdateParameters(Vector<T>)

Updates the network's parameters with new values.

public override void UpdateParameters(Vector<T> gradients)

Parameters

gradients Vector<T>

Remarks

For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.

This is typically used by optimization algorithms that calculate better parameter values based on training data.

ZeroShotClassify(Tensor<T>, IEnumerable<string>)

public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels)

Parameters

image Tensor<T>
classLabels IEnumerable<string>

Returns

Dictionary<string, T>

ZeroShotClassify(Tensor<T>, IEnumerable<string>, bool)

Performs zero-shot image classification with optional ITM scoring.

public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels, bool useItm)

Parameters

image Tensor<T>
classLabels IEnumerable<string>
useItm bool

Returns

Dictionary<string, T>

ZeroShotClassify(double[], IEnumerable<string>)

Performs zero-shot classification of an image against text labels.

public Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)

Parameters

imageData double[]

The preprocessed image data.

labels IEnumerable<string>

The candidate class labels.

Returns

Dictionary<string, T>

A dictionary mapping each label to its probability score.