Class Blip2NeuralNetwork<T>
- Namespace
- AiDotNet.NeuralNetworks
- Assembly
- AiDotNet.dll
BLIP-2 (Bootstrapped Language-Image Pre-training 2) neural network for vision-language tasks.
public class Blip2NeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IBlip2Model<T>, IMultimodalEmbedding<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
Blip2NeuralNetwork<T>
- Implements
-
IBlip2Model<T>
- Inherited Members
- Extension Methods
Remarks
BLIP-2 uses a Q-Former (Querying Transformer) to efficiently bridge frozen image encoders with frozen large language models. The Q-Former uses learnable query tokens that interact with frozen image features through cross-attention layers.
For Beginners: BLIP-2 is the next evolution of vision-language models!
Architecture overview:
- Frozen Image Encoder (ViT-G): Extracts image patch features
- Q-Former: Small trainable transformer that bridges vision and language
- Uses 32 learnable "query" tokens
- Queries attend to image features via cross-attention
- Output goes to the language model
- Frozen LLM (OPT/Flan-T5): Generates text from visual features
Why this architecture is brilliant:
- Only trains the small Q-Former (~188M parameters)
- Image encoder stays frozen (no GPU memory for gradients)
- LLM stays frozen (can use huge 66B+ models)
- Much cheaper to train than end-to-end models
Training stages:
- Vision-Language Representation Learning (Q-Former + ViT)
- Vision-to-Language Generative Learning (Q-Former + LLM)
Constructors
Blip2NeuralNetwork(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int, int, int, int, int, int, int, LanguageModelBackbone, ITokenizer?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a BLIP-2 network using native library layers.
public Blip2NeuralNetwork(NeuralNetworkArchitecture<T> architecture, int imageSize = 224, int channels = 3, int patchSize = 14, int vocabularySize = 30522, int maxSequenceLength = 32, int embeddingDimension = 256, int qformerHiddenDim = 768, int visionHiddenDim = 1408, int lmHiddenDim = 2560, int numQformerLayers = 12, int numQueryTokens = 32, int numHeads = 12, int numLmDecoderLayers = 6, LanguageModelBackbone languageModelBackbone = LanguageModelBackbone.OPT, ITokenizer? tokenizer = null, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
imageSizeintExpected image size (default 224 for BLIP-2).
channelsintNumber of image channels (default 3 for RGB).
patchSizeintPatch size for vision transformer.
vocabularySizeintText vocabulary size (BERT: 30522).
maxSequenceLengthintMaximum text sequence length.
embeddingDimensionintDimension of shared embedding space.
qformerHiddenDimintQ-Former hidden dimension.
visionHiddenDimintVision encoder hidden dimension.
lmHiddenDimintLanguage model hidden dimension.
numQformerLayersintNumber of Q-Former layers.
numQueryTokensintNumber of learnable query tokens.
numHeadsintNumber of attention heads.
numLmDecoderLayersintNumber of language model decoder layers for text generation.
languageModelBackboneLanguageModelBackboneType of LLM backbone (default: OPT).
tokenizerITokenizerOptional tokenizer for text processing. If null, creates a default based on backbone.
optimizerIOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer for training.
lossFunctionILossFunction<T>Optional loss function.
Remarks
For training from scratch, the tokenizer defaults to a basic implementation matching the language model backbone's special tokens. For production use, load a pretrained tokenizer using AutoTokenizer.
Blip2NeuralNetwork(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, LanguageModelBackbone, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a BLIP-2 network using pretrained ONNX models.
public Blip2NeuralNetwork(NeuralNetworkArchitecture<T> architecture, string visionEncoderPath, string qformerPath, string languageModelPath, ITokenizer tokenizer, LanguageModelBackbone languageModelBackbone = LanguageModelBackbone.OPT, int embeddingDimension = 256, int maxSequenceLength = 32, int imageSize = 224, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
visionEncoderPathstringPath to the vision encoder ONNX model.
qformerPathstringPath to the Q-Former ONNX model.
languageModelPathstringPath to the language model ONNX model.
tokenizerITokenizerTokenizer for text processing. REQUIRED - must match the language model backbone.
languageModelBackboneLanguageModelBackboneType of LLM backbone (default: OPT).
embeddingDimensionintDimension of the shared embedding space.
maxSequenceLengthintMaximum text sequence length.
imageSizeintExpected image size.
optimizerIOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer for fine-tuning.
lossFunctionILossFunction<T>Optional loss function.
Remarks
When loading pretrained ONNX models, you MUST provide a tokenizer that matches the language model backbone. Use AutoTokenizer to load the correct tokenizer, or use GetHuggingFaceModelName(LanguageModelBackbone) to get the model name for your backbone.
Properties
EmbeddingDimension
Gets the dimensionality of the embedding space.
public int EmbeddingDimension { get; }
Property Value
ImageSize
Gets the expected image size (square images: ImageSize x ImageSize pixels).
public int ImageSize { get; }
Property Value
LanguageModelBackbone
Gets the type of language model backbone used for generation.
public LanguageModelBackbone LanguageModelBackbone { get; }
Property Value
Remarks
BLIP-2 can use different LLM backbones: - OPT - decoder-only, good for general generation - FlanT5 - encoder-decoder, better for instruction-following The choice affects generation capabilities and quality.
MaxSequenceLength
Gets the maximum sequence length for text input.
public int MaxSequenceLength { get; }
Property Value
NumQueryTokens
Gets the number of learnable query tokens used by the Q-Former.
public int NumQueryTokens { get; }
Property Value
Remarks
The query tokens are learnable embeddings that interact with the frozen image encoder through cross-attention to extract visual features. Typically 32 queries are used.
ParameterCount
Gets the total number of parameters in the model.
public override int ParameterCount { get; }
Property Value
Remarks
For Beginners: This tells you how many adjustable values (weights and biases) your neural network has. More complex networks typically have more parameters and can learn more complex patterns, but also require more data to train effectively. This is part of the IFullModel interface for consistency with other model types.
Performance: This property uses caching to avoid recomputing the sum on every access. The cache is invalidated when layers are modified.
Methods
AnswerQuestion(Tensor<T>, string, int)
Answers a question about an image using the LLM backend.
public string AnswerQuestion(Tensor<T> image, string question, int maxLength = 30)
Parameters
imageTensor<T>The preprocessed image tensor.
questionstringThe question to answer about the image.
maxLengthintMaximum answer length.
Returns
- string
The generated answer.
Remarks
Formats the question appropriately for the LLM backend and generates an answer conditioned on both the visual features and the question. BLIP-2's LLM backend typically provides more detailed and accurate answers than BLIP's decoder.
For Beginners: Ask any question about an image!
BLIP-2 is better at VQA because:
- Uses a powerful LLM (OPT/Flan-T5) for generation
- LLM has more world knowledge
- Can give more detailed, reasoned answers
Examples:
- "What is the person doing?" -> "The person is riding a bicycle down a street"
- "What color is the car?" -> "The car is red"
- "Is it raining?" -> "No, it appears to be a sunny day"
Backward(Tensor<T>)
Backward pass through Q-Former (vision encoder is frozen).
public Tensor<T> Backward(Tensor<T> gradient)
Parameters
gradientTensor<T>
Returns
- Tensor<T>
ComputeContrastiveSimilarity(Tensor<T>, string)
Computes image-text contrastive similarity using Q-Former features.
public T ComputeContrastiveSimilarity(Tensor<T> image, string text)
Parameters
imageTensor<T>The preprocessed image tensor.
textstringThe text to compare.
Returns
- T
Contrastive similarity score.
Remarks
Uses the Q-Former's image-text contrastive (ITC) learning objective. Computes similarity between the CLS token of query outputs and text features. Faster than ITM but less accurate for fine-grained matching.
For Beginners: Quick similarity check between image and text!
Difference from ITM (Image-Text Matching):
- ITC: Fast, uses embedding similarity (like CLIP)
- ITM: Slower, uses cross-attention for deeper analysis
Use ITC for:
- Large-scale retrieval (searching millions of images)
- Quick filtering before detailed matching
Use ITM for:
- Final ranking of candidates
- When accuracy matters more than speed
ComputeImageTextMatch(Tensor<T>, string)
Computes image-text matching score using the Q-Former's ITM head.
public T ComputeImageTextMatch(Tensor<T> image, string text)
Parameters
imageTensor<T>The preprocessed image tensor.
textstringThe text to match against the image.
Returns
- T
Matching probability between 0 and 1.
Remarks
Uses the Q-Former's image-text matching head which applies cross-attention between query features and text features to determine if they match. This is trained with hard negative mining for better discrimination.
ComputeSimilarity(Vector<T>, Vector<T>)
Computes similarity between two embeddings.
public T ComputeSimilarity(Vector<T> textEmbedding, Vector<T> imageEmbedding)
Parameters
textEmbeddingVector<T>imageEmbeddingVector<T>
Returns
- T
Similarity score (cosine similarity for normalized embeddings).
CreateNewInstance()
Creates a new instance of the same type as this neural network.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
A new instance of the same neural network type.
Remarks
For Beginners: This creates a blank version of the same type of neural network.
It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data that was not covered by the general deserialization process.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReaderThe BinaryReader to read the data from.
Remarks
This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.
For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.
Dispose(bool)
Protected Dispose pattern implementation.
protected override void Dispose(bool disposing)
Parameters
disposingboolTrue if called from Dispose(), false if called from finalizer.
EmbedAsync(string)
public Task<Vector<T>> EmbedAsync(string text)
Parameters
textstring
Returns
- Task<Vector<T>>
EmbedBatchAsync(IEnumerable<string>)
public Task<Matrix<T>> EmbedBatchAsync(IEnumerable<string> texts)
Parameters
textsIEnumerable<string>
Returns
- Task<Matrix<T>>
EncodeImage(double[])
Encodes an image into an embedding vector.
public Vector<T> EncodeImage(double[] imageData)
Parameters
imageDatadouble[]The preprocessed image data as a flattened array in CHW format.
Returns
- Vector<T>
A normalized embedding vector.
EncodeImageBatch(IEnumerable<double[]>)
Encodes multiple images into embedding vectors in a batch.
public Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)
Parameters
imageDataBatchIEnumerable<double[]>The preprocessed images as flattened arrays.
Returns
- Matrix<T>
A matrix where each row is an embedding for the corresponding image.
EncodeText(string)
Encodes text into an embedding vector.
public Vector<T> EncodeText(string text)
Parameters
textstringThe text to encode.
Returns
- Vector<T>
A normalized embedding vector.
EncodeTextBatch(IEnumerable<string>)
Encodes multiple texts into embedding vectors in a batch.
public Matrix<T> EncodeTextBatch(IEnumerable<string> texts)
Parameters
textsIEnumerable<string>The texts to encode.
Returns
- Matrix<T>
A matrix where each row is an embedding for the corresponding text.
ExtractQFormerFeatures(Tensor<T>)
Extracts visual features using the Q-Former's learnable queries.
public Tensor<T> ExtractQFormerFeatures(Tensor<T> image)
Parameters
imageTensor<T>The preprocessed image tensor with shape [channels, height, width].
Returns
- Tensor<T>
Query output features with shape [numQueries, queryDim].
Remarks
The Q-Former uses cross-attention between learnable query tokens and the frozen image encoder output to extract query_num visual features. These features are then projected to match the LLM's input dimension.
For Beginners: Think of this as asking 32 questions about the image!
Process:
- Image goes through frozen ViT encoder -> patch features
- Query tokens attend to patch features via cross-attention
- Each query learns to focus on different aspects
- Output: 32 feature vectors summarizing the image
These 32 features are what gets sent to the language model.
Forward(Tensor<T>)
Forward pass through Q-Former and vision encoder.
public Tensor<T> Forward(Tensor<T> input)
Parameters
inputTensor<T>
Returns
- Tensor<T>
GenerateCaption(Tensor<T>, string?, int, int, double)
Generates a caption for an image using the LLM backend.
public string GenerateCaption(Tensor<T> image, string? prompt = null, int maxLength = 30, int numBeams = 5, double temperature = 1)
Parameters
imageTensor<T>The preprocessed image tensor.
promptstringOptional prompt to guide generation (e.g., "a photo of").
maxLengthintMaximum number of tokens to generate.
numBeamsintNumber of beams for beam search.
temperaturedoubleSampling temperature (lower = more deterministic).
Returns
- string
The generated caption.
Remarks
Uses the Q-Former to extract visual features, projects them to the LLM space, and then uses the LLM to generate text conditioned on these visual tokens.
For Beginners: This generates descriptions using a powerful language model!
The prompt helps guide the style:
- "a photo of" -> descriptive captions
- "Question: What is this? Answer:" -> Q&A style
- No prompt -> model's default behavior
Temperature controls randomness:
- 0.0-0.3: Very focused, deterministic
- 0.7-1.0: More creative, varied
GenerateCaptions(Tensor<T>, int, string?, int, double, double)
Generates multiple diverse captions for an image.
public IEnumerable<(string Caption, T Score)> GenerateCaptions(Tensor<T> image, int numCaptions = 5, string? prompt = null, int maxLength = 30, double temperature = 0.9, double topP = 0.95)
Parameters
imageTensor<T>The preprocessed image tensor.
numCaptionsintNumber of captions to generate.
promptstringOptional prompt to guide generation.
maxLengthintMaximum length per caption.
temperaturedoubleSampling temperature for diversity.
topPdoubleNucleus sampling probability threshold.
Returns
- IEnumerable<(string Caption, T Score)>
Collection of generated captions with their log probabilities.
Remarks
Uses nucleus (top-p) sampling with temperature to generate diverse captions. Returns captions with their generation scores for ranking.
GenerateWithInstruction(Tensor<T>, string, int)
Generates text conditioned on both image and text context (instructed generation).
public string GenerateWithInstruction(Tensor<T> image, string instruction, int maxLength = 100)
Parameters
imageTensor<T>The preprocessed image tensor.
instructionstringThe instruction or context for generation.
maxLengthintMaximum generation length.
Returns
- string
The generated response.
Remarks
Enables instruction-following behavior where the model generates text based on both visual input and textual instructions. This is particularly powerful with instruction-tuned LLM backends like Flan-T5.
For Beginners: Give instructions about what to do with the image!
Examples:
- "Describe this image in detail" -> Detailed description
- "List all the objects in this image" -> Bulleted list
- "Write a story based on this image" -> Creative narrative
- "Explain what is happening" -> Scene analysis
This is more flexible than simple captioning because you can customize the output format and content through instructions.
GetImageEmbedding(Tensor<T>)
public Vector<T> GetImageEmbedding(Tensor<T> image)
Parameters
imageTensor<T>
Returns
- Vector<T>
GetImageEmbeddings(IEnumerable<Tensor<T>>)
public IEnumerable<Vector<T>> GetImageEmbeddings(IEnumerable<Tensor<T>> images)
Parameters
imagesIEnumerable<Tensor<T>>
Returns
- IEnumerable<Vector<T>>
GetModelMetadata()
Gets the metadata for this neural network model.
public override ModelMetadata<T> GetModelMetadata()
Returns
- ModelMetadata<T>
A ModelMetaData object containing information about the model.
GetParameters()
Gets all trainable parameters of the network as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all parameters of the network.
Remarks
For Beginners: Neural networks learn by adjusting their "parameters" (also called weights and biases). This method collects all those adjustable values into a single list so they can be updated during training.
GetTextEmbedding(string)
public Vector<T> GetTextEmbedding(string text)
Parameters
textstring
Returns
- Vector<T>
GetTextEmbeddings(IEnumerable<string>)
public IEnumerable<Vector<T>> GetTextEmbeddings(IEnumerable<string> texts)
Parameters
textsIEnumerable<string>
Returns
- IEnumerable<Vector<T>>
GroundText(Tensor<T>, string)
Performs visual grounding to locate objects described in text.
public Vector<T> GroundText(Tensor<T> image, string description)
Parameters
imageTensor<T>The preprocessed image tensor.
descriptionstringText description of the object to locate.
Returns
- Vector<T>
Bounding box coordinates [x1, y1, x2, y2] normalized to [0, 1].
Remarks
Uses the Q-Former's attention patterns to identify which image regions correspond to the text description. Returns a bounding box for the most likely region.
For Beginners: Find where something is in an image!
Given text like "the red car on the left", this finds and returns the bounding box coordinates for that object.
The output is normalized coordinates:
- [0, 0, 1, 1] would be the entire image
- [0.5, 0.5, 1, 1] would be the bottom-right quarter
Use cases:
- Object detection from natural language
- Referring expression comprehension
- Interactive image editing ("remove the person on the right")
InitializeLayers()
Initializes layers for both modes.
protected override void InitializeLayers()
Predict(Tensor<T>)
Makes a prediction using the neural network.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>The input data to process.
Returns
- Tensor<T>
The network's prediction.
Remarks
For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).
RetrieveImages(string, IEnumerable<Tensor<T>>, int, bool, int)
Retrieves the most relevant images for a text query.
public IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Tensor<T>> imageFeatures, int topK = 10, bool useItmReranking = true, int rerankTopN = 100)
Parameters
querystringThe text query.
imageFeaturesIEnumerable<Tensor<T>>Pre-computed Q-Former features for images.
topKintNumber of results to return.
useItmRerankingboolWhether to rerank top results using ITM.
rerankTopNintNumber of candidates to rerank with ITM.
Returns
- IEnumerable<(int Index, T Score)>
Indices of top-K matching images with scores.
Remarks
Two-stage retrieval: 1. Fast ITC-based retrieval to get candidates 2. Optional ITM reranking for higher precision
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data that is not covered by the general serialization process.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriterThe BinaryWriter to write the data to.
Remarks
This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.
For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.
SetParameters(Vector<T>)
Sets the parameters of the neural network.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>The parameters to set.
Remarks
This method distributes the parameters to all layers in the network. The parameters should be in the same format as returned by GetParameters.
Train(Tensor<T>, Tensor<T>)
Trains the neural network on a single input-output pair.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>The input data.
expectedOutputTensor<T>The expected output for the given input.
Remarks
This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.
For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)
The network then:
- Makes a prediction based on the input
- Compares its prediction to the expected output
- Calculates how wrong it was (the loss)
- Adjusts its internal values to do better next time
After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.
UpdateParameters(Vector<T>)
Updates the network's parameters with new values.
public override void UpdateParameters(Vector<T> gradients)
Parameters
gradientsVector<T>
Remarks
For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.
This is typically used by optimization algorithms that calculate better parameter values based on training data.
ZeroShotClassify(Tensor<T>, IEnumerable<string>)
public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels)
Parameters
imageTensor<T>classLabelsIEnumerable<string>
Returns
- Dictionary<string, T>
ZeroShotClassify(Tensor<T>, IEnumerable<string>, bool)
Performs zero-shot image classification with optional ITM scoring.
public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels, bool useItm)
Parameters
imageTensor<T>classLabelsIEnumerable<string>useItmbool
Returns
- Dictionary<string, T>
ZeroShotClassify(double[], IEnumerable<string>)
Performs zero-shot classification of an image against text labels.
public Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)
Parameters
imageDatadouble[]The preprocessed image data.
labelsIEnumerable<string>The candidate class labels.
Returns
- Dictionary<string, T>
A dictionary mapping each label to its probability score.