Interface IBlip2Model<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for BLIP-2 (Bootstrapped Language-Image Pre-training 2) models.
public interface IBlip2Model<T> : IMultimodalEmbedding<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
Remarks
BLIP-2 is a more efficient and powerful successor to BLIP that uses a Q-Former (Querying Transformer) to bridge frozen image encoders with frozen large language models. This architecture enables better vision-language understanding with significantly less training compute.
For Beginners: BLIP-2 is like having a smart translator between images and language!
Key innovation - the Q-Former:
- Uses special "query tokens" to ask questions about the image
- These queries learn to extract the most useful visual information
- The extracted features then connect to powerful language models (LLMs)
Why BLIP-2 is special:
- Uses frozen (pre-trained) image encoders like ViT-G
- Uses frozen LLMs like OPT or Flan-T5
- Only trains the small Q-Former bridge (much cheaper!)
- Gets state-of-the-art results with less compute
Use cases (same as BLIP but better):
- More accurate image captioning
- Better visual question answering
- More nuanced image-text understanding
- Can leverage larger LLMs for better generation
Properties
LanguageModelBackbone
Gets the type of language model backbone used for generation.
LanguageModelBackbone LanguageModelBackbone { get; }
Property Value
Remarks
BLIP-2 can use different LLM backbones: - OPT - decoder-only, good for general generation - FlanT5 - encoder-decoder, better for instruction-following The choice affects generation capabilities and quality.
NumQueryTokens
Gets the number of learnable query tokens used by the Q-Former.
int NumQueryTokens { get; }
Property Value
Remarks
The query tokens are learnable embeddings that interact with the frozen image encoder through cross-attention to extract visual features. Typically 32 queries are used.
Methods
AnswerQuestion(Tensor<T>, string, int)
Answers a question about an image using the LLM backend.
string AnswerQuestion(Tensor<T> image, string question, int maxLength = 30)
Parameters
imageTensor<T>The preprocessed image tensor.
questionstringThe question to answer about the image.
maxLengthintMaximum answer length.
Returns
- string
The generated answer.
Remarks
Formats the question appropriately for the LLM backend and generates an answer conditioned on both the visual features and the question. BLIP-2's LLM backend typically provides more detailed and accurate answers than BLIP's decoder.
For Beginners: Ask any question about an image!
BLIP-2 is better at VQA because:
- Uses a powerful LLM (OPT/Flan-T5) for generation
- LLM has more world knowledge
- Can give more detailed, reasoned answers
Examples:
- "What is the person doing?" -> "The person is riding a bicycle down a street"
- "What color is the car?" -> "The car is red"
- "Is it raining?" -> "No, it appears to be a sunny day"
ComputeContrastiveSimilarity(Tensor<T>, string)
Computes image-text contrastive similarity using Q-Former features.
T ComputeContrastiveSimilarity(Tensor<T> image, string text)
Parameters
imageTensor<T>The preprocessed image tensor.
textstringThe text to compare.
Returns
- T
Contrastive similarity score.
Remarks
Uses the Q-Former's image-text contrastive (ITC) learning objective. Computes similarity between the CLS token of query outputs and text features. Faster than ITM but less accurate for fine-grained matching.
For Beginners: Quick similarity check between image and text!
Difference from ITM (Image-Text Matching):
- ITC: Fast, uses embedding similarity (like CLIP)
- ITM: Slower, uses cross-attention for deeper analysis
Use ITC for:
- Large-scale retrieval (searching millions of images)
- Quick filtering before detailed matching
Use ITM for:
- Final ranking of candidates
- When accuracy matters more than speed
ComputeImageTextMatch(Tensor<T>, string)
Computes image-text matching score using the Q-Former's ITM head.
T ComputeImageTextMatch(Tensor<T> image, string text)
Parameters
imageTensor<T>The preprocessed image tensor.
textstringThe text to match against the image.
Returns
- T
Matching probability between 0 and 1.
Remarks
Uses the Q-Former's image-text matching head which applies cross-attention between query features and text features to determine if they match. This is trained with hard negative mining for better discrimination.
ExtractQFormerFeatures(Tensor<T>)
Extracts visual features using the Q-Former's learnable queries.
Tensor<T> ExtractQFormerFeatures(Tensor<T> image)
Parameters
imageTensor<T>The preprocessed image tensor with shape [channels, height, width].
Returns
- Tensor<T>
Query output features with shape [numQueries, queryDim].
Remarks
The Q-Former uses cross-attention between learnable query tokens and the frozen image encoder output to extract query_num visual features. These features are then projected to match the LLM's input dimension.
For Beginners: Think of this as asking 32 questions about the image!
Process:
- Image goes through frozen ViT encoder -> patch features
- Query tokens attend to patch features via cross-attention
- Each query learns to focus on different aspects
- Output: 32 feature vectors summarizing the image
These 32 features are what gets sent to the language model.
GenerateCaption(Tensor<T>, string?, int, int, double)
Generates a caption for an image using the LLM backend.
string GenerateCaption(Tensor<T> image, string? prompt = null, int maxLength = 30, int numBeams = 5, double temperature = 1)
Parameters
imageTensor<T>The preprocessed image tensor.
promptstringOptional prompt to guide generation (e.g., "a photo of").
maxLengthintMaximum number of tokens to generate.
numBeamsintNumber of beams for beam search.
temperaturedoubleSampling temperature (lower = more deterministic).
Returns
- string
The generated caption.
Remarks
Uses the Q-Former to extract visual features, projects them to the LLM space, and then uses the LLM to generate text conditioned on these visual tokens.
For Beginners: This generates descriptions using a powerful language model!
The prompt helps guide the style:
- "a photo of" -> descriptive captions
- "Question: What is this? Answer:" -> Q&A style
- No prompt -> model's default behavior
Temperature controls randomness:
- 0.0-0.3: Very focused, deterministic
- 0.7-1.0: More creative, varied
GenerateCaptions(Tensor<T>, int, string?, int, double, double)
Generates multiple diverse captions for an image.
IEnumerable<(string Caption, T Score)> GenerateCaptions(Tensor<T> image, int numCaptions = 5, string? prompt = null, int maxLength = 30, double temperature = 0.9, double topP = 0.95)
Parameters
imageTensor<T>The preprocessed image tensor.
numCaptionsintNumber of captions to generate.
promptstringOptional prompt to guide generation.
maxLengthintMaximum length per caption.
temperaturedoubleSampling temperature for diversity.
topPdoubleNucleus sampling probability threshold.
Returns
- IEnumerable<(string Caption, T Score)>
Collection of generated captions with their log probabilities.
Remarks
Uses nucleus (top-p) sampling with temperature to generate diverse captions. Returns captions with their generation scores for ranking.
GenerateWithInstruction(Tensor<T>, string, int)
Generates text conditioned on both image and text context (instructed generation).
string GenerateWithInstruction(Tensor<T> image, string instruction, int maxLength = 100)
Parameters
imageTensor<T>The preprocessed image tensor.
instructionstringThe instruction or context for generation.
maxLengthintMaximum generation length.
Returns
- string
The generated response.
Remarks
Enables instruction-following behavior where the model generates text based on both visual input and textual instructions. This is particularly powerful with instruction-tuned LLM backends like Flan-T5.
For Beginners: Give instructions about what to do with the image!
Examples:
- "Describe this image in detail" -> Detailed description
- "List all the objects in this image" -> Bulleted list
- "Write a story based on this image" -> Creative narrative
- "Explain what is happening" -> Scene analysis
This is more flexible than simple captioning because you can customize the output format and content through instructions.
GroundText(Tensor<T>, string)
Performs visual grounding to locate objects described in text.
Vector<T> GroundText(Tensor<T> image, string description)
Parameters
imageTensor<T>The preprocessed image tensor.
descriptionstringText description of the object to locate.
Returns
- Vector<T>
Bounding box coordinates [x1, y1, x2, y2] normalized to [0, 1].
Remarks
Uses the Q-Former's attention patterns to identify which image regions correspond to the text description. Returns a bounding box for the most likely region.
For Beginners: Find where something is in an image!
Given text like "the red car on the left", this finds and returns the bounding box coordinates for that object.
The output is normalized coordinates:
- [0, 0, 1, 1] would be the entire image
- [0.5, 0.5, 1, 1] would be the bottom-right quarter
Use cases:
- Object detection from natural language
- Referring expression comprehension
- Interactive image editing ("remove the person on the right")
RetrieveImages(string, IEnumerable<Tensor<T>>, int, bool, int)
Retrieves the most relevant images for a text query.
IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Tensor<T>> imageFeatures, int topK = 10, bool useItmReranking = true, int rerankTopN = 100)
Parameters
querystringThe text query.
imageFeaturesIEnumerable<Tensor<T>>Pre-computed Q-Former features for images.
topKintNumber of results to return.
useItmRerankingboolWhether to rerank top results using ITM.
rerankTopNintNumber of candidates to rerank with ITM.
Returns
- IEnumerable<(int Index, T Score)>
Indices of top-K matching images with scores.
Remarks
Two-stage retrieval: 1. Fast ITC-based retrieval to get candidates 2. Optional ITM reranking for higher precision
ZeroShotClassify(Tensor<T>, IEnumerable<string>, bool)
Performs zero-shot image classification using text prompts.
Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels, bool useItm = false)
Parameters
imageTensor<T>The preprocessed image tensor.
classLabelsIEnumerable<string>The candidate class labels.
useItmboolIf true, use ITM for scoring; if false, use ITC.
Returns
- Dictionary<string, T>
Dictionary mapping class labels to probability scores.
Remarks
Classifies images into categories without any training on those specific categories. Can use either ITC (faster) or ITM (more accurate) for scoring.