Table of Contents

Interface IBlip2Model<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for BLIP-2 (Bootstrapped Language-Image Pre-training 2) models.

public interface IBlip2Model<T> : IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members

Remarks

BLIP-2 is a more efficient and powerful successor to BLIP that uses a Q-Former (Querying Transformer) to bridge frozen image encoders with frozen large language models. This architecture enables better vision-language understanding with significantly less training compute.

For Beginners: BLIP-2 is like having a smart translator between images and language!

Key innovation - the Q-Former:

  • Uses special "query tokens" to ask questions about the image
  • These queries learn to extract the most useful visual information
  • The extracted features then connect to powerful language models (LLMs)

Why BLIP-2 is special:

  • Uses frozen (pre-trained) image encoders like ViT-G
  • Uses frozen LLMs like OPT or Flan-T5
  • Only trains the small Q-Former bridge (much cheaper!)
  • Gets state-of-the-art results with less compute

Use cases (same as BLIP but better):

  • More accurate image captioning
  • Better visual question answering
  • More nuanced image-text understanding
  • Can leverage larger LLMs for better generation

Properties

LanguageModelBackbone

Gets the type of language model backbone used for generation.

LanguageModelBackbone LanguageModelBackbone { get; }

Property Value

LanguageModelBackbone

Remarks

BLIP-2 can use different LLM backbones: - OPT - decoder-only, good for general generation - FlanT5 - encoder-decoder, better for instruction-following The choice affects generation capabilities and quality.

NumQueryTokens

Gets the number of learnable query tokens used by the Q-Former.

int NumQueryTokens { get; }

Property Value

int

Remarks

The query tokens are learnable embeddings that interact with the frozen image encoder through cross-attention to extract visual features. Typically 32 queries are used.

Methods

AnswerQuestion(Tensor<T>, string, int)

Answers a question about an image using the LLM backend.

string AnswerQuestion(Tensor<T> image, string question, int maxLength = 30)

Parameters

image Tensor<T>

The preprocessed image tensor.

question string

The question to answer about the image.

maxLength int

Maximum answer length.

Returns

string

The generated answer.

Remarks

Formats the question appropriately for the LLM backend and generates an answer conditioned on both the visual features and the question. BLIP-2's LLM backend typically provides more detailed and accurate answers than BLIP's decoder.

For Beginners: Ask any question about an image!

BLIP-2 is better at VQA because:

  • Uses a powerful LLM (OPT/Flan-T5) for generation
  • LLM has more world knowledge
  • Can give more detailed, reasoned answers

Examples:

  • "What is the person doing?" -> "The person is riding a bicycle down a street"
  • "What color is the car?" -> "The car is red"
  • "Is it raining?" -> "No, it appears to be a sunny day"

ComputeContrastiveSimilarity(Tensor<T>, string)

Computes image-text contrastive similarity using Q-Former features.

T ComputeContrastiveSimilarity(Tensor<T> image, string text)

Parameters

image Tensor<T>

The preprocessed image tensor.

text string

The text to compare.

Returns

T

Contrastive similarity score.

Remarks

Uses the Q-Former's image-text contrastive (ITC) learning objective. Computes similarity between the CLS token of query outputs and text features. Faster than ITM but less accurate for fine-grained matching.

For Beginners: Quick similarity check between image and text!

Difference from ITM (Image-Text Matching):

  • ITC: Fast, uses embedding similarity (like CLIP)
  • ITM: Slower, uses cross-attention for deeper analysis

Use ITC for:

  • Large-scale retrieval (searching millions of images)
  • Quick filtering before detailed matching

Use ITM for:

  • Final ranking of candidates
  • When accuracy matters more than speed

ComputeImageTextMatch(Tensor<T>, string)

Computes image-text matching score using the Q-Former's ITM head.

T ComputeImageTextMatch(Tensor<T> image, string text)

Parameters

image Tensor<T>

The preprocessed image tensor.

text string

The text to match against the image.

Returns

T

Matching probability between 0 and 1.

Remarks

Uses the Q-Former's image-text matching head which applies cross-attention between query features and text features to determine if they match. This is trained with hard negative mining for better discrimination.

ExtractQFormerFeatures(Tensor<T>)

Extracts visual features using the Q-Former's learnable queries.

Tensor<T> ExtractQFormerFeatures(Tensor<T> image)

Parameters

image Tensor<T>

The preprocessed image tensor with shape [channels, height, width].

Returns

Tensor<T>

Query output features with shape [numQueries, queryDim].

Remarks

The Q-Former uses cross-attention between learnable query tokens and the frozen image encoder output to extract query_num visual features. These features are then projected to match the LLM's input dimension.

For Beginners: Think of this as asking 32 questions about the image!

Process:

  1. Image goes through frozen ViT encoder -> patch features
  2. Query tokens attend to patch features via cross-attention
  3. Each query learns to focus on different aspects
  4. Output: 32 feature vectors summarizing the image

These 32 features are what gets sent to the language model.

GenerateCaption(Tensor<T>, string?, int, int, double)

Generates a caption for an image using the LLM backend.

string GenerateCaption(Tensor<T> image, string? prompt = null, int maxLength = 30, int numBeams = 5, double temperature = 1)

Parameters

image Tensor<T>

The preprocessed image tensor.

prompt string

Optional prompt to guide generation (e.g., "a photo of").

maxLength int

Maximum number of tokens to generate.

numBeams int

Number of beams for beam search.

temperature double

Sampling temperature (lower = more deterministic).

Returns

string

The generated caption.

Remarks

Uses the Q-Former to extract visual features, projects them to the LLM space, and then uses the LLM to generate text conditioned on these visual tokens.

For Beginners: This generates descriptions using a powerful language model!

The prompt helps guide the style:

  • "a photo of" -> descriptive captions
  • "Question: What is this? Answer:" -> Q&A style
  • No prompt -> model's default behavior

Temperature controls randomness:

  • 0.0-0.3: Very focused, deterministic
  • 0.7-1.0: More creative, varied

GenerateCaptions(Tensor<T>, int, string?, int, double, double)

Generates multiple diverse captions for an image.

IEnumerable<(string Caption, T Score)> GenerateCaptions(Tensor<T> image, int numCaptions = 5, string? prompt = null, int maxLength = 30, double temperature = 0.9, double topP = 0.95)

Parameters

image Tensor<T>

The preprocessed image tensor.

numCaptions int

Number of captions to generate.

prompt string

Optional prompt to guide generation.

maxLength int

Maximum length per caption.

temperature double

Sampling temperature for diversity.

topP double

Nucleus sampling probability threshold.

Returns

IEnumerable<(string Caption, T Score)>

Collection of generated captions with their log probabilities.

Remarks

Uses nucleus (top-p) sampling with temperature to generate diverse captions. Returns captions with their generation scores for ranking.

GenerateWithInstruction(Tensor<T>, string, int)

Generates text conditioned on both image and text context (instructed generation).

string GenerateWithInstruction(Tensor<T> image, string instruction, int maxLength = 100)

Parameters

image Tensor<T>

The preprocessed image tensor.

instruction string

The instruction or context for generation.

maxLength int

Maximum generation length.

Returns

string

The generated response.

Remarks

Enables instruction-following behavior where the model generates text based on both visual input and textual instructions. This is particularly powerful with instruction-tuned LLM backends like Flan-T5.

For Beginners: Give instructions about what to do with the image!

Examples:

  • "Describe this image in detail" -> Detailed description
  • "List all the objects in this image" -> Bulleted list
  • "Write a story based on this image" -> Creative narrative
  • "Explain what is happening" -> Scene analysis

This is more flexible than simple captioning because you can customize the output format and content through instructions.

GroundText(Tensor<T>, string)

Performs visual grounding to locate objects described in text.

Vector<T> GroundText(Tensor<T> image, string description)

Parameters

image Tensor<T>

The preprocessed image tensor.

description string

Text description of the object to locate.

Returns

Vector<T>

Bounding box coordinates [x1, y1, x2, y2] normalized to [0, 1].

Remarks

Uses the Q-Former's attention patterns to identify which image regions correspond to the text description. Returns a bounding box for the most likely region.

For Beginners: Find where something is in an image!

Given text like "the red car on the left", this finds and returns the bounding box coordinates for that object.

The output is normalized coordinates:

  • [0, 0, 1, 1] would be the entire image
  • [0.5, 0.5, 1, 1] would be the bottom-right quarter

Use cases:

  • Object detection from natural language
  • Referring expression comprehension
  • Interactive image editing ("remove the person on the right")

RetrieveImages(string, IEnumerable<Tensor<T>>, int, bool, int)

Retrieves the most relevant images for a text query.

IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Tensor<T>> imageFeatures, int topK = 10, bool useItmReranking = true, int rerankTopN = 100)

Parameters

query string

The text query.

imageFeatures IEnumerable<Tensor<T>>

Pre-computed Q-Former features for images.

topK int

Number of results to return.

useItmReranking bool

Whether to rerank top results using ITM.

rerankTopN int

Number of candidates to rerank with ITM.

Returns

IEnumerable<(int Index, T Score)>

Indices of top-K matching images with scores.

Remarks

Two-stage retrieval: 1. Fast ITC-based retrieval to get candidates 2. Optional ITM reranking for higher precision

ZeroShotClassify(Tensor<T>, IEnumerable<string>, bool)

Performs zero-shot image classification using text prompts.

Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels, bool useItm = false)

Parameters

image Tensor<T>

The preprocessed image tensor.

classLabels IEnumerable<string>

The candidate class labels.

useItm bool

If true, use ITM for scoring; if false, use ITC.

Returns

Dictionary<string, T>

Dictionary mapping class labels to probability scores.

Remarks

Classifies images into categories without any training on those specific categories. Can use either ITC (faster) or ITM (more accurate) for scoring.