Table of Contents

Interface IBlipModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for BLIP (Bootstrapped Language-Image Pre-training) models.

public interface IBlipModel<T> : IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members

Remarks

BLIP extends CLIP's capabilities with additional vision-language tasks: image captioning, image-text matching, and visual question answering. This interface extends IMultimodalEmbedding<T> with these features.

For Beginners: BLIP is like CLIP but with extra superpowers!

What CLIP can do:

  • Compare images and text (are they related?)
  • Zero-shot classification (classify without training)

What BLIP adds:

  • Generate captions for images (describe what you see)
  • Answer questions about images (VQA)
  • Better image-text matching with cross-attention

BLIP was trained on a larger, cleaner dataset using a special "bootstrapping" technique that improves the quality of training data automatically.

Methods

AnswerQuestion(Tensor<T>, string, int)

Answers a question about an image's content.

string AnswerQuestion(Tensor<T> image, string question, int maxLength = 20)

Parameters

image Tensor<T>

The preprocessed image tensor.

question string

The question to answer (e.g., "What color is the car?").

maxLength int

Maximum length of the answer.

Returns

string

The generated answer.

Remarks

Visual Question Answering (VQA) generates natural language answers to questions about image content. The model uses cross-attention to focus on relevant image regions when generating the answer.

For Beginners: Ask questions about images and get answers!

Examples:

  • Image: Photo of a kitchen
  • "What appliances are visible?" → "refrigerator, microwave, and stove"
  • "What color are the cabinets?" → "white"
  • "Is there a window?" → "yes, above the sink"

This is useful for:

  • Accessibility (describe images for visually impaired users)
  • Content moderation (is there alcohol in this photo?)
  • Data extraction (what brand is this product?)

ComputeImageTextMatch(Tensor<T>, string)

Determines whether a given text accurately describes an image.

T ComputeImageTextMatch(Tensor<T> image, string text)

Parameters

image Tensor<T>

The preprocessed image tensor.

text string

The text description to evaluate.

Returns

T

A probability score between 0 and 1 indicating match quality.

Remarks

Uses the Image-Text Matching (ITM) head with cross-attention between image patches and text tokens for fine-grained matching. This is more accurate than simple embedding similarity for detailed matching.

For Beginners: This checks if a caption accurately describes an image.

Unlike simple similarity (dot product), this uses "cross-attention" which:

  • Looks at specific parts of the image
  • Compares them to specific words in the text
  • Gives a more accurate yes/no answer

Example:

  • Image: A red car parked on a street
  • "A red vehicle on pavement" → 0.92 (accurate!)
  • "A blue car in a garage" → 0.15 (wrong color and location)

Use this when you need precise matching, not just "related content."

GenerateCaption(Tensor<T>, int, int)

Generates a caption describing the content of an image.

string GenerateCaption(Tensor<T> image, int maxLength = 30, int numBeams = 3)

Parameters

image Tensor<T>

The preprocessed image tensor with shape [channels, height, width].

maxLength int

Maximum number of tokens to generate. Default is 30.

numBeams int

Number of beams for beam search. Default is 3 for quality/speed balance.

Returns

string

A generated caption describing the image.

Remarks

Uses the image-grounded text decoder to generate descriptive captions. The generation uses beam search by default for higher quality outputs.

For Beginners: This automatically describes what's in an image!

Example:

  • Input: Photo of a dog playing fetch in a park
  • Output: "a brown dog catching a frisbee on a grassy field"

Parameters:

  • maxLength: How long the caption can be (30 = roughly 25 words)
  • numBeams: More beams = better captions but slower (3 is a good balance)

Uses "beam search" - it explores multiple possible captions and picks the best one.

GenerateCaptions(Tensor<T>, int, int)

Generates multiple candidate captions for an image.

IEnumerable<string> GenerateCaptions(Tensor<T> image, int numCaptions = 5, int maxLength = 30)

Parameters

image Tensor<T>

The preprocessed image tensor.

numCaptions int

Number of captions to generate.

maxLength int

Maximum length per caption.

Returns

IEnumerable<string>

A collection of candidate captions.

Remarks

Uses nucleus (top-p) sampling to generate diverse captions. Useful for getting multiple perspectives on an image's content.

RankCaptions(Tensor<T>, IEnumerable<string>)

Ranks a set of candidate captions by how well they match an image.

IEnumerable<(string Caption, T Score)> RankCaptions(Tensor<T> image, IEnumerable<string> candidates)

Parameters

image Tensor<T>

The preprocessed image tensor.

candidates IEnumerable<string>

The candidate captions to rank.

Returns

IEnumerable<(string Caption, T Score)>

Captions ranked by match score, from best to worst.

Remarks

Uses the ITM head to score each candidate, then returns them in descending order. Useful for caption reranking in retrieval applications.

RetrieveImages(string, IEnumerable<Vector<T>>, int)

Retrieves the most relevant images for a text query from a collection.

IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Vector<T>> imageEmbeddings, int topK = 10)

Parameters

query string

The text query describing desired images.

imageEmbeddings IEnumerable<Vector<T>>

Pre-computed image embeddings.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices of the top-K matching images with their scores.

Remarks

Performs efficient text-to-image retrieval using embedding similarity. For large collections, pre-compute and cache image embeddings.

RetrieveTexts(Tensor<T>, IEnumerable<Vector<T>>, int)

Retrieves the most relevant texts for an image from a collection.

IEnumerable<(int Index, T Score)> RetrieveTexts(Tensor<T> image, IEnumerable<Vector<T>> textEmbeddings, int topK = 10)

Parameters

image Tensor<T>

The preprocessed image tensor.

textEmbeddings IEnumerable<Vector<T>>

Pre-computed text embeddings.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices of the top-K matching texts with their scores.

Remarks

Performs efficient image-to-text retrieval using embedding similarity. Useful for finding relevant captions or descriptions for images.