Interface IBlipModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for BLIP (Bootstrapped Language-Image Pre-training) models.
public interface IBlipModel<T> : IMultimodalEmbedding<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
Remarks
BLIP extends CLIP's capabilities with additional vision-language tasks: image captioning, image-text matching, and visual question answering. This interface extends IMultimodalEmbedding<T> with these features.
For Beginners: BLIP is like CLIP but with extra superpowers!
What CLIP can do:
- Compare images and text (are they related?)
- Zero-shot classification (classify without training)
What BLIP adds:
- Generate captions for images (describe what you see)
- Answer questions about images (VQA)
- Better image-text matching with cross-attention
BLIP was trained on a larger, cleaner dataset using a special "bootstrapping" technique that improves the quality of training data automatically.
Methods
AnswerQuestion(Tensor<T>, string, int)
Answers a question about an image's content.
string AnswerQuestion(Tensor<T> image, string question, int maxLength = 20)
Parameters
imageTensor<T>The preprocessed image tensor.
questionstringThe question to answer (e.g., "What color is the car?").
maxLengthintMaximum length of the answer.
Returns
- string
The generated answer.
Remarks
Visual Question Answering (VQA) generates natural language answers to questions about image content. The model uses cross-attention to focus on relevant image regions when generating the answer.
For Beginners: Ask questions about images and get answers!
Examples:
- Image: Photo of a kitchen
- "What appliances are visible?" → "refrigerator, microwave, and stove"
- "What color are the cabinets?" → "white"
- "Is there a window?" → "yes, above the sink"
This is useful for:
- Accessibility (describe images for visually impaired users)
- Content moderation (is there alcohol in this photo?)
- Data extraction (what brand is this product?)
ComputeImageTextMatch(Tensor<T>, string)
Determines whether a given text accurately describes an image.
T ComputeImageTextMatch(Tensor<T> image, string text)
Parameters
imageTensor<T>The preprocessed image tensor.
textstringThe text description to evaluate.
Returns
- T
A probability score between 0 and 1 indicating match quality.
Remarks
Uses the Image-Text Matching (ITM) head with cross-attention between image patches and text tokens for fine-grained matching. This is more accurate than simple embedding similarity for detailed matching.
For Beginners: This checks if a caption accurately describes an image.
Unlike simple similarity (dot product), this uses "cross-attention" which:
- Looks at specific parts of the image
- Compares them to specific words in the text
- Gives a more accurate yes/no answer
Example:
- Image: A red car parked on a street
- "A red vehicle on pavement" → 0.92 (accurate!)
- "A blue car in a garage" → 0.15 (wrong color and location)
Use this when you need precise matching, not just "related content."
GenerateCaption(Tensor<T>, int, int)
Generates a caption describing the content of an image.
string GenerateCaption(Tensor<T> image, int maxLength = 30, int numBeams = 3)
Parameters
imageTensor<T>The preprocessed image tensor with shape [channels, height, width].
maxLengthintMaximum number of tokens to generate. Default is 30.
numBeamsintNumber of beams for beam search. Default is 3 for quality/speed balance.
Returns
- string
A generated caption describing the image.
Remarks
Uses the image-grounded text decoder to generate descriptive captions. The generation uses beam search by default for higher quality outputs.
For Beginners: This automatically describes what's in an image!
Example:
- Input: Photo of a dog playing fetch in a park
- Output: "a brown dog catching a frisbee on a grassy field"
Parameters:
- maxLength: How long the caption can be (30 = roughly 25 words)
- numBeams: More beams = better captions but slower (3 is a good balance)
Uses "beam search" - it explores multiple possible captions and picks the best one.
GenerateCaptions(Tensor<T>, int, int)
Generates multiple candidate captions for an image.
IEnumerable<string> GenerateCaptions(Tensor<T> image, int numCaptions = 5, int maxLength = 30)
Parameters
imageTensor<T>The preprocessed image tensor.
numCaptionsintNumber of captions to generate.
maxLengthintMaximum length per caption.
Returns
- IEnumerable<string>
A collection of candidate captions.
Remarks
Uses nucleus (top-p) sampling to generate diverse captions. Useful for getting multiple perspectives on an image's content.
RankCaptions(Tensor<T>, IEnumerable<string>)
Ranks a set of candidate captions by how well they match an image.
IEnumerable<(string Caption, T Score)> RankCaptions(Tensor<T> image, IEnumerable<string> candidates)
Parameters
imageTensor<T>The preprocessed image tensor.
candidatesIEnumerable<string>The candidate captions to rank.
Returns
- IEnumerable<(string Caption, T Score)>
Captions ranked by match score, from best to worst.
Remarks
Uses the ITM head to score each candidate, then returns them in descending order. Useful for caption reranking in retrieval applications.
RetrieveImages(string, IEnumerable<Vector<T>>, int)
Retrieves the most relevant images for a text query from a collection.
IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Vector<T>> imageEmbeddings, int topK = 10)
Parameters
querystringThe text query describing desired images.
imageEmbeddingsIEnumerable<Vector<T>>Pre-computed image embeddings.
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score)>
Indices of the top-K matching images with their scores.
Remarks
Performs efficient text-to-image retrieval using embedding similarity. For large collections, pre-compute and cache image embeddings.
RetrieveTexts(Tensor<T>, IEnumerable<Vector<T>>, int)
Retrieves the most relevant texts for an image from a collection.
IEnumerable<(int Index, T Score)> RetrieveTexts(Tensor<T> image, IEnumerable<Vector<T>> textEmbeddings, int topK = 10)
Parameters
imageTensor<T>The preprocessed image tensor.
textEmbeddingsIEnumerable<Vector<T>>Pre-computed text embeddings.
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score)>
Indices of the top-K matching texts with their scores.
Remarks
Performs efficient image-to-text retrieval using embedding similarity. Useful for finding relevant captions or descriptions for images.