Interface IFlamingoModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for Flamingo-style models with in-context visual learning capabilities.
public interface IFlamingoModel<T> : IMultimodalEmbedding<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
Remarks
Flamingo is a visual language model that excels at few-shot learning - it can learn new tasks from just a few examples provided in the context. It uses gated cross-attention layers interleaved with frozen LLM layers to integrate visual information.
For Beginners: Flamingo learns new visual tasks from examples you show it!
Key innovation - In-context learning:
- Show Flamingo a few example image-text pairs
- It learns the pattern from these examples
- Apply the pattern to new images WITHOUT any training
Architecture:
- Vision Encoder: Extracts image features (Perceiver Resampler)
- Gated Cross-Attention: Injects visual info into language model
- Frozen LLM: Chinchilla-based language model
Example use case:
- Show 3 examples: [image1] "A red apple" [image2] "A blue car" [image3] "A green tree"
- Ask about new image: [image4] "What color?"
- Flamingo learns from examples that you want the color, answers correctly!
Why Flamingo is revolutionary:
- No fine-tuning needed for new tasks
- Adapts to new visual concepts on-the-fly
- Strong performance with minimal examples
Properties
LanguageModelBackbone
Gets the language model backbone used for generation.
LanguageModelBackbone LanguageModelBackbone { get; }
Property Value
Remarks
Flamingo typically uses Chinchilla as the backbone.
MaxImagesInContext
Gets the maximum number of images that can be processed in a single context.
int MaxImagesInContext { get; }
Property Value
NumPerceiverTokens
Gets the number of visual tokens per image after the Perceiver Resampler.
int NumPerceiverTokens { get; }
Property Value
Remarks
The Perceiver Resampler compresses visual features to a fixed number of tokens (typically 64) regardless of input image size. This enables efficient processing of multiple images in context.
Methods
DescribeVideo(IEnumerable<Tensor<T>>, string?, int)
Generates captions for a video represented as a sequence of frames.
string DescribeVideo(IEnumerable<Tensor<T>> frames, string? prompt = null, int maxLength = 256)
Parameters
framesIEnumerable<Tensor<T>>Sequence of video frame tensors.
promptstringOptional prompt to guide generation.
maxLengthintMaximum tokens to generate.
Returns
- string
Generated video description.
Remarks
Flamingo can process multiple frames as separate images interleaved in context, enabling basic video understanding.
ExtractPerceiverFeatures(Tensor<T>)
Extracts visual features using the Perceiver Resampler.
Tensor<T> ExtractPerceiverFeatures(Tensor<T> image)
Parameters
imageTensor<T>The preprocessed image tensor.
Returns
- Tensor<T>
Resampled visual tokens with shape [numPerceiverTokens, hiddenDim].
Remarks
The Perceiver Resampler uses cross-attention with learnable queries to compress variable-length visual features into a fixed number of tokens.
FewShotGenerate(IEnumerable<(Tensor<T> Image, string Text)>, Tensor<T>, string?, int)
Performs few-shot visual learning with interleaved image-text examples.
string FewShotGenerate(IEnumerable<(Tensor<T> Image, string Text)> examples, Tensor<T> queryImage, string? queryPrompt = null, int maxLength = 256)
Parameters
examplesIEnumerable<(Tensor<T> Image, string Text)>Few-shot examples as (image, text) pairs.
queryImageTensor<T>The new image to process.
queryPromptstringOptional prompt for the query (e.g., "What is this?").
maxLengthintMaximum tokens to generate.
Returns
- string
The generated response based on learned pattern.
Remarks
For Beginners: Learn a task from examples, then apply it!
Example - Learning to identify dog breeds: Examples:
- [image of labrador] "This is a Labrador Retriever"
- [image of poodle] "This is a Poodle"
- [image of beagle] "This is a Beagle"
Query: [image of golden retriever] "This is a..." Response: "Golden Retriever"
Flamingo learned the pattern from examples without any training!
FewShotImageRetrieval(IEnumerable<Tensor<T>>, string?, IEnumerable<Tensor<T>>, int)
Retrieves the most similar images from a database using few-shot context.
IEnumerable<(int Index, T Score)> FewShotImageRetrieval(IEnumerable<Tensor<T>> queryExamples, string? queryDescription, IEnumerable<Tensor<T>> candidateImages, int topK = 10)
Parameters
queryExamplesIEnumerable<Tensor<T>>Example images representing what you're looking for.
queryDescriptionstringOptional text description of desired images.
candidateImagesIEnumerable<Tensor<T>>Database of images to search.
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score)>
Indices of most similar images with scores.
FewShotVQA(IEnumerable<(Tensor<T> Image, string Question, string Answer)>, Tensor<T>, string)
Performs visual question answering with few-shot examples.
string FewShotVQA(IEnumerable<(Tensor<T> Image, string Question, string Answer)> examples, Tensor<T> queryImage, string question)
Parameters
examplesIEnumerable<(Tensor<T> Image, string Question, string Answer)>Example (image, question, answer) tuples.
queryImageTensor<T>The image to ask about.
questionstringThe question to answer.
Returns
- string
The generated answer.
GenerateWithMultipleImages(IEnumerable<Tensor<T>>, string, int)
Generates text for multiple images interleaved in a single context.
string GenerateWithMultipleImages(IEnumerable<Tensor<T>> images, string prompt, int maxLength = 512)
Parameters
imagesIEnumerable<Tensor<T>>Sequence of images to process.
promptstringPrompt that may reference images using special tokens.
maxLengthintMaximum tokens to generate.
Returns
- string
Generated text response.
Remarks
Supports prompts like: "<image> shows a cat and <image> shows a dog. Compare them." where <image> tokens are replaced with corresponding image features.
InContextClassify(IEnumerable<(Tensor<T> Image, string Label)>, Tensor<T>)
Performs in-context visual classification without explicit labels.
Dictionary<string, T> InContextClassify(IEnumerable<(Tensor<T> Image, string Label)> labeledExamples, Tensor<T> queryImage)
Parameters
labeledExamplesIEnumerable<(Tensor<T> Image, string Text)>Examples with (image, label) pairs.
queryImageTensor<T>The image to classify.
Returns
- Dictionary<string, T>
Dictionary mapping labels to confidence scores.
Remarks
For Beginners: Classify images using just a few examples!
Instead of training a classifier on thousands of images:
- Show a few examples per class
- Flamingo learns the categories
- It can now classify new images
This is "few-shot classification" - works with any categories!
ScoreImageText(Tensor<T>, string)
Computes the log probability of a given text completion for an image.
T ScoreImageText(Tensor<T> image, string text)
Parameters
imageTensor<T>The preprocessed image tensor.
textstringThe text to score.
Returns
- T
Log probability of the text given the image.
Remarks
Useful for ranking candidate captions or performing discriminative tasks with a generative model.