Interface IFlamingoModel<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Defines the contract for Flamingo-style models with in-context visual learning capabilities.

public interface IFlamingoModel<T> : IMultimodalEmbedding<T>

Type Parameters

T: The numeric type used for calculations.

Inherited Members: IMultimodalEmbedding<T>.EncodeText(string)

IMultimodalEmbedding<T>.EncodeTextBatch(IEnumerable<string>)

IMultimodalEmbedding<T>.EncodeImage(double[])

IMultimodalEmbedding<T>.EncodeImageBatch(IEnumerable<double[]>)

IMultimodalEmbedding<T>.ComputeSimilarity(Vector<T>, Vector<T>)

IMultimodalEmbedding<T>.ZeroShotClassify(double[], IEnumerable<string>)

IMultimodalEmbedding<T>.EmbeddingDimension

IMultimodalEmbedding<T>.MaxSequenceLength

IMultimodalEmbedding<T>.ImageSize

Remarks

Flamingo is a visual language model that excels at few-shot learning - it can learn new tasks from just a few examples provided in the context. It uses gated cross-attention layers interleaved with frozen LLM layers to integrate visual information.

For Beginners: Flamingo learns new visual tasks from examples you show it!

Key innovation - In-context learning:

Show Flamingo a few example image-text pairs
It learns the pattern from these examples
Apply the pattern to new images WITHOUT any training

Architecture:

Vision Encoder: Extracts image features (Perceiver Resampler)
Gated Cross-Attention: Injects visual info into language model
Frozen LLM: Chinchilla-based language model

Example use case:

Show 3 examples: [image1] "A red apple" [image2] "A blue car" [image3] "A green tree"
Ask about new image: [image4] "What color?"
Flamingo learns from examples that you want the color, answers correctly!

Why Flamingo is revolutionary:

No fine-tuning needed for new tasks
Adapts to new visual concepts on-the-fly
Strong performance with minimal examples

Properties

LanguageModelBackbone

Gets the language model backbone used for generation.

LanguageModelBackbone LanguageModelBackbone { get; }

Property Value

LanguageModelBackbone

Remarks

Flamingo typically uses Chinchilla as the backbone.

MaxImagesInContext

Gets the maximum number of images that can be processed in a single context.

int MaxImagesInContext { get; }

Property Value

int

NumPerceiverTokens

Gets the number of visual tokens per image after the Perceiver Resampler.

int NumPerceiverTokens { get; }

Property Value

int

Remarks

The Perceiver Resampler compresses visual features to a fixed number of tokens (typically 64) regardless of input image size. This enables efficient processing of multiple images in context.

Methods

DescribeVideo(IEnumerable<Tensor<T>>, string?, int)

Generates captions for a video represented as a sequence of frames.

string DescribeVideo(IEnumerable<Tensor<T>> frames, string? prompt = null, int maxLength = 256)

Parameters

frames IEnumerable<Tensor<T>>: Sequence of video frame tensors.
prompt string: Optional prompt to guide generation.
maxLength int: Maximum tokens to generate.

Returns

string: Generated video description.

Remarks

Flamingo can process multiple frames as separate images interleaved in context, enabling basic video understanding.

ExtractPerceiverFeatures(Tensor<T>)

Extracts visual features using the Perceiver Resampler.

Tensor<T> ExtractPerceiverFeatures(Tensor<T> image)

Parameters

image Tensor<T>: The preprocessed image tensor.

Returns

Tensor<T>: Resampled visual tokens with shape [numPerceiverTokens, hiddenDim].

Remarks

The Perceiver Resampler uses cross-attention with learnable queries to compress variable-length visual features into a fixed number of tokens.

FewShotGenerate(IEnumerable<(Tensor<T> Image, string Text)>, Tensor<T>, string?, int)

Performs few-shot visual learning with interleaved image-text examples.

string FewShotGenerate(IEnumerable<(Tensor<T> Image, string Text)> examples, Tensor<T> queryImage, string? queryPrompt = null, int maxLength = 256)

Parameters

examples IEnumerable<(Tensor<T> Image, string Text)>: Few-shot examples as (image, text) pairs.
queryImage Tensor<T>: The new image to process.
queryPrompt string: Optional prompt for the query (e.g., "What is this?").
maxLength int: Maximum tokens to generate.

Returns

string: The generated response based on learned pattern.

Remarks

For Beginners: Learn a task from examples, then apply it!

Example - Learning to identify dog breeds: Examples:

[image of labrador] "This is a Labrador Retriever"
[image of poodle] "This is a Poodle"
[image of beagle] "This is a Beagle"

Query: [image of golden retriever] "This is a..." Response: "Golden Retriever"

Flamingo learned the pattern from examples without any training!

FewShotImageRetrieval(IEnumerable<Tensor<T>>, string?, IEnumerable<Tensor<T>>, int)

Retrieves the most similar images from a database using few-shot context.

IEnumerable<(int Index, T Score)> FewShotImageRetrieval(IEnumerable<Tensor<T>> queryExamples, string? queryDescription, IEnumerable<Tensor<T>> candidateImages, int topK = 10)

Parameters

queryExamples IEnumerable<Tensor<T>>: Example images representing what you're looking for.
queryDescription string: Optional text description of desired images.
candidateImages IEnumerable<Tensor<T>>: Database of images to search.
topK int: Number of results to return.

Returns

IEnumerable<(int Index, T Score)>: Indices of most similar images with scores.

FewShotVQA(IEnumerable<(Tensor<T> Image, string Question, string Answer)>, Tensor<T>, string)

Performs visual question answering with few-shot examples.

string FewShotVQA(IEnumerable<(Tensor<T> Image, string Question, string Answer)> examples, Tensor<T> queryImage, string question)

Parameters

examples IEnumerable<(Tensor<T> Image, string Question, string Answer)>: Example (image, question, answer) tuples.
queryImage Tensor<T>: The image to ask about.
question string: The question to answer.

Returns

string: The generated answer.

GenerateWithMultipleImages(IEnumerable<Tensor<T>>, string, int)

Generates text for multiple images interleaved in a single context.

string GenerateWithMultipleImages(IEnumerable<Tensor<T>> images, string prompt, int maxLength = 512)

Parameters

images IEnumerable<Tensor<T>>: Sequence of images to process.
prompt string: Prompt that may reference images using special tokens.
maxLength int: Maximum tokens to generate.

Returns

string: Generated text response.

Remarks

Supports prompts like: "<image> shows a cat and <image> shows a dog. Compare them." where <image> tokens are replaced with corresponding image features.

InContextClassify(IEnumerable<(Tensor<T> Image, string Label)>, Tensor<T>)

Performs in-context visual classification without explicit labels.

Dictionary<string, T> InContextClassify(IEnumerable<(Tensor<T> Image, string Label)> labeledExamples, Tensor<T> queryImage)

Parameters

labeledExamples IEnumerable<(Tensor<T> Image, string Text)>: Examples with (image, label) pairs.
queryImage Tensor<T>: The image to classify.

Returns

Dictionary<string, T>: Dictionary mapping labels to confidence scores.

Remarks

For Beginners: Classify images using just a few examples!

Instead of training a classifier on thousands of images:

Show a few examples per class
Flamingo learns the categories
It can now classify new images

This is "few-shot classification" - works with any categories!

ScoreImageText(Tensor<T>, string)

Computes the log probability of a given text completion for an image.

T ScoreImageText(Tensor<T> image, string text)

Parameters

image Tensor<T>: The preprocessed image tensor.
text string: The text to score.

Returns

T: Log probability of the text given the image.

Remarks

Useful for ranking candidate captions or performing discriminative tasks with a generative model.

Table of Contents

Interface IFlamingoModel<T>

Type Parameters

Remarks

Properties

LanguageModelBackbone

Property Value

Remarks

MaxImagesInContext

Property Value

NumPerceiverTokens

Property Value

Remarks

Methods

DescribeVideo(IEnumerable<Tensor<T>>, string?, int)

Parameters

Returns

Remarks

ExtractPerceiverFeatures(Tensor<T>)

Parameters

Returns

Remarks

FewShotGenerate(IEnumerable<(Tensor<T> Image, string Text)>, Tensor<T>, string?, int)

Parameters

Returns

Remarks

FewShotImageRetrieval(IEnumerable<Tensor<T>>, string?, IEnumerable<Tensor<T>>, int)

Parameters

Returns

FewShotVQA(IEnumerable<(Tensor<T> Image, string Question, string Answer)>, Tensor<T>, string)

Parameters

Returns

GenerateWithMultipleImages(IEnumerable<Tensor<T>>, string, int)

Parameters

Returns

Remarks

InContextClassify(IEnumerable<(Tensor<T> Image, string Label)>, Tensor<T>)

Parameters

Returns

Remarks

ScoreImageText(Tensor<T>, string)

Parameters

Returns

Remarks