Table of Contents

Interface IFlamingoModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for Flamingo-style models with in-context visual learning capabilities.

public interface IFlamingoModel<T> : IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members

Remarks

Flamingo is a visual language model that excels at few-shot learning - it can learn new tasks from just a few examples provided in the context. It uses gated cross-attention layers interleaved with frozen LLM layers to integrate visual information.

For Beginners: Flamingo learns new visual tasks from examples you show it!

Key innovation - In-context learning:

  • Show Flamingo a few example image-text pairs
  • It learns the pattern from these examples
  • Apply the pattern to new images WITHOUT any training

Architecture:

  1. Vision Encoder: Extracts image features (Perceiver Resampler)
  2. Gated Cross-Attention: Injects visual info into language model
  3. Frozen LLM: Chinchilla-based language model

Example use case:

  • Show 3 examples: [image1] "A red apple" [image2] "A blue car" [image3] "A green tree"
  • Ask about new image: [image4] "What color?"
  • Flamingo learns from examples that you want the color, answers correctly!

Why Flamingo is revolutionary:

  • No fine-tuning needed for new tasks
  • Adapts to new visual concepts on-the-fly
  • Strong performance with minimal examples

Properties

LanguageModelBackbone

Gets the language model backbone used for generation.

LanguageModelBackbone LanguageModelBackbone { get; }

Property Value

LanguageModelBackbone

Remarks

Flamingo typically uses Chinchilla as the backbone.

MaxImagesInContext

Gets the maximum number of images that can be processed in a single context.

int MaxImagesInContext { get; }

Property Value

int

NumPerceiverTokens

Gets the number of visual tokens per image after the Perceiver Resampler.

int NumPerceiverTokens { get; }

Property Value

int

Remarks

The Perceiver Resampler compresses visual features to a fixed number of tokens (typically 64) regardless of input image size. This enables efficient processing of multiple images in context.

Methods

DescribeVideo(IEnumerable<Tensor<T>>, string?, int)

Generates captions for a video represented as a sequence of frames.

string DescribeVideo(IEnumerable<Tensor<T>> frames, string? prompt = null, int maxLength = 256)

Parameters

frames IEnumerable<Tensor<T>>

Sequence of video frame tensors.

prompt string

Optional prompt to guide generation.

maxLength int

Maximum tokens to generate.

Returns

string

Generated video description.

Remarks

Flamingo can process multiple frames as separate images interleaved in context, enabling basic video understanding.

ExtractPerceiverFeatures(Tensor<T>)

Extracts visual features using the Perceiver Resampler.

Tensor<T> ExtractPerceiverFeatures(Tensor<T> image)

Parameters

image Tensor<T>

The preprocessed image tensor.

Returns

Tensor<T>

Resampled visual tokens with shape [numPerceiverTokens, hiddenDim].

Remarks

The Perceiver Resampler uses cross-attention with learnable queries to compress variable-length visual features into a fixed number of tokens.

FewShotGenerate(IEnumerable<(Tensor<T> Image, string Text)>, Tensor<T>, string?, int)

Performs few-shot visual learning with interleaved image-text examples.

string FewShotGenerate(IEnumerable<(Tensor<T> Image, string Text)> examples, Tensor<T> queryImage, string? queryPrompt = null, int maxLength = 256)

Parameters

examples IEnumerable<(Tensor<T> Image, string Text)>

Few-shot examples as (image, text) pairs.

queryImage Tensor<T>

The new image to process.

queryPrompt string

Optional prompt for the query (e.g., "What is this?").

maxLength int

Maximum tokens to generate.

Returns

string

The generated response based on learned pattern.

Remarks

For Beginners: Learn a task from examples, then apply it!

Example - Learning to identify dog breeds: Examples:

  • [image of labrador] "This is a Labrador Retriever"
  • [image of poodle] "This is a Poodle"
  • [image of beagle] "This is a Beagle"

Query: [image of golden retriever] "This is a..." Response: "Golden Retriever"

Flamingo learned the pattern from examples without any training!

FewShotImageRetrieval(IEnumerable<Tensor<T>>, string?, IEnumerable<Tensor<T>>, int)

Retrieves the most similar images from a database using few-shot context.

IEnumerable<(int Index, T Score)> FewShotImageRetrieval(IEnumerable<Tensor<T>> queryExamples, string? queryDescription, IEnumerable<Tensor<T>> candidateImages, int topK = 10)

Parameters

queryExamples IEnumerable<Tensor<T>>

Example images representing what you're looking for.

queryDescription string

Optional text description of desired images.

candidateImages IEnumerable<Tensor<T>>

Database of images to search.

topK int

Number of results to return.

Returns

IEnumerable<(int Index, T Score)>

Indices of most similar images with scores.

FewShotVQA(IEnumerable<(Tensor<T> Image, string Question, string Answer)>, Tensor<T>, string)

Performs visual question answering with few-shot examples.

string FewShotVQA(IEnumerable<(Tensor<T> Image, string Question, string Answer)> examples, Tensor<T> queryImage, string question)

Parameters

examples IEnumerable<(Tensor<T> Image, string Question, string Answer)>

Example (image, question, answer) tuples.

queryImage Tensor<T>

The image to ask about.

question string

The question to answer.

Returns

string

The generated answer.

GenerateWithMultipleImages(IEnumerable<Tensor<T>>, string, int)

Generates text for multiple images interleaved in a single context.

string GenerateWithMultipleImages(IEnumerable<Tensor<T>> images, string prompt, int maxLength = 512)

Parameters

images IEnumerable<Tensor<T>>

Sequence of images to process.

prompt string

Prompt that may reference images using special tokens.

maxLength int

Maximum tokens to generate.

Returns

string

Generated text response.

Remarks

Supports prompts like: "<image> shows a cat and <image> shows a dog. Compare them." where <image> tokens are replaced with corresponding image features.

InContextClassify(IEnumerable<(Tensor<T> Image, string Label)>, Tensor<T>)

Performs in-context visual classification without explicit labels.

Dictionary<string, T> InContextClassify(IEnumerable<(Tensor<T> Image, string Label)> labeledExamples, Tensor<T> queryImage)

Parameters

labeledExamples IEnumerable<(Tensor<T> Image, string Text)>

Examples with (image, label) pairs.

queryImage Tensor<T>

The image to classify.

Returns

Dictionary<string, T>

Dictionary mapping labels to confidence scores.

Remarks

For Beginners: Classify images using just a few examples!

Instead of training a classifier on thousands of images:

  1. Show a few examples per class
  2. Flamingo learns the categories
  3. It can now classify new images

This is "few-shot classification" - works with any categories!

ScoreImageText(Tensor<T>, string)

Computes the log probability of a given text completion for an image.

T ScoreImageText(Tensor<T> image, string text)

Parameters

image Tensor<T>

The preprocessed image tensor.

text string

The text to score.

Returns

T

Log probability of the text given the image.

Remarks

Useful for ranking candidate captions or performing discriminative tasks with a generative model.