Interface ILLaVAModel<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Defines the contract for LLaVA (Large Language and Vision Assistant) models.

public interface ILLaVAModel<T> : IMultimodalEmbedding<T>

Type Parameters

T: The numeric type used for calculations.

Inherited Members: IMultimodalEmbedding<T>.EncodeText(string)

IMultimodalEmbedding<T>.EncodeTextBatch(IEnumerable<string>)

IMultimodalEmbedding<T>.EncodeImage(double[])

IMultimodalEmbedding<T>.EncodeImageBatch(IEnumerable<double[]>)

IMultimodalEmbedding<T>.ComputeSimilarity(Vector<T>, Vector<T>)

IMultimodalEmbedding<T>.ZeroShotClassify(double[], IEnumerable<string>)

IMultimodalEmbedding<T>.EmbeddingDimension

IMultimodalEmbedding<T>.MaxSequenceLength

IMultimodalEmbedding<T>.ImageSize

Remarks

LLaVA connects a vision encoder (like CLIP ViT) with a large language model (like LLaMA/Vicuna) through a simple projection layer, enabling visual instruction-following and conversational AI about images.

For Beginners: LLaVA is like giving eyes to ChatGPT!

Architecture:

Vision Encoder (CLIP ViT): Converts images to feature vectors
Projection Layer: Maps visual features to LLM's text embedding space
Large Language Model (LLaMA/Vicuna): Generates responses

Key capabilities:

Visual conversations: "What's in this image?" followed by "What color is the car?"
Visual reasoning: Understanding relationships, counting, spatial awareness
Instruction following: "Describe this image as if you were a poet"
Multi-turn dialogue: Context-aware conversations about images

Why LLaVA is popular:

Simple but effective architecture
Open-source and reproducible
Strong performance on visual understanding benchmarks
Efficient training with visual instruction tuning

Properties

LanguageModelBackbone

Gets the language model backbone used for generation.

LanguageModelBackbone LanguageModelBackbone { get; }

Property Value

LanguageModelBackbone

Remarks

Common backbones include LLaMA, Vicuna, Mistral, etc.

NumVisualTokens

Gets the maximum number of visual tokens used per image.

int NumVisualTokens { get; }

Property Value

int

Remarks

The number of patch tokens extracted from the vision encoder. For CLIP ViT-L/14 at 336x336, this is typically 576 tokens (24x24 patches).

VisionEncoderType

Gets the vision encoder type.

string VisionEncoderType { get; }

Property Value

string

Remarks

Typically CLIP ViT-L/14 or similar vision transformer models.

Methods

Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int, double)

Continues a multi-turn conversation about an image.

string Chat(Tensor<T> image, IEnumerable<(string Role, string Content)> conversationHistory, string userMessage, int maxLength = 512, double temperature = 0.7)

Parameters

image Tensor<T>: The preprocessed image tensor.
conversationHistory IEnumerable<(string Role, string Content)>: Previous turns as (role, content) pairs.
userMessage string: The new user message.
maxLength int: Maximum tokens to generate.
temperature double: Sampling temperature.

Returns

string: The assistant's response.

Remarks

Enables multi-turn visual dialogue where context is preserved across turns.

For Beginners: Have a conversation about an image!

Example conversation: User: "What's in this image?" Assistant: "A dog playing in a park with a red ball." User: "What breed is the dog?" Assistant: "It appears to be a Golden Retriever based on its golden fur and size." User: "Is it a sunny day?" Assistant: "Yes, there are shadows indicating bright sunlight and clear skies."

CompareImages(Tensor<T>, Tensor<T>, IEnumerable<string>?)

Compares two images and describes their differences.

string CompareImages(Tensor<T> image1, Tensor<T> image2, IEnumerable<string>? aspectsToCompare = null)

Parameters

image1 Tensor<T>: First preprocessed image tensor.
image2 Tensor<T>: Second preprocessed image tensor.
aspectsToCompare IEnumerable<string>: Optional specific aspects to compare.

Returns

string: A description of the differences between the images.

DescribeRegions(Tensor<T>, IEnumerable<Vector<T>>)

Generates a detailed description of specific regions in an image.

IEnumerable<string> DescribeRegions(Tensor<T> image, IEnumerable<Vector<T>> regions)

Parameters

image Tensor<T>: The preprocessed image tensor.
regions IEnumerable<Vector<T>>: List of bounding boxes [x1, y1, x2, y2] to describe.

Returns

IEnumerable<string>: Descriptions for each region.

ExtractVisualFeatures(Tensor<T>)

Extracts visual features before projection to LLM space.

Tensor<T> ExtractVisualFeatures(Tensor<T> image)

Parameters

image Tensor<T>: The preprocessed image tensor.

Returns

Tensor<T>: Visual feature tensor with shape [numPatches, hiddenDim].

Remarks

These are the raw CLIP features before being projected to match the LLM's embedding dimension. Useful for analysis or custom processing.

Generate(Tensor<T>, string, int, double, double)

Generates a response to a text prompt about an image.

string Generate(Tensor<T> image, string prompt, int maxLength = 512, double temperature = 0.7, double topP = 0.9)

Parameters

image Tensor<T>: The preprocessed image tensor.
prompt string: The user's question or instruction about the image.
maxLength int: Maximum number of tokens to generate.
temperature double: Sampling temperature (0 = deterministic, higher = more creative).
topP double: Nucleus sampling probability threshold.

Returns

string: The generated response.

Remarks

For Beginners: Ask any question about an image!

Examples:

"What is happening in this image?" → Detailed scene description
"How many people are in the photo?" → Counting and recognition
"What emotion does the person show?" → Emotional understanding
"Write a caption for social media" → Creative generation

GenerateMultiple(Tensor<T>, string, int, double)

Generates multiple diverse responses for the same prompt.

IEnumerable<(string Response, T Score)> GenerateMultiple(Tensor<T> image, string prompt, int numResponses = 5, double temperature = 0.9)

Parameters

image Tensor<T>: The preprocessed image tensor.
prompt string: The user's question or instruction.
numResponses int: Number of different responses to generate.
temperature double: Sampling temperature for diversity.

Returns

IEnumerable<(string Caption, T Score)>: Collection of generated responses with their log probabilities.

GroundObject(Tensor<T>, string)

Performs visual grounding to locate objects described by text.

Vector<T> GroundObject(Tensor<T> image, string description)

Parameters

image Tensor<T>: The preprocessed image tensor.
description string: Description of the object to locate.

Returns

Vector<T>: Bounding box coordinates [x1, y1, x2, y2] normalized to [0, 1].

Remarks

For Beginners: Find where something is in an image!

Example:

Description: "the red car on the left"
Returns: [0.1, 0.3, 0.4, 0.7] representing the car's bounding box

ProjectToLanguageSpace(Tensor<T>)

Projects visual features to the LLM's embedding space.

Tensor<T> ProjectToLanguageSpace(Tensor<T> visualFeatures)

Parameters

visualFeatures Tensor<T>: Visual features from ExtractVisualFeatures.

Returns

Tensor<T>: Projected features matching LLM embedding dimension.

Table of Contents

Interface ILLaVAModel<T>

Type Parameters

Remarks

Properties

LanguageModelBackbone

Property Value

Remarks

NumVisualTokens

Property Value

Remarks

VisionEncoderType

Property Value

Remarks

Methods

Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int, double)

Parameters

Returns

Remarks

CompareImages(Tensor<T>, Tensor<T>, IEnumerable<string>?)

Parameters

Returns

DescribeRegions(Tensor<T>, IEnumerable<Vector<T>>)

Parameters

Returns

ExtractVisualFeatures(Tensor<T>)

Parameters

Returns

Remarks

Generate(Tensor<T>, string, int, double, double)

Parameters

Returns

Remarks

GenerateMultiple(Tensor<T>, string, int, double)

Parameters

Returns

GroundObject(Tensor<T>, string)

Parameters

Returns

Remarks

ProjectToLanguageSpace(Tensor<T>)

Parameters

Returns