Table of Contents

Interface ILLaVAModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for LLaVA (Large Language and Vision Assistant) models.

public interface ILLaVAModel<T> : IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members

Remarks

LLaVA connects a vision encoder (like CLIP ViT) with a large language model (like LLaMA/Vicuna) through a simple projection layer, enabling visual instruction-following and conversational AI about images.

For Beginners: LLaVA is like giving eyes to ChatGPT!

Architecture:

  1. Vision Encoder (CLIP ViT): Converts images to feature vectors
  2. Projection Layer: Maps visual features to LLM's text embedding space
  3. Large Language Model (LLaMA/Vicuna): Generates responses

Key capabilities:

  • Visual conversations: "What's in this image?" followed by "What color is the car?"
  • Visual reasoning: Understanding relationships, counting, spatial awareness
  • Instruction following: "Describe this image as if you were a poet"
  • Multi-turn dialogue: Context-aware conversations about images

Why LLaVA is popular:

  • Simple but effective architecture
  • Open-source and reproducible
  • Strong performance on visual understanding benchmarks
  • Efficient training with visual instruction tuning

Properties

LanguageModelBackbone

Gets the language model backbone used for generation.

LanguageModelBackbone LanguageModelBackbone { get; }

Property Value

LanguageModelBackbone

Remarks

Common backbones include LLaMA, Vicuna, Mistral, etc.

NumVisualTokens

Gets the maximum number of visual tokens used per image.

int NumVisualTokens { get; }

Property Value

int

Remarks

The number of patch tokens extracted from the vision encoder. For CLIP ViT-L/14 at 336x336, this is typically 576 tokens (24x24 patches).

VisionEncoderType

Gets the vision encoder type.

string VisionEncoderType { get; }

Property Value

string

Remarks

Typically CLIP ViT-L/14 or similar vision transformer models.

Methods

Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int, double)

Continues a multi-turn conversation about an image.

string Chat(Tensor<T> image, IEnumerable<(string Role, string Content)> conversationHistory, string userMessage, int maxLength = 512, double temperature = 0.7)

Parameters

image Tensor<T>

The preprocessed image tensor.

conversationHistory IEnumerable<(string Role, string Content)>

Previous turns as (role, content) pairs.

userMessage string

The new user message.

maxLength int

Maximum tokens to generate.

temperature double

Sampling temperature.

Returns

string

The assistant's response.

Remarks

Enables multi-turn visual dialogue where context is preserved across turns.

For Beginners: Have a conversation about an image!

Example conversation: User: "What's in this image?" Assistant: "A dog playing in a park with a red ball." User: "What breed is the dog?" Assistant: "It appears to be a Golden Retriever based on its golden fur and size." User: "Is it a sunny day?" Assistant: "Yes, there are shadows indicating bright sunlight and clear skies."

CompareImages(Tensor<T>, Tensor<T>, IEnumerable<string>?)

Compares two images and describes their differences.

string CompareImages(Tensor<T> image1, Tensor<T> image2, IEnumerable<string>? aspectsToCompare = null)

Parameters

image1 Tensor<T>

First preprocessed image tensor.

image2 Tensor<T>

Second preprocessed image tensor.

aspectsToCompare IEnumerable<string>

Optional specific aspects to compare.

Returns

string

A description of the differences between the images.

DescribeRegions(Tensor<T>, IEnumerable<Vector<T>>)

Generates a detailed description of specific regions in an image.

IEnumerable<string> DescribeRegions(Tensor<T> image, IEnumerable<Vector<T>> regions)

Parameters

image Tensor<T>

The preprocessed image tensor.

regions IEnumerable<Vector<T>>

List of bounding boxes [x1, y1, x2, y2] to describe.

Returns

IEnumerable<string>

Descriptions for each region.

ExtractVisualFeatures(Tensor<T>)

Extracts visual features before projection to LLM space.

Tensor<T> ExtractVisualFeatures(Tensor<T> image)

Parameters

image Tensor<T>

The preprocessed image tensor.

Returns

Tensor<T>

Visual feature tensor with shape [numPatches, hiddenDim].

Remarks

These are the raw CLIP features before being projected to match the LLM's embedding dimension. Useful for analysis or custom processing.

Generate(Tensor<T>, string, int, double, double)

Generates a response to a text prompt about an image.

string Generate(Tensor<T> image, string prompt, int maxLength = 512, double temperature = 0.7, double topP = 0.9)

Parameters

image Tensor<T>

The preprocessed image tensor.

prompt string

The user's question or instruction about the image.

maxLength int

Maximum number of tokens to generate.

temperature double

Sampling temperature (0 = deterministic, higher = more creative).

topP double

Nucleus sampling probability threshold.

Returns

string

The generated response.

Remarks

For Beginners: Ask any question about an image!

Examples:

  • "What is happening in this image?" → Detailed scene description
  • "How many people are in the photo?" → Counting and recognition
  • "What emotion does the person show?" → Emotional understanding
  • "Write a caption for social media" → Creative generation

GenerateMultiple(Tensor<T>, string, int, double)

Generates multiple diverse responses for the same prompt.

IEnumerable<(string Response, T Score)> GenerateMultiple(Tensor<T> image, string prompt, int numResponses = 5, double temperature = 0.9)

Parameters

image Tensor<T>

The preprocessed image tensor.

prompt string

The user's question or instruction.

numResponses int

Number of different responses to generate.

temperature double

Sampling temperature for diversity.

Returns

IEnumerable<(string Caption, T Score)>

Collection of generated responses with their log probabilities.

GroundObject(Tensor<T>, string)

Performs visual grounding to locate objects described by text.

Vector<T> GroundObject(Tensor<T> image, string description)

Parameters

image Tensor<T>

The preprocessed image tensor.

description string

Description of the object to locate.

Returns

Vector<T>

Bounding box coordinates [x1, y1, x2, y2] normalized to [0, 1].

Remarks

For Beginners: Find where something is in an image!

Example:

  • Description: "the red car on the left"
  • Returns: [0.1, 0.3, 0.4, 0.7] representing the car's bounding box

ProjectToLanguageSpace(Tensor<T>)

Projects visual features to the LLM's embedding space.

Tensor<T> ProjectToLanguageSpace(Tensor<T> visualFeatures)

Parameters

visualFeatures Tensor<T>

Visual features from ExtractVisualFeatures.

Returns

Tensor<T>

Projected features matching LLM embedding dimension.