Interface ILLaVAModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for LLaVA (Large Language and Vision Assistant) models.
public interface ILLaVAModel<T> : IMultimodalEmbedding<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
Remarks
LLaVA connects a vision encoder (like CLIP ViT) with a large language model (like LLaMA/Vicuna) through a simple projection layer, enabling visual instruction-following and conversational AI about images.
For Beginners: LLaVA is like giving eyes to ChatGPT!
Architecture:
- Vision Encoder (CLIP ViT): Converts images to feature vectors
- Projection Layer: Maps visual features to LLM's text embedding space
- Large Language Model (LLaMA/Vicuna): Generates responses
Key capabilities:
- Visual conversations: "What's in this image?" followed by "What color is the car?"
- Visual reasoning: Understanding relationships, counting, spatial awareness
- Instruction following: "Describe this image as if you were a poet"
- Multi-turn dialogue: Context-aware conversations about images
Why LLaVA is popular:
- Simple but effective architecture
- Open-source and reproducible
- Strong performance on visual understanding benchmarks
- Efficient training with visual instruction tuning
Properties
LanguageModelBackbone
Gets the language model backbone used for generation.
LanguageModelBackbone LanguageModelBackbone { get; }
Property Value
Remarks
NumVisualTokens
Gets the maximum number of visual tokens used per image.
int NumVisualTokens { get; }
Property Value
Remarks
The number of patch tokens extracted from the vision encoder. For CLIP ViT-L/14 at 336x336, this is typically 576 tokens (24x24 patches).
VisionEncoderType
Gets the vision encoder type.
string VisionEncoderType { get; }
Property Value
Remarks
Typically CLIP ViT-L/14 or similar vision transformer models.
Methods
Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int, double)
Continues a multi-turn conversation about an image.
string Chat(Tensor<T> image, IEnumerable<(string Role, string Content)> conversationHistory, string userMessage, int maxLength = 512, double temperature = 0.7)
Parameters
imageTensor<T>The preprocessed image tensor.
conversationHistoryIEnumerable<(string Role, string Content)>Previous turns as (role, content) pairs.
userMessagestringThe new user message.
maxLengthintMaximum tokens to generate.
temperaturedoubleSampling temperature.
Returns
- string
The assistant's response.
Remarks
Enables multi-turn visual dialogue where context is preserved across turns.
For Beginners: Have a conversation about an image!
Example conversation: User: "What's in this image?" Assistant: "A dog playing in a park with a red ball." User: "What breed is the dog?" Assistant: "It appears to be a Golden Retriever based on its golden fur and size." User: "Is it a sunny day?" Assistant: "Yes, there are shadows indicating bright sunlight and clear skies."
CompareImages(Tensor<T>, Tensor<T>, IEnumerable<string>?)
Compares two images and describes their differences.
string CompareImages(Tensor<T> image1, Tensor<T> image2, IEnumerable<string>? aspectsToCompare = null)
Parameters
image1Tensor<T>First preprocessed image tensor.
image2Tensor<T>Second preprocessed image tensor.
aspectsToCompareIEnumerable<string>Optional specific aspects to compare.
Returns
- string
A description of the differences between the images.
DescribeRegions(Tensor<T>, IEnumerable<Vector<T>>)
Generates a detailed description of specific regions in an image.
IEnumerable<string> DescribeRegions(Tensor<T> image, IEnumerable<Vector<T>> regions)
Parameters
imageTensor<T>The preprocessed image tensor.
regionsIEnumerable<Vector<T>>List of bounding boxes [x1, y1, x2, y2] to describe.
Returns
- IEnumerable<string>
Descriptions for each region.
ExtractVisualFeatures(Tensor<T>)
Extracts visual features before projection to LLM space.
Tensor<T> ExtractVisualFeatures(Tensor<T> image)
Parameters
imageTensor<T>The preprocessed image tensor.
Returns
- Tensor<T>
Visual feature tensor with shape [numPatches, hiddenDim].
Remarks
These are the raw CLIP features before being projected to match the LLM's embedding dimension. Useful for analysis or custom processing.
Generate(Tensor<T>, string, int, double, double)
Generates a response to a text prompt about an image.
string Generate(Tensor<T> image, string prompt, int maxLength = 512, double temperature = 0.7, double topP = 0.9)
Parameters
imageTensor<T>The preprocessed image tensor.
promptstringThe user's question or instruction about the image.
maxLengthintMaximum number of tokens to generate.
temperaturedoubleSampling temperature (0 = deterministic, higher = more creative).
topPdoubleNucleus sampling probability threshold.
Returns
- string
The generated response.
Remarks
For Beginners: Ask any question about an image!
Examples:
- "What is happening in this image?" → Detailed scene description
- "How many people are in the photo?" → Counting and recognition
- "What emotion does the person show?" → Emotional understanding
- "Write a caption for social media" → Creative generation
GenerateMultiple(Tensor<T>, string, int, double)
Generates multiple diverse responses for the same prompt.
IEnumerable<(string Response, T Score)> GenerateMultiple(Tensor<T> image, string prompt, int numResponses = 5, double temperature = 0.9)
Parameters
imageTensor<T>The preprocessed image tensor.
promptstringThe user's question or instruction.
numResponsesintNumber of different responses to generate.
temperaturedoubleSampling temperature for diversity.
Returns
- IEnumerable<(string Caption, T Score)>
Collection of generated responses with their log probabilities.
GroundObject(Tensor<T>, string)
Performs visual grounding to locate objects described by text.
Vector<T> GroundObject(Tensor<T> image, string description)
Parameters
imageTensor<T>The preprocessed image tensor.
descriptionstringDescription of the object to locate.
Returns
- Vector<T>
Bounding box coordinates [x1, y1, x2, y2] normalized to [0, 1].
Remarks
For Beginners: Find where something is in an image!
Example:
- Description: "the red car on the left"
- Returns: [0.1, 0.3, 0.4, 0.7] representing the car's bounding box
ProjectToLanguageSpace(Tensor<T>)
Projects visual features to the LLM's embedding space.
Tensor<T> ProjectToLanguageSpace(Tensor<T> visualFeatures)
Parameters
visualFeaturesTensor<T>Visual features from ExtractVisualFeatures.
Returns
- Tensor<T>
Projected features matching LLM embedding dimension.