Table of Contents

Class LLaVANeuralNetwork<T>

Namespace
AiDotNet.NeuralNetworks
Assembly
AiDotNet.dll

LLaVA (Large Language and Vision Assistant) neural network for visual instruction following.

public class LLaVANeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, ILLaVAModel<T>, IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
LLaVANeuralNetwork<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

LLaVA connects a vision encoder (CLIP ViT) with a large language model (LLaMA/Vicuna) through a simple projection layer, enabling visual conversations and instruction following.

For Beginners: LLaVA is like giving eyes to ChatGPT!

Architecture overview:

  1. Vision Encoder (CLIP ViT-L/14): Extracts image patch features
  2. Projection Layer (MLP): Maps visual features to LLM's embedding space
  3. Large Language Model (LLaMA/Vicuna): Generates text responses

Key capabilities:

  • Visual conversations: "What's in this image?" followed by "What color is the car?"
  • Visual reasoning: Understanding relationships, counting, spatial awareness
  • Instruction following: "Describe this image as if you were a poet"
  • Multi-turn dialogue: Context-aware conversations about images

Constructors

LLaVANeuralNetwork(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int, int, int, int, LanguageModelBackbone, string, ITokenizer?, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a LLaVA network using native library layers.

public LLaVANeuralNetwork(NeuralNetworkArchitecture<T> architecture, int imageSize = 336, int channels = 3, int patchSize = 14, int vocabularySize = 32000, int maxSequenceLength = 2048, int embeddingDimension = 4096, int visionHiddenDim = 1024, int numVisionLayers = 24, int numLmLayers = 32, int numHeads = 16, LanguageModelBackbone languageModelBackbone = LanguageModelBackbone.LLaMA, string visionEncoderType = "clip-vit-l", ITokenizer? tokenizer = null, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>
imageSize int
channels int
patchSize int
vocabularySize int
maxSequenceLength int
embeddingDimension int
visionHiddenDim int
numVisionLayers int
numLmLayers int
numHeads int
languageModelBackbone LanguageModelBackbone
visionEncoderType string
tokenizer ITokenizer
optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>
lossFunction ILossFunction<T>

Remarks

In native training mode, a default tokenizer appropriate for the language model backbone will be created if not provided. For production inference, consider using HuggingFace.AutoTokenizer.FromPretrained(string, string?) with the model name from GetHuggingFaceModelName(LanguageModelBackbone).

LLaVANeuralNetwork(NeuralNetworkArchitecture<T>, string, string, ITokenizer, LanguageModelBackbone, string, int, int, int, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)

Creates a LLaVA network using pretrained ONNX models.

public LLaVANeuralNetwork(NeuralNetworkArchitecture<T> architecture, string visionEncoderPath, string languageModelPath, ITokenizer tokenizer, LanguageModelBackbone languageModelBackbone = LanguageModelBackbone.LLaMA, string visionEncoderType = "clip-vit-l", int embeddingDimension = 4096, int maxSequenceLength = 2048, int imageSize = 336, IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>
visionEncoderPath string
languageModelPath string
tokenizer ITokenizer
languageModelBackbone LanguageModelBackbone
visionEncoderType string
embeddingDimension int
maxSequenceLength int
imageSize int
optimizer IGradientBasedOptimizer<T, Tensor<T>, Tensor<T>>
lossFunction ILossFunction<T>

Remarks

For ONNX mode, you MUST provide a tokenizer that matches your language model backbone. Use HuggingFace.AutoTokenizer.FromPretrained(string, string?) to load the correct pretrained tokenizer.

Properties

EmbeddingDimension

Gets the dimensionality of the embedding space.

public int EmbeddingDimension { get; }

Property Value

int

ImageSize

Gets the expected image size (square images: ImageSize x ImageSize pixels).

public int ImageSize { get; }

Property Value

int

LanguageModelBackbone

Gets the language model backbone used for generation.

public LanguageModelBackbone LanguageModelBackbone { get; }

Property Value

LanguageModelBackbone

Remarks

Common backbones include LLaMA, Vicuna, Mistral, etc.

MaxSequenceLength

Gets the maximum sequence length for text input.

public int MaxSequenceLength { get; }

Property Value

int

NumVisualTokens

Gets the maximum number of visual tokens used per image.

public int NumVisualTokens { get; }

Property Value

int

Remarks

The number of patch tokens extracted from the vision encoder. For CLIP ViT-L/14 at 336x336, this is typically 576 tokens (24x24 patches).

ParameterCount

Gets the total number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

For Beginners: This tells you how many adjustable values (weights and biases) your neural network has. More complex networks typically have more parameters and can learn more complex patterns, but also require more data to train effectively. This is part of the IFullModel interface for consistency with other model types.

Performance: This property uses caching to avoid recomputing the sum on every access. The cache is invalidated when layers are modified.

VisionEncoderType

Gets the vision encoder type.

public string VisionEncoderType { get; }

Property Value

string

Remarks

Typically CLIP ViT-L/14 or similar vision transformer models.

Methods

Backward(Tensor<T>)

Backward pass through projection layers (vision encoder is frozen).

public Tensor<T> Backward(Tensor<T> gradient)

Parameters

gradient Tensor<T>

The gradient tensor from the loss function.

Returns

Tensor<T>

The gradient after backward propagation.

Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int, double)

Continues a multi-turn conversation about an image.

public string Chat(Tensor<T> image, IEnumerable<(string Role, string Content)> conversationHistory, string userMessage, int maxLength = 512, double temperature = 0.7)

Parameters

image Tensor<T>

The preprocessed image tensor.

conversationHistory IEnumerable<(string Role, string Content)>

Previous turns as (role, content) pairs.

userMessage string

The new user message.

maxLength int

Maximum tokens to generate.

temperature double

Sampling temperature.

Returns

string

The assistant's response.

Remarks

Enables multi-turn visual dialogue where context is preserved across turns.

For Beginners: Have a conversation about an image!

Example conversation: User: "What's in this image?" Assistant: "A dog playing in a park with a red ball." User: "What breed is the dog?" Assistant: "It appears to be a Golden Retriever based on its golden fur and size." User: "Is it a sunny day?" Assistant: "Yes, there are shadows indicating bright sunlight and clear skies."

CompareImages(Tensor<T>, Tensor<T>, IEnumerable<string>?)

Compares two images and describes their differences.

public string CompareImages(Tensor<T> image1, Tensor<T> image2, IEnumerable<string>? aspectsToCompare = null)

Parameters

image1 Tensor<T>

First preprocessed image tensor.

image2 Tensor<T>

Second preprocessed image tensor.

aspectsToCompare IEnumerable<string>

Optional specific aspects to compare.

Returns

string

A description of the differences between the images.

ComputeSimilarity(Vector<T>, Vector<T>)

Computes similarity between two embeddings.

public T ComputeSimilarity(Vector<T> textEmbedding, Vector<T> imageEmbedding)

Parameters

textEmbedding Vector<T>
imageEmbedding Vector<T>

Returns

T

Similarity score (cosine similarity for normalized embeddings).

CreateNewInstance()

Creates a new instance of the same type as this neural network.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

A new instance of the same neural network type.

Remarks

For Beginners: This creates a blank version of the same type of neural network.

It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.

DescribeRegions(Tensor<T>, IEnumerable<Vector<T>>)

Generates a detailed description of specific regions in an image.

public IEnumerable<string> DescribeRegions(Tensor<T> image, IEnumerable<Vector<T>> regions)

Parameters

image Tensor<T>

The preprocessed image tensor.

regions IEnumerable<Vector<T>>

List of bounding boxes [x1, y1, x2, y2] to describe.

Returns

IEnumerable<string>

Descriptions for each region.

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data that was not covered by the general deserialization process.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

The BinaryReader to read the data from.

Remarks

This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.

For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.

Dispose(bool)

Protected Dispose pattern implementation.

protected override void Dispose(bool disposing)

Parameters

disposing bool

True if called from Dispose(), false if called from finalizer.

EncodeImage(double[])

Encodes an image into an embedding vector.

public Vector<T> EncodeImage(double[] imageData)

Parameters

imageData double[]

The preprocessed image data as a flattened array in CHW format.

Returns

Vector<T>

A normalized embedding vector.

EncodeImageBatch(IEnumerable<double[]>)

Encodes multiple images into embedding vectors in a batch.

public Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)

Parameters

imageDataBatch IEnumerable<double[]>

The preprocessed images as flattened arrays.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding image.

EncodeText(string)

Encodes text into an embedding vector.

public Vector<T> EncodeText(string text)

Parameters

text string

The text to encode.

Returns

Vector<T>

A normalized embedding vector.

EncodeTextBatch(IEnumerable<string>)

Encodes multiple texts into embedding vectors in a batch.

public Matrix<T> EncodeTextBatch(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

The texts to encode.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding text.

ExtractVisualFeatures(Tensor<T>)

Extracts visual features before projection to LLM space.

public Tensor<T> ExtractVisualFeatures(Tensor<T> image)

Parameters

image Tensor<T>

The preprocessed image tensor.

Returns

Tensor<T>

Visual feature tensor with shape [numPatches, hiddenDim].

Remarks

These are the raw CLIP features before being projected to match the LLM's embedding dimension. Useful for analysis or custom processing.

Generate(Tensor<T>, string, int, double, double)

Generates a response to a text prompt about an image.

public string Generate(Tensor<T> image, string prompt, int maxLength = 512, double temperature = 0.7, double topP = 0.9)

Parameters

image Tensor<T>

The preprocessed image tensor.

prompt string

The user's question or instruction about the image.

maxLength int

Maximum number of tokens to generate.

temperature double

Sampling temperature (0 = deterministic, higher = more creative).

topP double

Nucleus sampling probability threshold.

Returns

string

The generated response.

Remarks

For Beginners: Ask any question about an image!

Examples:

  • "What is happening in this image?" → Detailed scene description
  • "How many people are in the photo?" → Counting and recognition
  • "What emotion does the person show?" → Emotional understanding
  • "Write a caption for social media" → Creative generation

GenerateMultiple(Tensor<T>, string, int, double)

Generates multiple diverse responses for the same prompt.

public IEnumerable<(string Response, T Score)> GenerateMultiple(Tensor<T> image, string prompt, int numResponses = 5, double temperature = 0.9)

Parameters

image Tensor<T>

The preprocessed image tensor.

prompt string

The user's question or instruction.

numResponses int

Number of different responses to generate.

temperature double

Sampling temperature for diversity.

Returns

IEnumerable<(string Caption, T Score)>

Collection of generated responses with their log probabilities.

GetImageEmbedding(Tensor<T>)

public Vector<T> GetImageEmbedding(Tensor<T> image)

Parameters

image Tensor<T>

Returns

Vector<T>

GetImageEmbeddings(IEnumerable<Tensor<T>>)

public IEnumerable<Vector<T>> GetImageEmbeddings(IEnumerable<Tensor<T>> images)

Parameters

images IEnumerable<Tensor<T>>

Returns

IEnumerable<Vector<T>>

GetModelMetadata()

Gets the metadata for this neural network model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

A ModelMetaData object containing information about the model.

GetTextEmbedding(string)

public Vector<T> GetTextEmbedding(string text)

Parameters

text string

Returns

Vector<T>

GetTextEmbeddings(IEnumerable<string>)

public IEnumerable<Vector<T>> GetTextEmbeddings(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

Returns

IEnumerable<Vector<T>>

GroundObject(Tensor<T>, string)

Performs visual grounding to locate objects described by text.

public Vector<T> GroundObject(Tensor<T> image, string description)

Parameters

image Tensor<T>

The preprocessed image tensor.

description string

Description of the object to locate.

Returns

Vector<T>

Bounding box coordinates [x1, y1, x2, y2] normalized to [0, 1].

Remarks

For Beginners: Find where something is in an image!

Example:

  • Description: "the red car on the left"
  • Returns: [0.1, 0.3, 0.4, 0.7] representing the car's bounding box

InitializeLayers()

Initializes the layers of the neural network based on the architecture.

protected override void InitializeLayers()

Remarks

For Beginners: This method sets up all the layers in your neural network according to the architecture you've defined. It's like assembling the parts of your network before you can use it.

Predict(Tensor<T>)

Makes a prediction using the neural network.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

The input data to process.

Returns

Tensor<T>

The network's prediction.

Remarks

For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).

ProjectToLanguageSpace(Tensor<T>)

Projects visual features to the LLM's embedding space.

public Tensor<T> ProjectToLanguageSpace(Tensor<T> visualFeatures)

Parameters

visualFeatures Tensor<T>

Visual features from ExtractVisualFeatures.

Returns

Tensor<T>

Projected features matching LLM embedding dimension.

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data that is not covered by the general serialization process.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

The BinaryWriter to write the data to.

Remarks

This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.

For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.

Train(Tensor<T>, Tensor<T>)

Trains the neural network on a single input-output pair.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>

The input data.

expectedOutput Tensor<T>

The expected output for the given input.

Remarks

This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.

For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)

The network then:

  1. Makes a prediction based on the input
  2. Compares its prediction to the expected output
  3. Calculates how wrong it was (the loss)
  4. Adjusts its internal values to do better next time

After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.

UpdateParameters(Vector<T>)

Updates the network's parameters with new values.

public override void UpdateParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The new parameter values to set.

Remarks

For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.

This is typically used by optimization algorithms that calculate better parameter values based on training data.

ZeroShotClassify(Tensor<T>, IEnumerable<string>)

public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels)

Parameters

image Tensor<T>
classLabels IEnumerable<string>

Returns

Dictionary<string, T>

ZeroShotClassify(double[], IEnumerable<string>)

Performs zero-shot classification of an image against text labels.

public Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)

Parameters

imageData double[]

The preprocessed image data.

labels IEnumerable<string>

The candidate class labels.

Returns

Dictionary<string, T>

A dictionary mapping each label to its probability score.