Table of Contents

Class Gpt4VisionNeuralNetwork<T>

Namespace
AiDotNet.NeuralNetworks
Assembly
AiDotNet.dll

GPT-4V-style neural network that combines vision understanding with large language model capabilities.

public class Gpt4VisionNeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IGpt4VisionModel<T>, IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
Gpt4VisionNeuralNetwork<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

This implementation provides a vision-language model that can understand images and generate text responses, similar to GPT-4V, LLaVA, or other vision-language models.

Architecture Overview: 1. Vision Encoder: ViT-based encoder to extract visual features 2. Vision-Language Projector: Maps visual features to LLM embedding space 3. Language Model: Transformer decoder for text generation 4. Multi-modal Attention: Allows text to attend to visual features

Constructors

Gpt4VisionNeuralNetwork(NeuralNetworkArchitecture<T>, ITokenizer, int, int, int, int, int, int, int, int, int, int, int, int, ILossFunction<T>?)

Creates a GPT-4 Vision network using native layers (for training or when ONNX is not available).

public Gpt4VisionNeuralNetwork(NeuralNetworkArchitecture<T> architecture, ITokenizer tokenizer, int embeddingDimension = 4096, int visionEmbeddingDim = 1024, int maxSequenceLength = 2048, int contextWindowSize = 128000, int imageSize = 336, int hiddenDim = 4096, int numVisionLayers = 24, int numLanguageLayers = 32, int numHeads = 32, int patchSize = 14, int vocabularySize = 128256, int maxImagesPerRequest = 10, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>
tokenizer ITokenizer
embeddingDimension int
visionEmbeddingDim int
maxSequenceLength int
contextWindowSize int
imageSize int
hiddenDim int
numVisionLayers int
numLanguageLayers int
numHeads int
patchSize int
vocabularySize int
maxImagesPerRequest int
lossFunction ILossFunction<T>

Gpt4VisionNeuralNetwork(NeuralNetworkArchitecture<T>, string, string, ITokenizer, int, int, int, int, int, int, ILossFunction<T>?)

Creates a GPT-4 Vision network using pretrained ONNX models.

public Gpt4VisionNeuralNetwork(NeuralNetworkArchitecture<T> architecture, string visionEncoderPath, string languageModelPath, ITokenizer tokenizer, int embeddingDimension = 4096, int visionEmbeddingDim = 1024, int maxSequenceLength = 2048, int contextWindowSize = 128000, int imageSize = 336, int maxImagesPerRequest = 10, ILossFunction<T>? lossFunction = null)

Parameters

architecture NeuralNetworkArchitecture<T>
visionEncoderPath string
languageModelPath string
tokenizer ITokenizer
embeddingDimension int
visionEmbeddingDim int
maxSequenceLength int
contextWindowSize int
imageSize int
maxImagesPerRequest int
lossFunction ILossFunction<T>

Properties

ContextWindowSize

Gets the context window size in tokens.

public int ContextWindowSize { get; }

Property Value

int

EmbeddingDimension

Gets the dimensionality of the embedding space.

public int EmbeddingDimension { get; }

Property Value

int

ImageEmbeddingDimension

public int ImageEmbeddingDimension { get; }

Property Value

int

ImageSize

Gets the expected image size (square images: ImageSize x ImageSize pixels).

public int ImageSize { get; }

Property Value

int

MaxImageResolution

Gets the maximum resolution supported for input images.

public (int Width, int Height) MaxImageResolution { get; }

Property Value

(int min, int max)

MaxImagesPerRequest

Gets the maximum number of images that can be processed in a single request.

public int MaxImagesPerRequest { get; }

Property Value

int

MaxSequenceLength

Gets the maximum sequence length for text input.

public int MaxSequenceLength { get; }

Property Value

int

ParameterCount

Gets the total number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

For Beginners: This tells you how many adjustable values (weights and biases) your neural network has. More complex networks typically have more parameters and can learn more complex patterns, but also require more data to train effectively. This is part of the IFullModel interface for consistency with other model types.

Performance: This property uses caching to avoid recomputing the sum on every access. The cache is invalidated when layers are modified.

SupportedDetailLevels

Gets the supported image detail levels.

public IReadOnlyList<string> SupportedDetailLevels { get; }

Property Value

IReadOnlyList<string>

TextEmbeddingDimension

public int TextEmbeddingDimension { get; }

Property Value

int

Methods

AnalyzeChart(Tensor<T>)

Analyzes a chart or graph and extracts data.

public (string ChartType, Dictionary<string, object> Data, string Interpretation) AnalyzeChart(Tensor<T> chartImage)

Parameters

chartImage Tensor<T>

Image of a chart or graph.

Returns

(string ChartType, Dictionary<string, object> Data, string Interpretation)

Chart analysis including type, data points, and interpretation.

AnalyzeDocument(Tensor<T>, string, string?)

Analyzes a document image (PDF page, screenshot, etc.).

public string AnalyzeDocument(Tensor<T> documentImage, string analysisType = "summary", string? additionalPrompt = null)

Parameters

documentImage Tensor<T>

The document image.

analysisType string

Type: "summary", "extract_text", "answer_questions", "analyze_structure".

additionalPrompt string

Optional additional instructions.

Returns

string

Analysis result.

Remarks

Specialized for understanding structured documents like: - PDF pages and scanned documents - Charts and graphs - Tables and spreadsheets - Forms and invoices

AnswerVisualQuestion(Tensor<T>, string)

Answers a visual question with confidence score.

public (string Answer, T Confidence) AnswerVisualQuestion(Tensor<T> image, string question)

Parameters

image Tensor<T>

The input image.

question string

Question about the image.

Returns

(string Label, T Confidence)

Answer and confidence score.

Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int)

Conducts a multi-turn conversation about an image.

public string Chat(Tensor<T> image, IEnumerable<(string Role, string Content)> conversationHistory, string userMessage, int maxTokens = 1024)

Parameters

image Tensor<T>

The image being discussed.

conversationHistory IEnumerable<(string Role, string Content)>

Previous turns as (role, content) pairs.

userMessage string

The new user message.

maxTokens int

Maximum tokens to generate.

Returns

string

Generated assistant response.

CompareImages(Tensor<T>, Tensor<T>, string)

Compares two images and describes their differences.

public string CompareImages(Tensor<T> image1, Tensor<T> image2, string comparisonType = "detailed")

Parameters

image1 Tensor<T>

First image.

image2 Tensor<T>

Second image.

comparisonType string

Type: "visual", "semantic", "detailed".

Returns

string

Comparison description.

ComputeSimilarity(Tensor<T>, string)

public T ComputeSimilarity(Tensor<T> image, string text)

Parameters

image Tensor<T>
text string

Returns

T

ComputeSimilarity(Vector<T>, Vector<T>)

Computes similarity between two embeddings.

public T ComputeSimilarity(Vector<T> textEmbedding, Vector<T> imageEmbedding)

Parameters

textEmbedding Vector<T>
imageEmbedding Vector<T>

Returns

T

Similarity score (cosine similarity for normalized embeddings).

CreateNewInstance()

Creates a new instance of the same type as this neural network.

protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

A new instance of the same neural network type.

Remarks

For Beginners: This creates a blank version of the same type of neural network.

It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.

DescribeImage(Tensor<T>, string, string)

Describes an image with specified style and detail level.

public string DescribeImage(Tensor<T> image, string style = "factual", string detailLevel = "medium")

Parameters

image Tensor<T>

The input image.

style string

Description style: "factual", "poetic", "technical", "accessibility".

detailLevel string

Detail level: "low", "medium", "high".

Returns

string

Generated description.

DeserializeNetworkSpecificData(BinaryReader)

Deserializes network-specific data that was not covered by the general deserialization process.

protected override void DeserializeNetworkSpecificData(BinaryReader reader)

Parameters

reader BinaryReader

The BinaryReader to read the data from.

Remarks

This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.

For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.

DetectObjects(Tensor<T>, string?)

Identifies and locates objects in an image with bounding boxes.

public IEnumerable<(string Label, T Confidence, int X, int Y, int Width, int Height)> DetectObjects(Tensor<T> image, string? objectQuery = null)

Parameters

image Tensor<T>

The input image.

objectQuery string

Optional specific objects to find, or null for all objects.

Returns

IEnumerable<(string Label, T Confidence, int X, int Y, int Width, int Height)>

List of detected objects with bounding boxes and confidence scores.

Dispose(bool)

Protected Dispose pattern implementation.

protected override void Dispose(bool disposing)

Parameters

disposing bool

True if called from Dispose(), false if called from finalizer.

EncodeImage(double[])

Encodes an image into an embedding vector.

public Vector<T> EncodeImage(double[] imageData)

Parameters

imageData double[]

The preprocessed image data as a flattened array in CHW format.

Returns

Vector<T>

A normalized embedding vector.

EncodeImageBatch(IEnumerable<double[]>)

Encodes multiple images into embedding vectors in a batch.

public Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)

Parameters

imageDataBatch IEnumerable<double[]>

The preprocessed images as flattened arrays.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding image.

EncodeText(string)

Encodes text into an embedding vector.

public Vector<T> EncodeText(string text)

Parameters

text string

The text to encode.

Returns

Vector<T>

A normalized embedding vector.

EncodeTextBatch(IEnumerable<string>)

Encodes multiple texts into embedding vectors in a batch.

public Matrix<T> EncodeTextBatch(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

The texts to encode.

Returns

Matrix<T>

A matrix where each row is an embedding for the corresponding text.

EvaluateImageQuality(Tensor<T>)

Evaluates image quality and provides improvement suggestions.

public (Dictionary<string, T> QualityScores, IEnumerable<string> Suggestions) EvaluateImageQuality(Tensor<T> image)

Parameters

image Tensor<T>

The image to evaluate.

Returns

(Dictionary<string, T> QualityScores, IEnumerable<string> Suggestions)

Quality assessment with scores and suggestions.

ExtractStructuredData(Tensor<T>, string)

Extracts structured data from an image.

public string ExtractStructuredData(Tensor<T> image, string schema)

Parameters

image Tensor<T>

The input image.

schema string

JSON schema describing expected output structure.

Returns

string

Extracted data as JSON string.

Remarks

For Beginners: Get structured data from images!

Example schema: {"name": "string", "price": "number", "in_stock": "boolean"} From a product image, extracts: {"name": "Widget", "price": 29.99, "in_stock": true}

ExtractText(Tensor<T>, bool)

Performs OCR with layout understanding.

public (string Text, Dictionary<string, object>? LayoutInfo) ExtractText(Tensor<T> image, bool preserveLayout = false)

Parameters

image Tensor<T>

Image containing text.

preserveLayout bool

Whether to preserve spatial layout in output.

Returns

(string Text, Dictionary<string, object> LayoutInfo)

Extracted text with optional layout information.

Generate(Tensor<T>, string, int, double)

Generates a response based on an image and text prompt.

public string Generate(Tensor<T> image, string prompt, int maxTokens = 1024, double temperature = 0.7)

Parameters

image Tensor<T>

The input image tensor [channels, height, width].

prompt string

The text prompt or question about the image.

maxTokens int

Maximum tokens to generate.

temperature double

Sampling temperature (0-2).

Returns

string

Generated text response.

GenerateCodeFromUI(Tensor<T>, string, string?)

Generates code from a UI screenshot.

public string GenerateCodeFromUI(Tensor<T> uiScreenshot, string targetFramework = "html_css", string? additionalInstructions = null)

Parameters

uiScreenshot Tensor<T>

Screenshot of a user interface.

targetFramework string

Target framework: "html_css", "react", "flutter", "swiftui".

additionalInstructions string

Optional styling or functionality instructions.

Returns

string

Generated code.

GenerateEditInstructions(Tensor<T>, string)

Generates image editing instructions based on a modification request.

public string GenerateEditInstructions(Tensor<T> image, string editRequest)

Parameters

image Tensor<T>

The original image.

editRequest string

Description of desired edit.

Returns

string

Structured editing instructions.

GenerateFromMultipleImages(IEnumerable<Tensor<T>>, string, int, double)

Generates a response based on multiple images and text prompt.

public string GenerateFromMultipleImages(IEnumerable<Tensor<T>> images, string prompt, int maxTokens = 1024, double temperature = 0.7)

Parameters

images IEnumerable<Tensor<T>>

Multiple input images.

prompt string

The text prompt referencing the images.

maxTokens int

Maximum tokens to generate.

temperature double

Sampling temperature.

Returns

string

Generated text response.

Remarks

For Beginners: Compare and analyze multiple images!

Examples:

  • "What are the differences between these two images?"
  • "Which of these products looks more appealing?"
  • "Describe how these images are related."

GenerateStory(Tensor<T>, string, string)

Generates a creative story or narrative based on an image.

public string GenerateStory(Tensor<T> image, string genre = "general", string length = "medium")

Parameters

image Tensor<T>

The inspiring image.

genre string

Story genre: "fantasy", "mystery", "romance", "scifi", "general".

length string

Approximate length: "short", "medium", "long".

Returns

string

Generated story.

GetAttentionMap(Tensor<T>, string)

Gets attention weights showing which image regions influenced the response.

public Tensor<T> GetAttentionMap(Tensor<T> image, string prompt)

Parameters

image Tensor<T>

The input image.

prompt string

The prompt used.

Returns

Tensor<T>

Attention map tensor [height, width] showing importance weights.

GetImageEmbedding(Tensor<T>)

public Vector<T> GetImageEmbedding(Tensor<T> image)

Parameters

image Tensor<T>

Returns

Vector<T>

GetImageEmbeddings(IEnumerable<Tensor<T>>)

public IEnumerable<Vector<T>> GetImageEmbeddings(IEnumerable<Tensor<T>> images)

Parameters

images IEnumerable<Tensor<T>>

Returns

IEnumerable<Vector<T>>

GetModelMetadata()

Gets the metadata for this neural network model.

public override ModelMetadata<T> GetModelMetadata()

Returns

ModelMetadata<T>

A ModelMetaData object containing information about the model.

GetTextEmbedding(string)

public Vector<T> GetTextEmbedding(string text)

Parameters

text string

Returns

Vector<T>

GetTextEmbeddings(IEnumerable<string>)

public IEnumerable<Vector<T>> GetTextEmbeddings(IEnumerable<string> texts)

Parameters

texts IEnumerable<string>

Returns

IEnumerable<Vector<T>>

InitializeLayers()

Initializes the layers of the neural network based on the architecture.

protected override void InitializeLayers()

Remarks

For Beginners: This method sets up all the layers in your neural network according to the architecture you've defined. It's like assembling the parts of your network before you can use it.

Predict(Tensor<T>)

Makes a prediction using the neural network.

public override Tensor<T> Predict(Tensor<T> input)

Parameters

input Tensor<T>

The input data to process.

Returns

Tensor<T>

The network's prediction.

Remarks

For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).

RetrieveImages(string, IEnumerable<Tensor<T>>, int)

public IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Tensor<T>> images, int topK = 5)

Parameters

query string
images IEnumerable<Tensor<T>>
topK int

Returns

IEnumerable<(int Index, T Score)>

RetrieveTexts(Tensor<T>, IEnumerable<string>, int)

public IEnumerable<(int Index, T Score)> RetrieveTexts(Tensor<T> image, IEnumerable<string> texts, int topK = 5)

Parameters

image Tensor<T>
texts IEnumerable<string>
topK int

Returns

IEnumerable<(int Index, T Score)>

SafetyCheck(Tensor<T>)

Identifies potential safety concerns in an image.

public Dictionary<string, (bool IsFlagged, T Confidence)> SafetyCheck(Tensor<T> image)

Parameters

image Tensor<T>

The image to analyze.

Returns

Dictionary<string, (bool IsFlagged, T Confidence)>

Safety assessment with categories and confidence levels.

SerializeNetworkSpecificData(BinaryWriter)

Serializes network-specific data that is not covered by the general serialization process.

protected override void SerializeNetworkSpecificData(BinaryWriter writer)

Parameters

writer BinaryWriter

The BinaryWriter to write the data to.

Remarks

This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.

For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.

Train(Tensor<T>, Tensor<T>)

Trains the neural network on a single input-output pair.

public override void Train(Tensor<T> input, Tensor<T> expectedOutput)

Parameters

input Tensor<T>

The input data.

expectedOutput Tensor<T>

The expected output for the given input.

Remarks

This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.

For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)

The network then:

  1. Makes a prediction based on the input
  2. Compares its prediction to the expected output
  3. Calculates how wrong it was (the loss)
  4. Adjusts its internal values to do better next time

After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.

UpdateParameters(Vector<T>)

Updates the network's parameters with new values.

public override void UpdateParameters(Vector<T> gradients)

Parameters

gradients Vector<T>

Remarks

For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.

This is typically used by optimization algorithms that calculate better parameter values based on training data.

VisualReasoning(Tensor<T>, string, string)

Performs visual reasoning tasks.

public (string Answer, string Explanation) VisualReasoning(Tensor<T> image, string reasoningTask, string question)

Parameters

image Tensor<T>

The input image.

reasoningTask string

Task type: "count", "compare", "spatial", "temporal", "causal".

question string

Specific question for the reasoning task.

Returns

(string component, string operation)

Reasoning result with explanation.

ZeroShotClassify(Tensor<T>, IEnumerable<string>)

public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> labels)

Parameters

image Tensor<T>
labels IEnumerable<string>

Returns

Dictionary<string, T>

ZeroShotClassify(double[], IEnumerable<string>)

Performs zero-shot classification of an image against text labels.

public Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)

Parameters

imageData double[]

The preprocessed image data.

labels IEnumerable<string>

The candidate class labels.

Returns

Dictionary<string, T>

A dictionary mapping each label to its probability score.