Interface IGpt4VisionModel<T>

Namespace: AiDotNet.Interfaces

Assembly: AiDotNet.dll

Defines the contract for GPT-4V-style models that combine vision understanding with large language model capabilities.

public interface IGpt4VisionModel<T> : IMultimodalEmbedding<T>

Type Parameters

T: The numeric type used for calculations.

Inherited Members: IMultimodalEmbedding<T>.EncodeText(string)

IMultimodalEmbedding<T>.EncodeTextBatch(IEnumerable<string>)

IMultimodalEmbedding<T>.EncodeImage(double[])

IMultimodalEmbedding<T>.EncodeImageBatch(IEnumerable<double[]>)

IMultimodalEmbedding<T>.ComputeSimilarity(Vector<T>, Vector<T>)

IMultimodalEmbedding<T>.ZeroShotClassify(double[], IEnumerable<string>)

IMultimodalEmbedding<T>.EmbeddingDimension

IMultimodalEmbedding<T>.MaxSequenceLength

IMultimodalEmbedding<T>.ImageSize

Remarks

GPT-4V represents the integration of vision capabilities into large language models, enabling sophisticated visual reasoning, multi-turn conversations about images, and complex visual-linguistic tasks.

For Beginners: GPT-4V is like giving ChatGPT the ability to see!

Key capabilities:

Visual reasoning: Understanding relationships, counting, spatial awareness
Multi-turn dialogue: Context-aware conversations about images
Document understanding: Reading and analyzing documents, charts, diagrams
Code generation from screenshots: Understanding UI and generating code
Creative tasks: Describing images poetically, writing stories from images

Architecture concepts:

Vision Encoder: Processes images into visual tokens
Visual-Language Alignment: Maps visual features to LLM embedding space
Large Language Model: Generates text responses conditioned on visual input
Multi-modal Attention: Allows text to attend to relevant image regions

Properties

ContextWindowSize

Gets the context window size in tokens.

int ContextWindowSize { get; }

Property Value

int

MaxImageResolution

Gets the maximum resolution supported for input images.

(int Width, int Height) MaxImageResolution { get; }

Property Value

(int min, int max)

MaxImagesPerRequest

Gets the maximum number of images that can be processed in a single request.

int MaxImagesPerRequest { get; }

Property Value

int

SupportedDetailLevels

Gets the supported image detail levels.

IReadOnlyList<string> SupportedDetailLevels { get; }

Property Value

IReadOnlyList<string>

Methods

AnalyzeChart(Tensor<T>)

Analyzes a chart or graph and extracts data.

(string ChartType, Dictionary<string, object> Data, string Interpretation) AnalyzeChart(Tensor<T> chartImage)

Parameters

chartImage Tensor<T>: Image of a chart or graph.

Returns

(string ChartType, Dictionary<string, object> Data, string Interpretation): Chart analysis including type, data points, and interpretation.

AnalyzeDocument(Tensor<T>, string, string?)

Analyzes a document image (PDF page, screenshot, etc.).

string AnalyzeDocument(Tensor<T> documentImage, string analysisType = "summary", string? additionalPrompt = null)

Parameters

documentImage Tensor<T>: The document image.
analysisType string: Type: "summary", "extract_text", "answer_questions", "analyze_structure".
additionalPrompt string: Optional additional instructions.

Returns

string: Analysis result.

Remarks

Specialized for understanding structured documents like: - PDF pages and scanned documents - Charts and graphs - Tables and spreadsheets - Forms and invoices

AnswerVisualQuestion(Tensor<T>, string)

Answers a visual question with confidence score.

(string Answer, T Confidence) AnswerVisualQuestion(Tensor<T> image, string question)

Parameters

image Tensor<T>: The input image.
question string: Question about the image.

Returns

(string Label, T Confidence): Answer and confidence score.

Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int)

Conducts a multi-turn conversation about an image.

string Chat(Tensor<T> image, IEnumerable<(string Role, string Content)> conversationHistory, string userMessage, int maxTokens = 1024)

Parameters

image Tensor<T>: The image being discussed.
conversationHistory IEnumerable<(string Role, string Content)>: Previous turns as (role, content) pairs.
userMessage string: The new user message.
maxTokens int: Maximum tokens to generate.

Returns

string: Generated assistant response.

CompareImages(Tensor<T>, Tensor<T>, string)

Compares two images and describes their differences.

string CompareImages(Tensor<T> image1, Tensor<T> image2, string comparisonType = "detailed")

Parameters

image1 Tensor<T>: First image.
image2 Tensor<T>: Second image.
comparisonType string: Type: "visual", "semantic", "detailed".

Returns

string: Comparison description.

DescribeImage(Tensor<T>, string, string)

Describes an image with specified style and detail level.

string DescribeImage(Tensor<T> image, string style = "factual", string detailLevel = "medium")

Parameters

image Tensor<T>: The input image.
style string: Description style: "factual", "poetic", "technical", "accessibility".
detailLevel string: Detail level: "low", "medium", "high".

Returns

string: Generated description.

DetectObjects(Tensor<T>, string?)

Identifies and locates objects in an image with bounding boxes.

IEnumerable<(string Label, T Confidence, int X, int Y, int Width, int Height)> DetectObjects(Tensor<T> image, string? objectQuery = null)

Parameters

image Tensor<T>: The input image.
objectQuery string: Optional specific objects to find, or null for all objects.

Returns

IEnumerable<(string Label, T Confidence, int X, int Y, int Width, int Height)>: List of detected objects with bounding boxes and confidence scores.

EvaluateImageQuality(Tensor<T>)

Evaluates image quality and provides improvement suggestions.

(Dictionary<string, T> QualityScores, IEnumerable<string> Suggestions) EvaluateImageQuality(Tensor<T> image)

Parameters

image Tensor<T>: The image to evaluate.

Returns

(Dictionary<string, T> QualityScores, IEnumerable<string> Suggestions): Quality assessment with scores and suggestions.

ExtractStructuredData(Tensor<T>, string)

Extracts structured data from an image.

string ExtractStructuredData(Tensor<T> image, string schema)

Parameters

image Tensor<T>: The input image.
schema string: JSON schema describing expected output structure.

Returns

string: Extracted data as JSON string.

Remarks

For Beginners: Get structured data from images!

Example schema: {"name": "string", "price": "number", "in_stock": "boolean"} From a product image, extracts: {"name": "Widget", "price": 29.99, "in_stock": true}

ExtractText(Tensor<T>, bool)

Performs OCR with layout understanding.

(string Text, Dictionary<string, object>? LayoutInfo) ExtractText(Tensor<T> image, bool preserveLayout = false)

Parameters

image Tensor<T>: Image containing text.
preserveLayout bool: Whether to preserve spatial layout in output.

Returns

(string Text, Dictionary<string, object> LayoutInfo): Extracted text with optional layout information.

Generate(Tensor<T>, string, int, double)

Generates a response based on an image and text prompt.

string Generate(Tensor<T> image, string prompt, int maxTokens = 1024, double temperature = 0.7)

Parameters

image Tensor<T>: The input image tensor [channels, height, width].
prompt string: The text prompt or question about the image.
maxTokens int: Maximum tokens to generate.
temperature double: Sampling temperature (0-2).

Returns

string: Generated text response.

GenerateCodeFromUI(Tensor<T>, string, string?)

Generates code from a UI screenshot.

string GenerateCodeFromUI(Tensor<T> uiScreenshot, string targetFramework = "html_css", string? additionalInstructions = null)

Parameters

uiScreenshot Tensor<T>: Screenshot of a user interface.
targetFramework string: Target framework: "html_css", "react", "flutter", "swiftui".
additionalInstructions string: Optional styling or functionality instructions.

Returns

string: Generated code.

GenerateEditInstructions(Tensor<T>, string)

Generates image editing instructions based on a modification request.

string GenerateEditInstructions(Tensor<T> image, string editRequest)

Parameters

image Tensor<T>: The original image.
editRequest string: Description of desired edit.

Returns

string: Structured editing instructions.

GenerateFromMultipleImages(IEnumerable<Tensor<T>>, string, int, double)

Generates a response based on multiple images and text prompt.

string GenerateFromMultipleImages(IEnumerable<Tensor<T>> images, string prompt, int maxTokens = 1024, double temperature = 0.7)

Parameters

images IEnumerable<Tensor<T>>: Multiple input images.
prompt string: The text prompt referencing the images.
maxTokens int: Maximum tokens to generate.
temperature double: Sampling temperature.

Returns

string: Generated text response.

Remarks

For Beginners: Compare and analyze multiple images!

Examples:

"What are the differences between these two images?"
"Which of these products looks more appealing?"
"Describe how these images are related."

GenerateStory(Tensor<T>, string, string)

Generates a creative story or narrative based on an image.

string GenerateStory(Tensor<T> image, string genre = "general", string length = "medium")

Parameters

image Tensor<T>: The inspiring image.
genre string: Story genre: "fantasy", "mystery", "romance", "scifi", "general".
length string: Approximate length: "short", "medium", "long".

Returns

string: Generated story.

GetAttentionMap(Tensor<T>, string)

Gets attention weights showing which image regions influenced the response.

Tensor<T> GetAttentionMap(Tensor<T> image, string prompt)

Parameters

image Tensor<T>: The input image.
prompt string: The prompt used.

Returns

Tensor<T>: Attention map tensor [height, width] showing importance weights.

SafetyCheck(Tensor<T>)

Identifies potential safety concerns in an image.

Dictionary<string, (bool IsFlagged, T Confidence)> SafetyCheck(Tensor<T> image)

Parameters

image Tensor<T>: The image to analyze.

Returns

Dictionary<string, (bool IsFlagged, T Confidence)>: Safety assessment with categories and confidence levels.

VisualReasoning(Tensor<T>, string, string)

Performs visual reasoning tasks.

(string Answer, string Explanation) VisualReasoning(Tensor<T> image, string reasoningTask, string question)

Parameters

image Tensor<T>: The input image.
reasoningTask string: Task type: "count", "compare", "spatial", "temporal", "causal".
question string: Specific question for the reasoning task.

Returns

(string component, string operation): Reasoning result with explanation.

Table of Contents

Interface IGpt4VisionModel<T>

Type Parameters

Remarks

Properties

ContextWindowSize

Property Value

MaxImageResolution

Property Value

MaxImagesPerRequest

Property Value

SupportedDetailLevels

Property Value

Methods

AnalyzeChart(Tensor<T>)

Parameters

Returns

AnalyzeDocument(Tensor<T>, string, string?)

Parameters

Returns

Remarks

AnswerVisualQuestion(Tensor<T>, string)

Parameters

Returns

Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int)

Parameters

Returns

CompareImages(Tensor<T>, Tensor<T>, string)

Parameters

Returns

DescribeImage(Tensor<T>, string, string)

Parameters

Returns

DetectObjects(Tensor<T>, string?)

Parameters

Returns

EvaluateImageQuality(Tensor<T>)

Parameters

Returns

ExtractStructuredData(Tensor<T>, string)

Parameters

Returns

Remarks

ExtractText(Tensor<T>, bool)

Parameters

Returns

Generate(Tensor<T>, string, int, double)

Parameters

Returns

GenerateCodeFromUI(Tensor<T>, string, string?)

Parameters

Returns

GenerateEditInstructions(Tensor<T>, string)

Parameters

Returns

GenerateFromMultipleImages(IEnumerable<Tensor<T>>, string, int, double)

Parameters

Returns

Remarks

GenerateStory(Tensor<T>, string, string)

Parameters

Returns

GetAttentionMap(Tensor<T>, string)

Parameters

Returns

SafetyCheck(Tensor<T>)

Parameters

Returns

VisualReasoning(Tensor<T>, string, string)

Parameters

Returns