Table of Contents

Interface IGpt4VisionModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Defines the contract for GPT-4V-style models that combine vision understanding with large language model capabilities.

public interface IGpt4VisionModel<T> : IMultimodalEmbedding<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members

Remarks

GPT-4V represents the integration of vision capabilities into large language models, enabling sophisticated visual reasoning, multi-turn conversations about images, and complex visual-linguistic tasks.

For Beginners: GPT-4V is like giving ChatGPT the ability to see!

Key capabilities:

  • Visual reasoning: Understanding relationships, counting, spatial awareness
  • Multi-turn dialogue: Context-aware conversations about images
  • Document understanding: Reading and analyzing documents, charts, diagrams
  • Code generation from screenshots: Understanding UI and generating code
  • Creative tasks: Describing images poetically, writing stories from images

Architecture concepts:

  1. Vision Encoder: Processes images into visual tokens
  2. Visual-Language Alignment: Maps visual features to LLM embedding space
  3. Large Language Model: Generates text responses conditioned on visual input
  4. Multi-modal Attention: Allows text to attend to relevant image regions

Properties

ContextWindowSize

Gets the context window size in tokens.

int ContextWindowSize { get; }

Property Value

int

MaxImageResolution

Gets the maximum resolution supported for input images.

(int Width, int Height) MaxImageResolution { get; }

Property Value

(int min, int max)

MaxImagesPerRequest

Gets the maximum number of images that can be processed in a single request.

int MaxImagesPerRequest { get; }

Property Value

int

SupportedDetailLevels

Gets the supported image detail levels.

IReadOnlyList<string> SupportedDetailLevels { get; }

Property Value

IReadOnlyList<string>

Methods

AnalyzeChart(Tensor<T>)

Analyzes a chart or graph and extracts data.

(string ChartType, Dictionary<string, object> Data, string Interpretation) AnalyzeChart(Tensor<T> chartImage)

Parameters

chartImage Tensor<T>

Image of a chart or graph.

Returns

(string ChartType, Dictionary<string, object> Data, string Interpretation)

Chart analysis including type, data points, and interpretation.

AnalyzeDocument(Tensor<T>, string, string?)

Analyzes a document image (PDF page, screenshot, etc.).

string AnalyzeDocument(Tensor<T> documentImage, string analysisType = "summary", string? additionalPrompt = null)

Parameters

documentImage Tensor<T>

The document image.

analysisType string

Type: "summary", "extract_text", "answer_questions", "analyze_structure".

additionalPrompt string

Optional additional instructions.

Returns

string

Analysis result.

Remarks

Specialized for understanding structured documents like: - PDF pages and scanned documents - Charts and graphs - Tables and spreadsheets - Forms and invoices

AnswerVisualQuestion(Tensor<T>, string)

Answers a visual question with confidence score.

(string Answer, T Confidence) AnswerVisualQuestion(Tensor<T> image, string question)

Parameters

image Tensor<T>

The input image.

question string

Question about the image.

Returns

(string Label, T Confidence)

Answer and confidence score.

Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int)

Conducts a multi-turn conversation about an image.

string Chat(Tensor<T> image, IEnumerable<(string Role, string Content)> conversationHistory, string userMessage, int maxTokens = 1024)

Parameters

image Tensor<T>

The image being discussed.

conversationHistory IEnumerable<(string Role, string Content)>

Previous turns as (role, content) pairs.

userMessage string

The new user message.

maxTokens int

Maximum tokens to generate.

Returns

string

Generated assistant response.

CompareImages(Tensor<T>, Tensor<T>, string)

Compares two images and describes their differences.

string CompareImages(Tensor<T> image1, Tensor<T> image2, string comparisonType = "detailed")

Parameters

image1 Tensor<T>

First image.

image2 Tensor<T>

Second image.

comparisonType string

Type: "visual", "semantic", "detailed".

Returns

string

Comparison description.

DescribeImage(Tensor<T>, string, string)

Describes an image with specified style and detail level.

string DescribeImage(Tensor<T> image, string style = "factual", string detailLevel = "medium")

Parameters

image Tensor<T>

The input image.

style string

Description style: "factual", "poetic", "technical", "accessibility".

detailLevel string

Detail level: "low", "medium", "high".

Returns

string

Generated description.

DetectObjects(Tensor<T>, string?)

Identifies and locates objects in an image with bounding boxes.

IEnumerable<(string Label, T Confidence, int X, int Y, int Width, int Height)> DetectObjects(Tensor<T> image, string? objectQuery = null)

Parameters

image Tensor<T>

The input image.

objectQuery string

Optional specific objects to find, or null for all objects.

Returns

IEnumerable<(string Label, T Confidence, int X, int Y, int Width, int Height)>

List of detected objects with bounding boxes and confidence scores.

EvaluateImageQuality(Tensor<T>)

Evaluates image quality and provides improvement suggestions.

(Dictionary<string, T> QualityScores, IEnumerable<string> Suggestions) EvaluateImageQuality(Tensor<T> image)

Parameters

image Tensor<T>

The image to evaluate.

Returns

(Dictionary<string, T> QualityScores, IEnumerable<string> Suggestions)

Quality assessment with scores and suggestions.

ExtractStructuredData(Tensor<T>, string)

Extracts structured data from an image.

string ExtractStructuredData(Tensor<T> image, string schema)

Parameters

image Tensor<T>

The input image.

schema string

JSON schema describing expected output structure.

Returns

string

Extracted data as JSON string.

Remarks

For Beginners: Get structured data from images!

Example schema: {"name": "string", "price": "number", "in_stock": "boolean"} From a product image, extracts: {"name": "Widget", "price": 29.99, "in_stock": true}

ExtractText(Tensor<T>, bool)

Performs OCR with layout understanding.

(string Text, Dictionary<string, object>? LayoutInfo) ExtractText(Tensor<T> image, bool preserveLayout = false)

Parameters

image Tensor<T>

Image containing text.

preserveLayout bool

Whether to preserve spatial layout in output.

Returns

(string Text, Dictionary<string, object> LayoutInfo)

Extracted text with optional layout information.

Generate(Tensor<T>, string, int, double)

Generates a response based on an image and text prompt.

string Generate(Tensor<T> image, string prompt, int maxTokens = 1024, double temperature = 0.7)

Parameters

image Tensor<T>

The input image tensor [channels, height, width].

prompt string

The text prompt or question about the image.

maxTokens int

Maximum tokens to generate.

temperature double

Sampling temperature (0-2).

Returns

string

Generated text response.

GenerateCodeFromUI(Tensor<T>, string, string?)

Generates code from a UI screenshot.

string GenerateCodeFromUI(Tensor<T> uiScreenshot, string targetFramework = "html_css", string? additionalInstructions = null)

Parameters

uiScreenshot Tensor<T>

Screenshot of a user interface.

targetFramework string

Target framework: "html_css", "react", "flutter", "swiftui".

additionalInstructions string

Optional styling or functionality instructions.

Returns

string

Generated code.

GenerateEditInstructions(Tensor<T>, string)

Generates image editing instructions based on a modification request.

string GenerateEditInstructions(Tensor<T> image, string editRequest)

Parameters

image Tensor<T>

The original image.

editRequest string

Description of desired edit.

Returns

string

Structured editing instructions.

GenerateFromMultipleImages(IEnumerable<Tensor<T>>, string, int, double)

Generates a response based on multiple images and text prompt.

string GenerateFromMultipleImages(IEnumerable<Tensor<T>> images, string prompt, int maxTokens = 1024, double temperature = 0.7)

Parameters

images IEnumerable<Tensor<T>>

Multiple input images.

prompt string

The text prompt referencing the images.

maxTokens int

Maximum tokens to generate.

temperature double

Sampling temperature.

Returns

string

Generated text response.

Remarks

For Beginners: Compare and analyze multiple images!

Examples:

  • "What are the differences between these two images?"
  • "Which of these products looks more appealing?"
  • "Describe how these images are related."

GenerateStory(Tensor<T>, string, string)

Generates a creative story or narrative based on an image.

string GenerateStory(Tensor<T> image, string genre = "general", string length = "medium")

Parameters

image Tensor<T>

The inspiring image.

genre string

Story genre: "fantasy", "mystery", "romance", "scifi", "general".

length string

Approximate length: "short", "medium", "long".

Returns

string

Generated story.

GetAttentionMap(Tensor<T>, string)

Gets attention weights showing which image regions influenced the response.

Tensor<T> GetAttentionMap(Tensor<T> image, string prompt)

Parameters

image Tensor<T>

The input image.

prompt string

The prompt used.

Returns

Tensor<T>

Attention map tensor [height, width] showing importance weights.

SafetyCheck(Tensor<T>)

Identifies potential safety concerns in an image.

Dictionary<string, (bool IsFlagged, T Confidence)> SafetyCheck(Tensor<T> image)

Parameters

image Tensor<T>

The image to analyze.

Returns

Dictionary<string, (bool IsFlagged, T Confidence)>

Safety assessment with categories and confidence levels.

VisualReasoning(Tensor<T>, string, string)

Performs visual reasoning tasks.

(string Answer, string Explanation) VisualReasoning(Tensor<T> image, string reasoningTask, string question)

Parameters

image Tensor<T>

The input image.

reasoningTask string

Task type: "count", "compare", "spatial", "temporal", "causal".

question string

Specific question for the reasoning task.

Returns

(string component, string operation)

Reasoning result with explanation.