Interface IGpt4VisionModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for GPT-4V-style models that combine vision understanding with large language model capabilities.
public interface IGpt4VisionModel<T> : IMultimodalEmbedding<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
Remarks
GPT-4V represents the integration of vision capabilities into large language models, enabling sophisticated visual reasoning, multi-turn conversations about images, and complex visual-linguistic tasks.
For Beginners: GPT-4V is like giving ChatGPT the ability to see!
Key capabilities:
- Visual reasoning: Understanding relationships, counting, spatial awareness
- Multi-turn dialogue: Context-aware conversations about images
- Document understanding: Reading and analyzing documents, charts, diagrams
- Code generation from screenshots: Understanding UI and generating code
- Creative tasks: Describing images poetically, writing stories from images
Architecture concepts:
- Vision Encoder: Processes images into visual tokens
- Visual-Language Alignment: Maps visual features to LLM embedding space
- Large Language Model: Generates text responses conditioned on visual input
- Multi-modal Attention: Allows text to attend to relevant image regions
Properties
ContextWindowSize
Gets the context window size in tokens.
int ContextWindowSize { get; }
Property Value
MaxImageResolution
Gets the maximum resolution supported for input images.
(int Width, int Height) MaxImageResolution { get; }
Property Value
MaxImagesPerRequest
Gets the maximum number of images that can be processed in a single request.
int MaxImagesPerRequest { get; }
Property Value
SupportedDetailLevels
Gets the supported image detail levels.
IReadOnlyList<string> SupportedDetailLevels { get; }
Property Value
Methods
AnalyzeChart(Tensor<T>)
Analyzes a chart or graph and extracts data.
(string ChartType, Dictionary<string, object> Data, string Interpretation) AnalyzeChart(Tensor<T> chartImage)
Parameters
chartImageTensor<T>Image of a chart or graph.
Returns
- (string ChartType, Dictionary<string, object> Data, string Interpretation)
Chart analysis including type, data points, and interpretation.
AnalyzeDocument(Tensor<T>, string, string?)
Analyzes a document image (PDF page, screenshot, etc.).
string AnalyzeDocument(Tensor<T> documentImage, string analysisType = "summary", string? additionalPrompt = null)
Parameters
documentImageTensor<T>The document image.
analysisTypestringType: "summary", "extract_text", "answer_questions", "analyze_structure".
additionalPromptstringOptional additional instructions.
Returns
- string
Analysis result.
Remarks
Specialized for understanding structured documents like: - PDF pages and scanned documents - Charts and graphs - Tables and spreadsheets - Forms and invoices
AnswerVisualQuestion(Tensor<T>, string)
Answers a visual question with confidence score.
(string Answer, T Confidence) AnswerVisualQuestion(Tensor<T> image, string question)
Parameters
imageTensor<T>The input image.
questionstringQuestion about the image.
Returns
- (string Label, T Confidence)
Answer and confidence score.
Chat(Tensor<T>, IEnumerable<(string Role, string Content)>, string, int)
Conducts a multi-turn conversation about an image.
string Chat(Tensor<T> image, IEnumerable<(string Role, string Content)> conversationHistory, string userMessage, int maxTokens = 1024)
Parameters
imageTensor<T>The image being discussed.
conversationHistoryIEnumerable<(string Role, string Content)>Previous turns as (role, content) pairs.
userMessagestringThe new user message.
maxTokensintMaximum tokens to generate.
Returns
- string
Generated assistant response.
CompareImages(Tensor<T>, Tensor<T>, string)
Compares two images and describes their differences.
string CompareImages(Tensor<T> image1, Tensor<T> image2, string comparisonType = "detailed")
Parameters
image1Tensor<T>First image.
image2Tensor<T>Second image.
comparisonTypestringType: "visual", "semantic", "detailed".
Returns
- string
Comparison description.
DescribeImage(Tensor<T>, string, string)
Describes an image with specified style and detail level.
string DescribeImage(Tensor<T> image, string style = "factual", string detailLevel = "medium")
Parameters
imageTensor<T>The input image.
stylestringDescription style: "factual", "poetic", "technical", "accessibility".
detailLevelstringDetail level: "low", "medium", "high".
Returns
- string
Generated description.
DetectObjects(Tensor<T>, string?)
Identifies and locates objects in an image with bounding boxes.
IEnumerable<(string Label, T Confidence, int X, int Y, int Width, int Height)> DetectObjects(Tensor<T> image, string? objectQuery = null)
Parameters
imageTensor<T>The input image.
objectQuerystringOptional specific objects to find, or null for all objects.
Returns
- IEnumerable<(string Label, T Confidence, int X, int Y, int Width, int Height)>
List of detected objects with bounding boxes and confidence scores.
EvaluateImageQuality(Tensor<T>)
Evaluates image quality and provides improvement suggestions.
(Dictionary<string, T> QualityScores, IEnumerable<string> Suggestions) EvaluateImageQuality(Tensor<T> image)
Parameters
imageTensor<T>The image to evaluate.
Returns
- (Dictionary<string, T> QualityScores, IEnumerable<string> Suggestions)
Quality assessment with scores and suggestions.
ExtractStructuredData(Tensor<T>, string)
Extracts structured data from an image.
string ExtractStructuredData(Tensor<T> image, string schema)
Parameters
imageTensor<T>The input image.
schemastringJSON schema describing expected output structure.
Returns
- string
Extracted data as JSON string.
Remarks
For Beginners: Get structured data from images!
Example schema: {"name": "string", "price": "number", "in_stock": "boolean"} From a product image, extracts: {"name": "Widget", "price": 29.99, "in_stock": true}
ExtractText(Tensor<T>, bool)
Performs OCR with layout understanding.
(string Text, Dictionary<string, object>? LayoutInfo) ExtractText(Tensor<T> image, bool preserveLayout = false)
Parameters
imageTensor<T>Image containing text.
preserveLayoutboolWhether to preserve spatial layout in output.
Returns
- (string Text, Dictionary<string, object> LayoutInfo)
Extracted text with optional layout information.
Generate(Tensor<T>, string, int, double)
Generates a response based on an image and text prompt.
string Generate(Tensor<T> image, string prompt, int maxTokens = 1024, double temperature = 0.7)
Parameters
imageTensor<T>The input image tensor [channels, height, width].
promptstringThe text prompt or question about the image.
maxTokensintMaximum tokens to generate.
temperaturedoubleSampling temperature (0-2).
Returns
- string
Generated text response.
GenerateCodeFromUI(Tensor<T>, string, string?)
Generates code from a UI screenshot.
string GenerateCodeFromUI(Tensor<T> uiScreenshot, string targetFramework = "html_css", string? additionalInstructions = null)
Parameters
uiScreenshotTensor<T>Screenshot of a user interface.
targetFrameworkstringTarget framework: "html_css", "react", "flutter", "swiftui".
additionalInstructionsstringOptional styling or functionality instructions.
Returns
- string
Generated code.
GenerateEditInstructions(Tensor<T>, string)
Generates image editing instructions based on a modification request.
string GenerateEditInstructions(Tensor<T> image, string editRequest)
Parameters
imageTensor<T>The original image.
editRequeststringDescription of desired edit.
Returns
- string
Structured editing instructions.
GenerateFromMultipleImages(IEnumerable<Tensor<T>>, string, int, double)
Generates a response based on multiple images and text prompt.
string GenerateFromMultipleImages(IEnumerable<Tensor<T>> images, string prompt, int maxTokens = 1024, double temperature = 0.7)
Parameters
imagesIEnumerable<Tensor<T>>Multiple input images.
promptstringThe text prompt referencing the images.
maxTokensintMaximum tokens to generate.
temperaturedoubleSampling temperature.
Returns
- string
Generated text response.
Remarks
For Beginners: Compare and analyze multiple images!
Examples:
- "What are the differences between these two images?"
- "Which of these products looks more appealing?"
- "Describe how these images are related."
GenerateStory(Tensor<T>, string, string)
Generates a creative story or narrative based on an image.
string GenerateStory(Tensor<T> image, string genre = "general", string length = "medium")
Parameters
imageTensor<T>The inspiring image.
genrestringStory genre: "fantasy", "mystery", "romance", "scifi", "general".
lengthstringApproximate length: "short", "medium", "long".
Returns
- string
Generated story.
GetAttentionMap(Tensor<T>, string)
Gets attention weights showing which image regions influenced the response.
Tensor<T> GetAttentionMap(Tensor<T> image, string prompt)
Parameters
imageTensor<T>The input image.
promptstringThe prompt used.
Returns
- Tensor<T>
Attention map tensor [height, width] showing importance weights.
SafetyCheck(Tensor<T>)
Identifies potential safety concerns in an image.
Dictionary<string, (bool IsFlagged, T Confidence)> SafetyCheck(Tensor<T> image)
Parameters
imageTensor<T>The image to analyze.
Returns
- Dictionary<string, (bool IsFlagged, T Confidence)>
Safety assessment with categories and confidence levels.
VisualReasoning(Tensor<T>, string, string)
Performs visual reasoning tasks.
(string Answer, string Explanation) VisualReasoning(Tensor<T> image, string reasoningTask, string question)
Parameters
imageTensor<T>The input image.
reasoningTaskstringTask type: "count", "compare", "spatial", "temporal", "causal".
questionstringSpecific question for the reasoning task.