Interface IUnifiedMultimodalModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Defines the contract for unified multimodal models that handle multiple modalities in a single architecture, similar to GPT-4o, Gemini, or Meta's CM3Leon.
public interface IUnifiedMultimodalModel<T>
Type Parameters
TThe numeric type used for calculations.
Remarks
Unified multimodal models represent the next generation of AI systems that can seamlessly process and generate content across multiple modalities (text, image, audio, video) within a single unified architecture.
For Beginners: One model that can see, hear, read, and create!
Key capabilities:
- Any-to-any generation: Text → Image, Image → Text, Audio → Text, etc.
- Interleaved understanding: Process mixed sequences of text, images, audio
- Cross-modal reasoning: Answer questions using information from multiple sources
- Unified embeddings: All modalities share a common representation space
Architecture concepts:
- Modality Encoders: Specialized encoders for each input type
- Unified Transformer: Core model that processes all modalities
- Modality Decoders: Generate outputs in any modality
- Cross-Attention: Allow modalities to attend to each other
Properties
EmbeddingDimension
Gets the unified embedding dimension.
int EmbeddingDimension { get; }
Property Value
MaxSequenceLength
Gets the maximum sequence length for interleaved inputs.
int MaxSequenceLength { get; }
Property Value
SupportedInputModalities
Gets the supported input modalities.
IReadOnlyList<ModalityType> SupportedInputModalities { get; }
Property Value
SupportedOutputModalities
Gets the supported output modalities.
IReadOnlyList<ModalityType> SupportedOutputModalities { get; }
Property Value
SupportsStreaming
Gets whether the model supports streaming generation.
bool SupportsStreaming { get; }
Property Value
Methods
AlignTemporally(IEnumerable<MultimodalInput<T>>)
Aligns content across modalities temporally.
Matrix<T> AlignTemporally(IEnumerable<MultimodalInput<T>> inputs)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal inputs with temporal content.
Returns
- Matrix<T>
Alignment matrix showing correspondences.
AnswerQuestion(IEnumerable<MultimodalInput<T>>, string)
Answers a question using multimodal context.
(string Answer, T Confidence) AnswerQuestion(IEnumerable<MultimodalInput<T>> context, string question)
Parameters
contextIEnumerable<MultimodalInput<T>>Multimodal context (images, documents, audio, etc.).
questionstringThe question to answer.
Returns
- (string Label, T Confidence)
Answer and confidence score.
Chat(IEnumerable<(string Role, IEnumerable<MultimodalInput<T>> Content)>, IEnumerable<MultimodalInput<T>>, int)
Conducts a multi-turn conversation with multimodal inputs.
string Chat(IEnumerable<(string Role, IEnumerable<MultimodalInput<T>> Content)> conversationHistory, IEnumerable<MultimodalInput<T>> newInputs, int maxTokens = 1024)
Parameters
conversationHistoryIEnumerable<(string Role, IEnumerable<MultimodalInput<T>> Content)>Previous turns with multimodal content.
newInputsIEnumerable<MultimodalInput<T>>New multimodal inputs for this turn.
maxTokensintMaximum tokens to generate.
Returns
- string
Generated response.
Compare(IEnumerable<MultimodalInput<T>>, IEnumerable<string>)
Compares multiple multimodal inputs and provides analysis.
(string Analysis, Dictionary<string, IEnumerable<T>> Scores) Compare(IEnumerable<MultimodalInput<T>> inputs, IEnumerable<string> comparisonCriteria)
Parameters
inputsIEnumerable<MultimodalInput<T>>Items to compare.
comparisonCriteriaIEnumerable<string>What aspects to compare.
Returns
- (string Analysis, Dictionary<string, IEnumerable<T>> Scores)
Comparison analysis.
ComputeSimilarity(MultimodalInput<T>, MultimodalInput<T>)
Computes cross-modal similarity between inputs.
T ComputeSimilarity(MultimodalInput<T> input1, MultimodalInput<T> input2)
Parameters
input1MultimodalInput<T>First multimodal input.
input2MultimodalInput<T>Second multimodal input.
Returns
- T
Similarity score (0-1).
Detect(IEnumerable<MultimodalInput<T>>, string)
Detects and localizes objects/events across modalities.
IEnumerable<(string Label, T Confidence, ModalityType Modality, object Location)> Detect(IEnumerable<MultimodalInput<T>> inputs, string targetDescription)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal inputs to analyze.
targetDescriptionstringWhat to look for.
Returns
- IEnumerable<(string Label, T Confidence, ModalityType Modality, object Location)>
Detections with locations and modalities.
Edit(MultimodalInput<T>, string)
Edits multimodal content based on instructions.
MultimodalOutput<T> Edit(MultimodalInput<T> original, string editInstructions)
Parameters
originalMultimodalInput<T>Original content.
editInstructionsstringInstructions for editing.
Returns
- MultimodalOutput<T>
Edited content.
Encode(MultimodalInput<T>)
Encodes any supported modality into the unified embedding space.
Vector<T> Encode(MultimodalInput<T> input)
Parameters
inputMultimodalInput<T>The multimodal input to encode.
Returns
- Vector<T>
Unified embedding vector.
EncodeSequence(IEnumerable<MultimodalInput<T>>)
Encodes multiple interleaved inputs into a sequence of embeddings.
Matrix<T> EncodeSequence(IEnumerable<MultimodalInput<T>> inputs)
Parameters
inputsIEnumerable<MultimodalInput<T>>Sequence of multimodal inputs in order.
Returns
- Matrix<T>
Matrix of embeddings [numInputs, embeddingDim].
Remarks
For Beginners: Process a conversation with mixed content!
Example input sequence:
- Text: "Look at this image and describe what you see"
- Image: [photo of a cat]
- Text: "Now listen to this sound"
- Audio: [meowing sound]
- Text: "Are they related?"
FewShotLearn(IEnumerable<(IEnumerable<MultimodalInput<T>> Inputs, MultimodalOutput<T> Output)>, IEnumerable<MultimodalInput<T>>)
Performs in-context learning from multimodal examples.
MultimodalOutput<T> FewShotLearn(IEnumerable<(IEnumerable<MultimodalInput<T>> Inputs, MultimodalOutput<T> Output)> examples, IEnumerable<MultimodalInput<T>> query)
Parameters
examplesIEnumerable<(IEnumerable<MultimodalInput<T>> Inputs, MultimodalOutput<T> Output)>Few-shot examples with inputs and outputs.
queryIEnumerable<MultimodalInput<T>>Query to process.
Returns
- MultimodalOutput<T>
Predicted output based on examples.
Fuse(IEnumerable<MultimodalInput<T>>, string)
Fuses multiple modality inputs into a unified representation.
Vector<T> Fuse(IEnumerable<MultimodalInput<T>> inputs, string fusionStrategy = "attention")
Parameters
inputsIEnumerable<MultimodalInput<T>>Inputs to fuse.
fusionStrategystringStrategy: "early", "late", "attention", "hybrid".
Returns
- Vector<T>
Fused embedding.
Generate(IEnumerable<MultimodalInput<T>>, ModalityType, int)
Generates output in the specified modality given multimodal inputs.
MultimodalOutput<T> Generate(IEnumerable<MultimodalInput<T>> inputs, ModalityType outputModality, int maxLength = 1024)
Parameters
inputsIEnumerable<MultimodalInput<T>>Input sequence (can be multiple modalities).
outputModalityModalityTypeDesired output modality.
maxLengthintMaximum output length (tokens for text, frames for video, etc.).
Returns
- MultimodalOutput<T>
Generated output in the specified modality.
GenerateAudio(IEnumerable<MultimodalInput<T>>, double, int)
Generates audio from multimodal inputs.
Tensor<T> GenerateAudio(IEnumerable<MultimodalInput<T>> inputs, double durationSeconds = 5, int sampleRate = 44100)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal input sequence.
durationSecondsdoubleTarget audio duration.
sampleRateintOutput sample rate.
Returns
- Tensor<T>
Generated audio waveform.
GenerateImage(IEnumerable<MultimodalInput<T>>, int, int)
Generates an image from multimodal inputs.
Tensor<T> GenerateImage(IEnumerable<MultimodalInput<T>> inputs, int width = 512, int height = 512)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal input sequence (text prompts, reference images, etc.).
widthintOutput image width.
heightintOutput image height.
Returns
- Tensor<T>
Generated image tensor [channels, height, width].
GenerateInterleaved(IEnumerable<MultimodalInput<T>>, IEnumerable<(ModalityType Modality, int MaxLength)>)
Generates an interleaved sequence of multiple modalities.
IEnumerable<MultimodalOutput<T>> GenerateInterleaved(IEnumerable<MultimodalInput<T>> inputs, IEnumerable<(ModalityType Modality, int MaxLength)> outputSpec)
Parameters
inputsIEnumerable<MultimodalInput<T>>Input sequence.
outputSpecIEnumerable<(ModalityType Modality, int MaxLength)>Specification of desired outputs (modality, length pairs).
Returns
- IEnumerable<MultimodalOutput<T>>
Interleaved output sequence.
Remarks
This enables generation of content like illustrated stories, narrated videos, or multimedia presentations.
GenerateText(IEnumerable<MultimodalInput<T>>, string, int, double)
Generates text response from multimodal inputs.
string GenerateText(IEnumerable<MultimodalInput<T>> inputs, string prompt, int maxTokens = 1024, double temperature = 0.7)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal input sequence.
promptstringText prompt/instruction.
maxTokensintMaximum tokens to generate.
temperaturedoubleSampling temperature.
Returns
- string
Generated text response.
GetCrossModalAttention(IEnumerable<MultimodalInput<T>>)
Gets attention weights showing cross-modal relationships.
Tensor<T> GetCrossModalAttention(IEnumerable<MultimodalInput<T>> inputs)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal inputs.
Returns
- Tensor<T>
Attention weights between all input pairs.
Reason(IEnumerable<MultimodalInput<T>>, string)
Performs reasoning across multiple modalities.
(string Result, IEnumerable<string> ReasoningSteps) Reason(IEnumerable<MultimodalInput<T>> inputs, string task)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal inputs to reason about.
taskstringReasoning task description.
Returns
- (string Result, IEnumerable<string> ReasoningSteps)
Reasoning result with step-by-step explanation.
Remarks
For Beginners: Multi-step thinking across different inputs!
Example: Given an image of a recipe and audio of someone cooking, reason about whether they're following the recipe correctly.
Retrieve(MultimodalInput<T>, IEnumerable<MultimodalInput<T>>, int)
Retrieves the most similar items from a database given a query.
IEnumerable<(int Index, T Score, ModalityType Modality)> Retrieve(MultimodalInput<T> query, IEnumerable<MultimodalInput<T>> database, int topK = 10)
Parameters
queryMultimodalInput<T>Query input (any modality).
databaseIEnumerable<MultimodalInput<T>>Database of items (any modalities).
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score, ModalityType Modality)>
Indices, scores, and modalities of matching items.
SafetyCheck(IEnumerable<MultimodalInput<T>>)
Checks content for safety across all modalities.
Dictionary<ModalityType, (bool IsSafe, IEnumerable<string> Flags)> SafetyCheck(IEnumerable<MultimodalInput<T>> inputs)
Parameters
inputsIEnumerable<MultimodalInput<T>>Content to check.
Returns
- Dictionary<ModalityType, (bool IsSafe, IEnumerable<string> Flags)>
Safety assessment per modality.
Summarize(IEnumerable<MultimodalInput<T>>, ModalityType, int)
Summarizes multimodal content.
MultimodalOutput<T> Summarize(IEnumerable<MultimodalInput<T>> inputs, ModalityType outputModality = ModalityType.Text, int maxLength = 256)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal content to summarize.
outputModalityModalityTypeModality for the summary.
maxLengthintMaximum summary length.
Returns
- MultimodalOutput<T>
Summary in the specified modality.
Translate(MultimodalInput<T>, ModalityType)
Translates content between modalities.
MultimodalOutput<T> Translate(MultimodalInput<T> input, ModalityType targetModality)
Parameters
inputMultimodalInput<T>Source input.
targetModalityModalityTypeTarget modality.
Returns
- MultimodalOutput<T>
Translated content.