Class UnifiedMultimodalNetwork<T>
- Namespace
- AiDotNet.NeuralNetworks
- Assembly
- AiDotNet.dll
Unified multimodal network that handles text, images, audio, and video in a single architecture with cross-modal attention and any-to-any generation.
public class UnifiedMultimodalNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IUnifiedMultimodalModel<T>
Type Parameters
TThe numeric type for calculations.
- Inheritance
-
UnifiedMultimodalNetwork<T>
- Implements
- Inherited Members
- Extension Methods
Constructors
UnifiedMultimodalNetwork(NeuralNetworkArchitecture<T>, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?, int?)
Initializes a new instance of the UnifiedMultimodalNetwork.
public UnifiedMultimodalNetwork(NeuralNetworkArchitecture<T> architecture, int embeddingDimension = 768, int maxSequenceLength = 2048, int numTransformerLayers = 12, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null, int? seed = null)
Parameters
architectureNeuralNetworkArchitecture<T>embeddingDimensionintmaxSequenceLengthintnumTransformerLayersintoptimizerIOptimizer<T, Tensor<T>, Tensor<T>>lossFunctionILossFunction<T>seedint?
Properties
EmbeddingDimension
Gets the unified embedding dimension.
public int EmbeddingDimension { get; }
Property Value
MaxSequenceLength
Gets the maximum sequence length for interleaved inputs.
public int MaxSequenceLength { get; }
Property Value
ParameterCount
Gets the total number of parameters in the model.
public override int ParameterCount { get; }
Property Value
Remarks
For Beginners: This tells you how many adjustable values (weights and biases) your neural network has. More complex networks typically have more parameters and can learn more complex patterns, but also require more data to train effectively. This is part of the IFullModel interface for consistency with other model types.
Performance: This property uses caching to avoid recomputing the sum on every access. The cache is invalidated when layers are modified.
SupportedInputModalities
Gets the supported input modalities.
public IReadOnlyList<ModalityType> SupportedInputModalities { get; }
Property Value
SupportedOutputModalities
Gets the supported output modalities.
public IReadOnlyList<ModalityType> SupportedOutputModalities { get; }
Property Value
SupportsStreaming
Gets whether the model supports streaming generation.
public bool SupportsStreaming { get; }
Property Value
Methods
AlignTemporally(IEnumerable<MultimodalInput<T>>)
Aligns content across modalities temporally.
public Matrix<T> AlignTemporally(IEnumerable<MultimodalInput<T>> inputs)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal inputs with temporal content.
Returns
- Matrix<T>
Alignment matrix showing correspondences.
AnswerQuestion(IEnumerable<MultimodalInput<T>>, string)
Answers a question using multimodal context.
public (string Answer, T Confidence) AnswerQuestion(IEnumerable<MultimodalInput<T>> context, string question)
Parameters
contextIEnumerable<MultimodalInput<T>>Multimodal context (images, documents, audio, etc.).
questionstringThe question to answer.
Returns
- (string Label, T Confidence)
Answer and confidence score.
Chat(IEnumerable<(string Role, IEnumerable<MultimodalInput<T>> Content)>, IEnumerable<MultimodalInput<T>>, int)
Conducts a multi-turn conversation with multimodal inputs.
public string Chat(IEnumerable<(string Role, IEnumerable<MultimodalInput<T>> Content)> conversationHistory, IEnumerable<MultimodalInput<T>> newInputs, int maxTokens = 1024)
Parameters
conversationHistoryIEnumerable<(string Role, IEnumerable<MultimodalInput<T>> Content)>Previous turns with multimodal content.
newInputsIEnumerable<MultimodalInput<T>>New multimodal inputs for this turn.
maxTokensintMaximum tokens to generate.
Returns
- string
Generated response.
Compare(IEnumerable<MultimodalInput<T>>, IEnumerable<string>)
Compares multiple multimodal inputs and provides analysis.
public (string Analysis, Dictionary<string, IEnumerable<T>> Scores) Compare(IEnumerable<MultimodalInput<T>> inputs, IEnumerable<string> comparisonCriteria)
Parameters
inputsIEnumerable<MultimodalInput<T>>Items to compare.
comparisonCriteriaIEnumerable<string>What aspects to compare.
Returns
- (string Analysis, Dictionary<string, IEnumerable<T>> Scores)
Comparison analysis.
ComputeSimilarity(MultimodalInput<T>, MultimodalInput<T>)
Computes cross-modal similarity between inputs.
public T ComputeSimilarity(MultimodalInput<T> input1, MultimodalInput<T> input2)
Parameters
input1MultimodalInput<T>First multimodal input.
input2MultimodalInput<T>Second multimodal input.
Returns
- T
Similarity score (0-1).
CreateNewInstance()
Creates a new instance of the same type as this neural network.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
A new instance of the same neural network type.
Remarks
For Beginners: This creates a blank version of the same type of neural network.
It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.
DeepCopy()
Creates a deep copy of the neural network.
public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
A new instance that is a deep copy of this neural network.
Remarks
This method creates a complete independent copy of the network, including all layers and their parameters. It uses serialization and deserialization to ensure a true deep copy.
For Beginners: This creates a completely independent duplicate of your neural network.
Think of it like creating an exact clone of your network where:
- The copy has the same structure (layers, connections)
- The copy has the same learned parameters (weights, biases)
- Changes to one network don't affect the other
This is useful when you want to:
- Experiment with modifications without risking your original network
- Create multiple variations of a model
- Save a snapshot of your model at a particular point in training
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data that was not covered by the general deserialization process.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReaderThe BinaryReader to read the data from.
Remarks
This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.
For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.
Detect(IEnumerable<MultimodalInput<T>>, string)
Detects and localizes objects/events across modalities.
public IEnumerable<(string Label, T Confidence, ModalityType Modality, object Location)> Detect(IEnumerable<MultimodalInput<T>> inputs, string targetDescription)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal inputs to analyze.
targetDescriptionstringWhat to look for.
Returns
- IEnumerable<(string Label, T Confidence, ModalityType Modality, object Location)>
Detections with locations and modalities.
Edit(MultimodalInput<T>, string)
Edits multimodal content based on instructions.
public MultimodalOutput<T> Edit(MultimodalInput<T> original, string editInstructions)
Parameters
originalMultimodalInput<T>Original content.
editInstructionsstringInstructions for editing.
Returns
- MultimodalOutput<T>
Edited content.
Encode(MultimodalInput<T>)
Encodes any supported modality into the unified embedding space.
public Vector<T> Encode(MultimodalInput<T> input)
Parameters
inputMultimodalInput<T>The multimodal input to encode.
Returns
- Vector<T>
Unified embedding vector.
EncodeSequence(IEnumerable<MultimodalInput<T>>)
Encodes multiple interleaved inputs into a sequence of embeddings.
public Matrix<T> EncodeSequence(IEnumerable<MultimodalInput<T>> inputs)
Parameters
inputsIEnumerable<MultimodalInput<T>>Sequence of multimodal inputs in order.
Returns
- Matrix<T>
Matrix of embeddings [numInputs, embeddingDim].
Remarks
For Beginners: Process a conversation with mixed content!
Example input sequence:
- Text: "Look at this image and describe what you see"
- Image: [photo of a cat]
- Text: "Now listen to this sound"
- Audio: [meowing sound]
- Text: "Are they related?"
FewShotLearn(IEnumerable<(IEnumerable<MultimodalInput<T>> Inputs, MultimodalOutput<T> Output)>, IEnumerable<MultimodalInput<T>>)
Performs in-context learning from multimodal examples.
public MultimodalOutput<T> FewShotLearn(IEnumerable<(IEnumerable<MultimodalInput<T>> Inputs, MultimodalOutput<T> Output)> examples, IEnumerable<MultimodalInput<T>> query)
Parameters
examplesIEnumerable<(IEnumerable<MultimodalInput<T>> Inputs, MultimodalOutput<T> Output)>Few-shot examples with inputs and outputs.
queryIEnumerable<MultimodalInput<T>>Query to process.
Returns
- MultimodalOutput<T>
Predicted output based on examples.
Fuse(IEnumerable<MultimodalInput<T>>, string)
Fuses multiple modality inputs into a unified representation.
public Vector<T> Fuse(IEnumerable<MultimodalInput<T>> inputs, string fusionStrategy = "attention")
Parameters
inputsIEnumerable<MultimodalInput<T>>Inputs to fuse.
fusionStrategystringStrategy: "early", "late", "attention", "hybrid".
Returns
- Vector<T>
Fused embedding.
Generate(IEnumerable<MultimodalInput<T>>, ModalityType, int)
Generates output in the specified modality given multimodal inputs.
public MultimodalOutput<T> Generate(IEnumerable<MultimodalInput<T>> inputs, ModalityType outputModality, int maxLength = 1024)
Parameters
inputsIEnumerable<MultimodalInput<T>>Input sequence (can be multiple modalities).
outputModalityModalityTypeDesired output modality.
maxLengthintMaximum output length (tokens for text, frames for video, etc.).
Returns
- MultimodalOutput<T>
Generated output in the specified modality.
GenerateAudio(IEnumerable<MultimodalInput<T>>, double, int)
Generates audio from multimodal inputs.
public Tensor<T> GenerateAudio(IEnumerable<MultimodalInput<T>> inputs, double durationSeconds = 5, int sampleRate = 44100)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal input sequence.
durationSecondsdoubleTarget audio duration.
sampleRateintOutput sample rate.
Returns
- Tensor<T>
Generated audio waveform.
GenerateImage(IEnumerable<MultimodalInput<T>>, int, int)
Generates an image from multimodal inputs.
public Tensor<T> GenerateImage(IEnumerable<MultimodalInput<T>> inputs, int width = 512, int height = 512)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal input sequence (text prompts, reference images, etc.).
widthintOutput image width.
heightintOutput image height.
Returns
- Tensor<T>
Generated image tensor [channels, height, width].
GenerateInterleaved(IEnumerable<MultimodalInput<T>>, IEnumerable<(ModalityType Modality, int MaxLength)>)
Generates an interleaved sequence of multiple modalities.
public IEnumerable<MultimodalOutput<T>> GenerateInterleaved(IEnumerable<MultimodalInput<T>> inputs, IEnumerable<(ModalityType Modality, int MaxLength)> outputSpec)
Parameters
inputsIEnumerable<MultimodalInput<T>>Input sequence.
outputSpecIEnumerable<(ModalityType Modality, int MaxLength)>Specification of desired outputs (modality, length pairs).
Returns
- IEnumerable<MultimodalOutput<T>>
Interleaved output sequence.
Remarks
This enables generation of content like illustrated stories, narrated videos, or multimedia presentations.
GenerateText(IEnumerable<MultimodalInput<T>>, string, int, double)
Generates text response from multimodal inputs.
public string GenerateText(IEnumerable<MultimodalInput<T>> inputs, string prompt, int maxTokens = 1024, double temperature = 0.7)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal input sequence.
promptstringText prompt/instruction.
maxTokensintMaximum tokens to generate.
temperaturedoubleSampling temperature.
Returns
- string
Generated text response.
GetCrossModalAttention(IEnumerable<MultimodalInput<T>>)
Gets attention weights showing cross-modal relationships.
public Tensor<T> GetCrossModalAttention(IEnumerable<MultimodalInput<T>> inputs)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal inputs.
Returns
- Tensor<T>
Attention weights between all input pairs.
GetModelMetadata()
Gets the metadata for this neural network model.
public override ModelMetadata<T> GetModelMetadata()
Returns
- ModelMetadata<T>
A ModelMetaData object containing information about the model.
GetParameters()
Gets all trainable parameters of the network as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all parameters of the network.
Remarks
For Beginners: Neural networks learn by adjusting their "parameters" (also called weights and biases). This method collects all those adjustable values into a single list so they can be updated during training.
InitializeLayers()
Initializes the layers of the neural network based on the architecture.
protected override void InitializeLayers()
Remarks
For Beginners: This method sets up all the layers in your neural network according to the architecture you've defined. It's like assembling the parts of your network before you can use it.
Predict(Tensor<T>)
Makes a prediction using the neural network.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>The input data to process.
Returns
- Tensor<T>
The network's prediction.
Remarks
For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).
Reason(IEnumerable<MultimodalInput<T>>, string)
Performs reasoning across multiple modalities.
public (string Result, IEnumerable<string> ReasoningSteps) Reason(IEnumerable<MultimodalInput<T>> inputs, string task)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal inputs to reason about.
taskstringReasoning task description.
Returns
- (string Result, IEnumerable<string> ReasoningSteps)
Reasoning result with step-by-step explanation.
Remarks
For Beginners: Multi-step thinking across different inputs!
Example: Given an image of a recipe and audio of someone cooking, reason about whether they're following the recipe correctly.
Retrieve(MultimodalInput<T>, IEnumerable<MultimodalInput<T>>, int)
Retrieves the most similar items from a database given a query.
public IEnumerable<(int Index, T Score, ModalityType Modality)> Retrieve(MultimodalInput<T> query, IEnumerable<MultimodalInput<T>> database, int topK = 10)
Parameters
queryMultimodalInput<T>Query input (any modality).
databaseIEnumerable<MultimodalInput<T>>Database of items (any modalities).
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score, ModalityType Modality)>
Indices, scores, and modalities of matching items.
SafetyCheck(IEnumerable<MultimodalInput<T>>)
Checks content for safety across all modalities.
public Dictionary<ModalityType, (bool IsSafe, IEnumerable<string> Flags)> SafetyCheck(IEnumerable<MultimodalInput<T>> inputs)
Parameters
inputsIEnumerable<MultimodalInput<T>>Content to check.
Returns
- Dictionary<ModalityType, (bool IsSafe, IEnumerable<string> Flags)>
Safety assessment per modality.
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data that is not covered by the general serialization process.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriterThe BinaryWriter to write the data to.
Remarks
This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.
For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.
SetParameters(Vector<T>)
Sets the parameters of the neural network.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>The parameters to set.
Remarks
This method distributes the parameters to all layers in the network. The parameters should be in the same format as returned by GetParameters.
Summarize(IEnumerable<MultimodalInput<T>>, ModalityType, int)
Summarizes multimodal content.
public MultimodalOutput<T> Summarize(IEnumerable<MultimodalInput<T>> inputs, ModalityType outputModality = ModalityType.Text, int maxLength = 256)
Parameters
inputsIEnumerable<MultimodalInput<T>>Multimodal content to summarize.
outputModalityModalityTypeModality for the summary.
maxLengthintMaximum summary length.
Returns
- MultimodalOutput<T>
Summary in the specified modality.
Train(Tensor<T>, Tensor<T>)
Trains the neural network on a single input-output pair.
public override void Train(Tensor<T> input, Tensor<T> expectedOutput)
Parameters
inputTensor<T>The input data.
expectedOutputTensor<T>The expected output for the given input.
Remarks
This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.
For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)
The network then:
- Makes a prediction based on the input
- Compares its prediction to the expected output
- Calculates how wrong it was (the loss)
- Adjusts its internal values to do better next time
After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.
Translate(MultimodalInput<T>, ModalityType)
Translates content between modalities.
public MultimodalOutput<T> Translate(MultimodalInput<T> input, ModalityType targetModality)
Parameters
inputMultimodalInput<T>Source input.
targetModalityModalityTypeTarget modality.
Returns
- MultimodalOutput<T>
Translated content.
UpdateParameters(Vector<T>)
Updates the network's parameters with new values.
public override void UpdateParameters(Vector<T> gradients)
Parameters
gradientsVector<T>
Remarks
For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.
This is typically used by optimization algorithms that calculate better parameter values based on training data.