Table of Contents

Class VideoUNetPredictor<T>

Namespace
AiDotNet.Diffusion.NoisePredictors
Assembly
AiDotNet.dll

3D U-Net architecture for video noise prediction in diffusion models.

public class VideoUNetPredictor<T> : NoisePredictorBase<T>, INoisePredictor<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
VideoUNetPredictor<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Remarks

The VideoUNetPredictor extends the standard U-Net architecture to handle video data by incorporating 3D convolutions and temporal attention. This is the core noise prediction network used in video diffusion models like Stable Video Diffusion.

For Beginners: While a regular U-Net processes single images, VideoUNet processes sequences of frames as a 3D volume:

Regular U-Net:

  • Input: [batch, channels, height, width]
  • 2D convolutions across spatial dimensions only
  • Each image processed independently

Video U-Net:

  • Input: [batch, channels, frames, height, width]
  • 3D convolutions across space AND time
  • Frames are processed together, understanding motion

Key features:

  • Temporal convolutions capture motion patterns
  • Temporal attention for long-range frame relationships
  • Skip connections across both space and time
  • Image conditioning for image-to-video generation

Used in: Stable Video Diffusion, ModelScope, VideoCrafter

Architecture details: - Encoder: 3D ResBlocks with temporal + spatial attention - Middle: Multiple 3D attention blocks - Decoder: 3D ResBlocks with skip connections - Temporal convolutions with kernel size 3 across frames

Constructors

VideoUNetPredictor(int, int?, int, int[]?, int, int[]?, int, int, int, bool, ILossFunction<T>?, int?)

Initializes a new instance of the VideoUNetPredictor class.

public VideoUNetPredictor(int inputChannels = 4, int? outputChannels = null, int baseChannels = 320, int[]? channelMultipliers = null, int numResBlocks = 2, int[]? attentionResolutions = null, int numTemporalLayers = 1, int contextDim = 1024, int numHeads = 8, bool supportsImageConditioning = true, ILossFunction<T>? lossFunction = null, int? seed = null)

Parameters

inputChannels int

Number of input channels (default: 4 for latent diffusion).

outputChannels int?

Number of output channels (default: same as input).

baseChannels int

Base channel count (default: 320).

channelMultipliers int[]

Channel multipliers per level (default: [1, 2, 4, 4]).

numResBlocks int

Number of residual blocks per level (default: 2).

attentionResolutions int[]

Resolution indices for attention (default: [1, 2, 3]).

numTemporalLayers int

Number of temporal transformer layers (default: 1).

contextDim int

Context dimension for cross-attention (default: 1024).

numHeads int

Number of attention heads (default: 8).

supportsImageConditioning bool

Whether to support image conditioning (default: true).

lossFunction ILossFunction<T>

Optional loss function (default: MSE).

seed int?

Optional random seed for reproducibility.

Properties

BaseChannels

Gets the base channel count used in the network architecture.

public override int BaseChannels { get; }

Property Value

int

Remarks

This determines the model capacity. Common values: - 320 for Stable Diffusion 1.x and 2.x - 384 for Stable Diffusion XL (base) - 1024 for large DiT models

ContextDimension

Gets the expected context dimension for cross-attention conditioning.

public override int ContextDimension { get; }

Property Value

int

Remarks

For CLIP-conditioned models, this is typically 768 or 1024. For T5-conditioned models (like SD3), this is typically 2048. Returns 0 if cross-attention is not supported.

InputChannels

Gets the number of input channels the predictor expects.

public override int InputChannels { get; }

Property Value

int

Remarks

For image models, this is typically: - 4 for latent diffusion models (VAE latent channels) - 3 for pixel-space RGB models - Higher for models with additional conditioning channels

NumTemporalLayers

Gets the number of temporal transformer layers.

public int NumTemporalLayers { get; }

Property Value

int

OutputChannels

Gets the number of output channels the predictor produces.

public override int OutputChannels { get; }

Property Value

int

Remarks

Usually matches InputChannels since we predict noise of the same shape as input. Some architectures may predict additional outputs like variance.

ParameterCount

Gets the number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

Remarks

This property returns the total count of trainable parameters in the model. It's useful for understanding model complexity and memory requirements.

SupportsCFG

Gets whether this noise predictor supports classifier-free guidance.

public override bool SupportsCFG { get; }

Property Value

bool

Remarks

Classifier-free guidance allows steering generation toward the conditioning (e.g., text prompt) without a separate classifier. Most modern models support this.

SupportsCrossAttention

Gets whether this noise predictor supports cross-attention conditioning.

public override bool SupportsCrossAttention { get; }

Property Value

bool

Remarks

Cross-attention allows the model to attend to conditioning tokens (like text embeddings). This is how text-to-image models incorporate the prompt.

SupportsImageConditioning

Gets whether this predictor supports image conditioning for image-to-video.

public bool SupportsImageConditioning { get; }

Property Value

bool

TimeEmbeddingDim

Gets the dimension of the time/timestep embedding.

public override int TimeEmbeddingDim { get; }

Property Value

int

Remarks

The timestep is embedded into a high-dimensional vector before being injected into the network. Typical values: 256, 512, 1024.

Methods

Clone()

Creates a deep copy of the noise predictor.

public override INoisePredictor<T> Clone()

Returns

INoisePredictor<T>

A new instance with the same parameters.

DeepCopy()

Creates a deep copy of this object.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GetParameters()

Gets the parameters that can be optimized.

public override Vector<T> GetParameters()

Returns

Vector<T>

PredictNoise(Tensor<T>, int, Tensor<T>?)

Predicts the noise in a noisy sample at a given timestep.

public override Tensor<T> PredictNoise(Tensor<T> noisySample, int timestep, Tensor<T>? conditioning = null)

Parameters

noisySample Tensor<T>

The noisy input sample [batch, channels, height, width].

timestep int

The current timestep in the diffusion process.

conditioning Tensor<T>

Optional conditioning tensor (e.g., text embeddings).

Returns

Tensor<T>

The predicted noise tensor with the same shape as noisySample.

Remarks

This is the main forward pass of the noise predictor. Given a noisy sample at timestep t, it predicts what noise was added.

For Beginners: This is where the actual denoising happens: 1. The network looks at the noisy image 2. It considers how noisy it should be at this timestep 3. It predicts the noise pattern 4. This prediction is subtracted to get a cleaner image

PredictNoiseWithEmbedding(Tensor<T>, Tensor<T>, Tensor<T>?)

Predicts noise with explicit timestep embedding (for batched different timesteps).

public override Tensor<T> PredictNoiseWithEmbedding(Tensor<T> noisySample, Tensor<T> timeEmbedding, Tensor<T>? conditioning = null)

Parameters

noisySample Tensor<T>

The noisy input sample [batch, channels, height, width].

timeEmbedding Tensor<T>

Pre-computed timestep embeddings [batch, timeEmbeddingDim].

conditioning Tensor<T>

Optional conditioning tensor (e.g., text embeddings).

Returns

Tensor<T>

The predicted noise tensor with the same shape as noisySample.

Remarks

This overload is useful when you want to use different timesteps per sample in a batch, or when you have pre-computed timestep embeddings for efficiency.

PredictNoiseWithImageCondition(Tensor<T>, int, Tensor<T>, Tensor<T>?)

Predicts noise for image-to-video generation with image conditioning.

public Tensor<T> PredictNoiseWithImageCondition(Tensor<T> noisySample, int timestep, Tensor<T> imageCondition, Tensor<T>? textConditioning = null)

Parameters

noisySample Tensor<T>

The noisy video latent.

timestep int

The current timestep.

imageCondition Tensor<T>

The conditioning image (first frame).

textConditioning Tensor<T>

Optional text conditioning.

Returns

Tensor<T>

The predicted noise.

SetParameters(Vector<T>)

Sets the model parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The parameter vector to set.

Remarks

This method allows direct modification of the model's internal parameters. This is useful for optimization algorithms that need to update parameters iteratively. If the length of parameters does not match ParameterCount, an ArgumentException should be thrown.

Exceptions

ArgumentException

Thrown when the length of parameters does not match ParameterCount.