Table of Contents

Interface INoisePredictor<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for noise prediction networks used in diffusion models.

public interface INoisePredictor<T> : IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members
Extension Methods

Remarks

Noise predictors are the core neural networks in diffusion models that learn to predict the noise added to samples at each timestep. They can be implemented as U-Nets, Diffusion Transformers (DiT), or other architectures.

For Beginners: A noise predictor is like a "noise detective" that looks at a noisy image and figures out exactly what noise was added to it.

How it works:

  1. The model receives a noisy image and a timestep
  2. The timestep tells the model how much noise should be in the image
  3. The model predicts what noise pattern was added
  4. This prediction is used to remove noise and recover the original image

Different architectures for noise prediction:

  • U-Net: The original and most common, uses an encoder-decoder with skip connections
  • DiT (Diffusion Transformer): Uses transformer blocks, powers state-of-the-art models like SD3 and Sora
  • U-ViT: Hybrid of U-Net and Vision Transformer

The architecture choice affects:

  • Quality of generated images
  • Speed of generation
  • Memory requirements
  • Ability to scale to larger models

This interface extends IFullModel<T, TInput, TOutput> to provide all standard model capabilities (training, saving, loading, gradients, checkpointing, etc.).

Properties

BaseChannels

Gets the base channel count used in the network architecture.

int BaseChannels { get; }

Property Value

int

Remarks

This determines the model capacity. Common values: - 320 for Stable Diffusion 1.x and 2.x - 384 for Stable Diffusion XL (base) - 1024 for large DiT models

ContextDimension

Gets the expected context dimension for cross-attention conditioning.

int ContextDimension { get; }

Property Value

int

Remarks

For CLIP-conditioned models, this is typically 768 or 1024. For T5-conditioned models (like SD3), this is typically 2048. Returns 0 if cross-attention is not supported.

InputChannels

Gets the number of input channels the predictor expects.

int InputChannels { get; }

Property Value

int

Remarks

For image models, this is typically: - 4 for latent diffusion models (VAE latent channels) - 3 for pixel-space RGB models - Higher for models with additional conditioning channels

OutputChannels

Gets the number of output channels the predictor produces.

int OutputChannels { get; }

Property Value

int

Remarks

Usually matches InputChannels since we predict noise of the same shape as input. Some architectures may predict additional outputs like variance.

SupportsCFG

Gets whether this noise predictor supports classifier-free guidance.

bool SupportsCFG { get; }

Property Value

bool

Remarks

Classifier-free guidance allows steering generation toward the conditioning (e.g., text prompt) without a separate classifier. Most modern models support this.

SupportsCrossAttention

Gets whether this noise predictor supports cross-attention conditioning.

bool SupportsCrossAttention { get; }

Property Value

bool

Remarks

Cross-attention allows the model to attend to conditioning tokens (like text embeddings). This is how text-to-image models incorporate the prompt.

TimeEmbeddingDim

Gets the dimension of the time/timestep embedding.

int TimeEmbeddingDim { get; }

Property Value

int

Remarks

The timestep is embedded into a high-dimensional vector before being injected into the network. Typical values: 256, 512, 1024.

Methods

GetTimestepEmbedding(int)

Computes the timestep embedding for a given timestep.

Tensor<T> GetTimestepEmbedding(int timestep)

Parameters

timestep int

The timestep to embed.

Returns

Tensor<T>

The timestep embedding vector [timeEmbeddingDim].

Remarks

Timesteps are typically embedded using sinusoidal positional encodings (like in Transformers) followed by a small MLP.

PredictNoise(Tensor<T>, int, Tensor<T>?)

Predicts the noise in a noisy sample at a given timestep.

Tensor<T> PredictNoise(Tensor<T> noisySample, int timestep, Tensor<T>? conditioning = null)

Parameters

noisySample Tensor<T>

The noisy input sample [batch, channels, height, width].

timestep int

The current timestep in the diffusion process.

conditioning Tensor<T>

Optional conditioning tensor (e.g., text embeddings).

Returns

Tensor<T>

The predicted noise tensor with the same shape as noisySample.

Remarks

This is the main forward pass of the noise predictor. Given a noisy sample at timestep t, it predicts what noise was added.

For Beginners: This is where the actual denoising happens: 1. The network looks at the noisy image 2. It considers how noisy it should be at this timestep 3. It predicts the noise pattern 4. This prediction is subtracted to get a cleaner image

PredictNoiseWithEmbedding(Tensor<T>, Tensor<T>, Tensor<T>?)

Predicts noise with explicit timestep embedding (for batched different timesteps).

Tensor<T> PredictNoiseWithEmbedding(Tensor<T> noisySample, Tensor<T> timeEmbedding, Tensor<T>? conditioning = null)

Parameters

noisySample Tensor<T>

The noisy input sample [batch, channels, height, width].

timeEmbedding Tensor<T>

Pre-computed timestep embeddings [batch, timeEmbeddingDim].

conditioning Tensor<T>

Optional conditioning tensor (e.g., text embeddings).

Returns

Tensor<T>

The predicted noise tensor with the same shape as noisySample.

Remarks

This overload is useful when you want to use different timesteps per sample in a batch, or when you have pre-computed timestep embeddings for efficiency.