Table of Contents

Class LatentDiffusionModelBase<T>

Namespace
AiDotNet.Diffusion
Assembly
AiDotNet.dll

Base class for latent diffusion models that operate in a compressed latent space.

public abstract class LatentDiffusionModelBase<T> : DiffusionModelBase<T>, ILatentDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
LatentDiffusionModelBase<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Derived
Inherited Members
Extension Methods

Remarks

This abstract base class provides common functionality for all latent diffusion models, including encoding/decoding, text-to-image generation, image-to-image transformation, and inpainting.

For Beginners: This is the foundation for latent diffusion models like Stable Diffusion. It combines a VAE (for compression), a noise predictor (for denoising), and optional conditioning (for guided generation from text or images).

Constructors

LatentDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?)

Initializes a new instance of the LatentDiffusionModelBase class.

protected LatentDiffusionModelBase(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null)

Parameters

options DiffusionModelOptions<T>

Configuration options for the diffusion model.

scheduler INoiseScheduler<T>

Optional custom scheduler.

Properties

Conditioner

Gets the conditioning module (optional, for conditioned generation).

public abstract IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

GuidanceScale

Gets the default guidance scale for classifier-free guidance.

public virtual double GuidanceScale { get; }

Property Value

double

Remarks

Higher values make generation more closely follow the conditioning. Typical values: 7.5 for Stable Diffusion, 5.0 for SDXL.

LatentChannels

Gets the number of latent channels.

public abstract int LatentChannels { get; }

Property Value

int

Remarks

Typically 4 for Stable Diffusion models.

NoisePredictor

Gets the noise predictor model (U-Net, DiT, etc.).

public abstract INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

SupportsInpainting

Gets whether this model supports inpainting.

public virtual bool SupportsInpainting { get; }

Property Value

bool

SupportsNegativePrompt

Gets whether this model supports negative prompts.

public virtual bool SupportsNegativePrompt { get; }

Property Value

bool

VAE

Gets the VAE model used for encoding and decoding.

public abstract IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Methods

ApplyGuidance(Tensor<T>, Tensor<T>, double)

Applies classifier-free guidance to combine conditional and unconditional predictions.

protected virtual Tensor<T> ApplyGuidance(Tensor<T> unconditional, Tensor<T> conditional, double scale)

Parameters

unconditional Tensor<T>

The unconditional noise prediction.

conditional Tensor<T>

The conditional noise prediction.

scale double

The guidance scale.

Returns

Tensor<T>

The guided noise prediction.

BlendLatentsWithMask(Tensor<T>, Tensor<T>, Tensor<T>, int)

Blends generated latents with original latents based on mask for inpainting.

protected virtual Tensor<T> BlendLatentsWithMask(Tensor<T> generated, Tensor<T> original, Tensor<T> mask, int timestep)

Parameters

generated Tensor<T>

The generated latents.

original Tensor<T>

The original latents.

mask Tensor<T>

The mask (1 = inpaint, 0 = keep original).

timestep int

Current timestep for noise addition to original.

Returns

Tensor<T>

Blended latents.

DecodeFromLatent(Tensor<T>)

Decodes a latent representation back to an image.

public virtual Tensor<T> DecodeFromLatent(Tensor<T> latent)

Parameters

latent Tensor<T>

The latent tensor.

Returns

Tensor<T>

The decoded image tensor [batch, channels, height, width].

Remarks

For Beginners: This decompresses a latent back to an image: - Input: Small latent (e.g., 64x64x4) - Output: Full-size image (e.g., 512x512x3)

EncodeToLatent(Tensor<T>, bool)

Encodes an image into latent space.

public virtual Tensor<T> EncodeToLatent(Tensor<T> image, bool sampleMode = true)

Parameters

image Tensor<T>

The input image tensor [batch, channels, height, width].

sampleMode bool

Whether to sample from the VAE distribution.

Returns

Tensor<T>

The latent representation.

Remarks

For Beginners: This compresses an image for processing: - Input: Full-size image (e.g., 512x512) - Output: Small latent (e.g., 64x64x4)

Use sampleMode=true during training for VAE regularization, and sampleMode=false for deterministic encoding during editing.

Generate(int[], int, int?)

Generates samples by iteratively denoising from random noise.

public override Tensor<T> Generate(int[] shape, int numInferenceSteps = 50, int? seed = null)

Parameters

shape int[]

The shape of samples to generate (e.g., [batchSize, channels, height, width]).

numInferenceSteps int

Number of denoising steps. More steps = higher quality, slower.

seed int?

Optional random seed for reproducibility. If null, uses system random.

Returns

Tensor<T>

Generated samples as a tensor.

Remarks

This is the main generation method. It starts with random noise and applies the reverse diffusion process to generate new samples.

For Beginners: This is how you create new images/data: 1. Start with pure random noise (like TV static) 2. Ask the model "what does this look like minus some noise?" 3. Repeat many times, each time removing a bit more noise 4. End with a clean generated sample

More inference steps = cleaner results but slower generation. Typical values: 20-50 for fast generation, 100-200 for high quality.

GenerateFromText(string, string?, int, int, int, double?, int?)

Generates images from text prompts using classifier-free guidance.

public virtual Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)

Parameters

prompt string

The text prompt describing the desired image.

negativePrompt string

Optional negative prompt (what to avoid).

width int

Image width in pixels (should be divisible by VAE downsample factor).

height int

Image height in pixels (should be divisible by VAE downsample factor).

numInferenceSteps int

Number of denoising steps.

guidanceScale double?

How closely to follow the prompt (higher = closer).

seed int?

Optional random seed for reproducibility.

Returns

Tensor<T>

The generated image tensor.

Remarks

This is the main text-to-image generation method. It performs: 1. Encode text prompts to conditioning embeddings 2. Generate random latent noise 3. Iteratively denoise with classifier-free guidance 4. Decode latent to image

For Beginners: This is how you generate images from text: - prompt: What you want ("a cat in a spacesuit") - negativePrompt: What to avoid ("blurry, low quality") - guidanceScale: How strictly to follow the prompt (7.5 is typical)

ImageToImage(Tensor<T>, string, string?, double, int, double?, int?)

Performs image-to-image generation (style transfer, editing).

public virtual Tensor<T> ImageToImage(Tensor<T> inputImage, string prompt, string? negativePrompt = null, double strength = 0.8, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)

Parameters

inputImage Tensor<T>

The input image to transform.

prompt string

The text prompt describing the desired transformation.

negativePrompt string

Optional negative prompt.

strength double

How much to transform (0.0 = no change, 1.0 = full regeneration).

numInferenceSteps int

Number of denoising steps.

guidanceScale double?

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

The transformed image tensor.

Remarks

Image-to-image works by: 1. Encode the input image to latent 2. Add noise to the latent (controlled by strength) 3. Denoise with text guidance 4. Decode back to image

For Beginners: This transforms an existing image based on a prompt:

  • strength=0.3: Minor changes, keeps most of the original
  • strength=0.7: Major changes, but composition remains
  • strength=1.0: Complete regeneration, original is just a starting point

Inpaint(Tensor<T>, Tensor<T>, string, string?, int, double?, int?)

Performs inpainting (filling in masked regions).

public virtual Tensor<T> Inpaint(Tensor<T> inputImage, Tensor<T> mask, string prompt, string? negativePrompt = null, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)

Parameters

inputImage Tensor<T>

The input image with areas to inpaint.

mask Tensor<T>

Binary mask where 1 = inpaint, 0 = keep original.

prompt string

Text prompt describing what to generate in the masked area.

negativePrompt string

Optional negative prompt.

numInferenceSteps int

Number of denoising steps.

guidanceScale double?

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

The inpainted image tensor.

Remarks

Inpainting fills in masked regions while keeping unmasked areas intact. The mask should be the same spatial size as the image.

For Beginners: This is like a smart "fill" tool: - Draw a mask over what you want to replace - Describe what should go there - The model generates content that blends naturally

PredictNoise(Tensor<T>, int)

Predicts the noise in a noisy sample at a given timestep.

public override Tensor<T> PredictNoise(Tensor<T> noisySample, int timestep)

Parameters

noisySample Tensor<T>

The noisy input sample.

timestep int

The current timestep in the diffusion process.

Returns

Tensor<T>

The predicted noise tensor.

Remarks

This is the core prediction that the model learns. Given a noisy sample at timestep t, predict what noise was added to create it.

For Beginners: The model looks at a noisy image and guesses "what noise was added to make it look like this?" This prediction is then used to remove that noise and get a cleaner image.

ResizeMaskToLatent(Tensor<T>, int[])

Resizes a mask tensor to match latent dimensions.

protected virtual Tensor<T> ResizeMaskToLatent(Tensor<T> mask, int[] latentShape)

Parameters

mask Tensor<T>

The original mask [batch, 1, height, width].

latentShape int[]

The target latent shape.

Returns

Tensor<T>

The resized mask matching latent dimensions.

SampleNoiseTensor(int[], Random)

Samples a noise tensor from standard normal distribution.

protected virtual Tensor<T> SampleNoiseTensor(int[] shape, Random rng)

Parameters

shape int[]

The shape of the tensor.

rng Random

Random number generator.

Returns

Tensor<T>

A tensor filled with Gaussian noise.

SetGuidanceScale(double)

Sets the guidance scale for classifier-free guidance.

public virtual void SetGuidanceScale(double scale)

Parameters

scale double

The guidance scale (typically 1.0-20.0).