Table of Contents

Interface ILatentDiffusionModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for latent diffusion models that operate in a compressed latent space.

public interface ILatentDiffusionModel<T> : IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members
Extension Methods

Remarks

Latent diffusion models are a highly efficient variant of diffusion models that perform the denoising process in a compressed latent space rather than pixel space. This is the architecture behind Stable Diffusion and many other state-of-the-art generative models.

For Beginners: Latent diffusion combines the power of diffusion models with the efficiency of autoencoders.

How it works:

  1. A VAE compresses images (512x512) into small latents (64x64)
  2. Diffusion happens in this compressed space (much faster!)
  3. The VAE decompresses the result back to a full image

Benefits:

  • Training is ~50x faster than pixel-space diffusion
  • Generation is ~50x faster
  • Quality remains very high
  • Enables practical high-resolution generation

Key components:

  • VAE: Compresses and decompresses images
  • Noise Predictor (U-Net/DiT): Predicts noise in latent space
  • Scheduler: Controls the denoising process
  • Conditioner: Encodes text/images for guided generation

This interface extends IDiffusionModel<T> with latent-space specific operations.

Properties

Conditioner

Gets the conditioning module (optional, for conditioned generation).

IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

GuidanceScale

Gets the default guidance scale for classifier-free guidance.

double GuidanceScale { get; }

Property Value

double

Remarks

Higher values make generation more closely follow the conditioning. Typical values: 7.5 for Stable Diffusion, 5.0 for SDXL.

LatentChannels

Gets the number of latent channels.

int LatentChannels { get; }

Property Value

int

Remarks

Typically 4 for Stable Diffusion models.

NoisePredictor

Gets the noise predictor model (U-Net, DiT, etc.).

INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

SupportsInpainting

Gets whether this model supports inpainting.

bool SupportsInpainting { get; }

Property Value

bool

SupportsNegativePrompt

Gets whether this model supports negative prompts.

bool SupportsNegativePrompt { get; }

Property Value

bool

VAE

Gets the VAE model used for encoding and decoding.

IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Methods

DecodeFromLatent(Tensor<T>)

Decodes a latent representation back to an image.

Tensor<T> DecodeFromLatent(Tensor<T> latent)

Parameters

latent Tensor<T>

The latent tensor.

Returns

Tensor<T>

The decoded image tensor [batch, channels, height, width].

Remarks

For Beginners: This decompresses a latent back to an image: - Input: Small latent (e.g., 64x64x4) - Output: Full-size image (e.g., 512x512x3)

EncodeToLatent(Tensor<T>, bool)

Encodes an image into latent space.

Tensor<T> EncodeToLatent(Tensor<T> image, bool sampleMode = true)

Parameters

image Tensor<T>

The input image tensor [batch, channels, height, width].

sampleMode bool

Whether to sample from the VAE distribution.

Returns

Tensor<T>

The latent representation.

Remarks

For Beginners: This compresses an image for processing: - Input: Full-size image (e.g., 512x512) - Output: Small latent (e.g., 64x64x4)

Use sampleMode=true during training for VAE regularization, and sampleMode=false for deterministic encoding during editing.

GenerateFromText(string, string?, int, int, int, double?, int?)

Generates images from text prompts using classifier-free guidance.

Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)

Parameters

prompt string

The text prompt describing the desired image.

negativePrompt string

Optional negative prompt (what to avoid).

width int

Image width in pixels (should be divisible by VAE downsample factor).

height int

Image height in pixels (should be divisible by VAE downsample factor).

numInferenceSteps int

Number of denoising steps.

guidanceScale double?

How closely to follow the prompt (higher = closer).

seed int?

Optional random seed for reproducibility.

Returns

Tensor<T>

The generated image tensor.

Remarks

This is the main text-to-image generation method. It performs: 1. Encode text prompts to conditioning embeddings 2. Generate random latent noise 3. Iteratively denoise with classifier-free guidance 4. Decode latent to image

For Beginners: This is how you generate images from text: - prompt: What you want ("a cat in a spacesuit") - negativePrompt: What to avoid ("blurry, low quality") - guidanceScale: How strictly to follow the prompt (7.5 is typical)

ImageToImage(Tensor<T>, string, string?, double, int, double?, int?)

Performs image-to-image generation (style transfer, editing).

Tensor<T> ImageToImage(Tensor<T> inputImage, string prompt, string? negativePrompt = null, double strength = 0.8, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)

Parameters

inputImage Tensor<T>

The input image to transform.

prompt string

The text prompt describing the desired transformation.

negativePrompt string

Optional negative prompt.

strength double

How much to transform (0.0 = no change, 1.0 = full regeneration).

numInferenceSteps int

Number of denoising steps.

guidanceScale double?

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

The transformed image tensor.

Remarks

Image-to-image works by: 1. Encode the input image to latent 2. Add noise to the latent (controlled by strength) 3. Denoise with text guidance 4. Decode back to image

For Beginners: This transforms an existing image based on a prompt:

  • strength=0.3: Minor changes, keeps most of the original
  • strength=0.7: Major changes, but composition remains
  • strength=1.0: Complete regeneration, original is just a starting point

Inpaint(Tensor<T>, Tensor<T>, string, string?, int, double?, int?)

Performs inpainting (filling in masked regions).

Tensor<T> Inpaint(Tensor<T> inputImage, Tensor<T> mask, string prompt, string? negativePrompt = null, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)

Parameters

inputImage Tensor<T>

The input image with areas to inpaint.

mask Tensor<T>

Binary mask where 1 = inpaint, 0 = keep original.

prompt string

Text prompt describing what to generate in the masked area.

negativePrompt string

Optional negative prompt.

numInferenceSteps int

Number of denoising steps.

guidanceScale double?

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

The inpainted image tensor.

Remarks

Inpainting fills in masked regions while keeping unmasked areas intact. The mask should be the same spatial size as the image.

For Beginners: This is like a smart "fill" tool: - Draw a mask over what you want to replace - Describe what should go there - The model generates content that blends naturally

SetGuidanceScale(double)

Sets the guidance scale for classifier-free guidance.

void SetGuidanceScale(double scale)

Parameters

scale double

The guidance scale (typically 1.0-20.0).