Interface ILatentDiffusionModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for latent diffusion models that operate in a compressed latent space.
public interface ILatentDiffusionModel<T> : IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
- Extension Methods
Remarks
Latent diffusion models are a highly efficient variant of diffusion models that perform the denoising process in a compressed latent space rather than pixel space. This is the architecture behind Stable Diffusion and many other state-of-the-art generative models.
For Beginners: Latent diffusion combines the power of diffusion models with the efficiency of autoencoders.
How it works:
- A VAE compresses images (512x512) into small latents (64x64)
- Diffusion happens in this compressed space (much faster!)
- The VAE decompresses the result back to a full image
Benefits:
- Training is ~50x faster than pixel-space diffusion
- Generation is ~50x faster
- Quality remains very high
- Enables practical high-resolution generation
Key components:
- VAE: Compresses and decompresses images
- Noise Predictor (U-Net/DiT): Predicts noise in latent space
- Scheduler: Controls the denoising process
- Conditioner: Encodes text/images for guided generation
This interface extends IDiffusionModel<T> with latent-space specific operations.
Properties
Conditioner
Gets the conditioning module (optional, for conditioned generation).
IConditioningModule<T>? Conditioner { get; }
Property Value
GuidanceScale
Gets the default guidance scale for classifier-free guidance.
double GuidanceScale { get; }
Property Value
Remarks
Higher values make generation more closely follow the conditioning. Typical values: 7.5 for Stable Diffusion, 5.0 for SDXL.
LatentChannels
Gets the number of latent channels.
int LatentChannels { get; }
Property Value
Remarks
Typically 4 for Stable Diffusion models.
NoisePredictor
Gets the noise predictor model (U-Net, DiT, etc.).
INoisePredictor<T> NoisePredictor { get; }
Property Value
SupportsInpainting
Gets whether this model supports inpainting.
bool SupportsInpainting { get; }
Property Value
SupportsNegativePrompt
Gets whether this model supports negative prompts.
bool SupportsNegativePrompt { get; }
Property Value
VAE
Gets the VAE model used for encoding and decoding.
IVAEModel<T> VAE { get; }
Property Value
- IVAEModel<T>
Methods
DecodeFromLatent(Tensor<T>)
Decodes a latent representation back to an image.
Tensor<T> DecodeFromLatent(Tensor<T> latent)
Parameters
latentTensor<T>The latent tensor.
Returns
- Tensor<T>
The decoded image tensor [batch, channels, height, width].
Remarks
For Beginners: This decompresses a latent back to an image: - Input: Small latent (e.g., 64x64x4) - Output: Full-size image (e.g., 512x512x3)
EncodeToLatent(Tensor<T>, bool)
Encodes an image into latent space.
Tensor<T> EncodeToLatent(Tensor<T> image, bool sampleMode = true)
Parameters
imageTensor<T>The input image tensor [batch, channels, height, width].
sampleModeboolWhether to sample from the VAE distribution.
Returns
- Tensor<T>
The latent representation.
Remarks
For Beginners: This compresses an image for processing: - Input: Full-size image (e.g., 512x512) - Output: Small latent (e.g., 64x64x4)
Use sampleMode=true during training for VAE regularization, and sampleMode=false for deterministic encoding during editing.
GenerateFromText(string, string?, int, int, int, double?, int?)
Generates images from text prompts using classifier-free guidance.
Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)
Parameters
promptstringThe text prompt describing the desired image.
negativePromptstringOptional negative prompt (what to avoid).
widthintImage width in pixels (should be divisible by VAE downsample factor).
heightintImage height in pixels (should be divisible by VAE downsample factor).
numInferenceStepsintNumber of denoising steps.
guidanceScaledouble?How closely to follow the prompt (higher = closer).
seedint?Optional random seed for reproducibility.
Returns
- Tensor<T>
The generated image tensor.
Remarks
This is the main text-to-image generation method. It performs: 1. Encode text prompts to conditioning embeddings 2. Generate random latent noise 3. Iteratively denoise with classifier-free guidance 4. Decode latent to image
For Beginners: This is how you generate images from text: - prompt: What you want ("a cat in a spacesuit") - negativePrompt: What to avoid ("blurry, low quality") - guidanceScale: How strictly to follow the prompt (7.5 is typical)
ImageToImage(Tensor<T>, string, string?, double, int, double?, int?)
Performs image-to-image generation (style transfer, editing).
Tensor<T> ImageToImage(Tensor<T> inputImage, string prompt, string? negativePrompt = null, double strength = 0.8, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)
Parameters
inputImageTensor<T>The input image to transform.
promptstringThe text prompt describing the desired transformation.
negativePromptstringOptional negative prompt.
strengthdoubleHow much to transform (0.0 = no change, 1.0 = full regeneration).
numInferenceStepsintNumber of denoising steps.
guidanceScaledouble?Classifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
The transformed image tensor.
Remarks
Image-to-image works by: 1. Encode the input image to latent 2. Add noise to the latent (controlled by strength) 3. Denoise with text guidance 4. Decode back to image
For Beginners: This transforms an existing image based on a prompt:
- strength=0.3: Minor changes, keeps most of the original
- strength=0.7: Major changes, but composition remains
- strength=1.0: Complete regeneration, original is just a starting point
Inpaint(Tensor<T>, Tensor<T>, string, string?, int, double?, int?)
Performs inpainting (filling in masked regions).
Tensor<T> Inpaint(Tensor<T> inputImage, Tensor<T> mask, string prompt, string? negativePrompt = null, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)
Parameters
inputImageTensor<T>The input image with areas to inpaint.
maskTensor<T>Binary mask where 1 = inpaint, 0 = keep original.
promptstringText prompt describing what to generate in the masked area.
negativePromptstringOptional negative prompt.
numInferenceStepsintNumber of denoising steps.
guidanceScaledouble?Classifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
The inpainted image tensor.
Remarks
Inpainting fills in masked regions while keeping unmasked areas intact. The mask should be the same spatial size as the image.
For Beginners: This is like a smart "fill" tool: - Draw a mask over what you want to replace - Describe what should go there - The model generates content that blends naturally
SetGuidanceScale(double)
Sets the guidance scale for classifier-free guidance.
void SetGuidanceScale(double scale)
Parameters
scaledoubleThe guidance scale (typically 1.0-20.0).