Class LatentDiffusionModelBase<T>
Base class for latent diffusion models that operate in a compressed latent space.
public abstract class LatentDiffusionModelBase<T> : DiffusionModelBase<T>, ILatentDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
LatentDiffusionModelBase<T>
- Implements
- Derived
- Inherited Members
- Extension Methods
Remarks
This abstract base class provides common functionality for all latent diffusion models, including encoding/decoding, text-to-image generation, image-to-image transformation, and inpainting.
For Beginners: This is the foundation for latent diffusion models like Stable Diffusion. It combines a VAE (for compression), a noise predictor (for denoising), and optional conditioning (for guided generation from text or images).
Constructors
LatentDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?)
Initializes a new instance of the LatentDiffusionModelBase class.
protected LatentDiffusionModelBase(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null)
Parameters
optionsDiffusionModelOptions<T>Configuration options for the diffusion model.
schedulerINoiseScheduler<T>Optional custom scheduler.
Properties
Conditioner
Gets the conditioning module (optional, for conditioned generation).
public abstract IConditioningModule<T>? Conditioner { get; }
Property Value
GuidanceScale
Gets the default guidance scale for classifier-free guidance.
public virtual double GuidanceScale { get; }
Property Value
Remarks
Higher values make generation more closely follow the conditioning. Typical values: 7.5 for Stable Diffusion, 5.0 for SDXL.
LatentChannels
Gets the number of latent channels.
public abstract int LatentChannels { get; }
Property Value
Remarks
Typically 4 for Stable Diffusion models.
NoisePredictor
Gets the noise predictor model (U-Net, DiT, etc.).
public abstract INoisePredictor<T> NoisePredictor { get; }
Property Value
SupportsInpainting
Gets whether this model supports inpainting.
public virtual bool SupportsInpainting { get; }
Property Value
SupportsNegativePrompt
Gets whether this model supports negative prompts.
public virtual bool SupportsNegativePrompt { get; }
Property Value
VAE
Gets the VAE model used for encoding and decoding.
public abstract IVAEModel<T> VAE { get; }
Property Value
- IVAEModel<T>
Methods
ApplyGuidance(Tensor<T>, Tensor<T>, double)
Applies classifier-free guidance to combine conditional and unconditional predictions.
protected virtual Tensor<T> ApplyGuidance(Tensor<T> unconditional, Tensor<T> conditional, double scale)
Parameters
unconditionalTensor<T>The unconditional noise prediction.
conditionalTensor<T>The conditional noise prediction.
scaledoubleThe guidance scale.
Returns
- Tensor<T>
The guided noise prediction.
BlendLatentsWithMask(Tensor<T>, Tensor<T>, Tensor<T>, int)
Blends generated latents with original latents based on mask for inpainting.
protected virtual Tensor<T> BlendLatentsWithMask(Tensor<T> generated, Tensor<T> original, Tensor<T> mask, int timestep)
Parameters
generatedTensor<T>The generated latents.
originalTensor<T>The original latents.
maskTensor<T>The mask (1 = inpaint, 0 = keep original).
timestepintCurrent timestep for noise addition to original.
Returns
- Tensor<T>
Blended latents.
DecodeFromLatent(Tensor<T>)
Decodes a latent representation back to an image.
public virtual Tensor<T> DecodeFromLatent(Tensor<T> latent)
Parameters
latentTensor<T>The latent tensor.
Returns
- Tensor<T>
The decoded image tensor [batch, channels, height, width].
Remarks
For Beginners: This decompresses a latent back to an image: - Input: Small latent (e.g., 64x64x4) - Output: Full-size image (e.g., 512x512x3)
EncodeToLatent(Tensor<T>, bool)
Encodes an image into latent space.
public virtual Tensor<T> EncodeToLatent(Tensor<T> image, bool sampleMode = true)
Parameters
imageTensor<T>The input image tensor [batch, channels, height, width].
sampleModeboolWhether to sample from the VAE distribution.
Returns
- Tensor<T>
The latent representation.
Remarks
For Beginners: This compresses an image for processing: - Input: Full-size image (e.g., 512x512) - Output: Small latent (e.g., 64x64x4)
Use sampleMode=true during training for VAE regularization, and sampleMode=false for deterministic encoding during editing.
Generate(int[], int, int?)
Generates samples by iteratively denoising from random noise.
public override Tensor<T> Generate(int[] shape, int numInferenceSteps = 50, int? seed = null)
Parameters
shapeint[]The shape of samples to generate (e.g., [batchSize, channels, height, width]).
numInferenceStepsintNumber of denoising steps. More steps = higher quality, slower.
seedint?Optional random seed for reproducibility. If null, uses system random.
Returns
- Tensor<T>
Generated samples as a tensor.
Remarks
This is the main generation method. It starts with random noise and applies the reverse diffusion process to generate new samples.
For Beginners: This is how you create new images/data: 1. Start with pure random noise (like TV static) 2. Ask the model "what does this look like minus some noise?" 3. Repeat many times, each time removing a bit more noise 4. End with a clean generated sample
More inference steps = cleaner results but slower generation. Typical values: 20-50 for fast generation, 100-200 for high quality.
GenerateFromText(string, string?, int, int, int, double?, int?)
Generates images from text prompts using classifier-free guidance.
public virtual Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)
Parameters
promptstringThe text prompt describing the desired image.
negativePromptstringOptional negative prompt (what to avoid).
widthintImage width in pixels (should be divisible by VAE downsample factor).
heightintImage height in pixels (should be divisible by VAE downsample factor).
numInferenceStepsintNumber of denoising steps.
guidanceScaledouble?How closely to follow the prompt (higher = closer).
seedint?Optional random seed for reproducibility.
Returns
- Tensor<T>
The generated image tensor.
Remarks
This is the main text-to-image generation method. It performs: 1. Encode text prompts to conditioning embeddings 2. Generate random latent noise 3. Iteratively denoise with classifier-free guidance 4. Decode latent to image
For Beginners: This is how you generate images from text: - prompt: What you want ("a cat in a spacesuit") - negativePrompt: What to avoid ("blurry, low quality") - guidanceScale: How strictly to follow the prompt (7.5 is typical)
ImageToImage(Tensor<T>, string, string?, double, int, double?, int?)
Performs image-to-image generation (style transfer, editing).
public virtual Tensor<T> ImageToImage(Tensor<T> inputImage, string prompt, string? negativePrompt = null, double strength = 0.8, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)
Parameters
inputImageTensor<T>The input image to transform.
promptstringThe text prompt describing the desired transformation.
negativePromptstringOptional negative prompt.
strengthdoubleHow much to transform (0.0 = no change, 1.0 = full regeneration).
numInferenceStepsintNumber of denoising steps.
guidanceScaledouble?Classifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
The transformed image tensor.
Remarks
Image-to-image works by: 1. Encode the input image to latent 2. Add noise to the latent (controlled by strength) 3. Denoise with text guidance 4. Decode back to image
For Beginners: This transforms an existing image based on a prompt:
- strength=0.3: Minor changes, keeps most of the original
- strength=0.7: Major changes, but composition remains
- strength=1.0: Complete regeneration, original is just a starting point
Inpaint(Tensor<T>, Tensor<T>, string, string?, int, double?, int?)
Performs inpainting (filling in masked regions).
public virtual Tensor<T> Inpaint(Tensor<T> inputImage, Tensor<T> mask, string prompt, string? negativePrompt = null, int numInferenceSteps = 50, double? guidanceScale = null, int? seed = null)
Parameters
inputImageTensor<T>The input image with areas to inpaint.
maskTensor<T>Binary mask where 1 = inpaint, 0 = keep original.
promptstringText prompt describing what to generate in the masked area.
negativePromptstringOptional negative prompt.
numInferenceStepsintNumber of denoising steps.
guidanceScaledouble?Classifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
The inpainted image tensor.
Remarks
Inpainting fills in masked regions while keeping unmasked areas intact. The mask should be the same spatial size as the image.
For Beginners: This is like a smart "fill" tool: - Draw a mask over what you want to replace - Describe what should go there - The model generates content that blends naturally
PredictNoise(Tensor<T>, int)
Predicts the noise in a noisy sample at a given timestep.
public override Tensor<T> PredictNoise(Tensor<T> noisySample, int timestep)
Parameters
noisySampleTensor<T>The noisy input sample.
timestepintThe current timestep in the diffusion process.
Returns
- Tensor<T>
The predicted noise tensor.
Remarks
This is the core prediction that the model learns. Given a noisy sample at timestep t, predict what noise was added to create it.
For Beginners: The model looks at a noisy image and guesses "what noise was added to make it look like this?" This prediction is then used to remove that noise and get a cleaner image.
ResizeMaskToLatent(Tensor<T>, int[])
Resizes a mask tensor to match latent dimensions.
protected virtual Tensor<T> ResizeMaskToLatent(Tensor<T> mask, int[] latentShape)
Parameters
maskTensor<T>The original mask [batch, 1, height, width].
latentShapeint[]The target latent shape.
Returns
- Tensor<T>
The resized mask matching latent dimensions.
SampleNoiseTensor(int[], Random)
Samples a noise tensor from standard normal distribution.
protected virtual Tensor<T> SampleNoiseTensor(int[] shape, Random rng)
Parameters
Returns
- Tensor<T>
A tensor filled with Gaussian noise.
SetGuidanceScale(double)
Sets the guidance scale for classifier-free guidance.
public virtual void SetGuidanceScale(double scale)
Parameters
scaledoubleThe guidance scale (typically 1.0-20.0).