Table of Contents

Class StableVideoDiffusion<T>

Namespace
AiDotNet.Diffusion.Models
Assembly
AiDotNet.dll

Stable Video Diffusion (SVD) model for image-to-video generation.

public class StableVideoDiffusion<T> : VideoDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
StableVideoDiffusion<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Examples

// Create a Stable Video Diffusion model
var svd = new StableVideoDiffusion<float>();

// Load your image (batch=1, channels=3, height=576, width=1024)
var inputImage = LoadImage("landscape.jpg");

// Generate video with default settings
var video = svd.GenerateFromImage(inputImage);

// Generate with custom motion (more movement)
var dynamicVideo = svd.GenerateFromImage(
    inputImage,
    numFrames: 25,
    fps: 7,
    motionBucketId: 200,  // Higher = more motion
    numInferenceSteps: 25,
    seed: 42);

// Output shape: [1, 25, 3, 576, 1024] (batch, frames, channels, height, width)
SaveVideo(dynamicVideo, "output.mp4");

Remarks

Stable Video Diffusion generates short video clips from a single input image. It extends the Stable Diffusion architecture with temporal awareness, using a 3D U-Net for noise prediction and a temporal VAE for encoding/decoding.

For Beginners: Think of SVD as "making a picture come to life." You give it a single image, and it generates a short video showing how that scene might animate:

Example workflow:

  1. Input: Photo of a waterfall
  2. SVD analyzes the scene and understands what should move
  3. Output: 4-second video showing water flowing, mist rising

Key features:

  • Image-to-video: Primary use case, animate still images
  • Motion control: Adjust how much motion to add (motion bucket)
  • Configurable length: Generate different numbers of frames
  • High quality: Based on Stable Diffusion's proven architecture

Compared to text-to-video:

  • More predictable results (scene is defined by input image)
  • Better quality (less ambiguity than text prompts)
  • Faster generation (can use fewer denoising steps)

Technical specifications: - Default resolution: 576x1024 or 1024x576 - Default frames: 25 frames at 7 FPS (~3.5 seconds) - Motion bucket ID: 1-255 (127 = moderate motion) - Noise augmentation: 0.02 default for conditioning image - Latent space: 4 channels, 8x spatial downsampling

Constructors

StableVideoDiffusion()

Initializes a new instance of StableVideoDiffusion with default parameters.

public StableVideoDiffusion()

Remarks

Creates an SVD model with standard parameters:

  • 25 frames at 7 FPS
  • 320 base channels
  • DDPM scheduler with 1000 training steps
  • Image conditioning enabled

StableVideoDiffusion(DiffusionModelOptions<T>?, INoiseScheduler<T>?, VideoUNetPredictor<T>?, TemporalVAE<T>?, IConditioningModule<T>?, int, int, double)

Initializes a new instance of StableVideoDiffusion with custom parameters.

public StableVideoDiffusion(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, VideoUNetPredictor<T>? videoUNet = null, TemporalVAE<T>? temporalVAE = null, IConditioningModule<T>? conditioner = null, int defaultNumFrames = 25, int defaultFPS = 7, double noiseAugmentStrength = 0.02)

Parameters

options DiffusionModelOptions<T>

Configuration options for the diffusion model.

scheduler INoiseScheduler<T>

Optional custom scheduler. Defaults to DDPM with 1000 steps.

videoUNet VideoUNetPredictor<T>

Optional custom VideoUNet predictor.

temporalVAE TemporalVAE<T>

Optional custom temporal VAE.

conditioner IConditioningModule<T>

Optional conditioning module for text guidance.

defaultNumFrames int

Default number of frames to generate.

defaultFPS int

Default frames per second.

noiseAugmentStrength double

Fields

DefaultHeight

Default height for SVD generation.

public const int DefaultHeight = 576

Field Value

int

DefaultWidth

Default width for SVD generation.

public const int DefaultWidth = 1024

Field Value

int

Properties

Conditioner

Gets the conditioning module if available.

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

Remarks

SVD primarily uses image conditioning rather than text conditioning. The conditioner is optional and typically used for additional guidance.

LatentChannels

Gets the number of latent channels (4 for SVD).

public override int LatentChannels { get; }

Property Value

int

NoisePredictor

Gets the noise predictor used by this model.

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the total number of parameters in the model.

public override int ParameterCount { get; }

Property Value

int

SupportsImageToVideo

Gets whether this model supports image-to-video generation.

public override bool SupportsImageToVideo { get; }

Property Value

bool

Remarks

Always true for SVD - this is the primary use case.

SupportsTextToVideo

Gets whether this model supports text-to-video generation.

public override bool SupportsTextToVideo { get; }

Property Value

bool

Remarks

Returns true only if a conditioning module is provided. SVD's primary mode is image-to-video, but text guidance can be added.

SupportsVideoToVideo

Gets whether this model supports video-to-video transformation.

public override bool SupportsVideoToVideo { get; }

Property Value

bool

Remarks

Partially supported through the VideoToVideo method inherited from base class.

TemporalVAE

Gets the temporal VAE specifically for video operations.

public override IVAEModel<T>? TemporalVAE { get; }

Property Value

IVAEModel<T>

VAE

Gets the VAE used by this model for image encoding.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Remarks

Returns the temporal VAE for both single image and video operations. The temporal VAE can handle both 4D (image) and 5D (video) tensors.

VideoUNet

Gets the video U-Net predictor with image conditioning support.

public VideoUNetPredictor<T> VideoUNet { get; }

Property Value

VideoUNetPredictor<T>

Methods

Clone()

Creates a clone of this StableVideoDiffusion model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>

A new instance with the same configuration.

CreateMotionEmbedding(int, int)

Creates SVD-specific motion embedding.

protected override Tensor<T> CreateMotionEmbedding(int motionBucketId, int fps)

Parameters

motionBucketId int

Motion intensity (1-255).

fps int

Frames per second.

Returns

Tensor<T>

Motion embedding tensor.

DecodeVideoLatents(Tensor<T>)

Decodes video latents using the temporal VAE.

protected override Tensor<T> DecodeVideoLatents(Tensor<T> latents)

Parameters

latents Tensor<T>

Video latents [batch, frames, channels, height, width].

Returns

Tensor<T>

Decoded video [batch, frames, channels, height, width].

DeepCopy()

Creates a deep copy of this model.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

A new instance with copied parameters.

EncodeConditioningImage(Tensor<T>, double, int?)

Encodes a conditioning image with SVD-specific processing.

protected override Tensor<T> EncodeConditioningImage(Tensor<T> image, double noiseAugStrength, int? seed)

Parameters

image Tensor<T>

The input image tensor.

noiseAugStrength double

Noise augmentation strength.

seed int?

Optional random seed.

Returns

Tensor<T>

Encoded image embedding for conditioning.

GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)

Generates a video from an input image using image-to-video diffusion.

public override Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)

Parameters

inputImage Tensor<T>

The conditioning image tensor [batch, channels, height, width]. Should be normalized to [-1, 1] range.

numFrames int?

Number of frames to generate. Default: 25.

fps int?

Frames per second. Default: 7.

numInferenceSteps int

Number of denoising steps. Default: 25.

motionBucketId int?

Motion intensity control (1-255). Higher values = more motion. Default: 127 (moderate motion).

noiseAugStrength double

Noise augmentation for conditioning image. Higher values encourage more deviation from input. Default: 0.02.

seed int?

Optional random seed for reproducibility.

Returns

Tensor<T>

Generated video tensor [batch, numFrames, channels, height, width].

Remarks

This method generates a video sequence from a single input image. The first frame will closely match the input image, while subsequent frames show natural motion based on the scene content.

Tips for best results: - Use high-quality, sharp input images - Adjust motion bucket for scene type (lower for static scenes, higher for action) - Use more inference steps for higher quality (25-50 steps) - Lower noise augmentation keeps output closer to input

GenerateWithEndImageGuidance(Tensor<T>, Tensor<T>, int, int, int?)

Generates video with motion guidance from a secondary image.

public Tensor<T> GenerateWithEndImageGuidance(Tensor<T> startImage, Tensor<T> endImage, int numFrames = 25, int numInferenceSteps = 25, int? seed = null)

Parameters

startImage Tensor<T>

The starting image for the video.

endImage Tensor<T>

Target image suggesting where motion should lead.

numFrames int

Number of frames to generate.

numInferenceSteps int

Number of denoising steps.

seed int?

Optional random seed.

Returns

Tensor<T>

Generated video transitioning from start to end image.

Remarks

Uses the latent space interpolation technique to guide the video generation toward the target end image. Not exact morphing, but provides directional guidance for the motion.

GenerateWithFirstFrame(Tensor<T>, int, int, int, int?)

Generates video with explicit first frame control.

public Tensor<T> GenerateWithFirstFrame(Tensor<T> firstFrame, int numFrames = 25, int motionBucketId = 127, int numInferenceSteps = 25, int? seed = null)

Parameters

firstFrame Tensor<T>

The exact first frame to use.

numFrames int

Number of frames to generate.

motionBucketId int

Motion intensity (1-255).

numInferenceSteps int

Number of denoising steps.

seed int?

Optional random seed.

Returns

Tensor<T>

Generated video with specified first frame.

Remarks

This method ensures the first frame exactly matches the input, while subsequent frames are generated through diffusion. Useful when you want precise control over the starting frame.

GetParameters()

Gets the flattened parameters of all components.

public override Vector<T> GetParameters()

Returns

Vector<T>

A vector containing all model parameters.

GetRecommendedResolution(double)

Gets the recommended resolution for SVD generation.

public static (int width, int height) GetRecommendedResolution(double aspectRatio = 1.7777777777777777)

Parameters

aspectRatio double

Desired aspect ratio (width/height).

Returns

(int min, int max)

Tuple of (width, height) optimized for SVD.

Remarks

SVD works best at specific resolutions. This method returns the closest supported resolution for the given aspect ratio.

PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)

Predicts noise for video frames conditioned on image and motion.

protected override Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)

Parameters

latents Tensor<T>

Current video latents [batch, channels, frames, height, width].

timestep int

Current diffusion timestep.

imageEmbedding Tensor<T>

Encoded conditioning image.

motionEmbedding Tensor<T>

Motion embedding for motion intensity control.

Returns

Tensor<T>

Predicted noise tensor with same shape as latents.

Remarks

This method uses the VideoUNet with image conditioning to predict noise for all frames simultaneously. The image embedding provides scene context while motion embedding controls animation intensity.

SetParameters(Vector<T>)

Sets the parameters for all components.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

The parameter vector to distribute across components.