Class StableVideoDiffusion<T>
Stable Video Diffusion (SVD) model for image-to-video generation.
public class StableVideoDiffusion<T> : VideoDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
StableVideoDiffusion<T>
- Implements
- Inherited Members
- Extension Methods
Examples
// Create a Stable Video Diffusion model
var svd = new StableVideoDiffusion<float>();
// Load your image (batch=1, channels=3, height=576, width=1024)
var inputImage = LoadImage("landscape.jpg");
// Generate video with default settings
var video = svd.GenerateFromImage(inputImage);
// Generate with custom motion (more movement)
var dynamicVideo = svd.GenerateFromImage(
inputImage,
numFrames: 25,
fps: 7,
motionBucketId: 200, // Higher = more motion
numInferenceSteps: 25,
seed: 42);
// Output shape: [1, 25, 3, 576, 1024] (batch, frames, channels, height, width)
SaveVideo(dynamicVideo, "output.mp4");
Remarks
Stable Video Diffusion generates short video clips from a single input image. It extends the Stable Diffusion architecture with temporal awareness, using a 3D U-Net for noise prediction and a temporal VAE for encoding/decoding.
For Beginners: Think of SVD as "making a picture come to life." You give it a single image, and it generates a short video showing how that scene might animate:
Example workflow:
- Input: Photo of a waterfall
- SVD analyzes the scene and understands what should move
- Output: 4-second video showing water flowing, mist rising
Key features:
- Image-to-video: Primary use case, animate still images
- Motion control: Adjust how much motion to add (motion bucket)
- Configurable length: Generate different numbers of frames
- High quality: Based on Stable Diffusion's proven architecture
Compared to text-to-video:
- More predictable results (scene is defined by input image)
- Better quality (less ambiguity than text prompts)
- Faster generation (can use fewer denoising steps)
Technical specifications: - Default resolution: 576x1024 or 1024x576 - Default frames: 25 frames at 7 FPS (~3.5 seconds) - Motion bucket ID: 1-255 (127 = moderate motion) - Noise augmentation: 0.02 default for conditioning image - Latent space: 4 channels, 8x spatial downsampling
Constructors
StableVideoDiffusion()
Initializes a new instance of StableVideoDiffusion with default parameters.
public StableVideoDiffusion()
Remarks
Creates an SVD model with standard parameters:
- 25 frames at 7 FPS
- 320 base channels
- DDPM scheduler with 1000 training steps
- Image conditioning enabled
StableVideoDiffusion(DiffusionModelOptions<T>?, INoiseScheduler<T>?, VideoUNetPredictor<T>?, TemporalVAE<T>?, IConditioningModule<T>?, int, int, double)
Initializes a new instance of StableVideoDiffusion with custom parameters.
public StableVideoDiffusion(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, VideoUNetPredictor<T>? videoUNet = null, TemporalVAE<T>? temporalVAE = null, IConditioningModule<T>? conditioner = null, int defaultNumFrames = 25, int defaultFPS = 7, double noiseAugmentStrength = 0.02)
Parameters
optionsDiffusionModelOptions<T>Configuration options for the diffusion model.
schedulerINoiseScheduler<T>Optional custom scheduler. Defaults to DDPM with 1000 steps.
videoUNetVideoUNetPredictor<T>Optional custom VideoUNet predictor.
temporalVAETemporalVAE<T>Optional custom temporal VAE.
conditionerIConditioningModule<T>Optional conditioning module for text guidance.
defaultNumFramesintDefault number of frames to generate.
defaultFPSintDefault frames per second.
noiseAugmentStrengthdouble
Fields
DefaultHeight
Default height for SVD generation.
public const int DefaultHeight = 576
Field Value
DefaultWidth
Default width for SVD generation.
public const int DefaultWidth = 1024
Field Value
Properties
Conditioner
Gets the conditioning module if available.
public override IConditioningModule<T>? Conditioner { get; }
Property Value
Remarks
SVD primarily uses image conditioning rather than text conditioning. The conditioner is optional and typically used for additional guidance.
LatentChannels
Gets the number of latent channels (4 for SVD).
public override int LatentChannels { get; }
Property Value
NoisePredictor
Gets the noise predictor used by this model.
public override INoisePredictor<T> NoisePredictor { get; }
Property Value
ParameterCount
Gets the total number of parameters in the model.
public override int ParameterCount { get; }
Property Value
SupportsImageToVideo
Gets whether this model supports image-to-video generation.
public override bool SupportsImageToVideo { get; }
Property Value
Remarks
Always true for SVD - this is the primary use case.
SupportsTextToVideo
Gets whether this model supports text-to-video generation.
public override bool SupportsTextToVideo { get; }
Property Value
Remarks
Returns true only if a conditioning module is provided. SVD's primary mode is image-to-video, but text guidance can be added.
SupportsVideoToVideo
Gets whether this model supports video-to-video transformation.
public override bool SupportsVideoToVideo { get; }
Property Value
Remarks
Partially supported through the VideoToVideo method inherited from base class.
TemporalVAE
Gets the temporal VAE specifically for video operations.
public override IVAEModel<T>? TemporalVAE { get; }
Property Value
- IVAEModel<T>
VAE
Gets the VAE used by this model for image encoding.
public override IVAEModel<T> VAE { get; }
Property Value
- IVAEModel<T>
Remarks
Returns the temporal VAE for both single image and video operations. The temporal VAE can handle both 4D (image) and 5D (video) tensors.
VideoUNet
Gets the video U-Net predictor with image conditioning support.
public VideoUNetPredictor<T> VideoUNet { get; }
Property Value
Methods
Clone()
Creates a clone of this StableVideoDiffusion model.
public override IDiffusionModel<T> Clone()
Returns
- IDiffusionModel<T>
A new instance with the same configuration.
CreateMotionEmbedding(int, int)
Creates SVD-specific motion embedding.
protected override Tensor<T> CreateMotionEmbedding(int motionBucketId, int fps)
Parameters
Returns
- Tensor<T>
Motion embedding tensor.
DecodeVideoLatents(Tensor<T>)
Decodes video latents using the temporal VAE.
protected override Tensor<T> DecodeVideoLatents(Tensor<T> latents)
Parameters
latentsTensor<T>Video latents [batch, frames, channels, height, width].
Returns
- Tensor<T>
Decoded video [batch, frames, channels, height, width].
DeepCopy()
Creates a deep copy of this model.
public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
A new instance with copied parameters.
EncodeConditioningImage(Tensor<T>, double, int?)
Encodes a conditioning image with SVD-specific processing.
protected override Tensor<T> EncodeConditioningImage(Tensor<T> image, double noiseAugStrength, int? seed)
Parameters
imageTensor<T>The input image tensor.
noiseAugStrengthdoubleNoise augmentation strength.
seedint?Optional random seed.
Returns
- Tensor<T>
Encoded image embedding for conditioning.
GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)
Generates a video from an input image using image-to-video diffusion.
public override Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)
Parameters
inputImageTensor<T>The conditioning image tensor [batch, channels, height, width]. Should be normalized to [-1, 1] range.
numFramesint?Number of frames to generate. Default: 25.
fpsint?Frames per second. Default: 7.
numInferenceStepsintNumber of denoising steps. Default: 25.
motionBucketIdint?Motion intensity control (1-255). Higher values = more motion. Default: 127 (moderate motion).
noiseAugStrengthdoubleNoise augmentation for conditioning image. Higher values encourage more deviation from input. Default: 0.02.
seedint?Optional random seed for reproducibility.
Returns
- Tensor<T>
Generated video tensor [batch, numFrames, channels, height, width].
Remarks
This method generates a video sequence from a single input image. The first frame will closely match the input image, while subsequent frames show natural motion based on the scene content.
Tips for best results: - Use high-quality, sharp input images - Adjust motion bucket for scene type (lower for static scenes, higher for action) - Use more inference steps for higher quality (25-50 steps) - Lower noise augmentation keeps output closer to input
GenerateWithEndImageGuidance(Tensor<T>, Tensor<T>, int, int, int?)
Generates video with motion guidance from a secondary image.
public Tensor<T> GenerateWithEndImageGuidance(Tensor<T> startImage, Tensor<T> endImage, int numFrames = 25, int numInferenceSteps = 25, int? seed = null)
Parameters
startImageTensor<T>The starting image for the video.
endImageTensor<T>Target image suggesting where motion should lead.
numFramesintNumber of frames to generate.
numInferenceStepsintNumber of denoising steps.
seedint?Optional random seed.
Returns
- Tensor<T>
Generated video transitioning from start to end image.
Remarks
Uses the latent space interpolation technique to guide the video generation toward the target end image. Not exact morphing, but provides directional guidance for the motion.
GenerateWithFirstFrame(Tensor<T>, int, int, int, int?)
Generates video with explicit first frame control.
public Tensor<T> GenerateWithFirstFrame(Tensor<T> firstFrame, int numFrames = 25, int motionBucketId = 127, int numInferenceSteps = 25, int? seed = null)
Parameters
firstFrameTensor<T>The exact first frame to use.
numFramesintNumber of frames to generate.
motionBucketIdintMotion intensity (1-255).
numInferenceStepsintNumber of denoising steps.
seedint?Optional random seed.
Returns
- Tensor<T>
Generated video with specified first frame.
Remarks
This method ensures the first frame exactly matches the input, while subsequent frames are generated through diffusion. Useful when you want precise control over the starting frame.
GetParameters()
Gets the flattened parameters of all components.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all model parameters.
GetRecommendedResolution(double)
Gets the recommended resolution for SVD generation.
public static (int width, int height) GetRecommendedResolution(double aspectRatio = 1.7777777777777777)
Parameters
aspectRatiodoubleDesired aspect ratio (width/height).
Returns
Remarks
SVD works best at specific resolutions. This method returns the closest supported resolution for the given aspect ratio.
PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)
Predicts noise for video frames conditioned on image and motion.
protected override Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)
Parameters
latentsTensor<T>Current video latents [batch, channels, frames, height, width].
timestepintCurrent diffusion timestep.
imageEmbeddingTensor<T>Encoded conditioning image.
motionEmbeddingTensor<T>Motion embedding for motion intensity control.
Returns
- Tensor<T>
Predicted noise tensor with same shape as latents.
Remarks
This method uses the VideoUNet with image conditioning to predict noise for all frames simultaneously. The image embedding provides scene context while motion embedding controls animation intensity.
SetParameters(Vector<T>)
Sets the parameters for all components.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>The parameter vector to distribute across components.