Class VideoDiffusionModelBase<T>
Base class for video diffusion models that generate temporal sequences.
public abstract class VideoDiffusionModelBase<T> : LatentDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
VideoDiffusionModelBase<T>
- Implements
- Derived
- Inherited Members
- Extension Methods
Remarks
This abstract base class provides common functionality for all video diffusion models, including image-to-video generation, text-to-video generation, video-to-video transformation, and frame interpolation.
For Beginners: This is the foundation for video generation models like Stable Video Diffusion and AnimateDiff. It extends latent diffusion to handle the temporal dimension, generating coherent video sequences where frames are consistent over time.
Key capabilities: - Image-to-Video: Animate a still image - Text-to-Video: Generate video from text description - Video-to-Video: Transform existing video style/content - Frame interpolation: Increase frame rate smoothly
Constructors
VideoDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?, int, int)
Initializes a new instance of the VideoDiffusionModelBase class.
protected VideoDiffusionModelBase(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, int defaultNumFrames = 25, int defaultFPS = 7)
Parameters
optionsDiffusionModelOptions<T>Configuration options for the diffusion model.
schedulerINoiseScheduler<T>Optional custom scheduler.
defaultNumFramesintDefault number of frames to generate.
defaultFPSintDefault frames per second.
Properties
DefaultFPS
Gets the default frames per second for generated videos.
public virtual int DefaultFPS { get; }
Property Value
Remarks
Typical values: 7 FPS for SVD, 8 FPS for AnimateDiff. Lower FPS = smoother but slower apparent motion.
DefaultNumFrames
Gets the default number of frames generated.
public virtual int DefaultNumFrames { get; }
Property Value
Remarks
Typical values: 14, 16, 25 frames. Limited by GPU memory.
MotionBucketId
Gets the motion bucket ID for controlling motion intensity (SVD-specific).
public virtual int MotionBucketId { get; }
Property Value
Remarks
Controls amount of motion in generated video. Lower values = less motion, higher values = more motion. Range: 1-255, default: 127.
NoiseAugStrength
Gets the noise augmentation strength for input images.
public virtual double NoiseAugStrength { get; protected set; }
Property Value
Remarks
Adding slight noise to the conditioning image encourages the model to generate motion rather than static frames.
SupportsImageToVideo
Gets whether this model supports image-to-video generation.
public abstract bool SupportsImageToVideo { get; }
Property Value
SupportsTextToVideo
Gets whether this model supports text-to-video generation.
public abstract bool SupportsTextToVideo { get; }
Property Value
SupportsVideoToVideo
Gets whether this model supports video-to-video transformation.
public abstract bool SupportsVideoToVideo { get; }
Property Value
TemporalVAE
Gets the temporal VAE for video encoding/decoding.
public virtual IVAEModel<T>? TemporalVAE { get; }
Property Value
- IVAEModel<T>
Remarks
For Beginners: A temporal VAE processes video frames together, maintaining consistency across time. It's better than processing each frame independently because it avoids flickering.
Methods
AddNoiseToVideoLatents(Tensor<T>, int, Random)
Adds noise to video latents at a specific timestep.
protected virtual Tensor<T> AddNoiseToVideoLatents(Tensor<T> latents, int timestep, Random rng)
Parameters
latentsTensor<T>The original latents.
timestepintThe timestep for noise level.
rngRandomRandom number generator.
Returns
- Tensor<T>
Noisy latents.
ApplyGuidanceVideo(Tensor<T>, Tensor<T>, double)
Applies classifier-free guidance to video noise predictions.
protected virtual Tensor<T> ApplyGuidanceVideo(Tensor<T> unconditional, Tensor<T> conditional, double scale)
Parameters
unconditionalTensor<T>conditionalTensor<T>scaledouble
Returns
- Tensor<T>
CreateMotionEmbedding(int, int)
Creates a motion embedding from the motion bucket ID and FPS.
protected virtual Tensor<T> CreateMotionEmbedding(int motionBucketId, int fps)
Parameters
Returns
- Tensor<T>
Motion embedding tensor.
DecodeVideoLatents(Tensor<T>)
Decodes video latents to frames.
protected virtual Tensor<T> DecodeVideoLatents(Tensor<T> latents)
Parameters
latentsTensor<T>Video latents [batch, numFrames, latentChannels, height, width].
Returns
- Tensor<T>
Decoded video [batch, numFrames, channels, height, width].
EncodeConditioningImage(Tensor<T>, double, int?)
Encodes a conditioning image for image-to-video generation.
protected virtual Tensor<T> EncodeConditioningImage(Tensor<T> image, double noiseAugStrength, int? seed)
Parameters
imageTensor<T>The conditioning image.
noiseAugStrengthdoubleNoise augmentation strength.
seedint?Optional random seed.
Returns
- Tensor<T>
The encoded image embedding.
EncodeVideoToLatent(Tensor<T>)
Encodes a video to latent space.
protected virtual Tensor<T> EncodeVideoToLatent(Tensor<T> video)
Parameters
videoTensor<T>The video tensor [batch, numFrames, channels, height, width].
Returns
- Tensor<T>
Video latents.
ExtractFrame(Tensor<T>, int)
Extracts a frame from the video tensor.
public virtual Tensor<T> ExtractFrame(Tensor<T> video, int frameIndex)
Parameters
videoTensor<T>The video tensor [batch, numFrames, channels, height, width].
frameIndexintIndex of the frame to extract.
Returns
- Tensor<T>
The frame as an image tensor [batch, channels, height, width].
ExtractFrameLatent(Tensor<T>, int)
Extracts a single frame's latent from video latents.
protected virtual Tensor<T> ExtractFrameLatent(Tensor<T> videoLatents, int frameIndex)
Parameters
videoLatentsTensor<T>frameIndexint
Returns
- Tensor<T>
FramesToVideo(Tensor<T>[])
Concatenates frames into a video tensor.
public virtual Tensor<T> FramesToVideo(Tensor<T>[] frames)
Parameters
framesTensor<T>[]Array of frame tensors [batch, channels, height, width].
Returns
- Tensor<T>
Video tensor [batch, numFrames, channels, height, width].
GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)
Generates a video from a conditioning image.
public virtual Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)
Parameters
inputImageTensor<T>The conditioning image [batch, channels, height, width].
numFramesint?Number of frames to generate.
fpsint?Target frames per second.
numInferenceStepsintNumber of denoising steps.
motionBucketIdint?Motion intensity (1-255).
noiseAugStrengthdoubleNoise augmentation for input image.
seedint?Optional random seed.
Returns
- Tensor<T>
Generated video tensor [batch, numFrames, channels, height, width].
Remarks
For Beginners: This animates a still image: - Input: A single image (photo, artwork, etc.) - Output: A video where the scene comes to life
Tips:
- motionBucketId controls how much movement happens
- noiseAugStrength slightly varies the input to encourage motion
- Higher inference steps = smoother motion but slower
GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)
Generates a video from a text prompt.
public virtual Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)
Parameters
promptstringText description of the video to generate.
negativePromptstringWhat to avoid in the video.
widthintVideo width in pixels.
heightintVideo height in pixels.
numFramesint?Number of frames to generate.
fpsint?Target frames per second.
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Generated video tensor [batch, numFrames, channels, height, width].
Remarks
For Beginners: This creates a video from a description: - prompt: What you want ("a dog running on a beach") - The model generates both the visual content and the motion
InsertFrameLatent(Tensor<T>, Tensor<T>, int)
Inserts a frame latent into video latents at the specified index.
protected virtual void InsertFrameLatent(Tensor<T> videoLatents, Tensor<T> frameLatent, int frameIndex)
Parameters
videoLatentsTensor<T>frameLatentTensor<T>frameIndexint
InterpolateFrames(Tensor<T>, int, FrameInterpolationMethod)
Interpolates between frames to increase frame rate.
public virtual Tensor<T> InterpolateFrames(Tensor<T> video, int targetFPS, FrameInterpolationMethod interpolationMethod = FrameInterpolationMethod.Diffusion)
Parameters
videoTensor<T>The input video [batch, numFrames, channels, height, width].
targetFPSintTarget frame rate.
interpolationMethodFrameInterpolationMethodMethod for frame interpolation.
Returns
- Tensor<T>
Interpolated video with more frames.
Remarks
For Beginners: This makes videos smoother by adding in-between frames: - Input: 7 FPS video (a bit choppy) - Output: 30 FPS video (smooth playback) The AI figures out what the in-between frames should look like.
InterpolateFramesBlend(Tensor<T>, int)
Interpolates frames using blend method.
protected virtual Tensor<T> InterpolateFramesBlend(Tensor<T> video, int targetFrames)
Parameters
videoTensor<T>targetFramesint
Returns
- Tensor<T>
InterpolateFramesDiffusion(Tensor<T>, int)
Interpolates frames using diffusion-based method.
protected virtual Tensor<T> InterpolateFramesDiffusion(Tensor<T> video, int targetFrames)
Parameters
videoTensor<T>targetFramesint
Returns
- Tensor<T>
InterpolateFramesLinear(Tensor<T>, int)
Interpolates frames using linear interpolation.
protected virtual Tensor<T> InterpolateFramesLinear(Tensor<T> video, int targetFrames)
Parameters
videoTensor<T>targetFramesint
Returns
- Tensor<T>
InterpolateFramesOpticalFlow(Tensor<T>, int)
Interpolates frames using optical flow (simplified).
protected virtual Tensor<T> InterpolateFramesOpticalFlow(Tensor<T> video, int targetFrames)
Parameters
videoTensor<T>targetFramesint
Returns
- Tensor<T>
LinearBlend(Tensor<T>, Tensor<T>, double)
Linearly blends two frames.
protected virtual Tensor<T> LinearBlend(Tensor<T> frame0, Tensor<T> frame1, double t)
Parameters
frame0Tensor<T>frame1Tensor<T>tdouble
Returns
- Tensor<T>
PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)
Predicts noise for video frames conditioned on image and motion.
protected abstract Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)
Parameters
latentsTensor<T>Current video latents.
timestepintCurrent timestep.
imageEmbeddingTensor<T>Conditioning image embedding.
motionEmbeddingTensor<T>Motion embedding.
Returns
- Tensor<T>
Predicted noise for all frames.
PredictVideoNoiseWithText(Tensor<T>, int, Tensor<T>)
Predicts noise for video frames conditioned on text.
protected virtual Tensor<T> PredictVideoNoiseWithText(Tensor<T> latents, int timestep, Tensor<T> textEmbedding)
Parameters
latentsTensor<T>Current video latents.
timestepintCurrent timestep.
textEmbeddingTensor<T>Text embedding.
Returns
- Tensor<T>
Predicted noise for all frames.
SchedulerStepVideo(Tensor<T>, Tensor<T>, int)
Performs a scheduler step for video latents.
protected virtual Tensor<T> SchedulerStepVideo(Tensor<T> latents, Tensor<T> noisePrediction, int timestep)
Parameters
latentsTensor<T>Current latents.
noisePredictionTensor<T>Predicted noise.
timestepintCurrent timestep.
Returns
- Tensor<T>
Updated latents.
SetMotionBucketId(int)
Sets the motion intensity for generation.
public virtual void SetMotionBucketId(int bucketId)
Parameters
bucketIdintMotion bucket ID (1-255).
VideoToVideo(Tensor<T>, string, string?, double, int, double, int?)
Transforms an existing video.
public virtual Tensor<T> VideoToVideo(Tensor<T> inputVideo, string prompt, string? negativePrompt = null, double strength = 0.7, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)
Parameters
inputVideoTensor<T>The input video [batch, numFrames, channels, height, width].
promptstringText prompt describing the transformation.
negativePromptstringWhat to avoid.
strengthdoubleTransformation strength (0.0-1.0).
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Transformed video tensor.
Remarks
For Beginners: This changes an existing video's style or content: - strength=0.3: Minor style changes, motion preserved - strength=0.7: Major changes, but timing preserved - strength=1.0: Complete regeneration guided by original