Table of Contents

Class VideoDiffusionModelBase<T>

Namespace
AiDotNet.Diffusion
Assembly
AiDotNet.dll

Base class for video diffusion models that generate temporal sequences.

public abstract class VideoDiffusionModelBase<T> : LatentDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
VideoDiffusionModelBase<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Derived
Inherited Members
Extension Methods

Remarks

This abstract base class provides common functionality for all video diffusion models, including image-to-video generation, text-to-video generation, video-to-video transformation, and frame interpolation.

For Beginners: This is the foundation for video generation models like Stable Video Diffusion and AnimateDiff. It extends latent diffusion to handle the temporal dimension, generating coherent video sequences where frames are consistent over time.

Key capabilities: - Image-to-Video: Animate a still image - Text-to-Video: Generate video from text description - Video-to-Video: Transform existing video style/content - Frame interpolation: Increase frame rate smoothly

Constructors

VideoDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?, int, int)

Initializes a new instance of the VideoDiffusionModelBase class.

protected VideoDiffusionModelBase(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, int defaultNumFrames = 25, int defaultFPS = 7)

Parameters

options DiffusionModelOptions<T>

Configuration options for the diffusion model.

scheduler INoiseScheduler<T>

Optional custom scheduler.

defaultNumFrames int

Default number of frames to generate.

defaultFPS int

Default frames per second.

Properties

DefaultFPS

Gets the default frames per second for generated videos.

public virtual int DefaultFPS { get; }

Property Value

int

Remarks

Typical values: 7 FPS for SVD, 8 FPS for AnimateDiff. Lower FPS = smoother but slower apparent motion.

DefaultNumFrames

Gets the default number of frames generated.

public virtual int DefaultNumFrames { get; }

Property Value

int

Remarks

Typical values: 14, 16, 25 frames. Limited by GPU memory.

MotionBucketId

Gets the motion bucket ID for controlling motion intensity (SVD-specific).

public virtual int MotionBucketId { get; }

Property Value

int

Remarks

Controls amount of motion in generated video. Lower values = less motion, higher values = more motion. Range: 1-255, default: 127.

NoiseAugStrength

Gets the noise augmentation strength for input images.

public virtual double NoiseAugStrength { get; protected set; }

Property Value

double

Remarks

Adding slight noise to the conditioning image encourages the model to generate motion rather than static frames.

SupportsImageToVideo

Gets whether this model supports image-to-video generation.

public abstract bool SupportsImageToVideo { get; }

Property Value

bool

SupportsTextToVideo

Gets whether this model supports text-to-video generation.

public abstract bool SupportsTextToVideo { get; }

Property Value

bool

SupportsVideoToVideo

Gets whether this model supports video-to-video transformation.

public abstract bool SupportsVideoToVideo { get; }

Property Value

bool

TemporalVAE

Gets the temporal VAE for video encoding/decoding.

public virtual IVAEModel<T>? TemporalVAE { get; }

Property Value

IVAEModel<T>

Remarks

For Beginners: A temporal VAE processes video frames together, maintaining consistency across time. It's better than processing each frame independently because it avoids flickering.

Methods

AddNoiseToVideoLatents(Tensor<T>, int, Random)

Adds noise to video latents at a specific timestep.

protected virtual Tensor<T> AddNoiseToVideoLatents(Tensor<T> latents, int timestep, Random rng)

Parameters

latents Tensor<T>

The original latents.

timestep int

The timestep for noise level.

rng Random

Random number generator.

Returns

Tensor<T>

Noisy latents.

ApplyGuidanceVideo(Tensor<T>, Tensor<T>, double)

Applies classifier-free guidance to video noise predictions.

protected virtual Tensor<T> ApplyGuidanceVideo(Tensor<T> unconditional, Tensor<T> conditional, double scale)

Parameters

unconditional Tensor<T>
conditional Tensor<T>
scale double

Returns

Tensor<T>

CreateMotionEmbedding(int, int)

Creates a motion embedding from the motion bucket ID and FPS.

protected virtual Tensor<T> CreateMotionEmbedding(int motionBucketId, int fps)

Parameters

motionBucketId int

The motion intensity.

fps int

Frames per second.

Returns

Tensor<T>

Motion embedding tensor.

DecodeVideoLatents(Tensor<T>)

Decodes video latents to frames.

protected virtual Tensor<T> DecodeVideoLatents(Tensor<T> latents)

Parameters

latents Tensor<T>

Video latents [batch, numFrames, latentChannels, height, width].

Returns

Tensor<T>

Decoded video [batch, numFrames, channels, height, width].

EncodeConditioningImage(Tensor<T>, double, int?)

Encodes a conditioning image for image-to-video generation.

protected virtual Tensor<T> EncodeConditioningImage(Tensor<T> image, double noiseAugStrength, int? seed)

Parameters

image Tensor<T>

The conditioning image.

noiseAugStrength double

Noise augmentation strength.

seed int?

Optional random seed.

Returns

Tensor<T>

The encoded image embedding.

EncodeVideoToLatent(Tensor<T>)

Encodes a video to latent space.

protected virtual Tensor<T> EncodeVideoToLatent(Tensor<T> video)

Parameters

video Tensor<T>

The video tensor [batch, numFrames, channels, height, width].

Returns

Tensor<T>

Video latents.

ExtractFrame(Tensor<T>, int)

Extracts a frame from the video tensor.

public virtual Tensor<T> ExtractFrame(Tensor<T> video, int frameIndex)

Parameters

video Tensor<T>

The video tensor [batch, numFrames, channels, height, width].

frameIndex int

Index of the frame to extract.

Returns

Tensor<T>

The frame as an image tensor [batch, channels, height, width].

ExtractFrameLatent(Tensor<T>, int)

Extracts a single frame's latent from video latents.

protected virtual Tensor<T> ExtractFrameLatent(Tensor<T> videoLatents, int frameIndex)

Parameters

videoLatents Tensor<T>
frameIndex int

Returns

Tensor<T>

FramesToVideo(Tensor<T>[])

Concatenates frames into a video tensor.

public virtual Tensor<T> FramesToVideo(Tensor<T>[] frames)

Parameters

frames Tensor<T>[]

Array of frame tensors [batch, channels, height, width].

Returns

Tensor<T>

Video tensor [batch, numFrames, channels, height, width].

GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)

Generates a video from a conditioning image.

public virtual Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)

Parameters

inputImage Tensor<T>

The conditioning image [batch, channels, height, width].

numFrames int?

Number of frames to generate.

fps int?

Target frames per second.

numInferenceSteps int

Number of denoising steps.

motionBucketId int?

Motion intensity (1-255).

noiseAugStrength double

Noise augmentation for input image.

seed int?

Optional random seed.

Returns

Tensor<T>

Generated video tensor [batch, numFrames, channels, height, width].

Remarks

For Beginners: This animates a still image: - Input: A single image (photo, artwork, etc.) - Output: A video where the scene comes to life

Tips:

  • motionBucketId controls how much movement happens
  • noiseAugStrength slightly varies the input to encourage motion
  • Higher inference steps = smoother motion but slower

GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)

Generates a video from a text prompt.

public virtual Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)

Parameters

prompt string

Text description of the video to generate.

negativePrompt string

What to avoid in the video.

width int

Video width in pixels.

height int

Video height in pixels.

numFrames int?

Number of frames to generate.

fps int?

Target frames per second.

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Generated video tensor [batch, numFrames, channels, height, width].

Remarks

For Beginners: This creates a video from a description: - prompt: What you want ("a dog running on a beach") - The model generates both the visual content and the motion

InsertFrameLatent(Tensor<T>, Tensor<T>, int)

Inserts a frame latent into video latents at the specified index.

protected virtual void InsertFrameLatent(Tensor<T> videoLatents, Tensor<T> frameLatent, int frameIndex)

Parameters

videoLatents Tensor<T>
frameLatent Tensor<T>
frameIndex int

InterpolateFrames(Tensor<T>, int, FrameInterpolationMethod)

Interpolates between frames to increase frame rate.

public virtual Tensor<T> InterpolateFrames(Tensor<T> video, int targetFPS, FrameInterpolationMethod interpolationMethod = FrameInterpolationMethod.Diffusion)

Parameters

video Tensor<T>

The input video [batch, numFrames, channels, height, width].

targetFPS int

Target frame rate.

interpolationMethod FrameInterpolationMethod

Method for frame interpolation.

Returns

Tensor<T>

Interpolated video with more frames.

Remarks

For Beginners: This makes videos smoother by adding in-between frames: - Input: 7 FPS video (a bit choppy) - Output: 30 FPS video (smooth playback) The AI figures out what the in-between frames should look like.

InterpolateFramesBlend(Tensor<T>, int)

Interpolates frames using blend method.

protected virtual Tensor<T> InterpolateFramesBlend(Tensor<T> video, int targetFrames)

Parameters

video Tensor<T>
targetFrames int

Returns

Tensor<T>

InterpolateFramesDiffusion(Tensor<T>, int)

Interpolates frames using diffusion-based method.

protected virtual Tensor<T> InterpolateFramesDiffusion(Tensor<T> video, int targetFrames)

Parameters

video Tensor<T>
targetFrames int

Returns

Tensor<T>

InterpolateFramesLinear(Tensor<T>, int)

Interpolates frames using linear interpolation.

protected virtual Tensor<T> InterpolateFramesLinear(Tensor<T> video, int targetFrames)

Parameters

video Tensor<T>
targetFrames int

Returns

Tensor<T>

InterpolateFramesOpticalFlow(Tensor<T>, int)

Interpolates frames using optical flow (simplified).

protected virtual Tensor<T> InterpolateFramesOpticalFlow(Tensor<T> video, int targetFrames)

Parameters

video Tensor<T>
targetFrames int

Returns

Tensor<T>

LinearBlend(Tensor<T>, Tensor<T>, double)

Linearly blends two frames.

protected virtual Tensor<T> LinearBlend(Tensor<T> frame0, Tensor<T> frame1, double t)

Parameters

frame0 Tensor<T>
frame1 Tensor<T>
t double

Returns

Tensor<T>

PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)

Predicts noise for video frames conditioned on image and motion.

protected abstract Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)

Parameters

latents Tensor<T>

Current video latents.

timestep int

Current timestep.

imageEmbedding Tensor<T>

Conditioning image embedding.

motionEmbedding Tensor<T>

Motion embedding.

Returns

Tensor<T>

Predicted noise for all frames.

PredictVideoNoiseWithText(Tensor<T>, int, Tensor<T>)

Predicts noise for video frames conditioned on text.

protected virtual Tensor<T> PredictVideoNoiseWithText(Tensor<T> latents, int timestep, Tensor<T> textEmbedding)

Parameters

latents Tensor<T>

Current video latents.

timestep int

Current timestep.

textEmbedding Tensor<T>

Text embedding.

Returns

Tensor<T>

Predicted noise for all frames.

SchedulerStepVideo(Tensor<T>, Tensor<T>, int)

Performs a scheduler step for video latents.

protected virtual Tensor<T> SchedulerStepVideo(Tensor<T> latents, Tensor<T> noisePrediction, int timestep)

Parameters

latents Tensor<T>

Current latents.

noisePrediction Tensor<T>

Predicted noise.

timestep int

Current timestep.

Returns

Tensor<T>

Updated latents.

SetMotionBucketId(int)

Sets the motion intensity for generation.

public virtual void SetMotionBucketId(int bucketId)

Parameters

bucketId int

Motion bucket ID (1-255).

VideoToVideo(Tensor<T>, string, string?, double, int, double, int?)

Transforms an existing video.

public virtual Tensor<T> VideoToVideo(Tensor<T> inputVideo, string prompt, string? negativePrompt = null, double strength = 0.7, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)

Parameters

inputVideo Tensor<T>

The input video [batch, numFrames, channels, height, width].

prompt string

Text prompt describing the transformation.

negativePrompt string

What to avoid.

strength double

Transformation strength (0.0-1.0).

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Transformed video tensor.

Remarks

For Beginners: This changes an existing video's style or content: - strength=0.3: Minor style changes, motion preserved - strength=0.7: Major changes, but timing preserved - strength=1.0: Complete regeneration guided by original