Class VideoDiffusionModelBase<T>

Namespace: AiDotNet.Diffusion

Assembly: AiDotNet.dll

Base class for video diffusion models that generate temporal sequences.

public abstract class VideoDiffusionModelBase<T> : LatentDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

DiffusionModelBase<T>

LatentDiffusionModelBase<T>

VideoDiffusionModelBase<T>

Implements: ILatentDiffusionModel<T>

IVideoDiffusionModel<T>

IDiffusionModel<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

Derived: AnimateDiffModel<T>

StableVideoDiffusion<T>

VideoCrafterModel<T>

Inherited Members: LatentDiffusionModelBase<T>.VAE

LatentDiffusionModelBase<T>.NoisePredictor

LatentDiffusionModelBase<T>.Conditioner

LatentDiffusionModelBase<T>.LatentChannels

LatentDiffusionModelBase<T>.GuidanceScale

LatentDiffusionModelBase<T>.SupportsNegativePrompt

LatentDiffusionModelBase<T>.SupportsInpainting

LatentDiffusionModelBase<T>.EncodeToLatent(Tensor<T>, bool)

LatentDiffusionModelBase<T>.DecodeFromLatent(Tensor<T>)

LatentDiffusionModelBase<T>.GenerateFromText(string, string, int, int, int, double?, int?)

LatentDiffusionModelBase<T>.ImageToImage(Tensor<T>, string, string, double, int, double?, int?)

LatentDiffusionModelBase<T>.Inpaint(Tensor<T>, Tensor<T>, string, string, int, double?, int?)

LatentDiffusionModelBase<T>.SetGuidanceScale(double)

LatentDiffusionModelBase<T>.PredictNoise(Tensor<T>, int)

LatentDiffusionModelBase<T>.Generate(int[], int, int?)

LatentDiffusionModelBase<T>.ApplyGuidance(Tensor<T>, Tensor<T>, double)

LatentDiffusionModelBase<T>.SampleNoiseTensor(int[], Random)

LatentDiffusionModelBase<T>.ResizeMaskToLatent(Tensor<T>, int[])

LatentDiffusionModelBase<T>.BlendLatentsWithMask(Tensor<T>, Tensor<T>, Tensor<T>, int)

DiffusionModelBase<T>.NumOps

DiffusionModelBase<T>.RandomGenerator

DiffusionModelBase<T>.LossFunction

DiffusionModelBase<T>.LearningRate

DiffusionModelBase<T>.Scheduler

DiffusionModelBase<T>.ParameterCount

DiffusionModelBase<T>.DefaultLossFunction

DiffusionModelBase<T>.SupportsJitCompilation

DiffusionModelBase<T>.ComputeLoss(Tensor<T>, Tensor<T>, int[])

DiffusionModelBase<T>.Train(Tensor<T>, Tensor<T>)

DiffusionModelBase<T>.Predict(Tensor<T>)

DiffusionModelBase<T>.GetModelMetadata()

DiffusionModelBase<T>.GetParameters()

DiffusionModelBase<T>.SetParameters(Vector<T>)

DiffusionModelBase<T>.WithParameters(Vector<T>)

DiffusionModelBase<T>.Serialize()

DiffusionModelBase<T>.Deserialize(byte[])

DiffusionModelBase<T>.SaveModel(string)

DiffusionModelBase<T>.LoadModel(string)

DiffusionModelBase<T>.SaveState(Stream)

DiffusionModelBase<T>.LoadState(Stream)

DiffusionModelBase<T>.GetActiveFeatureIndices()

DiffusionModelBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

DiffusionModelBase<T>.IsFeatureUsed(int)

DiffusionModelBase<T>.GetFeatureImportance()

DiffusionModelBase<T>.DeepCopy()

DiffusionModelBase<T>.Clone()

DiffusionModelBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

DiffusionModelBase<T>.ApplyGradients(Vector<T>, T)

DiffusionModelBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

DiffusionModelBase<T>.SampleNoise(int, Random)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Remarks

This abstract base class provides common functionality for all video diffusion models, including image-to-video generation, text-to-video generation, video-to-video transformation, and frame interpolation.

For Beginners: This is the foundation for video generation models like Stable Video Diffusion and AnimateDiff. It extends latent diffusion to handle the temporal dimension, generating coherent video sequences where frames are consistent over time.

Key capabilities: - Image-to-Video: Animate a still image - Text-to-Video: Generate video from text description - Video-to-Video: Transform existing video style/content - Frame interpolation: Increase frame rate smoothly

Constructors

VideoDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?, int, int)

Initializes a new instance of the VideoDiffusionModelBase class.

protected VideoDiffusionModelBase(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, int defaultNumFrames = 25, int defaultFPS = 7)

Parameters

options DiffusionModelOptions<T>: Configuration options for the diffusion model.
scheduler INoiseScheduler<T>: Optional custom scheduler.
defaultNumFrames int: Default number of frames to generate.
defaultFPS int: Default frames per second.

Properties

DefaultFPS

Gets the default frames per second for generated videos.

public virtual int DefaultFPS { get; }

Property Value

int

Remarks

Typical values: 7 FPS for SVD, 8 FPS for AnimateDiff. Lower FPS = smoother but slower apparent motion.

DefaultNumFrames

Gets the default number of frames generated.

public virtual int DefaultNumFrames { get; }

Property Value

int

Remarks

Typical values: 14, 16, 25 frames. Limited by GPU memory.

MotionBucketId

Gets the motion bucket ID for controlling motion intensity (SVD-specific).

public virtual int MotionBucketId { get; }

Property Value

int

Remarks

Controls amount of motion in generated video. Lower values = less motion, higher values = more motion. Range: 1-255, default: 127.

NoiseAugStrength

Gets the noise augmentation strength for input images.

public virtual double NoiseAugStrength { get; protected set; }

Property Value

double

Remarks

Adding slight noise to the conditioning image encourages the model to generate motion rather than static frames.

SupportsImageToVideo

Gets whether this model supports image-to-video generation.

public abstract bool SupportsImageToVideo { get; }

Property Value

bool

SupportsTextToVideo

Gets whether this model supports text-to-video generation.

public abstract bool SupportsTextToVideo { get; }

Property Value

bool

SupportsVideoToVideo

Gets whether this model supports video-to-video transformation.

public abstract bool SupportsVideoToVideo { get; }

Property Value

bool

TemporalVAE

Gets the temporal VAE for video encoding/decoding.

public virtual IVAEModel<T>? TemporalVAE { get; }

Property Value

IVAEModel<T>

Remarks

For Beginners: A temporal VAE processes video frames together, maintaining consistency across time. It's better than processing each frame independently because it avoids flickering.

Methods

AddNoiseToVideoLatents(Tensor<T>, int, Random)

Adds noise to video latents at a specific timestep.

protected virtual Tensor<T> AddNoiseToVideoLatents(Tensor<T> latents, int timestep, Random rng)

Parameters

latents Tensor<T>: The original latents.
timestep int: The timestep for noise level.
rng Random: Random number generator.

Returns

Tensor<T>: Noisy latents.

ApplyGuidanceVideo(Tensor<T>, Tensor<T>, double)

Applies classifier-free guidance to video noise predictions.

protected virtual Tensor<T> ApplyGuidanceVideo(Tensor<T> unconditional, Tensor<T> conditional, double scale)

Parameters

unconditional Tensor<T>
conditional Tensor<T>
scale double

Returns

Tensor<T>

CreateMotionEmbedding(int, int)

Creates a motion embedding from the motion bucket ID and FPS.

protected virtual Tensor<T> CreateMotionEmbedding(int motionBucketId, int fps)

Parameters

motionBucketId int: The motion intensity.
fps int: Frames per second.

Returns

Tensor<T>: Motion embedding tensor.

DecodeVideoLatents(Tensor<T>)

Decodes video latents to frames.

protected virtual Tensor<T> DecodeVideoLatents(Tensor<T> latents)

Parameters

latents Tensor<T>: Video latents [batch, numFrames, latentChannels, height, width].

Returns

Tensor<T>: Decoded video [batch, numFrames, channels, height, width].

EncodeConditioningImage(Tensor<T>, double, int?)

Encodes a conditioning image for image-to-video generation.

protected virtual Tensor<T> EncodeConditioningImage(Tensor<T> image, double noiseAugStrength, int? seed)

Parameters

image Tensor<T>: The conditioning image.
noiseAugStrength double: Noise augmentation strength.
seed int?: Optional random seed.

Returns

Tensor<T>: The encoded image embedding.

EncodeVideoToLatent(Tensor<T>)

Encodes a video to latent space.

protected virtual Tensor<T> EncodeVideoToLatent(Tensor<T> video)

Parameters

video Tensor<T>: The video tensor [batch, numFrames, channels, height, width].

Returns

Tensor<T>: Video latents.

ExtractFrame(Tensor<T>, int)

Extracts a frame from the video tensor.

public virtual Tensor<T> ExtractFrame(Tensor<T> video, int frameIndex)

Parameters

video Tensor<T>: The video tensor [batch, numFrames, channels, height, width].
frameIndex int: Index of the frame to extract.

Returns

Tensor<T>: The frame as an image tensor [batch, channels, height, width].

ExtractFrameLatent(Tensor<T>, int)

Extracts a single frame's latent from video latents.

protected virtual Tensor<T> ExtractFrameLatent(Tensor<T> videoLatents, int frameIndex)

Parameters

videoLatents Tensor<T>
frameIndex int

Returns

Tensor<T>

FramesToVideo(Tensor<T>[])

Concatenates frames into a video tensor.

public virtual Tensor<T> FramesToVideo(Tensor<T>[] frames)

Parameters

frames Tensor<T>[]: Array of frame tensors [batch, channels, height, width].

Returns

Tensor<T>: Video tensor [batch, numFrames, channels, height, width].

GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)

Generates a video from a conditioning image.

public virtual Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)

Parameters

inputImage Tensor<T>: The conditioning image [batch, channels, height, width].
numFrames int?: Number of frames to generate.
fps int?: Target frames per second.
numInferenceSteps int: Number of denoising steps.
motionBucketId int?: Motion intensity (1-255).
noiseAugStrength double: Noise augmentation for input image.
seed int?: Optional random seed.

Returns

Tensor<T>: Generated video tensor [batch, numFrames, channels, height, width].

Remarks

For Beginners: This animates a still image: - Input: A single image (photo, artwork, etc.) - Output: A video where the scene comes to life

Tips:

motionBucketId controls how much movement happens
noiseAugStrength slightly varies the input to encourage motion
Higher inference steps = smoother motion but slower

GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)

Generates a video from a text prompt.

public virtual Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)

Parameters

prompt string: Text description of the video to generate.
negativePrompt string: What to avoid in the video.
width int: Video width in pixels.
height int: Video height in pixels.
numFrames int?: Number of frames to generate.
fps int?: Target frames per second.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Generated video tensor [batch, numFrames, channels, height, width].

Remarks

For Beginners: This creates a video from a description: - prompt: What you want ("a dog running on a beach") - The model generates both the visual content and the motion

InsertFrameLatent(Tensor<T>, Tensor<T>, int)

Inserts a frame latent into video latents at the specified index.

protected virtual void InsertFrameLatent(Tensor<T> videoLatents, Tensor<T> frameLatent, int frameIndex)

Parameters

videoLatents Tensor<T>
frameLatent Tensor<T>
frameIndex int

InterpolateFrames(Tensor<T>, int, FrameInterpolationMethod)

Interpolates between frames to increase frame rate.

public virtual Tensor<T> InterpolateFrames(Tensor<T> video, int targetFPS, FrameInterpolationMethod interpolationMethod = FrameInterpolationMethod.Diffusion)

Parameters

video Tensor<T>: The input video [batch, numFrames, channels, height, width].
targetFPS int: Target frame rate.
interpolationMethod FrameInterpolationMethod: Method for frame interpolation.

Returns

Tensor<T>: Interpolated video with more frames.

Remarks

For Beginners: This makes videos smoother by adding in-between frames: - Input: 7 FPS video (a bit choppy) - Output: 30 FPS video (smooth playback) The AI figures out what the in-between frames should look like.

InterpolateFramesBlend(Tensor<T>, int)

Interpolates frames using blend method.

protected virtual Tensor<T> InterpolateFramesBlend(Tensor<T> video, int targetFrames)

Parameters

video Tensor<T>
targetFrames int

Returns

Tensor<T>

InterpolateFramesDiffusion(Tensor<T>, int)

Interpolates frames using diffusion-based method.

protected virtual Tensor<T> InterpolateFramesDiffusion(Tensor<T> video, int targetFrames)

Parameters

video Tensor<T>
targetFrames int

Returns

Tensor<T>

InterpolateFramesLinear(Tensor<T>, int)

Interpolates frames using linear interpolation.

protected virtual Tensor<T> InterpolateFramesLinear(Tensor<T> video, int targetFrames)

Parameters

video Tensor<T>
targetFrames int

Returns

Tensor<T>

InterpolateFramesOpticalFlow(Tensor<T>, int)

Interpolates frames using optical flow (simplified).

protected virtual Tensor<T> InterpolateFramesOpticalFlow(Tensor<T> video, int targetFrames)

Parameters

video Tensor<T>
targetFrames int

Returns

Tensor<T>

LinearBlend(Tensor<T>, Tensor<T>, double)

Linearly blends two frames.

protected virtual Tensor<T> LinearBlend(Tensor<T> frame0, Tensor<T> frame1, double t)

Parameters

frame0 Tensor<T>
frame1 Tensor<T>
t double

Returns

Tensor<T>

PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)

Predicts noise for video frames conditioned on image and motion.

protected abstract Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)

Parameters

latents Tensor<T>: Current video latents.
timestep int: Current timestep.
imageEmbedding Tensor<T>: Conditioning image embedding.
motionEmbedding Tensor<T>: Motion embedding.

Returns

Tensor<T>: Predicted noise for all frames.

PredictVideoNoiseWithText(Tensor<T>, int, Tensor<T>)

Predicts noise for video frames conditioned on text.

protected virtual Tensor<T> PredictVideoNoiseWithText(Tensor<T> latents, int timestep, Tensor<T> textEmbedding)

Parameters

latents Tensor<T>: Current video latents.
timestep int: Current timestep.
textEmbedding Tensor<T>: Text embedding.

Returns

Tensor<T>: Predicted noise for all frames.

SchedulerStepVideo(Tensor<T>, Tensor<T>, int)

Performs a scheduler step for video latents.

protected virtual Tensor<T> SchedulerStepVideo(Tensor<T> latents, Tensor<T> noisePrediction, int timestep)

Parameters

latents Tensor<T>: Current latents.
noisePrediction Tensor<T>: Predicted noise.
timestep int: Current timestep.

Returns

Tensor<T>: Updated latents.

SetMotionBucketId(int)

Sets the motion intensity for generation.

public virtual void SetMotionBucketId(int bucketId)

Parameters

bucketId int: Motion bucket ID (1-255).

VideoToVideo(Tensor<T>, string, string?, double, int, double, int?)

Transforms an existing video.

public virtual Tensor<T> VideoToVideo(Tensor<T> inputVideo, string prompt, string? negativePrompt = null, double strength = 0.7, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)

Parameters

inputVideo Tensor<T>: The input video [batch, numFrames, channels, height, width].
prompt string: Text prompt describing the transformation.
negativePrompt string: What to avoid.
strength double: Transformation strength (0.0-1.0).
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Classifier-free guidance scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Transformed video tensor.

Remarks

For Beginners: This changes an existing video's style or content: - strength=0.3: Minor style changes, motion preserved - strength=0.7: Major changes, but timing preserved - strength=1.0: Complete regeneration guided by original

Table of Contents

Class VideoDiffusionModelBase<T>

Type Parameters

Remarks

Constructors

VideoDiffusionModelBase(DiffusionModelOptions<T>?, INoiseScheduler<T>?, int, int)

Parameters

Properties

DefaultFPS

Property Value

Remarks

DefaultNumFrames

Property Value

Remarks

MotionBucketId

Property Value

Remarks

NoiseAugStrength

Property Value

Remarks

SupportsImageToVideo

Property Value

SupportsTextToVideo

Property Value

SupportsVideoToVideo

Property Value

TemporalVAE

Property Value

Remarks

Methods

AddNoiseToVideoLatents(Tensor<T>, int, Random)

Parameters

Returns

ApplyGuidanceVideo(Tensor<T>, Tensor<T>, double)

Parameters

Returns

CreateMotionEmbedding(int, int)

Parameters

Returns

DecodeVideoLatents(Tensor<T>)

Parameters

Returns

EncodeConditioningImage(Tensor<T>, double, int?)

Parameters

Returns

EncodeVideoToLatent(Tensor<T>)

Parameters

Returns

ExtractFrame(Tensor<T>, int)

Parameters

Returns

ExtractFrameLatent(Tensor<T>, int)

Parameters

Returns

FramesToVideo(Tensor<T>[])

Parameters

Returns

GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)

Parameters

Returns

Remarks

GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)

Parameters

Returns

Remarks

InsertFrameLatent(Tensor<T>, Tensor<T>, int)

Parameters

InterpolateFrames(Tensor<T>, int, FrameInterpolationMethod)

Parameters

Returns

Remarks

InterpolateFramesBlend(Tensor<T>, int)

Parameters

Returns

InterpolateFramesDiffusion(Tensor<T>, int)

Parameters

Returns

InterpolateFramesLinear(Tensor<T>, int)

Parameters

Returns