Table of Contents

Interface IVideoDiffusionModel<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for video diffusion models that generate temporal sequences.

public interface IVideoDiffusionModel<T> : IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inherited Members
Extension Methods

Remarks

Video diffusion models extend image diffusion to handle the temporal dimension, generating coherent video sequences. They model both spatial (within-frame) and temporal (across-frame) dependencies.

For Beginners: Video diffusion is like image diffusion, but it creates videos instead of single images. The main challenge is making the frames look consistent over time (no flickering or teleporting objects).

How video diffusion works:

  1. The model generates multiple frames at once (typically 14-25 frames)
  2. Special "temporal attention" ensures frames are consistent
  3. The model can be conditioned on a starting image, text, or both

Common approaches:

  • Image-to-Video (SVD): Start from an image, generate motion
  • Text-to-Video (VideoCrafter): Generate video from text description
  • Video-to-Video: Transform existing video with new style/content

Key challenges solved by these models:

  • Temporal consistency (no flickering)
  • Motion coherence (objects move naturally)
  • Long-range dependencies (beginning and end are related)

This interface extends IDiffusionModel<T> with video-specific operations.

Properties

DefaultFPS

Gets the default frames per second for generated videos.

int DefaultFPS { get; }

Property Value

int

Remarks

Typical values: 7 FPS for SVD, 8 FPS for AnimateDiff. Lower FPS = smoother but slower apparent motion.

DefaultNumFrames

Gets the default number of frames generated.

int DefaultNumFrames { get; }

Property Value

int

Remarks

Typical values: 14, 16, 25 frames. Limited by GPU memory.

MotionBucketId

Gets the motion bucket ID for controlling motion intensity (SVD-specific).

int MotionBucketId { get; }

Property Value

int

Remarks

Controls amount of motion in generated video. Lower values = less motion, higher values = more motion. Range: 1-255, default: 127.

SupportsImageToVideo

Gets whether this model supports image-to-video generation.

bool SupportsImageToVideo { get; }

Property Value

bool

SupportsTextToVideo

Gets whether this model supports text-to-video generation.

bool SupportsTextToVideo { get; }

Property Value

bool

SupportsVideoToVideo

Gets whether this model supports video-to-video transformation.

bool SupportsVideoToVideo { get; }

Property Value

bool

Methods

ExtractFrame(Tensor<T>, int)

Extracts a frame from the video tensor.

Tensor<T> ExtractFrame(Tensor<T> video, int frameIndex)

Parameters

video Tensor<T>

The video tensor [batch, numFrames, channels, height, width].

frameIndex int

Index of the frame to extract.

Returns

Tensor<T>

The frame as an image tensor [batch, channels, height, width].

FramesToVideo(Tensor<T>[])

Concatenates frames into a video tensor.

Tensor<T> FramesToVideo(Tensor<T>[] frames)

Parameters

frames Tensor<T>[]

Array of frame tensors [batch, channels, height, width].

Returns

Tensor<T>

Video tensor [batch, numFrames, channels, height, width].

GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)

Generates a video from a conditioning image.

Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)

Parameters

inputImage Tensor<T>

The conditioning image [batch, channels, height, width].

numFrames int?

Number of frames to generate.

fps int?

Target frames per second.

numInferenceSteps int

Number of denoising steps.

motionBucketId int?

Motion intensity (1-255).

noiseAugStrength double

Noise augmentation for input image.

seed int?

Optional random seed.

Returns

Tensor<T>

Generated video tensor [batch, numFrames, channels, height, width].

Remarks

For Beginners: This animates a still image: - Input: A single image (photo, artwork, etc.) - Output: A video where the scene comes to life

Tips:

  • motionBucketId controls how much movement happens
  • noiseAugStrength slightly varies the input to encourage motion
  • Higher inference steps = smoother motion but slower

GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)

Generates a video from a text prompt.

Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)

Parameters

prompt string

Text description of the video to generate.

negativePrompt string

What to avoid in the video.

width int

Video width in pixels.

height int

Video height in pixels.

numFrames int?

Number of frames to generate.

fps int?

Target frames per second.

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Generated video tensor [batch, numFrames, channels, height, width].

Remarks

For Beginners: This creates a video from a description: - prompt: What you want ("a dog running on a beach") - The model generates both the visual content and the motion

InterpolateFrames(Tensor<T>, int, FrameInterpolationMethod)

Interpolates between frames to increase frame rate.

Tensor<T> InterpolateFrames(Tensor<T> video, int targetFPS, FrameInterpolationMethod interpolationMethod = FrameInterpolationMethod.Diffusion)

Parameters

video Tensor<T>

The input video [batch, numFrames, channels, height, width].

targetFPS int

Target frame rate.

interpolationMethod FrameInterpolationMethod

Method for frame interpolation.

Returns

Tensor<T>

Interpolated video with more frames.

Remarks

For Beginners: This makes videos smoother by adding in-between frames: - Input: 7 FPS video (a bit choppy) - Output: 30 FPS video (smooth playback) The AI figures out what the in-between frames should look like.

SetMotionBucketId(int)

Sets the motion intensity for generation.

void SetMotionBucketId(int bucketId)

Parameters

bucketId int

Motion bucket ID (1-255).

VideoToVideo(Tensor<T>, string, string?, double, int, double, int?)

Transforms an existing video.

Tensor<T> VideoToVideo(Tensor<T> inputVideo, string prompt, string? negativePrompt = null, double strength = 0.7, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)

Parameters

inputVideo Tensor<T>

The input video [batch, numFrames, channels, height, width].

prompt string

Text prompt describing the transformation.

negativePrompt string

What to avoid.

strength double

Transformation strength (0.0-1.0).

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Transformed video tensor.

Remarks

For Beginners: This changes an existing video's style or content: - strength=0.3: Minor style changes, motion preserved - strength=0.7: Major changes, but timing preserved - strength=1.0: Complete regeneration guided by original