Interface IVideoDiffusionModel<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for video diffusion models that generate temporal sequences.
public interface IVideoDiffusionModel<T> : IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inherited Members
- Extension Methods
Remarks
Video diffusion models extend image diffusion to handle the temporal dimension, generating coherent video sequences. They model both spatial (within-frame) and temporal (across-frame) dependencies.
For Beginners: Video diffusion is like image diffusion, but it creates videos instead of single images. The main challenge is making the frames look consistent over time (no flickering or teleporting objects).
How video diffusion works:
- The model generates multiple frames at once (typically 14-25 frames)
- Special "temporal attention" ensures frames are consistent
- The model can be conditioned on a starting image, text, or both
Common approaches:
- Image-to-Video (SVD): Start from an image, generate motion
- Text-to-Video (VideoCrafter): Generate video from text description
- Video-to-Video: Transform existing video with new style/content
Key challenges solved by these models:
- Temporal consistency (no flickering)
- Motion coherence (objects move naturally)
- Long-range dependencies (beginning and end are related)
This interface extends IDiffusionModel<T> with video-specific operations.
Properties
DefaultFPS
Gets the default frames per second for generated videos.
int DefaultFPS { get; }
Property Value
Remarks
Typical values: 7 FPS for SVD, 8 FPS for AnimateDiff. Lower FPS = smoother but slower apparent motion.
DefaultNumFrames
Gets the default number of frames generated.
int DefaultNumFrames { get; }
Property Value
Remarks
Typical values: 14, 16, 25 frames. Limited by GPU memory.
MotionBucketId
Gets the motion bucket ID for controlling motion intensity (SVD-specific).
int MotionBucketId { get; }
Property Value
Remarks
Controls amount of motion in generated video. Lower values = less motion, higher values = more motion. Range: 1-255, default: 127.
SupportsImageToVideo
Gets whether this model supports image-to-video generation.
bool SupportsImageToVideo { get; }
Property Value
SupportsTextToVideo
Gets whether this model supports text-to-video generation.
bool SupportsTextToVideo { get; }
Property Value
SupportsVideoToVideo
Gets whether this model supports video-to-video transformation.
bool SupportsVideoToVideo { get; }
Property Value
Methods
ExtractFrame(Tensor<T>, int)
Extracts a frame from the video tensor.
Tensor<T> ExtractFrame(Tensor<T> video, int frameIndex)
Parameters
videoTensor<T>The video tensor [batch, numFrames, channels, height, width].
frameIndexintIndex of the frame to extract.
Returns
- Tensor<T>
The frame as an image tensor [batch, channels, height, width].
FramesToVideo(Tensor<T>[])
Concatenates frames into a video tensor.
Tensor<T> FramesToVideo(Tensor<T>[] frames)
Parameters
framesTensor<T>[]Array of frame tensors [batch, channels, height, width].
Returns
- Tensor<T>
Video tensor [batch, numFrames, channels, height, width].
GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)
Generates a video from a conditioning image.
Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)
Parameters
inputImageTensor<T>The conditioning image [batch, channels, height, width].
numFramesint?Number of frames to generate.
fpsint?Target frames per second.
numInferenceStepsintNumber of denoising steps.
motionBucketIdint?Motion intensity (1-255).
noiseAugStrengthdoubleNoise augmentation for input image.
seedint?Optional random seed.
Returns
- Tensor<T>
Generated video tensor [batch, numFrames, channels, height, width].
Remarks
For Beginners: This animates a still image: - Input: A single image (photo, artwork, etc.) - Output: A video where the scene comes to life
Tips:
- motionBucketId controls how much movement happens
- noiseAugStrength slightly varies the input to encourage motion
- Higher inference steps = smoother motion but slower
GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)
Generates a video from a text prompt.
Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)
Parameters
promptstringText description of the video to generate.
negativePromptstringWhat to avoid in the video.
widthintVideo width in pixels.
heightintVideo height in pixels.
numFramesint?Number of frames to generate.
fpsint?Target frames per second.
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Generated video tensor [batch, numFrames, channels, height, width].
Remarks
For Beginners: This creates a video from a description: - prompt: What you want ("a dog running on a beach") - The model generates both the visual content and the motion
InterpolateFrames(Tensor<T>, int, FrameInterpolationMethod)
Interpolates between frames to increase frame rate.
Tensor<T> InterpolateFrames(Tensor<T> video, int targetFPS, FrameInterpolationMethod interpolationMethod = FrameInterpolationMethod.Diffusion)
Parameters
videoTensor<T>The input video [batch, numFrames, channels, height, width].
targetFPSintTarget frame rate.
interpolationMethodFrameInterpolationMethodMethod for frame interpolation.
Returns
- Tensor<T>
Interpolated video with more frames.
Remarks
For Beginners: This makes videos smoother by adding in-between frames: - Input: 7 FPS video (a bit choppy) - Output: 30 FPS video (smooth playback) The AI figures out what the in-between frames should look like.
SetMotionBucketId(int)
Sets the motion intensity for generation.
void SetMotionBucketId(int bucketId)
Parameters
bucketIdintMotion bucket ID (1-255).
VideoToVideo(Tensor<T>, string, string?, double, int, double, int?)
Transforms an existing video.
Tensor<T> VideoToVideo(Tensor<T> inputVideo, string prompt, string? negativePrompt = null, double strength = 0.7, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)
Parameters
inputVideoTensor<T>The input video [batch, numFrames, channels, height, width].
promptstringText prompt describing the transformation.
negativePromptstringWhat to avoid.
strengthdoubleTransformation strength (0.0-1.0).
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Transformed video tensor.
Remarks
For Beginners: This changes an existing video's style or content: - strength=0.3: Minor style changes, motion preserved - strength=0.7: Major changes, but timing preserved - strength=1.0: Complete regeneration guided by original