Table of Contents

Class AnimateDiffModel<T>

Namespace
AiDotNet.Diffusion.Models
Assembly
AiDotNet.dll

AnimateDiff model for text-to-video and image-to-video generation.

public class AnimateDiffModel<T> : VideoDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
AnimateDiffModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Examples

// Create AnimateDiff with default motion modules
var animateDiff = new AnimateDiffModel<float>();

// Text-to-video generation
var video = animateDiff.GenerateFromText(
    prompt: "A beautiful sunset over the ocean, waves gently rolling",
    width: 512,
    height: 512,
    numFrames: 16,
    numInferenceSteps: 25);

// Image-to-video with text guidance
var inputImage = LoadImage("beach.jpg");
var animatedVideo = animateDiff.AnimateImage(
    inputImage,
    prompt: "gentle waves, moving clouds",
    numFrames: 16);

Remarks

AnimateDiff extends Stable Diffusion with motion modules that enable temporal consistency in video generation. Unlike SVD which is trained end-to-end for video, AnimateDiff adds motion modules to existing text-to-image models, making it highly flexible.

For Beginners: Think of AnimateDiff as "teaching an image generator to make videos."

How it works:

  1. Start with a text-to-image model (like Stable Diffusion)
  2. Add special "motion modules" between the layers
  3. These modules learn how things move in videos
  4. The original image quality is preserved while adding motion

Key advantages:

  • Works with any Stable Diffusion model/checkpoint
  • Can use existing LoRAs, ControlNets, etc.
  • Flexible: text-to-video, image-to-video, or both
  • Lower training requirements than full video models

Example use cases:

  • Generate a short animation from a text prompt
  • Animate a still image with natural motion
  • Create consistent character animations
  • Style transfer for videos using SD checkpoints

Architecture overview: - Base: Standard Stable Diffusion U-Net - Motion Modules: Temporal attention layers inserted after spatial attention - VAE: Standard SD VAE (per-frame encoding/decoding) - Optional: LoRA adapters for style customization

Supported modes:

  • Text-to-Video: Generate video from text prompt
  • Image-to-Video: Animate an input image with text guidance
  • Video-to-Video: Style transfer or modify existing video

Constructors

AnimateDiffModel()

Initializes a new instance of AnimateDiffModel with default parameters.

public AnimateDiffModel()

AnimateDiffModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, StandardVAE<T>?, IConditioningModule<T>?, MotionModuleConfig?, int, int)

Initializes a new instance of AnimateDiffModel with custom parameters.

public AnimateDiffModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, StandardVAE<T>? vae = null, IConditioningModule<T>? conditioner = null, MotionModuleConfig? motionConfig = null, int defaultNumFrames = 16, int defaultFPS = 8)

Parameters

options DiffusionModelOptions<T>

Configuration options for the diffusion model.

scheduler INoiseScheduler<T>

Optional custom scheduler.

unet UNetNoisePredictor<T>

Optional custom U-Net noise predictor.

vae StandardVAE<T>

Optional custom VAE.

conditioner IConditioningModule<T>

Optional conditioning module for text guidance.

motionConfig MotionModuleConfig

Optional motion module configuration.

defaultNumFrames int

Default number of frames to generate.

defaultFPS int

Default frames per second.

Fields

DefaultHeight

Default AnimateDiff height (SD compatible).

public const int DefaultHeight = 512

Field Value

int

DefaultWidth

Default AnimateDiff width (SD compatible).

public const int DefaultWidth = 512

Field Value

int

Properties

Conditioner

Gets the conditioning module.

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

ContextLength

Gets or sets the context length for temporal attention.

public int ContextLength { get; set; }

Property Value

int

Remarks

Controls how many frames are processed together in the motion modules. Larger values provide better temporal consistency but require more memory.

ContextOverlap

Gets or sets the context overlap for sliding window generation.

public int ContextOverlap { get; set; }

Property Value

int

Remarks

When generating more frames than ContextLength, this controls the overlap between windows to maintain smooth transitions.

LatentChannels

Gets the number of latent channels.

public override int LatentChannels { get; }

Property Value

int

MotionConfig

Gets the motion module configuration.

public MotionModuleConfig MotionConfig { get; }

Property Value

MotionModuleConfig

NoisePredictor

Gets the noise predictor.

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the total parameter count.

public override int ParameterCount { get; }

Property Value

int

SupportsImageToVideo

Gets whether image-to-video is supported.

public override bool SupportsImageToVideo { get; }

Property Value

bool

Remarks

AnimateDiff supports animating still images when a conditioner is available.

SupportsTextToVideo

Gets whether text-to-video is supported.

public override bool SupportsTextToVideo { get; }

Property Value

bool

Remarks

AnimateDiff's primary mode is text-to-video.

SupportsVideoToVideo

Gets whether video-to-video is supported.

public override bool SupportsVideoToVideo { get; }

Property Value

bool

VAE

Gets the VAE for frame encoding/decoding.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Methods

Clone()

Clones this AnimateDiff model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>

DecodeVideoLatents(Tensor<T>)

Decodes video latents to frames.

protected override Tensor<T> DecodeVideoLatents(Tensor<T> latents)

Parameters

latents Tensor<T>

Returns

Tensor<T>

DeepCopy()

Creates a deep copy.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)

Generates video from an input image.

public override Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)

Parameters

inputImage Tensor<T>
numFrames int?
fps int?
numInferenceSteps int
motionBucketId int?
noiseAugStrength double
seed int?

Returns

Tensor<T>

GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)

Generates video from text using AnimateDiff.

public override Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)

Parameters

prompt string

The text prompt describing the video.

negativePrompt string

Optional negative prompt.

width int

Video width.

height int

Video height.

numFrames int?

Number of frames to generate.

fps int?

Frames per second (for motion module).

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Classifier-free guidance scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Generated video tensor [batch, numFrames, channels, height, width].

GetParameters()

Gets all parameters.

public override Vector<T> GetParameters()

Returns

Vector<T>

PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)

Predicts video noise for image-to-video generation.

protected override Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)

Parameters

latents Tensor<T>
timestep int
imageEmbedding Tensor<T>
motionEmbedding Tensor<T>

Returns

Tensor<T>

SetParameters(Vector<T>)

Sets all parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>