Class AnimateDiffModel<T>
AnimateDiff model for text-to-video and image-to-video generation.
public class AnimateDiffModel<T> : VideoDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
AnimateDiffModel<T>
- Implements
- Inherited Members
- Extension Methods
Examples
// Create AnimateDiff with default motion modules
var animateDiff = new AnimateDiffModel<float>();
// Text-to-video generation
var video = animateDiff.GenerateFromText(
prompt: "A beautiful sunset over the ocean, waves gently rolling",
width: 512,
height: 512,
numFrames: 16,
numInferenceSteps: 25);
// Image-to-video with text guidance
var inputImage = LoadImage("beach.jpg");
var animatedVideo = animateDiff.AnimateImage(
inputImage,
prompt: "gentle waves, moving clouds",
numFrames: 16);
Remarks
AnimateDiff extends Stable Diffusion with motion modules that enable temporal consistency in video generation. Unlike SVD which is trained end-to-end for video, AnimateDiff adds motion modules to existing text-to-image models, making it highly flexible.
For Beginners: Think of AnimateDiff as "teaching an image generator to make videos."
How it works:
- Start with a text-to-image model (like Stable Diffusion)
- Add special "motion modules" between the layers
- These modules learn how things move in videos
- The original image quality is preserved while adding motion
Key advantages:
- Works with any Stable Diffusion model/checkpoint
- Can use existing LoRAs, ControlNets, etc.
- Flexible: text-to-video, image-to-video, or both
- Lower training requirements than full video models
Example use cases:
- Generate a short animation from a text prompt
- Animate a still image with natural motion
- Create consistent character animations
- Style transfer for videos using SD checkpoints
Architecture overview: - Base: Standard Stable Diffusion U-Net - Motion Modules: Temporal attention layers inserted after spatial attention - VAE: Standard SD VAE (per-frame encoding/decoding) - Optional: LoRA adapters for style customization
Supported modes:
- Text-to-Video: Generate video from text prompt
- Image-to-Video: Animate an input image with text guidance
- Video-to-Video: Style transfer or modify existing video
Constructors
AnimateDiffModel()
Initializes a new instance of AnimateDiffModel with default parameters.
public AnimateDiffModel()
AnimateDiffModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, UNetNoisePredictor<T>?, StandardVAE<T>?, IConditioningModule<T>?, MotionModuleConfig?, int, int)
Initializes a new instance of AnimateDiffModel with custom parameters.
public AnimateDiffModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, UNetNoisePredictor<T>? unet = null, StandardVAE<T>? vae = null, IConditioningModule<T>? conditioner = null, MotionModuleConfig? motionConfig = null, int defaultNumFrames = 16, int defaultFPS = 8)
Parameters
optionsDiffusionModelOptions<T>Configuration options for the diffusion model.
schedulerINoiseScheduler<T>Optional custom scheduler.
unetUNetNoisePredictor<T>Optional custom U-Net noise predictor.
vaeStandardVAE<T>Optional custom VAE.
conditionerIConditioningModule<T>Optional conditioning module for text guidance.
motionConfigMotionModuleConfigOptional motion module configuration.
defaultNumFramesintDefault number of frames to generate.
defaultFPSintDefault frames per second.
Fields
DefaultHeight
Default AnimateDiff height (SD compatible).
public const int DefaultHeight = 512
Field Value
DefaultWidth
Default AnimateDiff width (SD compatible).
public const int DefaultWidth = 512
Field Value
Properties
Conditioner
Gets the conditioning module.
public override IConditioningModule<T>? Conditioner { get; }
Property Value
ContextLength
Gets or sets the context length for temporal attention.
public int ContextLength { get; set; }
Property Value
Remarks
Controls how many frames are processed together in the motion modules. Larger values provide better temporal consistency but require more memory.
ContextOverlap
Gets or sets the context overlap for sliding window generation.
public int ContextOverlap { get; set; }
Property Value
Remarks
When generating more frames than ContextLength, this controls the overlap between windows to maintain smooth transitions.
LatentChannels
Gets the number of latent channels.
public override int LatentChannels { get; }
Property Value
MotionConfig
Gets the motion module configuration.
public MotionModuleConfig MotionConfig { get; }
Property Value
NoisePredictor
Gets the noise predictor.
public override INoisePredictor<T> NoisePredictor { get; }
Property Value
ParameterCount
Gets the total parameter count.
public override int ParameterCount { get; }
Property Value
SupportsImageToVideo
Gets whether image-to-video is supported.
public override bool SupportsImageToVideo { get; }
Property Value
Remarks
AnimateDiff supports animating still images when a conditioner is available.
SupportsTextToVideo
Gets whether text-to-video is supported.
public override bool SupportsTextToVideo { get; }
Property Value
Remarks
AnimateDiff's primary mode is text-to-video.
SupportsVideoToVideo
Gets whether video-to-video is supported.
public override bool SupportsVideoToVideo { get; }
Property Value
VAE
Gets the VAE for frame encoding/decoding.
public override IVAEModel<T> VAE { get; }
Property Value
- IVAEModel<T>
Methods
Clone()
Clones this AnimateDiff model.
public override IDiffusionModel<T> Clone()
Returns
DecodeVideoLatents(Tensor<T>)
Decodes video latents to frames.
protected override Tensor<T> DecodeVideoLatents(Tensor<T> latents)
Parameters
latentsTensor<T>
Returns
- Tensor<T>
DeepCopy()
Creates a deep copy.
public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)
Generates video from an input image.
public override Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)
Parameters
inputImageTensor<T>numFramesint?fpsint?numInferenceStepsintmotionBucketIdint?noiseAugStrengthdoubleseedint?
Returns
- Tensor<T>
GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)
Generates video from text using AnimateDiff.
public override Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 512, int height = 512, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)
Parameters
promptstringThe text prompt describing the video.
negativePromptstringOptional negative prompt.
widthintVideo width.
heightintVideo height.
numFramesint?Number of frames to generate.
fpsint?Frames per second (for motion module).
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleClassifier-free guidance scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Generated video tensor [batch, numFrames, channels, height, width].
GetParameters()
Gets all parameters.
public override Vector<T> GetParameters()
Returns
- Vector<T>
PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)
Predicts video noise for image-to-video generation.
protected override Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)
Parameters
latentsTensor<T>timestepintimageEmbeddingTensor<T>motionEmbeddingTensor<T>
Returns
- Tensor<T>
SetParameters(Vector<T>)
Sets all parameters.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>