Table of Contents

Class VideoCrafterModel<T>

Namespace
AiDotNet.Diffusion.Models
Assembly
AiDotNet.dll

VideoCrafter model for high-quality text-to-video and image-to-video generation.

public class VideoCrafterModel<T> : VideoDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
VideoCrafterModel<T>
Implements
IFullModel<T, Tensor<T>, Tensor<T>>
IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>
IParameterizable<T, Tensor<T>, Tensor<T>>
ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>
IGradientComputable<T, Tensor<T>, Tensor<T>>
Inherited Members
Extension Methods

Examples

// Create VideoCrafter model
var videoCrafter = new VideoCrafterModel<float>();

// Text-to-video generation
var video = videoCrafter.GenerateFromText(
    prompt: "A beautiful sunset over the ocean, waves crashing",
    width: 1024,
    height: 576,
    numFrames: 16,
    numInferenceSteps: 50);

// Image-to-video with text guidance
var inputImage = LoadImage("sunset.jpg");
var animatedVideo = videoCrafter.GenerateFromImageAndText(
    image: inputImage,
    prompt: "waves gently rolling, seagulls flying",
    numFrames: 16);

Remarks

VideoCrafter is a video generation model that combines the strengths of text-to-video and image-to-video generation. It uses a dual-conditioning approach that enables both modalities while maintaining high visual quality and temporal coherence.

For Beginners: VideoCrafter is like having two video generation modes in one:

Mode 1 - Text-to-Video:

  • Input: "A rocket launching into space"
  • Output: 5-second video of a rocket launch

Mode 2 - Image-to-Video:

  • Input: Photo of a rocket on launch pad
  • Output: 5-second video of the rocket launching

Key advantages:

  • High visual quality (up to 1024x576 resolution)
  • Long video generation (up to 16+ seconds)
  • Good temporal coherence (smooth motion)
  • Dual conditioning (text + image together)

Unlike AnimateDiff which adds motion to SD models, VideoCrafter is trained end-to-end specifically for video generation, resulting in better quality.

Architecture: - 3D U-Net with factorized spatial-temporal attention - Dual cross-attention for text and image conditioning - Temporal VAE for consistent frame encoding - DDIM scheduler for fast inference

Constructors

VideoCrafterModel()

Initializes a new instance of VideoCrafterModel with default parameters.

public VideoCrafterModel()

VideoCrafterModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, VideoUNetPredictor<T>?, TemporalVAE<T>?, IConditioningModule<T>?, IConditioningModule<T>?, int, int)

Initializes a new instance of VideoCrafterModel with custom parameters.

public VideoCrafterModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, VideoUNetPredictor<T>? videoUNet = null, TemporalVAE<T>? temporalVAE = null, IConditioningModule<T>? textConditioner = null, IConditioningModule<T>? imageConditioner = null, int defaultNumFrames = 16, int defaultFPS = 8)

Parameters

options DiffusionModelOptions<T>

Configuration options.

scheduler INoiseScheduler<T>

Optional scheduler.

videoUNet VideoUNetPredictor<T>

Optional VideoUNet predictor.

temporalVAE TemporalVAE<T>

Optional temporal VAE.

textConditioner IConditioningModule<T>

Optional text conditioning module.

imageConditioner IConditioningModule<T>

Optional image conditioning module.

defaultNumFrames int

Default number of frames.

defaultFPS int

Default FPS.

Fields

DefaultHeight

Default VideoCrafter height.

public const int DefaultHeight = 576

Field Value

int

DefaultWidth

Default VideoCrafter width.

public const int DefaultWidth = 1024

Field Value

int

Properties

Conditioner

Gets the primary conditioning module (text).

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

ImageConditioner

Gets the image conditioning module.

public IConditioningModule<T>? ImageConditioner { get; }

Property Value

IConditioningModule<T>

ImageConditioningScale

Gets or sets the image conditioning scale.

public double ImageConditioningScale { get; set; }

Property Value

double

Remarks

Controls how strongly the input image influences the output video. Higher values keep the video closer to the input image.

LatentChannels

Gets the latent channels.

public override int LatentChannels { get; }

Property Value

int

NoisePredictor

Gets the noise predictor.

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the total parameter count.

public override int ParameterCount { get; }

Property Value

int

SupportsImageToVideo

Gets whether image-to-video is supported.

public override bool SupportsImageToVideo { get; }

Property Value

bool

SupportsTextToVideo

Gets whether text-to-video is supported.

public override bool SupportsTextToVideo { get; }

Property Value

bool

SupportsVideoToVideo

Gets whether video-to-video is supported.

public override bool SupportsVideoToVideo { get; }

Property Value

bool

TemporalVAE

Gets the temporal VAE.

public override IVAEModel<T>? TemporalVAE { get; }

Property Value

IVAEModel<T>

UseDualConditioning

Gets or sets whether to use dual conditioning (text + image together).

public bool UseDualConditioning { get; set; }

Property Value

bool

VAE

Gets the VAE.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Methods

Clone()

Clones this model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>

DecodeVideoLatents(Tensor<T>)

Decodes video latents using temporal VAE.

protected override Tensor<T> DecodeVideoLatents(Tensor<T> latents)

Parameters

latents Tensor<T>

Returns

Tensor<T>

DeepCopy()

Creates a deep copy.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)

Generates video from image with optional text guidance.

public override Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)

Parameters

inputImage Tensor<T>
numFrames int?
fps int?
numInferenceSteps int
motionBucketId int?
noiseAugStrength double
seed int?

Returns

Tensor<T>

GenerateFromImageAndText(Tensor<T>, string, string?, int?, int, double, double, int?)

Generates video with dual conditioning (image + text).

public Tensor<T> GenerateFromImageAndText(Tensor<T> image, string prompt, string? negativePrompt = null, int? numFrames = null, int numInferenceSteps = 50, double guidanceScale = 7.5, double imageScale = 1, int? seed = null)

Parameters

image Tensor<T>

The conditioning image.

prompt string

The text prompt for guidance.

negativePrompt string

Optional negative prompt.

numFrames int?

Number of frames to generate.

numInferenceSteps int

Number of denoising steps.

guidanceScale double

Text guidance scale.

imageScale double

Image conditioning scale.

seed int?

Optional random seed.

Returns

Tensor<T>

Generated video tensor.

Remarks

For Beginners: This method combines the best of both worlds: - The image provides the visual style and starting point - The text describes what motion/action should happen

Example:

  • Image: Photo of a person standing
  • Prompt: "person starts dancing energetically"
  • Result: Video of that person dancing

GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)

Generates video from text prompt.

public override Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 1024, int height = 576, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)

Parameters

prompt string
negativePrompt string
width int
height int
numFrames int?
fps int?
numInferenceSteps int
guidanceScale double
seed int?

Returns

Tensor<T>

GetParameters()

Gets all parameters.

public override Vector<T> GetParameters()

Returns

Vector<T>

PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)

Predicts video noise for image-to-video.

protected override Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)

Parameters

latents Tensor<T>
timestep int
imageEmbedding Tensor<T>
motionEmbedding Tensor<T>

Returns

Tensor<T>

PredictVideoNoiseWithText(Tensor<T>, int, Tensor<T>)

Predicts video noise with text conditioning.

protected override Tensor<T> PredictVideoNoiseWithText(Tensor<T> latents, int timestep, Tensor<T> textEmbedding)

Parameters

latents Tensor<T>
timestep int
textEmbedding Tensor<T>

Returns

Tensor<T>

SetParameters(Vector<T>)

Sets all parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>