Class VideoCrafterModel<T>

Namespace: AiDotNet.Diffusion.Models

Assembly: AiDotNet.dll

VideoCrafter model for high-quality text-to-video and image-to-video generation.

public class VideoCrafterModel<T> : VideoDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

DiffusionModelBase<T>

LatentDiffusionModelBase<T>

VideoDiffusionModelBase<T>

VideoCrafterModel<T>

Implements: ILatentDiffusionModel<T>

IVideoDiffusionModel<T>

IDiffusionModel<T>

IFullModel<T, Tensor<T>, Tensor<T>>

IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>

IModelSerializer

ICheckpointableModel

IParameterizable<T, Tensor<T>, Tensor<T>>

IFeatureAware

IFeatureImportance<T>

ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>

IGradientComputable<T, Tensor<T>, Tensor<T>>

IJitCompilable<T>

Inherited Members: VideoDiffusionModelBase<T>.DefaultNumFrames

VideoDiffusionModelBase<T>.DefaultFPS

VideoDiffusionModelBase<T>.MotionBucketId

VideoDiffusionModelBase<T>.NoiseAugStrength

VideoDiffusionModelBase<T>.VideoToVideo(Tensor<T>, string, string, double, int, double, int?)

VideoDiffusionModelBase<T>.InterpolateFrames(Tensor<T>, int, FrameInterpolationMethod)

VideoDiffusionModelBase<T>.SetMotionBucketId(int)

VideoDiffusionModelBase<T>.ExtractFrame(Tensor<T>, int)

VideoDiffusionModelBase<T>.FramesToVideo(Tensor<T>[])

VideoDiffusionModelBase<T>.EncodeConditioningImage(Tensor<T>, double, int?)

VideoDiffusionModelBase<T>.CreateMotionEmbedding(int, int)

VideoDiffusionModelBase<T>.EncodeVideoToLatent(Tensor<T>)

VideoDiffusionModelBase<T>.AddNoiseToVideoLatents(Tensor<T>, int, Random)

VideoDiffusionModelBase<T>.SchedulerStepVideo(Tensor<T>, Tensor<T>, int)

VideoDiffusionModelBase<T>.ApplyGuidanceVideo(Tensor<T>, Tensor<T>, double)

VideoDiffusionModelBase<T>.ExtractFrameLatent(Tensor<T>, int)

VideoDiffusionModelBase<T>.InsertFrameLatent(Tensor<T>, Tensor<T>, int)

VideoDiffusionModelBase<T>.InterpolateFramesDiffusion(Tensor<T>, int)

VideoDiffusionModelBase<T>.InterpolateFramesOpticalFlow(Tensor<T>, int)

VideoDiffusionModelBase<T>.InterpolateFramesLinear(Tensor<T>, int)

VideoDiffusionModelBase<T>.InterpolateFramesBlend(Tensor<T>, int)

VideoDiffusionModelBase<T>.LinearBlend(Tensor<T>, Tensor<T>, double)

LatentDiffusionModelBase<T>.GuidanceScale

LatentDiffusionModelBase<T>.SupportsNegativePrompt

LatentDiffusionModelBase<T>.SupportsInpainting

LatentDiffusionModelBase<T>.EncodeToLatent(Tensor<T>, bool)

LatentDiffusionModelBase<T>.DecodeFromLatent(Tensor<T>)

LatentDiffusionModelBase<T>.GenerateFromText(string, string, int, int, int, double?, int?)

LatentDiffusionModelBase<T>.ImageToImage(Tensor<T>, string, string, double, int, double?, int?)

LatentDiffusionModelBase<T>.Inpaint(Tensor<T>, Tensor<T>, string, string, int, double?, int?)

LatentDiffusionModelBase<T>.SetGuidanceScale(double)

LatentDiffusionModelBase<T>.PredictNoise(Tensor<T>, int)

LatentDiffusionModelBase<T>.Generate(int[], int, int?)

LatentDiffusionModelBase<T>.ApplyGuidance(Tensor<T>, Tensor<T>, double)

LatentDiffusionModelBase<T>.SampleNoiseTensor(int[], Random)

LatentDiffusionModelBase<T>.ResizeMaskToLatent(Tensor<T>, int[])

LatentDiffusionModelBase<T>.BlendLatentsWithMask(Tensor<T>, Tensor<T>, Tensor<T>, int)

DiffusionModelBase<T>.NumOps

DiffusionModelBase<T>.RandomGenerator

DiffusionModelBase<T>.LossFunction

DiffusionModelBase<T>.LearningRate

DiffusionModelBase<T>.Scheduler

DiffusionModelBase<T>.DefaultLossFunction

DiffusionModelBase<T>.SupportsJitCompilation

DiffusionModelBase<T>.ComputeLoss(Tensor<T>, Tensor<T>, int[])

DiffusionModelBase<T>.Train(Tensor<T>, Tensor<T>)

DiffusionModelBase<T>.Predict(Tensor<T>)

DiffusionModelBase<T>.GetModelMetadata()

DiffusionModelBase<T>.WithParameters(Vector<T>)

DiffusionModelBase<T>.Serialize()

DiffusionModelBase<T>.Deserialize(byte[])

DiffusionModelBase<T>.SaveModel(string)

DiffusionModelBase<T>.LoadModel(string)

DiffusionModelBase<T>.SaveState(Stream)

DiffusionModelBase<T>.LoadState(Stream)

DiffusionModelBase<T>.GetActiveFeatureIndices()

DiffusionModelBase<T>.SetActiveFeatureIndices(IEnumerable<int>)

DiffusionModelBase<T>.IsFeatureUsed(int)

DiffusionModelBase<T>.GetFeatureImportance()

DiffusionModelBase<T>.ComputeGradients(Tensor<T>, Tensor<T>, ILossFunction<T>)

DiffusionModelBase<T>.ApplyGradients(Vector<T>, T)

DiffusionModelBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

DiffusionModelBase<T>.SampleNoise(int, Random)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Extension Methods: DistributedExtensions.AsDistributedForHighBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributedForLowBandwidth<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, ICommunicationBackend<T>)

DistributedExtensions.AsDistributed<T, TInput, TOutput>(IFullModel<T, TInput, TOutput>, IShardingConfiguration<T>)

Examples

// Create VideoCrafter model
var videoCrafter = new VideoCrafterModel<float>();

// Text-to-video generation
var video = videoCrafter.GenerateFromText(
    prompt: "A beautiful sunset over the ocean, waves crashing",
    width: 1024,
    height: 576,
    numFrames: 16,
    numInferenceSteps: 50);

// Image-to-video with text guidance
var inputImage = LoadImage("sunset.jpg");
var animatedVideo = videoCrafter.GenerateFromImageAndText(
    image: inputImage,
    prompt: "waves gently rolling, seagulls flying",
    numFrames: 16);

Remarks

VideoCrafter is a video generation model that combines the strengths of text-to-video and image-to-video generation. It uses a dual-conditioning approach that enables both modalities while maintaining high visual quality and temporal coherence.

For Beginners: VideoCrafter is like having two video generation modes in one:

Mode 1 - Text-to-Video:

Input: "A rocket launching into space"
Output: 5-second video of a rocket launch

Mode 2 - Image-to-Video:

Input: Photo of a rocket on launch pad
Output: 5-second video of the rocket launching

Key advantages:

High visual quality (up to 1024x576 resolution)
Long video generation (up to 16+ seconds)
Good temporal coherence (smooth motion)
Dual conditioning (text + image together)

Unlike AnimateDiff which adds motion to SD models, VideoCrafter is trained end-to-end specifically for video generation, resulting in better quality.

Architecture: - 3D U-Net with factorized spatial-temporal attention - Dual cross-attention for text and image conditioning - Temporal VAE for consistent frame encoding - DDIM scheduler for fast inference

Constructors

VideoCrafterModel()

Initializes a new instance of VideoCrafterModel with default parameters.

public VideoCrafterModel()

VideoCrafterModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, VideoUNetPredictor<T>?, TemporalVAE<T>?, IConditioningModule<T>?, IConditioningModule<T>?, int, int)

Initializes a new instance of VideoCrafterModel with custom parameters.

public VideoCrafterModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, VideoUNetPredictor<T>? videoUNet = null, TemporalVAE<T>? temporalVAE = null, IConditioningModule<T>? textConditioner = null, IConditioningModule<T>? imageConditioner = null, int defaultNumFrames = 16, int defaultFPS = 8)

Parameters

options DiffusionModelOptions<T>: Configuration options.
scheduler INoiseScheduler<T>: Optional scheduler.
videoUNet VideoUNetPredictor<T>: Optional VideoUNet predictor.
temporalVAE TemporalVAE<T>: Optional temporal VAE.
textConditioner IConditioningModule<T>: Optional text conditioning module.
imageConditioner IConditioningModule<T>: Optional image conditioning module.
defaultNumFrames int: Default number of frames.
defaultFPS int: Default FPS.

Fields

DefaultHeight

Default VideoCrafter height.

public const int DefaultHeight = 576

Field Value

int

DefaultWidth

Default VideoCrafter width.

public const int DefaultWidth = 1024

Field Value

int

Properties

Conditioner

Gets the primary conditioning module (text).

public override IConditioningModule<T>? Conditioner { get; }

Property Value

IConditioningModule<T>

ImageConditioner

Gets the image conditioning module.

public IConditioningModule<T>? ImageConditioner { get; }

Property Value

IConditioningModule<T>

ImageConditioningScale

Gets or sets the image conditioning scale.

public double ImageConditioningScale { get; set; }

Property Value

double

Remarks

Controls how strongly the input image influences the output video. Higher values keep the video closer to the input image.

LatentChannels

Gets the latent channels.

public override int LatentChannels { get; }

Property Value

int

NoisePredictor

Gets the noise predictor.

public override INoisePredictor<T> NoisePredictor { get; }

Property Value

INoisePredictor<T>

ParameterCount

Gets the total parameter count.

public override int ParameterCount { get; }

Property Value

int

SupportsImageToVideo

Gets whether image-to-video is supported.

public override bool SupportsImageToVideo { get; }

Property Value

bool

SupportsTextToVideo

Gets whether text-to-video is supported.

public override bool SupportsTextToVideo { get; }

Property Value

bool

SupportsVideoToVideo

Gets whether video-to-video is supported.

public override bool SupportsVideoToVideo { get; }

Property Value

bool

TemporalVAE

Gets the temporal VAE.

public override IVAEModel<T>? TemporalVAE { get; }

Property Value

IVAEModel<T>

UseDualConditioning

Gets or sets whether to use dual conditioning (text + image together).

public bool UseDualConditioning { get; set; }

Property Value

bool

VAE

Gets the VAE.

public override IVAEModel<T> VAE { get; }

Property Value

IVAEModel<T>

Methods

Clone()

Clones this model.

public override IDiffusionModel<T> Clone()

Returns

IDiffusionModel<T>

DecodeVideoLatents(Tensor<T>)

Decodes video latents using temporal VAE.

protected override Tensor<T> DecodeVideoLatents(Tensor<T> latents)

Parameters

latents Tensor<T>

Returns

Tensor<T>

DeepCopy()

Creates a deep copy.

public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()

Returns

IFullModel<T, Tensor<T>, Tensor<T>>

GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)

Generates video from image with optional text guidance.

public override Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)

Parameters

inputImage Tensor<T>
numFrames int?
fps int?
numInferenceSteps int
motionBucketId int?
noiseAugStrength double
seed int?

Returns

Tensor<T>

GenerateFromImageAndText(Tensor<T>, string, string?, int?, int, double, double, int?)

Generates video with dual conditioning (image + text).

public Tensor<T> GenerateFromImageAndText(Tensor<T> image, string prompt, string? negativePrompt = null, int? numFrames = null, int numInferenceSteps = 50, double guidanceScale = 7.5, double imageScale = 1, int? seed = null)

Parameters

image Tensor<T>: The conditioning image.
prompt string: The text prompt for guidance.
negativePrompt string: Optional negative prompt.
numFrames int?: Number of frames to generate.
numInferenceSteps int: Number of denoising steps.
guidanceScale double: Text guidance scale.
imageScale double: Image conditioning scale.
seed int?: Optional random seed.

Returns

Tensor<T>: Generated video tensor.

Remarks

For Beginners: This method combines the best of both worlds: - The image provides the visual style and starting point - The text describes what motion/action should happen

Example:

Image: Photo of a person standing
Prompt: "person starts dancing energetically"
Result: Video of that person dancing

GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)

Generates video from text prompt.

public override Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 1024, int height = 576, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)

Parameters

prompt string
negativePrompt string
width int
height int
numFrames int?
fps int?
numInferenceSteps int
guidanceScale double
seed int?

Returns

Tensor<T>

GetParameters()

Gets all parameters.

public override Vector<T> GetParameters()

Returns

Vector<T>

PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)

Predicts video noise for image-to-video.

protected override Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)

Parameters

latents Tensor<T>
timestep int
imageEmbedding Tensor<T>
motionEmbedding Tensor<T>

Returns

Tensor<T>

PredictVideoNoiseWithText(Tensor<T>, int, Tensor<T>)

Predicts video noise with text conditioning.

protected override Tensor<T> PredictVideoNoiseWithText(Tensor<T> latents, int timestep, Tensor<T> textEmbedding)

Parameters

latents Tensor<T>
timestep int
textEmbedding Tensor<T>

Returns

Tensor<T>

SetParameters(Vector<T>)

Sets all parameters.

public override void SetParameters(Vector<T> parameters)

Parameters

parameters Vector<T>

Table of Contents

Class VideoCrafterModel<T>

Type Parameters

Examples

Remarks

Constructors

VideoCrafterModel()

VideoCrafterModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, VideoUNetPredictor<T>?, TemporalVAE<T>?, IConditioningModule<T>?, IConditioningModule<T>?, int, int)

Parameters

Fields

DefaultHeight

Field Value

DefaultWidth

Field Value

Properties

Conditioner

Property Value

ImageConditioner

Property Value

ImageConditioningScale

Property Value

Remarks

LatentChannels

Property Value

NoisePredictor

Property Value

ParameterCount

Property Value

SupportsImageToVideo

Property Value

SupportsTextToVideo

Property Value

SupportsVideoToVideo

Property Value

TemporalVAE

Property Value

UseDualConditioning

Property Value

VAE

Property Value

Methods

Clone()

Returns

DecodeVideoLatents(Tensor<T>)

Parameters

Returns

DeepCopy()

Returns

GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)

Parameters

Returns

GenerateFromImageAndText(Tensor<T>, string, string?, int?, int, double, double, int?)

Parameters

Returns

Remarks

GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)

Parameters

Returns

GetParameters()

Returns

PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)

Parameters

Returns

PredictVideoNoiseWithText(Tensor<T>, int, Tensor<T>)

Parameters

Returns

SetParameters(Vector<T>)

Parameters