Class VideoCrafterModel<T>
VideoCrafter model for high-quality text-to-video and image-to-video generation.
public class VideoCrafterModel<T> : VideoDiffusionModelBase<T>, ILatentDiffusionModel<T>, IVideoDiffusionModel<T>, IDiffusionModel<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
VideoCrafterModel<T>
- Implements
- Inherited Members
- Extension Methods
Examples
// Create VideoCrafter model
var videoCrafter = new VideoCrafterModel<float>();
// Text-to-video generation
var video = videoCrafter.GenerateFromText(
prompt: "A beautiful sunset over the ocean, waves crashing",
width: 1024,
height: 576,
numFrames: 16,
numInferenceSteps: 50);
// Image-to-video with text guidance
var inputImage = LoadImage("sunset.jpg");
var animatedVideo = videoCrafter.GenerateFromImageAndText(
image: inputImage,
prompt: "waves gently rolling, seagulls flying",
numFrames: 16);
Remarks
VideoCrafter is a video generation model that combines the strengths of text-to-video and image-to-video generation. It uses a dual-conditioning approach that enables both modalities while maintaining high visual quality and temporal coherence.
For Beginners: VideoCrafter is like having two video generation modes in one:
Mode 1 - Text-to-Video:
- Input: "A rocket launching into space"
- Output: 5-second video of a rocket launch
Mode 2 - Image-to-Video:
- Input: Photo of a rocket on launch pad
- Output: 5-second video of the rocket launching
Key advantages:
- High visual quality (up to 1024x576 resolution)
- Long video generation (up to 16+ seconds)
- Good temporal coherence (smooth motion)
- Dual conditioning (text + image together)
Unlike AnimateDiff which adds motion to SD models, VideoCrafter is trained end-to-end specifically for video generation, resulting in better quality.
Architecture: - 3D U-Net with factorized spatial-temporal attention - Dual cross-attention for text and image conditioning - Temporal VAE for consistent frame encoding - DDIM scheduler for fast inference
Constructors
VideoCrafterModel()
Initializes a new instance of VideoCrafterModel with default parameters.
public VideoCrafterModel()
VideoCrafterModel(DiffusionModelOptions<T>?, INoiseScheduler<T>?, VideoUNetPredictor<T>?, TemporalVAE<T>?, IConditioningModule<T>?, IConditioningModule<T>?, int, int)
Initializes a new instance of VideoCrafterModel with custom parameters.
public VideoCrafterModel(DiffusionModelOptions<T>? options = null, INoiseScheduler<T>? scheduler = null, VideoUNetPredictor<T>? videoUNet = null, TemporalVAE<T>? temporalVAE = null, IConditioningModule<T>? textConditioner = null, IConditioningModule<T>? imageConditioner = null, int defaultNumFrames = 16, int defaultFPS = 8)
Parameters
optionsDiffusionModelOptions<T>Configuration options.
schedulerINoiseScheduler<T>Optional scheduler.
videoUNetVideoUNetPredictor<T>Optional VideoUNet predictor.
temporalVAETemporalVAE<T>Optional temporal VAE.
textConditionerIConditioningModule<T>Optional text conditioning module.
imageConditionerIConditioningModule<T>Optional image conditioning module.
defaultNumFramesintDefault number of frames.
defaultFPSintDefault FPS.
Fields
DefaultHeight
Default VideoCrafter height.
public const int DefaultHeight = 576
Field Value
DefaultWidth
Default VideoCrafter width.
public const int DefaultWidth = 1024
Field Value
Properties
Conditioner
Gets the primary conditioning module (text).
public override IConditioningModule<T>? Conditioner { get; }
Property Value
ImageConditioner
Gets the image conditioning module.
public IConditioningModule<T>? ImageConditioner { get; }
Property Value
ImageConditioningScale
Gets or sets the image conditioning scale.
public double ImageConditioningScale { get; set; }
Property Value
Remarks
Controls how strongly the input image influences the output video. Higher values keep the video closer to the input image.
LatentChannels
Gets the latent channels.
public override int LatentChannels { get; }
Property Value
NoisePredictor
Gets the noise predictor.
public override INoisePredictor<T> NoisePredictor { get; }
Property Value
ParameterCount
Gets the total parameter count.
public override int ParameterCount { get; }
Property Value
SupportsImageToVideo
Gets whether image-to-video is supported.
public override bool SupportsImageToVideo { get; }
Property Value
SupportsTextToVideo
Gets whether text-to-video is supported.
public override bool SupportsTextToVideo { get; }
Property Value
SupportsVideoToVideo
Gets whether video-to-video is supported.
public override bool SupportsVideoToVideo { get; }
Property Value
TemporalVAE
Gets the temporal VAE.
public override IVAEModel<T>? TemporalVAE { get; }
Property Value
- IVAEModel<T>
UseDualConditioning
Gets or sets whether to use dual conditioning (text + image together).
public bool UseDualConditioning { get; set; }
Property Value
VAE
Gets the VAE.
public override IVAEModel<T> VAE { get; }
Property Value
- IVAEModel<T>
Methods
Clone()
Clones this model.
public override IDiffusionModel<T> Clone()
Returns
DecodeVideoLatents(Tensor<T>)
Decodes video latents using temporal VAE.
protected override Tensor<T> DecodeVideoLatents(Tensor<T> latents)
Parameters
latentsTensor<T>
Returns
- Tensor<T>
DeepCopy()
Creates a deep copy.
public override IFullModel<T, Tensor<T>, Tensor<T>> DeepCopy()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
GenerateFromImage(Tensor<T>, int?, int?, int, int?, double, int?)
Generates video from image with optional text guidance.
public override Tensor<T> GenerateFromImage(Tensor<T> inputImage, int? numFrames = null, int? fps = null, int numInferenceSteps = 25, int? motionBucketId = null, double noiseAugStrength = 0.02, int? seed = null)
Parameters
inputImageTensor<T>numFramesint?fpsint?numInferenceStepsintmotionBucketIdint?noiseAugStrengthdoubleseedint?
Returns
- Tensor<T>
GenerateFromImageAndText(Tensor<T>, string, string?, int?, int, double, double, int?)
Generates video with dual conditioning (image + text).
public Tensor<T> GenerateFromImageAndText(Tensor<T> image, string prompt, string? negativePrompt = null, int? numFrames = null, int numInferenceSteps = 50, double guidanceScale = 7.5, double imageScale = 1, int? seed = null)
Parameters
imageTensor<T>The conditioning image.
promptstringThe text prompt for guidance.
negativePromptstringOptional negative prompt.
numFramesint?Number of frames to generate.
numInferenceStepsintNumber of denoising steps.
guidanceScaledoubleText guidance scale.
imageScaledoubleImage conditioning scale.
seedint?Optional random seed.
Returns
- Tensor<T>
Generated video tensor.
Remarks
For Beginners: This method combines the best of both worlds: - The image provides the visual style and starting point - The text describes what motion/action should happen
Example:
- Image: Photo of a person standing
- Prompt: "person starts dancing energetically"
- Result: Video of that person dancing
GenerateFromText(string, string?, int, int, int?, int?, int, double, int?)
Generates video from text prompt.
public override Tensor<T> GenerateFromText(string prompt, string? negativePrompt = null, int width = 1024, int height = 576, int? numFrames = null, int? fps = null, int numInferenceSteps = 50, double guidanceScale = 7.5, int? seed = null)
Parameters
promptstringnegativePromptstringwidthintheightintnumFramesint?fpsint?numInferenceStepsintguidanceScaledoubleseedint?
Returns
- Tensor<T>
GetParameters()
Gets all parameters.
public override Vector<T> GetParameters()
Returns
- Vector<T>
PredictVideoNoise(Tensor<T>, int, Tensor<T>, Tensor<T>)
Predicts video noise for image-to-video.
protected override Tensor<T> PredictVideoNoise(Tensor<T> latents, int timestep, Tensor<T> imageEmbedding, Tensor<T> motionEmbedding)
Parameters
latentsTensor<T>timestepintimageEmbeddingTensor<T>motionEmbeddingTensor<T>
Returns
- Tensor<T>
PredictVideoNoiseWithText(Tensor<T>, int, Tensor<T>)
Predicts video noise with text conditioning.
protected override Tensor<T> PredictVideoNoiseWithText(Tensor<T> latents, int timestep, Tensor<T> textEmbedding)
Parameters
latentsTensor<T>timestepinttextEmbeddingTensor<T>
Returns
- Tensor<T>
SetParameters(Vector<T>)
Sets all parameters.
public override void SetParameters(Vector<T> parameters)
Parameters
parametersVector<T>