Table of Contents

Interface IConditioningModule<T>

Namespace
AiDotNet.Interfaces
Assembly
AiDotNet.dll

Interface for conditioning modules that encode various inputs into embeddings for diffusion models.

public interface IConditioningModule<T>

Type Parameters

T

The numeric type used for calculations.

Remarks

Conditioning modules convert various types of input (text, images, audio, etc.) into embedding tensors that guide the diffusion process. They are essential for controlled generation like text-to-image, image-to-image, or style transfer.

For Beginners: A conditioning module is like a "translator" that converts your input (like a text prompt) into a format the diffusion model can understand.

Common types of conditioning:

  1. Text conditioning (CLIP, T5): "A cat sitting on a couch" → embedding vectors
  2. Image conditioning (IP-Adapter): An image → embedding vectors for style/content
  3. Control conditioning (ControlNet): Depth maps, edges, poses → spatial guidance

Why conditioning matters:

  • Without conditioning: Model generates random images
  • With text conditioning: Model generates images matching your description
  • With image conditioning: Model preserves style or content from reference images
  • With control conditioning: Model follows spatial structure (poses, edges, depth)

Different conditioning methods:

  • Cross-attention: Text embeddings attend to image features (most common for text)
  • Addition/Concatenation: Add or concat embeddings to time embedding
  • Spatial: Add control signals directly to features at each resolution

Properties

ConditioningType

Gets the type of conditioning this module provides.

ConditioningType ConditioningType { get; }

Property Value

ConditioningType

EmbeddingDimension

Gets the dimension of the output embeddings.

int EmbeddingDimension { get; }

Property Value

int

Remarks

For CLIP text encoders, this is typically 768 or 1024. For T5, this is typically 1024 or 2048. For image encoders, it varies by architecture.

MaxSequenceLength

Gets the maximum sequence length for text input.

int MaxSequenceLength { get; }

Property Value

int

Remarks

For CLIP, this is typically 77 tokens. For T5, this can be much longer (512 or more). Returns 0 for non-text conditioning modules.

ProducesPooledOutput

Gets whether this module produces pooled (global) or sequence embeddings.

bool ProducesPooledOutput { get; }

Property Value

bool

Remarks

- Pooled: Single vector representing the entire input (e.g., CLIP pooled output) - Sequence: Multiple vectors, one per token/patch (e.g., for cross-attention)

Methods

Encode(Tensor<T>)

Encodes the input into conditioning embeddings.

Tensor<T> Encode(Tensor<T> input)

Parameters

input Tensor<T>

The input tensor (format depends on conditioning type).

Returns

Tensor<T>

The conditioning embeddings.

Remarks

Input format by type: - Text: Tokenized text [batch, seqLength] - Image: Image tensor [batch, channels, height, width] - Audio: Audio tensor [batch, channels, samples] - Control: Control signal [batch, channels, height, width]

Output format: - Sequence: [batch, seqLength, embeddingDim] - Pooled: [batch, embeddingDim]

EncodeText(Tensor<T>, Tensor<T>?)

Encodes text input (convenience method for text conditioning).

Tensor<T> EncodeText(Tensor<T> tokenIds, Tensor<T>? attentionMask = null)

Parameters

tokenIds Tensor<T>

Tokenized text [batch, seqLength].

attentionMask Tensor<T>

Optional attention mask [batch, seqLength].

Returns

Tensor<T>

Text embeddings for cross-attention.

Remarks

This is the primary method for text conditioning. The attention mask indicates which tokens are real (1) vs padding (0).

GetPooledEmbedding(Tensor<T>)

Gets the pooled (global) embedding from sequence embeddings.

Tensor<T> GetPooledEmbedding(Tensor<T> sequenceEmbeddings)

Parameters

sequenceEmbeddings Tensor<T>

The sequence embeddings [batch, seqLength, dim].

Returns

Tensor<T>

Pooled embeddings [batch, dim].

Remarks

For CLIP, this is typically the EOS token embedding. For other models, it might be mean pooling or a learned pooler.

GetUnconditionalEmbedding(int)

Gets unconditioned (null) embeddings for classifier-free guidance.

Tensor<T> GetUnconditionalEmbedding(int batchSize)

Parameters

batchSize int

The batch size.

Returns

Tensor<T>

Unconditional embeddings matching the output format.

Remarks

Classifier-free guidance requires running the model with both conditional and unconditional embeddings. This returns the "empty" or "null" conditioning used for the unconditional pass.

For Beginners: This creates a "blank" conditioning that says "generate anything". By comparing model outputs with and without the prompt, we can steer generation more strongly toward the prompt.

Tokenize(string)

Tokenizes text input (for text conditioning modules).

Tensor<T> Tokenize(string text)

Parameters

text string

The text to tokenize.

Returns

Tensor<T>

Token IDs as a tensor [1, seqLength].

Remarks

Converts text to a sequence of integer token IDs that the encoder understands. Handles padding and truncation to MaxSequenceLength.

TokenizeBatch(string[])

Tokenizes a batch of text inputs.

Tensor<T> TokenizeBatch(string[] texts)

Parameters

texts string[]

The texts to tokenize.

Returns

Tensor<T>

Token IDs as a tensor [batch, seqLength].