Interface IConditioningModule<T>
- Namespace
- AiDotNet.Interfaces
- Assembly
- AiDotNet.dll
Interface for conditioning modules that encode various inputs into embeddings for diffusion models.
public interface IConditioningModule<T>
Type Parameters
TThe numeric type used for calculations.
Remarks
Conditioning modules convert various types of input (text, images, audio, etc.) into embedding tensors that guide the diffusion process. They are essential for controlled generation like text-to-image, image-to-image, or style transfer.
For Beginners: A conditioning module is like a "translator" that converts your input (like a text prompt) into a format the diffusion model can understand.
Common types of conditioning:
- Text conditioning (CLIP, T5): "A cat sitting on a couch" → embedding vectors
- Image conditioning (IP-Adapter): An image → embedding vectors for style/content
- Control conditioning (ControlNet): Depth maps, edges, poses → spatial guidance
Why conditioning matters:
- Without conditioning: Model generates random images
- With text conditioning: Model generates images matching your description
- With image conditioning: Model preserves style or content from reference images
- With control conditioning: Model follows spatial structure (poses, edges, depth)
Different conditioning methods:
- Cross-attention: Text embeddings attend to image features (most common for text)
- Addition/Concatenation: Add or concat embeddings to time embedding
- Spatial: Add control signals directly to features at each resolution
Properties
ConditioningType
Gets the type of conditioning this module provides.
ConditioningType ConditioningType { get; }
Property Value
EmbeddingDimension
Gets the dimension of the output embeddings.
int EmbeddingDimension { get; }
Property Value
Remarks
For CLIP text encoders, this is typically 768 or 1024. For T5, this is typically 1024 or 2048. For image encoders, it varies by architecture.
MaxSequenceLength
Gets the maximum sequence length for text input.
int MaxSequenceLength { get; }
Property Value
Remarks
For CLIP, this is typically 77 tokens. For T5, this can be much longer (512 or more). Returns 0 for non-text conditioning modules.
ProducesPooledOutput
Gets whether this module produces pooled (global) or sequence embeddings.
bool ProducesPooledOutput { get; }
Property Value
Remarks
- Pooled: Single vector representing the entire input (e.g., CLIP pooled output) - Sequence: Multiple vectors, one per token/patch (e.g., for cross-attention)
Methods
Encode(Tensor<T>)
Encodes the input into conditioning embeddings.
Tensor<T> Encode(Tensor<T> input)
Parameters
inputTensor<T>The input tensor (format depends on conditioning type).
Returns
- Tensor<T>
The conditioning embeddings.
Remarks
Input format by type: - Text: Tokenized text [batch, seqLength] - Image: Image tensor [batch, channels, height, width] - Audio: Audio tensor [batch, channels, samples] - Control: Control signal [batch, channels, height, width]
Output format: - Sequence: [batch, seqLength, embeddingDim] - Pooled: [batch, embeddingDim]
EncodeText(Tensor<T>, Tensor<T>?)
Encodes text input (convenience method for text conditioning).
Tensor<T> EncodeText(Tensor<T> tokenIds, Tensor<T>? attentionMask = null)
Parameters
tokenIdsTensor<T>Tokenized text [batch, seqLength].
attentionMaskTensor<T>Optional attention mask [batch, seqLength].
Returns
- Tensor<T>
Text embeddings for cross-attention.
Remarks
This is the primary method for text conditioning. The attention mask indicates which tokens are real (1) vs padding (0).
GetPooledEmbedding(Tensor<T>)
Gets the pooled (global) embedding from sequence embeddings.
Tensor<T> GetPooledEmbedding(Tensor<T> sequenceEmbeddings)
Parameters
sequenceEmbeddingsTensor<T>The sequence embeddings [batch, seqLength, dim].
Returns
- Tensor<T>
Pooled embeddings [batch, dim].
Remarks
For CLIP, this is typically the EOS token embedding. For other models, it might be mean pooling or a learned pooler.
GetUnconditionalEmbedding(int)
Gets unconditioned (null) embeddings for classifier-free guidance.
Tensor<T> GetUnconditionalEmbedding(int batchSize)
Parameters
batchSizeintThe batch size.
Returns
- Tensor<T>
Unconditional embeddings matching the output format.
Remarks
Classifier-free guidance requires running the model with both conditional and unconditional embeddings. This returns the "empty" or "null" conditioning used for the unconditional pass.
For Beginners: This creates a "blank" conditioning that says "generate anything". By comparing model outputs with and without the prompt, we can steer generation more strongly toward the prompt.
Tokenize(string)
Tokenizes text input (for text conditioning modules).
Tensor<T> Tokenize(string text)
Parameters
textstringThe text to tokenize.
Returns
- Tensor<T>
Token IDs as a tensor [1, seqLength].
Remarks
Converts text to a sequence of integer token IDs that the encoder understands. Handles padding and truncation to MaxSequenceLength.
TokenizeBatch(string[])
Tokenizes a batch of text inputs.
Tensor<T> TokenizeBatch(string[] texts)
Parameters
textsstring[]The texts to tokenize.
Returns
- Tensor<T>
Token IDs as a tensor [batch, seqLength].