Class InferenceOptimizationConfig
- Namespace
- AiDotNet.Configuration
- Assembly
- AiDotNet.dll
Configuration for inference-time optimizations to maximize prediction throughput and efficiency.
public class InferenceOptimizationConfig
- Inheritance
-
InferenceOptimizationConfig
- Inherited Members
Remarks
This configuration controls advanced inference optimizations including KV caching for transformers, request batching for throughput, and speculative decoding for faster autoregressive generation. These optimizations are automatically applied during prediction based on your configuration.
For Beginners: Inference optimization makes your model's predictions faster and more efficient.
Key features:
- KV Cache: Remembers previous computations in attention layers (2-10x faster for long sequences)
- Batching: Groups multiple predictions together (higher throughput)
- Speculative Decoding: Uses a small model to draft tokens, then verifies (1.5-3x faster generation)
Default settings are optimized for most use cases. Simply enable and let the library handle the rest.
Example:
var config = InferenceOptimizationConfig.Default;
var result = await new AiModelBuilder<double, ...>()
.ConfigureModel(myModel)
.ConfigureInferenceOptimizations(config)
.BuildAsync();
Properties
AdaptiveBatchSize
Gets or sets whether adaptive batch sizing is enabled.
public bool AdaptiveBatchSize { get; set; }
Property Value
- bool
True to enable adaptive sizing (default: true).
Remarks
For Beginners: Automatically adjusts batch size based on system load.
When enabled:
- Low load: Smaller batches for lower latency
- High load: Larger batches for higher throughput
- Automatically balances latency vs throughput
AttentionMasking
Gets or sets how attention masking should be applied for optimized attention implementations.
public AttentionMaskingMode AttentionMasking { get; set; }
Property Value
Remarks
- Auto: Applies causal masking for known autoregressive models (e.g., text generation), otherwise no mask.
- Disabled: Never applies causal masking.
- Causal: Always applies causal masking (GPT-style).
BatchTimeoutMs
Gets or sets the maximum time to wait for batch to fill in milliseconds.
public int BatchTimeoutMs { get; set; }
Property Value
- int
Batch timeout in milliseconds (default: 10ms).
Remarks
For Beginners: How long to wait before processing a partial batch.
Lower values = lower latency but smaller batches. Higher values = larger batches but more waiting.
Default
Gets a default configuration with sensible settings for most use cases.
public static InferenceOptimizationConfig Default { get; }
Property Value
Remarks
Default settings:
- KV Cache: Enabled for transformer models, 1GB max size
- Batching: Enabled with adaptive batch sizing
- Speculative Decoding: Disabled (requires explicit configuration)
DraftModelType
Gets or sets the type of draft model to use for speculative decoding.
public DraftModelType DraftModelType { get; set; }
Property Value
- DraftModelType
Draft model type (default: NGram).
Remarks
For Beginners: The draft model generates candidate tokens quickly.
Options:
- NGram: Simple statistical model (fast, no GPU needed)
- SmallNeural: Smaller companion model (more accurate drafts)
NGram is usually sufficient and has near-zero overhead.
Note: Small neural draft models require an external companion model. In the MVP, the library falls back to NGram when a companion draft model is not available.
EnableBatching
Gets or sets whether request batching is enabled.
public bool EnableBatching { get; set; }
Property Value
- bool
True to enable batching (default: true).
Remarks
For Beginners: Batching groups multiple predictions together for efficiency.
Benefits:
- Higher throughput (more predictions per second)
- Better GPU utilization
- Lower per-request latency under load
How it works:
- Incoming prediction requests are queued
- When batch is full OR timeout reached, batch is processed together
- Results are returned to each caller
Trade-offs:
- Slight latency increase for single requests (waiting for batch)
- Significant throughput increase under load
EnableFlashAttention
Gets or sets whether Flash Attention is enabled (when applicable).
public bool EnableFlashAttention { get; set; }
Property Value
Remarks
Flash Attention computes exact attention without materializing the full N×N attention matrix, reducing memory bandwidth pressure and improving throughput for long sequences.
EnableKVCache
Gets or sets whether KV (Key-Value) caching is enabled for attention layers.
public bool EnableKVCache { get; set; }
Property Value
- bool
True to enable KV caching (default: true).
Remarks
For Beginners: KV caching speeds up transformer models by remembering previous computations.
How it works:
- Attention layers compute keys and values for each token
- Without caching: Recomputes all keys/values for every new token
- With caching: Stores previous keys/values, only computes for new tokens
Benefits:
- 2-10x faster for long sequences
- Essential for autoregressive generation (GPT-style)
- Minimal memory overhead for huge speedup
When to disable:
- Memory-constrained environments
- Very short sequences (overhead exceeds benefit)
- Non-transformer models (no effect)
EnablePagedKVCache
Gets or sets whether to use a paged KV-cache backend (vLLM-style) for long-context / multi-sequence serving.
public bool EnablePagedKVCache { get; set; }
Property Value
Remarks
When enabled, the system may choose a paged cache implementation that allocates KV memory in fixed-size blocks. This is the industry-standard approach for high-throughput serving where many sequences are active concurrently. Users can disable this to force the traditional contiguous KV-cache.
EnableSpeculativeDecoding
Gets or sets whether speculative decoding is enabled.
public bool EnableSpeculativeDecoding { get; set; }
Property Value
- bool
True to enable speculative decoding (default: false).
Remarks
For Beginners: Speculative decoding speeds up autoregressive generation (GPT-style).
How it works:
- A small "draft" model quickly generates candidate tokens
- The main model verifies all candidates in one pass
- Accepted tokens are kept, rejected ones are regenerated
Benefits:
- 1.5-3x faster generation for LLMs
- No quality loss (verification ensures correctness)
Requirements:
- Autoregressive model (generates tokens sequentially)
- Draft model must be available (NGram or smaller neural network)
When to disable:
- Non-autoregressive models
- Single-pass predictions
- When draft model overhead exceeds benefit
EnableWeightOnlyQuantization
Gets or sets whether weight-only INT8 quantization is enabled for inference.
public bool EnableWeightOnlyQuantization { get; set; }
Property Value
Remarks
Weight-only quantization reduces memory bandwidth and improves cache locality by storing weights in int8 with per-output scaling. Activations remain in FP32/FP16, and accumulation is performed in float.
For Beginners: This makes your model weights smaller so the CPU/GPU can read them faster.
This is disabled by default until validated across more layer types and kernels. When enabled, the optimizer will apply it opportunistically and fall back safely when unsupported.
HighPerformance
Gets a high-performance configuration optimized for maximum throughput.
public static InferenceOptimizationConfig HighPerformance { get; }
Property Value
Remarks
All optimizations enabled with aggressive settings:
- KV Cache: Enabled with 2GB max size
- Batching: Enabled with larger batch sizes
- Speculative Decoding: Enabled with NGram draft model
KVCacheEvictionPolicy
Gets or sets the KV cache eviction policy.
public CacheEvictionPolicy KVCacheEvictionPolicy { get; set; }
Property Value
- CacheEvictionPolicy
Cache eviction policy (default: LRU).
KVCacheMaxSizeMB
Gets or sets the maximum KV cache size in megabytes.
public int KVCacheMaxSizeMB { get; set; }
Property Value
- int
Maximum cache size in MB (default: 1024 = 1GB).
Remarks
For Beginners: This limits how much memory the KV cache can use.
Guidelines:
- 512MB: Good for small models or memory-constrained systems
- 1024MB (default): Balanced for most use cases
- 2048MB+: For large models or long sequences
When cache fills up, oldest entries are evicted (LRU policy).
KVCachePrecision
Gets or sets the precision used for KV-cache storage.
public KVCachePrecisionMode KVCachePrecision { get; set; }
Property Value
Remarks
Industry-standard serving stores KV-cache in FP16 to halve memory usage and increase cache capacity. The default Auto selects FP16 when KV-cache is enabled and the numeric type supports it.
For Beginners: This setting controls how much memory your model uses during autoregressive inference.
- FP16: Uses about half the memory (recommended default)
- FP32: Uses more memory but can be slightly more numerically accurate
Most production systems prefer FP16 KV-cache for capacity and throughput.
KVCacheQuantization
Gets or sets the quantization mode used for KV-cache storage.
public KVCacheQuantizationMode KVCacheQuantization { get; set; }
Property Value
Remarks
KV-cache quantization can further reduce memory beyond FP16 by storing keys/values in int8 with scaling. This is an opt-in advanced feature because it can introduce small numerical error.
For Beginners: - None (default): Store KV-cache in FP16/FP32 depending on KVCachePrecision. - Int8: Store KV-cache in 8-bit integers to save memory (advanced).
KVCacheWindowSize
Gets or sets the sliding window size in tokens when UseSlidingWindowKVCache is enabled.
public int KVCacheWindowSize { get; set; }
Property Value
- int
Window size in tokens (default: 1024).
MaxBatchSize
Gets or sets the maximum batch size for grouped predictions.
public int MaxBatchSize { get; set; }
Property Value
- int
Maximum batch size (default: 32).
Remarks
For Beginners: How many predictions to group together.
Guidelines:
- 8-16: Good for memory-constrained systems
- 32 (default): Balanced for most cases
- 64+: For high-throughput GPU inference
Larger batches = better throughput but more memory.
MinBatchSize
Gets or sets the minimum batch size before processing.
public int MinBatchSize { get; set; }
Property Value
- int
Minimum batch size (default: 1).
PagedKVCacheBlockSize
Gets or sets the block size (in tokens) for the paged KV-cache when enabled.
public int PagedKVCacheBlockSize { get; set; }
Property Value
Remarks
Common values are 16 or 32. Smaller blocks reduce internal fragmentation; larger blocks reduce table overhead.
SpeculationDepth
Gets or sets the speculation depth (number of tokens to draft ahead).
public int SpeculationDepth { get; set; }
Property Value
- int
Speculation depth (default: 4).
Remarks
For Beginners: How many tokens the draft model predicts at once.
Guidelines:
- 3-4: Conservative, high acceptance rate
- 5-6: Balanced (default: 4)
- 7+: Aggressive, may have more rejections
Higher depth = more speedup potential but more wasted work on rejections.
SpeculationPolicy
Gets or sets the policy for when speculative decoding should run.
public SpeculationPolicy SpeculationPolicy { get; set; }
Property Value
Remarks
Auto is recommended: it can back off speculative decoding under high load (e.g., large batches) to avoid throughput regressions, while still enabling it for latency-sensitive scenarios.
SpeculativeMethod
Gets or sets the speculative decoding method.
public SpeculativeMethod SpeculativeMethod { get; set; }
Property Value
Remarks
The default Auto currently selects ClassicDraftModel.
For Beginners: This chooses the "style" of speculative decoding.
UseSlidingWindowKVCache
Gets or sets whether to use a sliding window KV-cache for long contexts.
public bool UseSlidingWindowKVCache { get; set; }
Property Value
Remarks
When enabled, only the most recent KVCacheWindowSize tokens are kept. This is a common industry approach for long-context serving to cap memory usage.
UseTreeSpeculation
Gets or sets whether to use tree-structured speculation.
public bool UseTreeSpeculation { get; set; }
Property Value
- bool
True to enable tree speculation (default: false).
Remarks
For Beginners: Tree speculation generates multiple candidate sequences in parallel.
Instead of one sequence of draft tokens, generates a tree of possibilities. Can improve acceptance rate but uses more memory.
Methods
Validate()
Validates the configuration and throws if any values are invalid.
public void Validate()
Remarks
For Beginners: Call this method to ensure your configuration is valid before use.
Validation rules:
- KVCacheMaxSizeMB must be positive
- MaxBatchSize must be positive
- MinBatchSize must be positive and not exceed MaxBatchSize
- BatchTimeoutMs must be non-negative
- SpeculationDepth must be non-negative
Exceptions
- InvalidOperationException
Thrown when configuration values are invalid.