Table of Contents

Class InferenceOptimizationConfig

Namespace
AiDotNet.Configuration
Assembly
AiDotNet.dll

Configuration for inference-time optimizations to maximize prediction throughput and efficiency.

public class InferenceOptimizationConfig
Inheritance
InferenceOptimizationConfig
Inherited Members

Remarks

This configuration controls advanced inference optimizations including KV caching for transformers, request batching for throughput, and speculative decoding for faster autoregressive generation. These optimizations are automatically applied during prediction based on your configuration.

For Beginners: Inference optimization makes your model's predictions faster and more efficient.

Key features:

  • KV Cache: Remembers previous computations in attention layers (2-10x faster for long sequences)
  • Batching: Groups multiple predictions together (higher throughput)
  • Speculative Decoding: Uses a small model to draft tokens, then verifies (1.5-3x faster generation)

Default settings are optimized for most use cases. Simply enable and let the library handle the rest.

Example:

var config = InferenceOptimizationConfig.Default;

var result = await new AiModelBuilder<double, ...>()
    .ConfigureModel(myModel)
    .ConfigureInferenceOptimizations(config)
    .BuildAsync();

Properties

AdaptiveBatchSize

Gets or sets whether adaptive batch sizing is enabled.

public bool AdaptiveBatchSize { get; set; }

Property Value

bool

True to enable adaptive sizing (default: true).

Remarks

For Beginners: Automatically adjusts batch size based on system load.

When enabled:

  • Low load: Smaller batches for lower latency
  • High load: Larger batches for higher throughput
  • Automatically balances latency vs throughput

AttentionMasking

Gets or sets how attention masking should be applied for optimized attention implementations.

public AttentionMaskingMode AttentionMasking { get; set; }

Property Value

AttentionMaskingMode

Remarks

  • Auto: Applies causal masking for known autoregressive models (e.g., text generation), otherwise no mask.
  • Disabled: Never applies causal masking.
  • Causal: Always applies causal masking (GPT-style).

BatchTimeoutMs

Gets or sets the maximum time to wait for batch to fill in milliseconds.

public int BatchTimeoutMs { get; set; }

Property Value

int

Batch timeout in milliseconds (default: 10ms).

Remarks

For Beginners: How long to wait before processing a partial batch.

Lower values = lower latency but smaller batches. Higher values = larger batches but more waiting.

Default

Gets a default configuration with sensible settings for most use cases.

public static InferenceOptimizationConfig Default { get; }

Property Value

InferenceOptimizationConfig

Remarks

Default settings:

  • KV Cache: Enabled for transformer models, 1GB max size
  • Batching: Enabled with adaptive batch sizing
  • Speculative Decoding: Disabled (requires explicit configuration)

DraftModelType

Gets or sets the type of draft model to use for speculative decoding.

public DraftModelType DraftModelType { get; set; }

Property Value

DraftModelType

Draft model type (default: NGram).

Remarks

For Beginners: The draft model generates candidate tokens quickly.

Options:

  • NGram: Simple statistical model (fast, no GPU needed)
  • SmallNeural: Smaller companion model (more accurate drafts)

NGram is usually sufficient and has near-zero overhead.

Note: Small neural draft models require an external companion model. In the MVP, the library falls back to NGram when a companion draft model is not available.

EnableBatching

Gets or sets whether request batching is enabled.

public bool EnableBatching { get; set; }

Property Value

bool

True to enable batching (default: true).

Remarks

For Beginners: Batching groups multiple predictions together for efficiency.

Benefits:

  • Higher throughput (more predictions per second)
  • Better GPU utilization
  • Lower per-request latency under load

How it works:

  • Incoming prediction requests are queued
  • When batch is full OR timeout reached, batch is processed together
  • Results are returned to each caller

Trade-offs:

  • Slight latency increase for single requests (waiting for batch)
  • Significant throughput increase under load

EnableFlashAttention

Gets or sets whether Flash Attention is enabled (when applicable).

public bool EnableFlashAttention { get; set; }

Property Value

bool

Remarks

Flash Attention computes exact attention without materializing the full N×N attention matrix, reducing memory bandwidth pressure and improving throughput for long sequences.

EnableKVCache

Gets or sets whether KV (Key-Value) caching is enabled for attention layers.

public bool EnableKVCache { get; set; }

Property Value

bool

True to enable KV caching (default: true).

Remarks

For Beginners: KV caching speeds up transformer models by remembering previous computations.

How it works:

  • Attention layers compute keys and values for each token
  • Without caching: Recomputes all keys/values for every new token
  • With caching: Stores previous keys/values, only computes for new tokens

Benefits:

  • 2-10x faster for long sequences
  • Essential for autoregressive generation (GPT-style)
  • Minimal memory overhead for huge speedup

When to disable:

  • Memory-constrained environments
  • Very short sequences (overhead exceeds benefit)
  • Non-transformer models (no effect)

EnablePagedKVCache

Gets or sets whether to use a paged KV-cache backend (vLLM-style) for long-context / multi-sequence serving.

public bool EnablePagedKVCache { get; set; }

Property Value

bool

Remarks

When enabled, the system may choose a paged cache implementation that allocates KV memory in fixed-size blocks. This is the industry-standard approach for high-throughput serving where many sequences are active concurrently. Users can disable this to force the traditional contiguous KV-cache.

EnableSpeculativeDecoding

Gets or sets whether speculative decoding is enabled.

public bool EnableSpeculativeDecoding { get; set; }

Property Value

bool

True to enable speculative decoding (default: false).

Remarks

For Beginners: Speculative decoding speeds up autoregressive generation (GPT-style).

How it works:

  1. A small "draft" model quickly generates candidate tokens
  2. The main model verifies all candidates in one pass
  3. Accepted tokens are kept, rejected ones are regenerated

Benefits:

  • 1.5-3x faster generation for LLMs
  • No quality loss (verification ensures correctness)

Requirements:

  • Autoregressive model (generates tokens sequentially)
  • Draft model must be available (NGram or smaller neural network)

When to disable:

  • Non-autoregressive models
  • Single-pass predictions
  • When draft model overhead exceeds benefit

EnableWeightOnlyQuantization

Gets or sets whether weight-only INT8 quantization is enabled for inference.

public bool EnableWeightOnlyQuantization { get; set; }

Property Value

bool

Remarks

Weight-only quantization reduces memory bandwidth and improves cache locality by storing weights in int8 with per-output scaling. Activations remain in FP32/FP16, and accumulation is performed in float.

For Beginners: This makes your model weights smaller so the CPU/GPU can read them faster.

This is disabled by default until validated across more layer types and kernels. When enabled, the optimizer will apply it opportunistically and fall back safely when unsupported.

HighPerformance

Gets a high-performance configuration optimized for maximum throughput.

public static InferenceOptimizationConfig HighPerformance { get; }

Property Value

InferenceOptimizationConfig

Remarks

All optimizations enabled with aggressive settings:

  • KV Cache: Enabled with 2GB max size
  • Batching: Enabled with larger batch sizes
  • Speculative Decoding: Enabled with NGram draft model

KVCacheEvictionPolicy

Gets or sets the KV cache eviction policy.

public CacheEvictionPolicy KVCacheEvictionPolicy { get; set; }

Property Value

CacheEvictionPolicy

Cache eviction policy (default: LRU).

KVCacheMaxSizeMB

Gets or sets the maximum KV cache size in megabytes.

public int KVCacheMaxSizeMB { get; set; }

Property Value

int

Maximum cache size in MB (default: 1024 = 1GB).

Remarks

For Beginners: This limits how much memory the KV cache can use.

Guidelines:

  • 512MB: Good for small models or memory-constrained systems
  • 1024MB (default): Balanced for most use cases
  • 2048MB+: For large models or long sequences

When cache fills up, oldest entries are evicted (LRU policy).

KVCachePrecision

Gets or sets the precision used for KV-cache storage.

public KVCachePrecisionMode KVCachePrecision { get; set; }

Property Value

KVCachePrecisionMode

Remarks

Industry-standard serving stores KV-cache in FP16 to halve memory usage and increase cache capacity. The default Auto selects FP16 when KV-cache is enabled and the numeric type supports it.

For Beginners: This setting controls how much memory your model uses during autoregressive inference.

  • FP16: Uses about half the memory (recommended default)
  • FP32: Uses more memory but can be slightly more numerically accurate

Most production systems prefer FP16 KV-cache for capacity and throughput.

KVCacheQuantization

Gets or sets the quantization mode used for KV-cache storage.

public KVCacheQuantizationMode KVCacheQuantization { get; set; }

Property Value

KVCacheQuantizationMode

Remarks

KV-cache quantization can further reduce memory beyond FP16 by storing keys/values in int8 with scaling. This is an opt-in advanced feature because it can introduce small numerical error.

For Beginners: - None (default): Store KV-cache in FP16/FP32 depending on KVCachePrecision. - Int8: Store KV-cache in 8-bit integers to save memory (advanced).

KVCacheWindowSize

Gets or sets the sliding window size in tokens when UseSlidingWindowKVCache is enabled.

public int KVCacheWindowSize { get; set; }

Property Value

int

Window size in tokens (default: 1024).

MaxBatchSize

Gets or sets the maximum batch size for grouped predictions.

public int MaxBatchSize { get; set; }

Property Value

int

Maximum batch size (default: 32).

Remarks

For Beginners: How many predictions to group together.

Guidelines:

  • 8-16: Good for memory-constrained systems
  • 32 (default): Balanced for most cases
  • 64+: For high-throughput GPU inference

Larger batches = better throughput but more memory.

MinBatchSize

Gets or sets the minimum batch size before processing.

public int MinBatchSize { get; set; }

Property Value

int

Minimum batch size (default: 1).

PagedKVCacheBlockSize

Gets or sets the block size (in tokens) for the paged KV-cache when enabled.

public int PagedKVCacheBlockSize { get; set; }

Property Value

int

Remarks

Common values are 16 or 32. Smaller blocks reduce internal fragmentation; larger blocks reduce table overhead.

SpeculationDepth

Gets or sets the speculation depth (number of tokens to draft ahead).

public int SpeculationDepth { get; set; }

Property Value

int

Speculation depth (default: 4).

Remarks

For Beginners: How many tokens the draft model predicts at once.

Guidelines:

  • 3-4: Conservative, high acceptance rate
  • 5-6: Balanced (default: 4)
  • 7+: Aggressive, may have more rejections

Higher depth = more speedup potential but more wasted work on rejections.

SpeculationPolicy

Gets or sets the policy for when speculative decoding should run.

public SpeculationPolicy SpeculationPolicy { get; set; }

Property Value

SpeculationPolicy

Remarks

Auto is recommended: it can back off speculative decoding under high load (e.g., large batches) to avoid throughput regressions, while still enabling it for latency-sensitive scenarios.

SpeculativeMethod

Gets or sets the speculative decoding method.

public SpeculativeMethod SpeculativeMethod { get; set; }

Property Value

SpeculativeMethod

Remarks

The default Auto currently selects ClassicDraftModel.

For Beginners: This chooses the "style" of speculative decoding.

UseSlidingWindowKVCache

Gets or sets whether to use a sliding window KV-cache for long contexts.

public bool UseSlidingWindowKVCache { get; set; }

Property Value

bool

Remarks

When enabled, only the most recent KVCacheWindowSize tokens are kept. This is a common industry approach for long-context serving to cap memory usage.

UseTreeSpeculation

Gets or sets whether to use tree-structured speculation.

public bool UseTreeSpeculation { get; set; }

Property Value

bool

True to enable tree speculation (default: false).

Remarks

For Beginners: Tree speculation generates multiple candidate sequences in parallel.

Instead of one sequence of draft tokens, generates a tree of possibilities. Can improve acceptance rate but uses more memory.

Methods

Validate()

Validates the configuration and throws if any values are invalid.

public void Validate()

Remarks

For Beginners: Call this method to ensure your configuration is valid before use.

Validation rules:

  • KVCacheMaxSizeMB must be positive
  • MaxBatchSize must be positive
  • MinBatchSize must be positive and not exceed MaxBatchSize
  • BatchTimeoutMs must be non-negative
  • SpeculationDepth must be non-negative

Exceptions

InvalidOperationException

Thrown when configuration values are invalid.