Class SLoRAAdapter<T>

Namespace: AiDotNet.LoRA.Adapters

Assembly: AiDotNet.dll

S-LoRA adapter for scalable serving of thousands of concurrent LoRA adapters.

public class SLoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T: The numeric type used for calculations, typically float or double.

Inheritance: object

LayerBase<T>

LoRAAdapterBase<T>

SLoRAAdapter<T>

Implements: IDisposable

ILoRAAdapter<T>

ILayer<T>

IJitCompilable<T>

IDiagnosticsProvider

IWeightLoadable<T>

Inherited Members: LoRAAdapterBase<T>._baseLayer

LoRAAdapterBase<T>._loraLayer

LoRAAdapterBase<T>._freezeBaseLayer

LoRAAdapterBase<T>.BaseLayer

LoRAAdapterBase<T>.LoRALayer

LoRAAdapterBase<T>.IsBaseLayerFrozen

LoRAAdapterBase<T>.Rank

LoRAAdapterBase<T>.Alpha

LoRAAdapterBase<T>.ParameterCount

LoRAAdapterBase<T>.SupportsTraining

LoRAAdapterBase<T>.CreateLoRALayer(int, double)

LoRAAdapterBase<T>.Forward(Tensor<T>)

LoRAAdapterBase<T>.Backward(Tensor<T>)

LoRAAdapterBase<T>.UpdateParameters(T)

LoRAAdapterBase<T>.GetParameters()

LoRAAdapterBase<T>.SetParameters(Vector<T>)

LoRAAdapterBase<T>.CreateMergedLayerWithClone(Vector<T>)

LoRAAdapterBase<T>.MergeToDenseOrFullyConnected()

LoRAAdapterBase<T>.UpdateParametersFromLayers()

LoRAAdapterBase<T>.ResetState()

LoRAAdapterBase<T>.SupportsJitCompilation

LoRAAdapterBase<T>.ExportComputationGraph(List<ComputationNode<T>>)

LayerBase<T>.Engine

LayerBase<T>.ScalarActivation

LayerBase<T>.VectorActivation

LayerBase<T>.UsingVectorActivation

LayerBase<T>.NumOps

LayerBase<T>.Random

LayerBase<T>.Parameters

LayerBase<T>.ParameterGradients

LayerBase<T>.InputShape

LayerBase<T>.InputShapes

LayerBase<T>.UpdateInputShape(int[])

LayerBase<T>.OutputShape

LayerBase<T>.IsTrainingMode

LayerBase<T>.InitializationStrategy

LayerBase<T>.IsInitialized

LayerBase<T>.InitializationLock

LayerBase<T>.EnsureInitialized()

LayerBase<T>.UseAutodiff

LayerBase<T>.SetTrainingMode(bool)

LayerBase<T>.GetParameterGradients()

LayerBase<T>.ClearGradients()

LayerBase<T>.GetInputShape()

LayerBase<T>.GetInputShapes()

LayerBase<T>.GetOutputShape()

LayerBase<T>.GetWeights()

LayerBase<T>.GetBiases()

LayerBase<T>.MapActivationToFused()

LayerBase<T>.SupportsGpuExecution

LayerBase<T>.SupportsGpuTraining

LayerBase<T>.CanExecuteOnGpu

LayerBase<T>.CanTrainOnGpu

LayerBase<T>.ForwardGpu(params IGpuTensor<T>[])

LayerBase<T>.BackwardGpu(IGpuTensor<T>)

LayerBase<T>.UpdateParametersGpu(IGpuOptimizerConfig)

LayerBase<T>.UploadWeightsToGpu()

LayerBase<T>.DownloadWeightsFromGpu()

LayerBase<T>.ZeroGradientsGpu()

LayerBase<T>.GetActivationTypes()

LayerBase<T>.Forward(params Tensor<T>[])

LayerBase<T>.ApplyActivation(Tensor<T>)

LayerBase<T>.ApplyActivation(Vector<T>)

LayerBase<T>.ActivateTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ActivateTensor(IVectorActivationFunction<T>, Tensor<T>)

LayerBase<T>.CalculateInputShape(int, int, int)

LayerBase<T>.CalculateOutputShape(int, int, int)

LayerBase<T>.Clone()

LayerBase<T>.DerivativeTensor(IActivationFunction<T>, Tensor<T>)

LayerBase<T>.ApplyActivationDerivative(T, T)

LayerBase<T>.ApplyActivationDerivative(Tensor<T>, Tensor<T>)

LayerBase<T>.ComputeActivationJacobian(Vector<T>)

LayerBase<T>.ApplyActivationDerivative(Vector<T>, Vector<T>)

LayerBase<T>.UpdateParameters(Vector<T>)

LayerBase<T>.Serialize(BinaryWriter)

LayerBase<T>.Deserialize(BinaryReader)

LayerBase<T>.GetDiagnostics()

LayerBase<T>.ApplyActivationToGraph(ComputationNode<T>)

LayerBase<T>.CanActivationBeJitted()

LayerBase<T>.RegisterTrainableParameter(Tensor<T>, PersistentTensorRole)

LayerBase<T>.InvalidateTrainableParameter(Tensor<T>)

LayerBase<T>.HasGpuActivation()

LayerBase<T>.ApplyActivationForwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.ApplyActivationBackwardGpu(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int)

LayerBase<T>.GetFusedActivationType()

LayerBase<T>.ApplyGpuActivation(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, int, FusedActivationType)

LayerBase<T>.ApplyGpuActivationBackward(IDirectGpuBackend, IGpuBuffer, IGpuBuffer, IGpuBuffer, IGpuBuffer, int, FusedActivationType, float)

LayerBase<T>.Dispose()

LayerBase<T>.Dispose(bool)

LayerBase<T>.WeightParameterName

LayerBase<T>.BiasParameterName

LayerBase<T>.SetWeights(Tensor<T>)

LayerBase<T>.SetBiases(Tensor<T>)

LayerBase<T>.GetParameterNames()

LayerBase<T>.TryGetParameter(string, out Tensor<T>)

LayerBase<T>.SetParameter(string, Tensor<T>)

LayerBase<T>.GetParameterShape(string)

LayerBase<T>.NamedParameterCount

LayerBase<T>.ValidateWeights(IEnumerable<string>, Func<string, string>)

LayerBase<T>.LoadWeights(Dictionary<string, Tensor<T>>, Func<string, string>, bool)

object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

S-LoRA (Scalable LoRA) is a system designed for efficient serving of many LoRA adapters simultaneously. Published in November 2023, it addresses the challenge of deploying thousands of task-specific LoRA adapters in production environments with limited GPU memory.

For Beginners: S-LoRA solves a real-world problem in production AI systems.

The problem:

You have a large base model (like GPT or LLaMA)
You want to serve thousands of different LoRA adapters (one per customer, task, or use case)
Each adapter is small (few MB), but thousands of them won't fit in GPU memory
Naive approaches either: load one adapter at a time (slow) or reserve memory for all (wasteful)

S-LoRA's solution:

Unified memory pool: Dynamically manage adapter weights and cache together
Batched computation: Process multiple adapters in parallel efficiently
Adapter clustering: Group adapters by rank for optimized computation
On-demand loading: Fetch adapters from CPU to GPU memory only when needed

Key features implemented:

Unified Memory Pool: Single pool for adapter weights (no pre-allocation waste)
Adapter Clustering: Group adapters by rank for batched computation
Dynamic Loading: Load adapters on-demand, evict when not needed
Batched Forward Pass: Process multiple requests with different adapters simultaneously
Memory Efficiency: Serve 100x more adapters than naive approaches

Research Paper Reference: "S-LoRA: Serving Thousands of Concurrent LoRA Adapters" Ying Sheng, Shiyi Cao, et al. (November 2023) arXiv:2311.03285

Performance (from paper):

Throughput: 4x improvement over vLLM, 30x over HuggingFace PEFT
Adapter capacity: 2,000+ concurrent adapters on single server
Memory efficiency: 75-90% GPU memory utilization
Scalability: Superlinear throughput scaling with more GPUs

Example usage:

// Create S-LoRA serving system for base layer
var sloraAdapter = new SLoRAAdapter<double>(baseLayer, rank: 8);

// Register multiple adapters for different tasks
sloraAdapter.RegisterAdapter("customer_1", adapter1);
sloraAdapter.RegisterAdapter("customer_2", adapter2);
sloraAdapter.RegisterAdapter("task_classification", adapter3);

// Process batched requests efficiently
var outputs = sloraAdapter.BatchForward(inputs, adapterIds);

When to use S-LoRA:

Serving multiple LoRA adapters in production
Multi-tenant AI systems (one adapter per tenant)
Task-specific fine-tuning at scale
Limited GPU memory but many adapters
Need high throughput with many concurrent users

Differences from standard LoRA:

Standard LoRA: Single adapter, simple forward/backward pass
S-LoRA: Multiple adapters, optimized for concurrent serving, memory pooling

Constructors

SLoRAAdapter(ILayer<T>, int, double, int, bool)

Initializes a new S-LoRA adapter for scalable multi-adapter serving.

public SLoRAAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, int maxLoadedAdapters = 100, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>: The base layer to adapt with S-LoRA.
rank int: The default rank for the primary LoRA decomposition.
alpha double: The LoRA scaling factor (defaults to rank if negative).
maxLoadedAdapters int: Maximum number of adapters to keep loaded simultaneously (default: 100).
freezeBaseLayer bool: Whether to freeze the base layer's parameters during training.

Remarks

For Beginners: This creates an S-LoRA serving system for efficient multi-adapter deployment.

Parameters:

baseLayer: The shared base model that all adapters modify
rank: Default rank for new adapters (typical: 8-32)
alpha: Scaling factor for LoRA contributions
maxLoadedAdapters: How many adapters to cache in "GPU memory" (100 = good balance)
freezeBaseLayer: Lock base weights (true for serving, false for continued training)

How S-LoRA works:

One base model shared across all adapters (memory efficient)
Thousands of small adapters registered in unified pool
Only popular adapters kept loaded in fast memory
Unpopular adapters evicted and loaded on-demand
Batched computation for multiple adapters simultaneously

Example: Serving 10,000 customer-specific adapters:

Base model: 7B parameters (14 GB)
Each adapter: rank 16 (few MB)
Total pool: 10,000 adapters (few GB in CPU memory)
Loaded cache: 100 most-used adapters (hundreds of MB in GPU memory)
Result: Serve 10,000 adapters with GPU memory for 1 base model + 100 adapters!

This is 100x more efficient than loading full fine-tuned models.

Exceptions

ArgumentNullException: Thrown when baseLayer is null.
ArgumentException: Thrown when maxLoadedAdapters is less than 1.

Properties

LoadedAdapterCount

Gets the number of adapters currently loaded in memory.

public int LoadedAdapterCount { get; }

Property Value

int

Remarks

This represents the "hot" adapters actively being used or cached. S-LoRA dynamically loads/evicts adapters based on request patterns.

MaxLoadedAdapters

Gets the maximum number of adapters that can be loaded simultaneously.

public int MaxLoadedAdapters { get; }

Property Value

int

Remarks

This simulates GPU memory constraints. S-LoRA's unified paging mechanism efficiently manages this limited resource.

RankClusterCount

Gets the number of rank clusters for batched computation optimization.

public int RankClusterCount { get; }

Property Value

int

Remarks

Adapters with the same rank are clustered together for efficient batched computation. This is a key optimization in S-LoRA for heterogeneous adapter serving.

TotalAdapterCount

Gets the total number of registered adapters in the pool.

public int TotalAdapterCount { get; }

Property Value

int

Remarks

This represents all adapters in the system, including those not currently loaded. S-LoRA can serve thousands of adapters from a unified pool.

Methods

BatchForward(Tensor<T>[], string[])

Performs batched forward pass with multiple adapters simultaneously.

public Tensor<T>[] BatchForward(Tensor<T>[] inputs, string[] adapterIds)

Parameters

inputs Tensor<T>[]: Array of input tensors.
adapterIds string[]: Array of adapter IDs corresponding to each input.

Returns

Tensor<T>[]: Array of output tensors.

Remarks

This method demonstrates S-LoRA's key innovation: efficient batched computation across heterogeneous adapters. Adapters are clustered by rank for optimized computation.

For Beginners: This is S-LoRA's killer feature - processing many requests efficiently!

The problem with naive batching:

Request 1: Use customer A's adapter (rank 8)
Request 2: Use customer B's adapter (rank 16)
Request 3: Use customer C's adapter (rank 8)
Naive approach: Process one by one (slow) or merge adapters (memory expensive)

S-LoRA's solution:

Group requests by adapter rank (rank-based clustering)
Process same-rank adapters in optimized batches
Use custom kernels for heterogeneous batching
Minimize memory overhead and maximize throughput

Batching strategy:

Cluster 1 (rank 8): [customer A, customer C] - batch process together
Cluster 2 (rank 16): [customer B] - process separately
Base model: Shared computation for all requests

Performance benefits (from paper):

4x throughput vs. non-batched serving
30x throughput vs. merging adapters per request
Near-linear scaling with more concurrent requests
75-90% GPU utilization

Example: Multi-tenant API serving

// Batch of 100 requests from different customers
var inputs = new Tensor<T>[100];
var adapterIds = new string[100];

for (int i = 0; i < 100; i++)
{
    inputs[i] = GetCustomerRequest(i);
    adapterIds[i] = $"customer_{GetCustomerId(i)}";
}

// Process entire batch efficiently (S-LoRA magic!)
var outputs = slora.BatchForward(inputs, adapterIds);

This enables high-throughput multi-tenant AI serving!

Exceptions

ArgumentNullException: Thrown when inputs or adapterIds is null.
ArgumentException: Thrown when array lengths don't match or adapter not found.

ClearAdapters()

Clears all adapters from the pool (useful for testing or reset).

public void ClearAdapters()

Remarks

This method removes all adapters from the unified pool except the primary adapter. Useful for resetting the system or clearing adapters during reconfiguration.

For Beginners: This wipes all registered adapters (except the default one).

Use cases:

Testing: Reset between test runs
Maintenance: Clear old adapters no longer in use
Reconfiguration: Remove all adapters before registering new set
Memory cleanup: Free memory from unused adapters

Example: Periodic cleanup

// Monthly cleanup of inactive customer adapters
slora.ClearAdapters();

// Re-register only active customers
foreach (var customer in GetActiveCustomers())
{
    var adapter = LoadCustomerAdapter(customer.Id);
    slora.RegisterAdapter(customer.Id, adapter, customer.Rank);
}

Note: Primary adapter is preserved to maintain base functionality.

Forward(Tensor<T>, string)

Performs batched forward pass with a specific adapter.

public Tensor<T> Forward(Tensor<T> input, string adapterId = "primary")

Parameters

input Tensor<T>: Input tensor.
adapterId string: The ID of the adapter to use (default: "primary").

Returns

Tensor<T>: Output tensor with adapter applied.

Remarks

This method performs S-LoRA's optimized forward pass with automatic adapter loading and reference tracking.

For Beginners: This runs inference with a specific adapter efficiently.

What happens during forward pass:

Load adapter if not already cached (automatic on-demand loading)
Increment reference count (prevent eviction during processing)
Run base model forward pass
Run adapter-specific LoRA computation
Combine base output + adapter output
Decrement reference count (allow eviction if needed)

Key S-LoRA optimizations simulated:

Separated base and adapter computation (can batch differently)
Automatic loading from unified pool
Reference counting prevents eviction during processing
LRU access tracking for cache management

Example: Multi-customer request handling

// Request from customer A
var outputA = slora.Forward(inputA, "customer_a");

// Request from customer B (different adapter)
var outputB = slora.Forward(inputB, "customer_b");

// Request from customer A again (adapter still cached)
var outputA2 = slora.Forward(inputA2, "customer_a");

Each customer gets their personalized model behavior efficiently!

Exceptions

ArgumentException: Thrown when adapter ID is not found.

GetRankCluster(int)

Gets the list of adapter IDs in a specific rank cluster.

public List<string> GetRankCluster(int rank)

Parameters

rank int: The rank to query.

Returns

List<string>: List of adapter IDs with the specified rank, or empty list if none.

Remarks

This method provides access to S-LoRA's rank-based clustering information. Adapters with the same rank can be batched together more efficiently.

For Beginners: This shows which adapters can be batched together efficiently.

Why rank clustering matters:

Adapters with same rank have same computational cost
Can use same CUDA kernels / computation paths
Better memory access patterns
Higher GPU utilization

Example: Analyzing your adapter distribution

var slora = new SLoRAAdapter<double>(baseModel, rank: 8);

// Register many adapters with different ranks
// ...

// See how adapters are distributed
var rank8Adapters = slora.GetRankCluster(8);   // Maybe 500 adapters
var rank16Adapters = slora.GetRankCluster(16); // Maybe 300 adapters
var rank32Adapters = slora.GetRankCluster(32); // Maybe 200 adapters

Console.WriteLine($"Rank 8: {rank8Adapters.Count} adapters");
Console.WriteLine($"Rank 16: {rank16Adapters.Count} adapters");
Console.WriteLine($"Rank 32: {rank32Adapters.Count} adapters");

This helps optimize batch sizes and resource allocation!

GetStatistics()

Gets statistics about the current state of the S-LoRA system.

public Dictionary<string, double> GetStatistics()

Returns

Dictionary<string, double>: Dictionary containing system statistics.

Remarks

This method provides detailed statistics about S-LoRA's memory usage, cache efficiency, and adapter distribution.

For Beginners: This gives you insights into how well your S-LoRA system is performing.

Key metrics returned:

TotalAdapters: How many adapters registered in pool
LoadedAdapters: How many currently cached in "GPU memory"
CacheUtilization: Percentage of cache capacity used
RankClusters: Number of different rank groups
AverageRank: Mean rank across all adapters
ActiveReferences: Adapters currently processing requests

Example: Monitoring production system

var stats = slora.GetStatistics();

Console.WriteLine($"Total adapters: {stats["TotalAdapters"]}");
Console.WriteLine($"Loaded adapters: {stats["LoadedAdapters"]}");
Console.WriteLine($"Cache utilization: {stats["CacheUtilization"]}%");

// Alert if cache too small
if ((double)stats["CacheUtilization"] > 95)
{
    Console.WriteLine("Warning: Cache nearly full, consider increasing maxLoadedAdapters");
}

Use this to tune your S-LoRA configuration for optimal performance!

LoadAdapter(string)

Loads an adapter from the pool into active memory (simulates GPU loading).

public void LoadAdapter(string adapterId)

Parameters

adapterId string: The ID of the adapter to load.

Remarks

This method simulates S-LoRA's dynamic adapter loading from CPU to GPU memory. If the loaded adapter cache is full, it evicts the least recently used adapter.

For Beginners: This moves an adapter from slow storage to fast cache.

In S-LoRA's architecture:

CPU memory: All adapters stored here (slow but large capacity)
GPU memory: Hot adapters cached here (fast but limited capacity)

Loading process:

Check if adapter already loaded (if yes, update access time and return)
Check if cache is full (if yes, evict least recently used adapter)
Load adapter into cache
Mark as loaded and update access timestamp

LRU eviction policy:

Adapters with oldest last access time evicted first
Adapters with active references (in-flight requests) never evicted
This keeps popular adapters hot in cache

Example: Customer request patterns

Time 0: Customer A requests (load adapter A)
Time 1: Customer B requests (load adapter B)
...
Time 99: Customer Z requests (load adapter Z, cache now full at 100)
Time 100: Customer AA requests (evict least-used, load adapter AA)
Time 101: Customer A requests again (adapter A was evicted, reload)

Popular customers stay cached, inactive ones evicted automatically!

Exceptions

ArgumentException: Thrown when adapter ID is not found in pool.

MergeToOriginalLayer()

Merges the primary adapter into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>: A new layer with primary LoRA weights merged into the base layer's weights.

Remarks

For S-LoRA, this merges the primary adapter (the one created during initialization). In production S-LoRA deployments, individual adapters typically remain separate for efficient multi-adapter serving rather than being merged.

For Beginners: This merges the default adapter for deployment.

When to merge adapters:

Deploying a single-adapter model (no longer need multi-adapter serving)
Want maximum inference speed for one specific adapter
Converting S-LoRA deployment back to standard model

When NOT to merge:

Serving multiple adapters (defeats purpose of S-LoRA)
Need to swap adapters dynamically
Want memory efficiency of shared base model

S-LoRA's strength is NOT merging:

Keep base model frozen and shared
Keep all adapters separate in pool
Swap adapters per request efficiently
Serve thousands of adapters from one base model

This method is mainly for compatibility or transitioning away from S-LoRA architecture.

Exceptions

InvalidOperationException: Thrown when the base layer type is not supported.

RegisterAdapter(string, LoRALayer<T>, int)

Registers a new adapter in the unified memory pool.

public void RegisterAdapter(string adapterId, LoRALayer<T> loraLayer, int rank)

Parameters

adapterId string: Unique identifier for this adapter.
loraLayer LoRALayer<T>: The LoRA layer to register.
rank int: The rank of this adapter.

Remarks

This method adds a new adapter to S-LoRA's unified memory pool. The adapter is not immediately loaded into GPU memory but is available for on-demand loading when needed.

For Beginners: This is like adding a new customer or task-specific adapter to your system.

What happens when you register an adapter:

Adapter stored in CPU memory pool (cheap storage)
Added to rank cluster for batched computation optimization
Not loaded to GPU yet (only loaded when first used)
Can register thousands of adapters this way

Example: Multi-tenant SaaS application

var slora = new SLoRAAdapter<double>(baseModel, rank: 8, maxLoadedAdapters: 100);

// Register 1000 customer adapters
for (int i = 0; i < 1000; i++)
{
    var adapter = LoadCustomerAdapter(i);
    slora.RegisterAdapter($"customer_{i}", adapter, rank: 8);
}

// All 1000 adapters registered, but only 100 will be loaded at once
// Popular customers get fast GPU-cached access
// Inactive customers loaded on-demand from CPU pool

This enables serving far more adapters than GPU memory allows!

Exceptions

ArgumentNullException: Thrown when adapterId or loraLayer is null.
ArgumentException: Thrown when an adapter with this ID already exists.

Table of Contents

Class SLoRAAdapter<T>

Type Parameters

Remarks

Constructors

SLoRAAdapter(ILayer<T>, int, double, int, bool)

Parameters

Remarks

Exceptions

Properties

LoadedAdapterCount

Property Value

Remarks

MaxLoadedAdapters

Property Value

Remarks

RankClusterCount

Property Value

Remarks

TotalAdapterCount

Property Value

Remarks

Methods

BatchForward(Tensor<T>[], string[])

Parameters

Returns

Remarks

Exceptions

ClearAdapters()

Remarks

Forward(Tensor<T>, string)

Parameters

Returns

Remarks

Exceptions

GetRankCluster(int)

Parameters

Returns

Remarks

GetStatistics()

Returns

Remarks

LoadAdapter(string)

Parameters

Remarks

Exceptions

MergeToOriginalLayer()

Returns

Remarks

Exceptions

RegisterAdapter(string, LoRALayer<T>, int)

Parameters

Remarks

Exceptions