Class SLoRAAdapter<T>
S-LoRA adapter for scalable serving of thousands of concurrent LoRA adapters.
public class SLoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>
Type Parameters
TThe numeric type used for calculations, typically float or double.
- Inheritance
-
LayerBase<T>SLoRAAdapter<T>
- Implements
-
ILoRAAdapter<T>ILayer<T>
- Inherited Members
Remarks
S-LoRA (Scalable LoRA) is a system designed for efficient serving of many LoRA adapters simultaneously. Published in November 2023, it addresses the challenge of deploying thousands of task-specific LoRA adapters in production environments with limited GPU memory.
For Beginners: S-LoRA solves a real-world problem in production AI systems.
The problem:
- You have a large base model (like GPT or LLaMA)
- You want to serve thousands of different LoRA adapters (one per customer, task, or use case)
- Each adapter is small (few MB), but thousands of them won't fit in GPU memory
- Naive approaches either: load one adapter at a time (slow) or reserve memory for all (wasteful)
S-LoRA's solution:
- Unified memory pool: Dynamically manage adapter weights and cache together
- Batched computation: Process multiple adapters in parallel efficiently
- Adapter clustering: Group adapters by rank for optimized computation
- On-demand loading: Fetch adapters from CPU to GPU memory only when needed
Key features implemented:
- Unified Memory Pool: Single pool for adapter weights (no pre-allocation waste)
- Adapter Clustering: Group adapters by rank for batched computation
- Dynamic Loading: Load adapters on-demand, evict when not needed
- Batched Forward Pass: Process multiple requests with different adapters simultaneously
- Memory Efficiency: Serve 100x more adapters than naive approaches
Research Paper Reference: "S-LoRA: Serving Thousands of Concurrent LoRA Adapters" Ying Sheng, Shiyi Cao, et al. (November 2023) arXiv:2311.03285
Performance (from paper):
- Throughput: 4x improvement over vLLM, 30x over HuggingFace PEFT
- Adapter capacity: 2,000+ concurrent adapters on single server
- Memory efficiency: 75-90% GPU memory utilization
- Scalability: Superlinear throughput scaling with more GPUs
Example usage:
// Create S-LoRA serving system for base layer
var sloraAdapter = new SLoRAAdapter<double>(baseLayer, rank: 8);
// Register multiple adapters for different tasks
sloraAdapter.RegisterAdapter("customer_1", adapter1);
sloraAdapter.RegisterAdapter("customer_2", adapter2);
sloraAdapter.RegisterAdapter("task_classification", adapter3);
// Process batched requests efficiently
var outputs = sloraAdapter.BatchForward(inputs, adapterIds);
When to use S-LoRA:
- Serving multiple LoRA adapters in production
- Multi-tenant AI systems (one adapter per tenant)
- Task-specific fine-tuning at scale
- Limited GPU memory but many adapters
- Need high throughput with many concurrent users
Differences from standard LoRA:
- Standard LoRA: Single adapter, simple forward/backward pass
- S-LoRA: Multiple adapters, optimized for concurrent serving, memory pooling
Constructors
SLoRAAdapter(ILayer<T>, int, double, int, bool)
Initializes a new S-LoRA adapter for scalable multi-adapter serving.
public SLoRAAdapter(ILayer<T> baseLayer, int rank, double alpha = -1, int maxLoadedAdapters = 100, bool freezeBaseLayer = true)
Parameters
baseLayerILayer<T>The base layer to adapt with S-LoRA.
rankintThe default rank for the primary LoRA decomposition.
alphadoubleThe LoRA scaling factor (defaults to rank if negative).
maxLoadedAdaptersintMaximum number of adapters to keep loaded simultaneously (default: 100).
freezeBaseLayerboolWhether to freeze the base layer's parameters during training.
Remarks
For Beginners: This creates an S-LoRA serving system for efficient multi-adapter deployment.
Parameters:
- baseLayer: The shared base model that all adapters modify
- rank: Default rank for new adapters (typical: 8-32)
- alpha: Scaling factor for LoRA contributions
- maxLoadedAdapters: How many adapters to cache in "GPU memory" (100 = good balance)
- freezeBaseLayer: Lock base weights (true for serving, false for continued training)
How S-LoRA works:
- One base model shared across all adapters (memory efficient)
- Thousands of small adapters registered in unified pool
- Only popular adapters kept loaded in fast memory
- Unpopular adapters evicted and loaded on-demand
- Batched computation for multiple adapters simultaneously
Example: Serving 10,000 customer-specific adapters:
- Base model: 7B parameters (14 GB)
- Each adapter: rank 16 (few MB)
- Total pool: 10,000 adapters (few GB in CPU memory)
- Loaded cache: 100 most-used adapters (hundreds of MB in GPU memory)
- Result: Serve 10,000 adapters with GPU memory for 1 base model + 100 adapters!
This is 100x more efficient than loading full fine-tuned models.
Exceptions
- ArgumentNullException
Thrown when baseLayer is null.
- ArgumentException
Thrown when maxLoadedAdapters is less than 1.
Properties
LoadedAdapterCount
Gets the number of adapters currently loaded in memory.
public int LoadedAdapterCount { get; }
Property Value
Remarks
This represents the "hot" adapters actively being used or cached. S-LoRA dynamically loads/evicts adapters based on request patterns.
MaxLoadedAdapters
Gets the maximum number of adapters that can be loaded simultaneously.
public int MaxLoadedAdapters { get; }
Property Value
Remarks
This simulates GPU memory constraints. S-LoRA's unified paging mechanism efficiently manages this limited resource.
RankClusterCount
Gets the number of rank clusters for batched computation optimization.
public int RankClusterCount { get; }
Property Value
Remarks
Adapters with the same rank are clustered together for efficient batched computation. This is a key optimization in S-LoRA for heterogeneous adapter serving.
TotalAdapterCount
Gets the total number of registered adapters in the pool.
public int TotalAdapterCount { get; }
Property Value
Remarks
This represents all adapters in the system, including those not currently loaded. S-LoRA can serve thousands of adapters from a unified pool.
Methods
BatchForward(Tensor<T>[], string[])
Performs batched forward pass with multiple adapters simultaneously.
public Tensor<T>[] BatchForward(Tensor<T>[] inputs, string[] adapterIds)
Parameters
inputsTensor<T>[]Array of input tensors.
adapterIdsstring[]Array of adapter IDs corresponding to each input.
Returns
- Tensor<T>[]
Array of output tensors.
Remarks
This method demonstrates S-LoRA's key innovation: efficient batched computation across heterogeneous adapters. Adapters are clustered by rank for optimized computation.
For Beginners: This is S-LoRA's killer feature - processing many requests efficiently!
The problem with naive batching:
- Request 1: Use customer A's adapter (rank 8)
- Request 2: Use customer B's adapter (rank 16)
- Request 3: Use customer C's adapter (rank 8)
- Naive approach: Process one by one (slow) or merge adapters (memory expensive)
S-LoRA's solution:
- Group requests by adapter rank (rank-based clustering)
- Process same-rank adapters in optimized batches
- Use custom kernels for heterogeneous batching
- Minimize memory overhead and maximize throughput
Batching strategy:
- Cluster 1 (rank 8): [customer A, customer C] - batch process together
- Cluster 2 (rank 16): [customer B] - process separately
- Base model: Shared computation for all requests
Performance benefits (from paper):
- 4x throughput vs. non-batched serving
- 30x throughput vs. merging adapters per request
- Near-linear scaling with more concurrent requests
- 75-90% GPU utilization
Example: Multi-tenant API serving
// Batch of 100 requests from different customers
var inputs = new Tensor<T>[100];
var adapterIds = new string[100];
for (int i = 0; i < 100; i++)
{
inputs[i] = GetCustomerRequest(i);
adapterIds[i] = $"customer_{GetCustomerId(i)}";
}
// Process entire batch efficiently (S-LoRA magic!)
var outputs = slora.BatchForward(inputs, adapterIds);
This enables high-throughput multi-tenant AI serving!
Exceptions
- ArgumentNullException
Thrown when inputs or adapterIds is null.
- ArgumentException
Thrown when array lengths don't match or adapter not found.
ClearAdapters()
Clears all adapters from the pool (useful for testing or reset).
public void ClearAdapters()
Remarks
This method removes all adapters from the unified pool except the primary adapter. Useful for resetting the system or clearing adapters during reconfiguration.
For Beginners: This wipes all registered adapters (except the default one).
Use cases:
- Testing: Reset between test runs
- Maintenance: Clear old adapters no longer in use
- Reconfiguration: Remove all adapters before registering new set
- Memory cleanup: Free memory from unused adapters
Example: Periodic cleanup
// Monthly cleanup of inactive customer adapters
slora.ClearAdapters();
// Re-register only active customers
foreach (var customer in GetActiveCustomers())
{
var adapter = LoadCustomerAdapter(customer.Id);
slora.RegisterAdapter(customer.Id, adapter, customer.Rank);
}
Note: Primary adapter is preserved to maintain base functionality.
Forward(Tensor<T>, string)
Performs batched forward pass with a specific adapter.
public Tensor<T> Forward(Tensor<T> input, string adapterId = "primary")
Parameters
inputTensor<T>Input tensor.
adapterIdstringThe ID of the adapter to use (default: "primary").
Returns
- Tensor<T>
Output tensor with adapter applied.
Remarks
This method performs S-LoRA's optimized forward pass with automatic adapter loading and reference tracking.
For Beginners: This runs inference with a specific adapter efficiently.
What happens during forward pass:
- Load adapter if not already cached (automatic on-demand loading)
- Increment reference count (prevent eviction during processing)
- Run base model forward pass
- Run adapter-specific LoRA computation
- Combine base output + adapter output
- Decrement reference count (allow eviction if needed)
Key S-LoRA optimizations simulated:
- Separated base and adapter computation (can batch differently)
- Automatic loading from unified pool
- Reference counting prevents eviction during processing
- LRU access tracking for cache management
Example: Multi-customer request handling
// Request from customer A
var outputA = slora.Forward(inputA, "customer_a");
// Request from customer B (different adapter)
var outputB = slora.Forward(inputB, "customer_b");
// Request from customer A again (adapter still cached)
var outputA2 = slora.Forward(inputA2, "customer_a");
Each customer gets their personalized model behavior efficiently!
Exceptions
- ArgumentException
Thrown when adapter ID is not found.
GetRankCluster(int)
Gets the list of adapter IDs in a specific rank cluster.
public List<string> GetRankCluster(int rank)
Parameters
rankintThe rank to query.
Returns
Remarks
This method provides access to S-LoRA's rank-based clustering information. Adapters with the same rank can be batched together more efficiently.
For Beginners: This shows which adapters can be batched together efficiently.
Why rank clustering matters:
- Adapters with same rank have same computational cost
- Can use same CUDA kernels / computation paths
- Better memory access patterns
- Higher GPU utilization
Example: Analyzing your adapter distribution
var slora = new SLoRAAdapter<double>(baseModel, rank: 8);
// Register many adapters with different ranks
// ...
// See how adapters are distributed
var rank8Adapters = slora.GetRankCluster(8); // Maybe 500 adapters
var rank16Adapters = slora.GetRankCluster(16); // Maybe 300 adapters
var rank32Adapters = slora.GetRankCluster(32); // Maybe 200 adapters
Console.WriteLine($"Rank 8: {rank8Adapters.Count} adapters");
Console.WriteLine($"Rank 16: {rank16Adapters.Count} adapters");
Console.WriteLine($"Rank 32: {rank32Adapters.Count} adapters");
This helps optimize batch sizes and resource allocation!
GetStatistics()
Gets statistics about the current state of the S-LoRA system.
public Dictionary<string, double> GetStatistics()
Returns
- Dictionary<string, double>
Dictionary containing system statistics.
Remarks
This method provides detailed statistics about S-LoRA's memory usage, cache efficiency, and adapter distribution.
For Beginners: This gives you insights into how well your S-LoRA system is performing.
Key metrics returned:
- TotalAdapters: How many adapters registered in pool
- LoadedAdapters: How many currently cached in "GPU memory"
- CacheUtilization: Percentage of cache capacity used
- RankClusters: Number of different rank groups
- AverageRank: Mean rank across all adapters
- ActiveReferences: Adapters currently processing requests
Example: Monitoring production system
var stats = slora.GetStatistics();
Console.WriteLine($"Total adapters: {stats["TotalAdapters"]}");
Console.WriteLine($"Loaded adapters: {stats["LoadedAdapters"]}");
Console.WriteLine($"Cache utilization: {stats["CacheUtilization"]}%");
// Alert if cache too small
if ((double)stats["CacheUtilization"] > 95)
{
Console.WriteLine("Warning: Cache nearly full, consider increasing maxLoadedAdapters");
}
Use this to tune your S-LoRA configuration for optimal performance!
LoadAdapter(string)
Loads an adapter from the pool into active memory (simulates GPU loading).
public void LoadAdapter(string adapterId)
Parameters
adapterIdstringThe ID of the adapter to load.
Remarks
This method simulates S-LoRA's dynamic adapter loading from CPU to GPU memory. If the loaded adapter cache is full, it evicts the least recently used adapter.
For Beginners: This moves an adapter from slow storage to fast cache.
In S-LoRA's architecture:
- CPU memory: All adapters stored here (slow but large capacity)
- GPU memory: Hot adapters cached here (fast but limited capacity)
Loading process:
- Check if adapter already loaded (if yes, update access time and return)
- Check if cache is full (if yes, evict least recently used adapter)
- Load adapter into cache
- Mark as loaded and update access timestamp
LRU eviction policy:
- Adapters with oldest last access time evicted first
- Adapters with active references (in-flight requests) never evicted
- This keeps popular adapters hot in cache
Example: Customer request patterns
Time 0: Customer A requests (load adapter A)
Time 1: Customer B requests (load adapter B)
...
Time 99: Customer Z requests (load adapter Z, cache now full at 100)
Time 100: Customer AA requests (evict least-used, load adapter AA)
Time 101: Customer A requests again (adapter A was evicted, reload)
Popular customers stay cached, inactive ones evicted automatically!
Exceptions
- ArgumentException
Thrown when adapter ID is not found in pool.
MergeToOriginalLayer()
Merges the primary adapter into the base layer and returns the merged layer.
public override ILayer<T> MergeToOriginalLayer()
Returns
- ILayer<T>
A new layer with primary LoRA weights merged into the base layer's weights.
Remarks
For S-LoRA, this merges the primary adapter (the one created during initialization). In production S-LoRA deployments, individual adapters typically remain separate for efficient multi-adapter serving rather than being merged.
For Beginners: This merges the default adapter for deployment.
When to merge adapters:
- Deploying a single-adapter model (no longer need multi-adapter serving)
- Want maximum inference speed for one specific adapter
- Converting S-LoRA deployment back to standard model
When NOT to merge:
- Serving multiple adapters (defeats purpose of S-LoRA)
- Need to swap adapters dynamically
- Want memory efficiency of shared base model
S-LoRA's strength is NOT merging:
- Keep base model frozen and shared
- Keep all adapters separate in pool
- Swap adapters per request efficiently
- Serve thousands of adapters from one base model
This method is mainly for compatibility or transitioning away from S-LoRA architecture.
Exceptions
- InvalidOperationException
Thrown when the base layer type is not supported.
RegisterAdapter(string, LoRALayer<T>, int)
Registers a new adapter in the unified memory pool.
public void RegisterAdapter(string adapterId, LoRALayer<T> loraLayer, int rank)
Parameters
adapterIdstringUnique identifier for this adapter.
loraLayerLoRALayer<T>The LoRA layer to register.
rankintThe rank of this adapter.
Remarks
This method adds a new adapter to S-LoRA's unified memory pool. The adapter is not immediately loaded into GPU memory but is available for on-demand loading when needed.
For Beginners: This is like adding a new customer or task-specific adapter to your system.
What happens when you register an adapter:
- Adapter stored in CPU memory pool (cheap storage)
- Added to rank cluster for batched computation optimization
- Not loaded to GPU yet (only loaded when first used)
- Can register thousands of adapters this way
Example: Multi-tenant SaaS application
var slora = new SLoRAAdapter<double>(baseModel, rank: 8, maxLoadedAdapters: 100);
// Register 1000 customer adapters
for (int i = 0; i < 1000; i++)
{
var adapter = LoadCustomerAdapter(i);
slora.RegisterAdapter($"customer_{i}", adapter, rank: 8);
}
// All 1000 adapters registered, but only 100 will be loaded at once
// Popular customers get fast GPU-cached access
// Inactive customers loaded on-demand from CPU pool
This enables serving far more adapters than GPU memory allows!
Exceptions
- ArgumentNullException
Thrown when adapterId or loraLayer is null.
- ArgumentException
Thrown when an adapter with this ID already exists.