Table of Contents

Class LongLoRAAdapter<T>

Namespace
AiDotNet.LoRA.Adapters
Assembly
AiDotNet.dll

LongLoRA adapter that efficiently extends LoRA to handle longer context lengths using shifted sparse attention.

public class LongLoRAAdapter<T> : LoRAAdapterBase<T>, IDisposable, ILoRAAdapter<T>, ILayer<T>, IJitCompilable<T>, IDiagnosticsProvider, IWeightLoadable<T>

Type Parameters

T

The numeric type used for calculations, typically float or double.

Inheritance
LongLoRAAdapter<T>
Implements
Inherited Members

Remarks

LongLoRA (2023) addresses the challenge of adapting large language models to longer context windows in a parameter-efficient manner. While standard LoRA works well for same-length fine-tuning, extending context windows naively would require substantial computational resources.

LongLoRA introduces two key innovations: 1. Shifted Sparse Attention (S²-Attn): During training only, uses shifted group attention patterns that are more efficient while maintaining effectiveness for long contexts 2. Dense Attention at Inference: At inference time, switches back to standard dense attention for full context utilization without the training overhead

For Beginners: LongLoRA makes it affordable to train models on longer sequences.

The Problem:

  • Standard LoRA works great for adapting models, but extending context length is expensive
  • Full dense attention on long sequences requires O(n²) computation
  • Training on 32k tokens instead of 2k tokens would be 256x slower!

LongLoRA's Solution:

  • Uses a clever "shifted sparse attention" trick during training
  • Divides the sequence into groups and shifts them to maintain information flow
  • Much cheaper to train: O(n * k) where k is group size (typically 2048)
  • At inference, uses full dense attention to maintain quality

Key Parameters:

  • OriginalContextLength: The base model's context window (e.g., 2048)
  • ExtendedContextLength: The target longer context (e.g., 8192 or 32768)
  • UseShiftedAttention: Enable shifted sparse attention (training only)
  • AttentionShiftSize: How many positions to shift attention groups (usually half the group size)

Example Use Case: You have a model trained on 2k token contexts but need to process 16k token documents. LongLoRA lets you extend the context efficiently:

  • Training: Use shifted sparse attention (much faster)
  • Inference: Use full dense attention (full quality)

Comparison to Standard LoRA:

  • Standard LoRA: Efficient parameter adaptation, same context length
  • LongLoRA: Efficient parameter adaptation + context length extension
  • Adds minimal overhead (just the attention shift mechanism)

Research Background: LongLoRA has been successfully used to extend:

  • LLaMA 2 7B from 4k to 32k context (8x extension)
  • LLaMA 2 13B from 4k to 64k context (16x extension)
  • With only ~10% of the training cost compared to full fine-tuning

Reference: LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (2023) https://arxiv.org/abs/2309.12307

Constructors

LongLoRAAdapter(ILayer<T>, int, int, int, double, int, bool)

Initializes a new LongLoRA adapter for efficient context length extension.

public LongLoRAAdapter(ILayer<T> baseLayer, int rank, int originalContextLength, int extendedContextLength, double alpha = -1, int attentionShiftSize = -1, bool freezeBaseLayer = true)

Parameters

baseLayer ILayer<T>

The layer to adapt with LongLoRA.

rank int

The rank of the LoRA decomposition.

originalContextLength int

The original context length of the base model.

extendedContextLength int

The target extended context length.

alpha double

The LoRA scaling factor (defaults to rank if negative).

attentionShiftSize int

The shift size for shifted sparse attention (defaults to originalContextLength/2).

freezeBaseLayer bool

Whether to freeze the base layer's parameters during training.

Remarks

For Beginners: This creates a LongLoRA adapter to extend your model's context window.

Parameters:

  • baseLayer: The layer you want to adapt (typically attention layers)
  • rank: How much LoRA compression to use (8-16 is typical)
  • originalContextLength: How long sequences your base model handles (e.g., 2048)
  • extendedContextLength: How long you want to extend it to (e.g., 8192 or 16384)
  • alpha: LoRA strength (usually equals rank)
  • attentionShiftSize: How much to shift attention groups (auto-calculated if not specified)
  • freezeBaseLayer: Whether to freeze original weights (usually true for efficiency)

The adapter will use shifted sparse attention during training for efficiency, and you can switch to dense attention during inference for quality.

Exceptions

ArgumentNullException

Thrown when baseLayer is null.

ArgumentException

Thrown when context lengths or shift size are invalid.

Properties

AttentionShiftSize

Gets the attention shift size used in shifted sparse attention.

public int AttentionShiftSize { get; }

Property Value

int

Remarks

This determines how much groups are shifted to maintain information flow. Typically set to half the group size (e.g., 1024 for 2048 group size).

For Beginners: This is the "sliding window" amount that ensures different parts of the sequence can communicate across groups. Too small and information doesn't flow well; too large and you lose the efficiency benefit.

ExtendedContextLength

Gets the extended context length this adapter targets.

public int ExtendedContextLength { get; }

Property Value

int

Remarks

This is the new, longer context window you want to support after adaptation. Should be larger than OriginalContextLength.

For Beginners: This is how long of a sequence your adapted model can handle. For example, extending from 2k to 16k tokens means you can process 8x longer documents!

IsTraining

Gets or sets whether the adapter is in training mode.

public bool IsTraining { get; set; }

Property Value

bool

Remarks

Training mode affects whether shifted attention is applied. Set to false during inference to use standard dense attention.

OriginalContextLength

Gets the original context length of the base model.

public int OriginalContextLength { get; }

Property Value

int

Remarks

This is the maximum sequence length the base model was originally trained to handle. Typical values: 512, 1024, 2048, 4096.

UseShiftedAttention

Gets or sets whether to use shifted sparse attention during forward/backward passes.

public bool UseShiftedAttention { get; set; }

Property Value

bool

Remarks

When enabled (training mode): - Uses shifted group attention pattern for efficiency - Divides sequence into groups and shifts them - Significantly reduces computational cost

When disabled (inference mode): - Uses standard dense attention - Full context utilization - Better quality but slower

For Beginners: Enable this during training to save compute, disable it during inference to get the best quality. The training trick doesn't hurt the final model's ability to use full attention at inference time!

Methods

Backward(Tensor<T>)

Performs the backward pass with optional shifted sparse attention.

public override Tensor<T> Backward(Tensor<T> outputGradient)

Parameters

outputGradient Tensor<T>

Gradient flowing back from the next layer.

Returns

Tensor<T>

Gradient to pass to the previous layer.

Remarks

The backward pass mirrors the forward pass behavior: - Applies the same shifting pattern to gradients during training - Ensures gradient flow is consistent with the forward pass attention pattern

For Beginners: This propagates learning signals backward through the network. It uses the same shifted pattern as the forward pass to ensure the gradients match the attention pattern used during the forward pass.

Forward(Tensor<T>)

Performs the forward pass with optional shifted sparse attention.

public override Tensor<T> Forward(Tensor<T> input)

Parameters

input Tensor<T>

Input tensor of shape [batchSize, sequenceLength, featureDim].

Returns

Tensor<T>

Output tensor with LoRA adaptation applied.

Remarks

The forward pass behavior depends on the UseShiftedAttention flag: - When true (training): Applies shifted group attention for efficiency - When false (inference): Uses standard dense attention

Shifted Sparse Attention Process: 1. Divide the sequence into groups of size OriginalContextLength 2. Shift alternate groups by AttentionShiftSize positions 3. Apply attention within each group 4. Shift back to restore original positions

For Beginners: This processes your input through the adapted layer.

During training (shifted attention enabled):

  • Breaks long sequence into manageable chunks
  • Shifts them to allow cross-chunk communication
  • Much faster than processing the full sequence at once

During inference (shifted attention disabled):

  • Processes the full sequence with complete attention
  • Slower but gives best quality

The magic is that training with the shifted trick still produces a model that works great with full attention at inference!

MergeToOriginalLayer()

Merges the LongLoRA adaptation into the base layer and returns the merged layer.

public override ILayer<T> MergeToOriginalLayer()

Returns

ILayer<T>

A new layer with LoRA weights merged into the base layer's weights.

Remarks

For LongLoRA, merging works like standard LoRA - the shifted attention pattern is only used during training and doesn't affect the final merged weights. The merged layer can use full dense attention at inference time.

For Beginners: After training with LongLoRA, you can merge the weights just like standard LoRA. The shifted attention trick was only for efficient training - it doesn't change the final model! The merged model will work great with full attention on long contexts because that's what it learned to handle (just using a training shortcut).

ResetState()

Resets the internal state of the adapter.

public override void ResetState()

Remarks

For Beginners: This clears all internal memory and cached data. Call this when starting to process a new, unrelated sequence.