Table of Contents

Enum DistillationStrategyType

Namespace
AiDotNet.Enums
Assembly
AiDotNet.dll

Specifies the type of knowledge distillation strategy to use for transferring knowledge from teacher to student models.

public enum DistillationStrategyType

Fields

AttentionBased = 2
ContrastiveBased = 4

Contrastive Representation Distillation / CRD (Tian et al., 2020). Uses contrastive learning to match teacher and student representations.

Best for: Self-supervised learning, representation learning.

Key Parameters: Temperature, negative samples, contrast weight.

Pros: Strong theoretical foundation, works without labels.

Cons: Requires careful tuning, needs negative sampling.

FactorTransfer = 9

Factor Transfer (Kim et al., 2018). Transfers factors (paraphrased representations) from teacher to student.

Best for: Cross-architecture transfer, efficient distillation.

Key Parameters: Paraphraser network, factor layers.

Pros: Flexible, works across different architectures.

Cons: Requires additional paraphraser network.

FeatureBased = 1

Feature-based distillation / FitNets (Romero et al., 2014). Matches intermediate layer representations between teacher and student.

Best for: Different architectures (e.g., CNN → MobileNet), transfer across domains.

Key Parameters: Layer pairs to match, feature weight.

Pros: Transfers deeper knowledge, works across architectures.

Cons: Requires layer mapping, may need projection layers.

FlowBased = 6

Flow of Solution Procedure / FSP (Yim et al., 2017). Transfers the flow of information between layers.

Best for: Deep networks, capturing layer-to-layer flow.

Key Parameters: Layer pairs for flow matrices.

Pros: Captures information flow, good for deep networks.

Cons: Requires multiple layer pairs, complex to configure.

Hybrid = 12

Combined/Hybrid distillation. Combines multiple strategies (e.g., Response + Feature + Attention).

Best for: Maximizing knowledge transfer, complex models.

Key Parameters: Weights for each strategy.

Pros: Transfers knowledge at multiple levels.

Cons: More hyperparameters to tune, computationally expensive.

NeuronSelectivity = 10
ProbabilisticTransfer = 7
RelationBased = 3

Relational Knowledge Distillation / RKD (Park et al., 2019). Preserves relationships (distances and angles) between sample representations.

Best for: Metric learning, few-shot learning, embedding models.

Key Parameters: Distance weight, angle weight.

Pros: Preserves structural relationships, robust to architecture changes.

Cons: More computationally expensive (pairwise comparisons).

ResponseBased = 0

Response-based distillation (Hinton et al., 2015). Matches the teacher's final output predictions using temperature-scaled softmax.

Best for: Standard classification tasks, general-purpose distillation.

Key Parameters: Temperature (2-10), Alpha (0.3-0.5).

Pros: Simple, effective, widely used.

Cons: Doesn't capture intermediate representations.

SelfDistillation = 11

Self-distillation (Zhang et al., 2019; Furlanello et al., 2018). Model learns from its own predictions to improve calibration and generalization.

Best for: Improving calibration, no separate teacher needed.

Key Parameters: Generations, temperature, EMA decay.

Pros: No separate teacher, improves calibration.

Cons: Requires multiple training runs.

SimilarityPreserving = 5
VariationalInformation = 8

Variational Information Distillation / VID (Ahn et al., 2019). Uses variational bounds to maximize mutual information between teacher and student.

Best for: Information-theoretic distillation, maximizing information transfer.

Key Parameters: Variational parameters, MI estimator.

Pros: Theoretical guarantees, maximizes information.

Cons: Complex implementation, harder to tune.

Remarks

For Beginners: Different distillation strategies focus on different aspects of the teacher's knowledge. Some match final outputs, others match intermediate features or relationships between samples.

Choosing a Strategy: - Use **ResponseBased** for most cases (standard Hinton distillation) - Use **FeatureBased** when student architecture differs significantly from teacher - Use **AttentionBased** for transformer models (BERT, GPT) - Use **RelationBased** to preserve relationships between samples - Use **Contrastive** for self-supervised learning scenarios