Enum DistillationStrategyType
Specifies the type of knowledge distillation strategy to use for transferring knowledge from teacher to student models.
public enum DistillationStrategyType
Fields
AttentionBased = 2ContrastiveBased = 4Contrastive Representation Distillation / CRD (Tian et al., 2020). Uses contrastive learning to match teacher and student representations.
Best for: Self-supervised learning, representation learning.
Key Parameters: Temperature, negative samples, contrast weight.
Pros: Strong theoretical foundation, works without labels.
Cons: Requires careful tuning, needs negative sampling.
FactorTransfer = 9Factor Transfer (Kim et al., 2018). Transfers factors (paraphrased representations) from teacher to student.
Best for: Cross-architecture transfer, efficient distillation.
Key Parameters: Paraphraser network, factor layers.
Pros: Flexible, works across different architectures.
Cons: Requires additional paraphraser network.
FeatureBased = 1Feature-based distillation / FitNets (Romero et al., 2014). Matches intermediate layer representations between teacher and student.
Best for: Different architectures (e.g., CNN → MobileNet), transfer across domains.
Key Parameters: Layer pairs to match, feature weight.
Pros: Transfers deeper knowledge, works across architectures.
Cons: Requires layer mapping, may need projection layers.
FlowBased = 6Flow of Solution Procedure / FSP (Yim et al., 2017). Transfers the flow of information between layers.
Best for: Deep networks, capturing layer-to-layer flow.
Key Parameters: Layer pairs for flow matrices.
Pros: Captures information flow, good for deep networks.
Cons: Requires multiple layer pairs, complex to configure.
Hybrid = 12Combined/Hybrid distillation. Combines multiple strategies (e.g., Response + Feature + Attention).
Best for: Maximizing knowledge transfer, complex models.
Key Parameters: Weights for each strategy.
Pros: Transfers knowledge at multiple levels.
Cons: More hyperparameters to tune, computationally expensive.
NeuronSelectivity = 10ProbabilisticTransfer = 7RelationBased = 3Relational Knowledge Distillation / RKD (Park et al., 2019). Preserves relationships (distances and angles) between sample representations.
Best for: Metric learning, few-shot learning, embedding models.
Key Parameters: Distance weight, angle weight.
Pros: Preserves structural relationships, robust to architecture changes.
Cons: More computationally expensive (pairwise comparisons).
ResponseBased = 0Response-based distillation (Hinton et al., 2015). Matches the teacher's final output predictions using temperature-scaled softmax.
Best for: Standard classification tasks, general-purpose distillation.
Key Parameters: Temperature (2-10), Alpha (0.3-0.5).
Pros: Simple, effective, widely used.
Cons: Doesn't capture intermediate representations.
SelfDistillation = 11Self-distillation (Zhang et al., 2019; Furlanello et al., 2018). Model learns from its own predictions to improve calibration and generalization.
Best for: Improving calibration, no separate teacher needed.
Key Parameters: Generations, temperature, EMA decay.
Pros: No separate teacher, improves calibration.
Cons: Requires multiple training runs.
SimilarityPreserving = 5VariationalInformation = 8Variational Information Distillation / VID (Ahn et al., 2019). Uses variational bounds to maximize mutual information between teacher and student.
Best for: Information-theoretic distillation, maximizing information transfer.
Key Parameters: Variational parameters, MI estimator.
Pros: Theoretical guarantees, maximizes information.
Cons: Complex implementation, harder to tune.
Remarks
For Beginners: Different distillation strategies focus on different aspects of the teacher's knowledge. Some match final outputs, others match intermediate features or relationships between samples.
Choosing a Strategy: - Use **ResponseBased** for most cases (standard Hinton distillation) - Use **FeatureBased** when student architecture differs significantly from teacher - Use **AttentionBased** for transformer models (BERT, GPT) - Use **RelationBased** to preserve relationships between samples - Use **Contrastive** for self-supervised learning scenarios