Enum DistillationStrategyType

Namespace: AiDotNet.Enums

Assembly: AiDotNet.dll

Specifies the type of knowledge distillation strategy to use for transferring knowledge from teacher to student models.

public enum DistillationStrategyType

Fields

AttentionBased = 2

ContrastiveBased = 4

Contrastive Representation Distillation / CRD (Tian et al., 2020). Uses contrastive learning to match teacher and student representations.

Best for: Self-supervised learning, representation learning.

Key Parameters: Temperature, negative samples, contrast weight.

Pros: Strong theoretical foundation, works without labels.

Cons: Requires careful tuning, needs negative sampling.

FactorTransfer = 9

Factor Transfer (Kim et al., 2018). Transfers factors (paraphrased representations) from teacher to student.

Best for: Cross-architecture transfer, efficient distillation.

Key Parameters: Paraphraser network, factor layers.

Pros: Flexible, works across different architectures.

Cons: Requires additional paraphraser network.

FeatureBased = 1

Feature-based distillation / FitNets (Romero et al., 2014). Matches intermediate layer representations between teacher and student.

Best for: Different architectures (e.g., CNN → MobileNet), transfer across domains.

Key Parameters: Layer pairs to match, feature weight.

Pros: Transfers deeper knowledge, works across architectures.

Cons: Requires layer mapping, may need projection layers.

FlowBased = 6

Flow of Solution Procedure / FSP (Yim et al., 2017). Transfers the flow of information between layers.

Best for: Deep networks, capturing layer-to-layer flow.

Key Parameters: Layer pairs for flow matrices.

Pros: Captures information flow, good for deep networks.

Cons: Requires multiple layer pairs, complex to configure.

Hybrid = 12

Combined/Hybrid distillation. Combines multiple strategies (e.g., Response + Feature + Attention).

Best for: Maximizing knowledge transfer, complex models.

Key Parameters: Weights for each strategy.

Pros: Transfers knowledge at multiple levels.

Cons: More hyperparameters to tune, computationally expensive.

NeuronSelectivity = 10

ProbabilisticTransfer = 7

RelationBased = 3

Relational Knowledge Distillation / RKD (Park et al., 2019). Preserves relationships (distances and angles) between sample representations.

Best for: Metric learning, few-shot learning, embedding models.

Key Parameters: Distance weight, angle weight.

Pros: Preserves structural relationships, robust to architecture changes.

Cons: More computationally expensive (pairwise comparisons).

ResponseBased = 0

Response-based distillation (Hinton et al., 2015). Matches the teacher's final output predictions using temperature-scaled softmax.

Best for: Standard classification tasks, general-purpose distillation.

Key Parameters: Temperature (2-10), Alpha (0.3-0.5).

Pros: Simple, effective, widely used.

Cons: Doesn't capture intermediate representations.

SelfDistillation = 11

Self-distillation (Zhang et al., 2019; Furlanello et al., 2018). Model learns from its own predictions to improve calibration and generalization.

Best for: Improving calibration, no separate teacher needed.

Key Parameters: Generations, temperature, EMA decay.

Pros: No separate teacher, improves calibration.

Cons: Requires multiple training runs.

SimilarityPreserving = 5

VariationalInformation = 8

Variational Information Distillation / VID (Ahn et al., 2019). Uses variational bounds to maximize mutual information between teacher and student.

Best for: Information-theoretic distillation, maximizing information transfer.

Key Parameters: Variational parameters, MI estimator.

Pros: Theoretical guarantees, maximizes information.

Cons: Complex implementation, harder to tune.

Remarks

For Beginners: Different distillation strategies focus on different aspects of the teacher's knowledge. Some match final outputs, others match intermediate features or relationships between samples.

Choosing a Strategy: - Use **ResponseBased** for most cases (standard Hinton distillation) - Use **FeatureBased** when student architecture differs significantly from teacher - Use **AttentionBased** for transformer models (BERT, GPT) - Use **RelationBased** to preserve relationships between samples - Use **Contrastive** for self-supervised learning scenarios

Table of Contents

Enum DistillationStrategyType

Fields

Remarks