Enum TeacherModelType
Specifies the type of teacher model to use for knowledge distillation.
public enum TeacherModelType
Fields
Adaptive = 5Adaptive teacher that adjusts teaching based on student performance. Modulates difficulty or focus areas based on how well the student is learning.
Best for: Curriculum learning, progressive training.
Requirements: Teacher model + adaptation logic.
Pros: Optimizes teaching strategy, faster convergence.
Cons: More complex, requires performance monitoring.
Curriculum = 7Curriculum teacher that provides progressive difficulty. Starts with easy samples and gradually increases difficulty.
Best for: Complex tasks, improving convergence.
Requirements: Teacher model + curriculum strategy.
Pros: Better convergence, handles complex tasks.
Cons: Requires curriculum design, longer training.
Distributed = 10Distributed teacher split across multiple devices/nodes. Large teacher model is distributed for efficient inference.
Best for: Very large teachers, distributed training.
Requirements: Multi-device setup, large teacher.
Pros: Handles very large models, parallel processing.
Cons: Complex setup, communication overhead.
Ensemble = 1Ensemble of multiple teacher models. Combines predictions from multiple teachers (averaging, voting, or weighted combination).
Best for: High-accuracy requirements, combining diverse models.
Requirements: Multiple pre-trained teacher models.
Pros: More robust, captures diverse knowledge.
Cons: Slower (multiple forward passes), requires more memory.
MultiModal = 4Multi-modal teacher (e.g., CLIP, vision-language models). Handles multiple input modalities (text, images, audio, etc.).
Best for: Cross-modal learning, vision-language tasks.
Requirements: Multi-modal pre-trained model.
Pros: Handles multiple modalities, rich representations.
Cons: Complex, requires multi-modal data.
NeuralNetwork = 0Standard neural network teacher. Uses a single, pre-trained neural network as the teacher model.
Best for: Standard distillation scenarios, single teacher.
Requirements: Pre-trained teacher model.
Pros: Simple, straightforward, fast.
Cons: Limited to single model's knowledge.
Online = 6Online teacher that updates during student training. Teacher weights are updated simultaneously with student (co-training).
Best for: Continuous learning, evolving data distributions.
Requirements: Updateable teacher model.
Pros: Adapts to new data, maintains relevance.
Cons: Risk of teacher degradation, complex optimization.
Pretrained = 2Pretrained model loaded from checkpoint or ONNX. Loads a teacher from a saved checkpoint, ONNX model, or other serialized format.
Best for: Using external models, cross-framework distillation.
Requirements: Model checkpoint or ONNX file.
Pros: Reuse existing models, framework-agnostic.
Cons: May require format conversions.
Quantized = 9Quantized teacher with reduced precision (INT8, INT4). Uses quantized version of teacher for faster inference during distillation.
Best for: Fast distillation, resource-constrained environments.
Requirements: Quantized teacher model.
Pros: Faster, less memory, still effective.
Cons: Slight accuracy loss, quantization overhead.
Self = 8Self-teacher where model teaches itself (Born-Again Networks). The model acts as its own teacher to improve calibration and generalization.
Best for: Improving calibration, no separate teacher available.
Requirements: Initial trained model.
Pros: No separate teacher needed, improves calibration.
Cons: Requires multiple training generations.
Transformer = 3Transformer-based teacher (BERT, GPT, ViT, etc.). Specialized for transformer architectures with attention mechanisms.
Best for: Language models, vision transformers, attention-based models.
Requirements: Transformer teacher model.
Pros: Supports attention distillation, handles sequences.
Cons: Specific to transformer architecture.
Remarks
For Beginners: The teacher model is the "expert" that guides the student model's learning. Different teacher types are suited for different scenarios and distillation goals.
Choosing a Teacher: - Use **NeuralNetwork** for standard NN-to-NN distillation - Use **Ensemble** to combine knowledge from multiple models - Use **Pretrained** to load from checkpoints or ONNX - Use **Adaptive** for curriculum learning (progressive difficulty) - Use **Online** when teacher should update during training