Enum SSLMethodType
Specifies the type of self-supervised learning method to use for representation learning.
public enum SSLMethodType
Fields
BYOL = 4BYOL: Bootstrap Your Own Latent (Grill et al., 2020). Non-contrastive method using momentum encoder without negative samples.
Best for: Avoiding negative sample mining, asymmetric networks.
Key Parameters: Momentum (0.99-0.999), predictor MLP.
Pros: No negative samples needed, robust to batch size.
Cons: Requires careful design to prevent collapse.
BarlowTwins = 6Barlow Twins: Self-Supervised Learning via Redundancy Reduction (Zbontar et al., 2021). Reduces redundancy in embeddings by making cross-correlation close to identity.
Best for: Interpretable approach, avoiding collapse naturally.
Key Parameters: Lambda (redundancy reduction weight), projection dimension.
Pros: Interpretable loss, naturally avoids collapse, no negative samples.
Cons: Requires careful scaling of loss terms.
DINO = 7DINO: Emerging Properties in Self-Supervised Vision Transformers (Caron et al., 2021). Self-distillation with no labels using centering and sharpening.
Best for: Vision Transformers, emergent attention properties.
Key Parameters: Teacher temperature (0.04-0.07), centering momentum.
Pros: Emergent attention maps, strong ViT performance.
Cons: Primarily designed for Vision Transformers.
MAE = 9MAE: Masked Autoencoders Are Scalable Vision Learners (He et al., 2022). Generative approach that reconstructs masked image patches.
Best for: Efficient pretraining, generative understanding.
Key Parameters: Mask ratio (0.75), decoder depth.
Pros: Efficient (only encode visible patches), scalable.
Cons: May require fine-tuning for best downstream performance.
MoCo = 1MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (He et al., 2020). Uses a momentum encoder and memory queue for efficient contrastive learning.
Best for: Limited GPU memory, consistent negative samples.
Key Parameters: Queue size (65536), momentum (0.999), temperature (0.07).
Pros: Memory efficient, consistent negative samples, good performance.
Cons: More complex than SimCLR, requires momentum encoder.
MoCoV2 = 2MoCo v2: Improved Baselines with Momentum Contrastive Learning (Chen et al., 2020). Adds MLP projection head and stronger augmentations to MoCo.
Best for: Better performance than MoCo v1 with similar efficiency.
Key Parameters: Same as MoCo plus MLP projection head.
Pros: Combines MoCo efficiency with SimCLR improvements.
Cons: Slightly more complex than MoCo v1.
MoCoV3 = 3MoCo v3: An Empirical Study of Training Self-Supervised Vision Transformers (Chen et al., 2021). Adapted for Vision Transformers without memory queue.
Best for: Vision Transformers (ViT), modern architectures.
Key Parameters: Momentum (0.99-0.999), symmetric loss.
Pros: Optimized for ViT, simpler than MoCo v1/v2.
Cons: Best suited for transformer architectures.
SimCLR = 0SimCLR: A Simple Framework for Contrastive Learning of Visual Representations (Chen et al., 2020). Uses large batch contrastive learning with strong augmentations.
Best for: Simple setup, strong performance, research baselines.
Key Parameters: Temperature (0.1-0.5), batch size (256-8192), projection dimension (128).
Pros: Simple architecture, strong performance, well-understood.
Cons: Requires large batch sizes for best performance.
SimSiam = 5iBOT = 8iBOT: Image BERT Pre-Training with Online Tokenizer (Zhou et al., 2022). Combines masked image modeling with self-distillation.
Best for: Combining generative and discriminative approaches.
Key Parameters: Mask ratio (0.4), patch tokenizer.
Pros: Best of both worlds (DINO + MAE-like objectives).
Cons: More complex than pure DINO or MAE.
Remarks
For Beginners: Self-supervised learning (SSL) methods learn useful representations from unlabeled data by creating "pretext tasks" - artificial tasks that force the model to learn meaningful features. Different methods use different strategies to achieve this.
Choosing a Method:
- Use SimCLR for simplicity and good performance (no memory bank needed)
- Use MoCo variants for large batch sizes or limited GPU memory
- Use BYOL or SimSiam to avoid negative sample mining
- Use BarlowTwins for interpretable redundancy-reduction approach
- Use DINO for Vision Transformers with self-distillation
- Use MAE for generative masked autoencoding approach