Enum LanguageModelBackbone
Defines the language model backbone types used in multimodal neural networks.
public enum LanguageModelBackbone
Fields
Chinchilla = 5Chinchilla by DeepMind - compute-optimal language model.
For Beginners: Chinchilla is DeepMind's language model optimized for the right balance of model size and training data.
Key characteristics:
- 70B parameters trained on 1.4T tokens
- Optimized for training efficiency
- Used as backbone in Flamingo
- Excellent multimodal learning capabilities
FlanT5 = 1Flan-T5 by Google - instruction-tuned T5 model.
For Beginners: Flan-T5 is Google's T5 model fine-tuned on a mixture of instruction-following tasks. It's an encoder-decoder model.
Key characteristics:
- Encoder-decoder architecture
- Excellent at following instructions
- Better for question-answering tasks
- Hidden dimension: 2048 for Flan-T5-XL
- Commonly used in BLIP-2 for instruction-following variants
LLaMA = 2LLaMA (Large Language Model Meta AI) by Meta.
For Beginners: LLaMA is Meta's efficient open-source language model that achieves strong performance with fewer parameters.
Key characteristics:
- Decoder-only architecture
- Very efficient for its size
- Base model for many fine-tuned variants
- Commonly used in LLaVA
- Available in 7B, 13B, 33B, 65B sizes
Mistral = 4Mistral - efficient open-source language model.
For Beginners: Mistral is a newer, highly efficient language model that outperforms LLaMA-2 on many benchmarks despite being smaller.
Key characteristics:
- Uses sliding window attention for efficiency
- Strong performance at 7B parameter scale
- Good for resource-constrained scenarios
- Increasingly used in newer LLaVA variants
OPT = 0OPT (Open Pre-trained Transformer) by Meta AI.
For Beginners: OPT is a family of decoder-only language models from Meta AI that range from 125M to 175B parameters. It's commonly used in BLIP-2.
Key characteristics:
- Decoder-only architecture (like GPT)
- Good for general text generation
- Available in various sizes (OPT-2.7B is common for BLIP-2)
- Hidden dimension: 2560 for OPT-2.7B
Phi = 6Phi by Microsoft - small but capable language model.
For Beginners: Phi models are Microsoft's small language models that achieve impressive performance for their size.
Key characteristics:
- Very small (1.3B to 3B parameters)
- Trained on high-quality "textbook" data
- Fast inference on limited hardware
- Good for lightweight multimodal applications
Qwen = 7Qwen by Alibaba - multilingual language model.
For Beginners: Qwen is Alibaba's multilingual language model with strong Chinese and English capabilities.
Key characteristics:
- Strong multilingual support
- Good for international applications
- Available in various sizes
- Used in Qwen-VL for vision-language tasks
RoBERTa = 8RoBERTa - robustly optimized BERT-style encoder.
For Beginners: RoBERTa is an improved BERT model that uses the same encoder-only architecture but trains longer on more data for better results. It is commonly used in document understanding models like LayoutLMv3.
Vicuna = 3Vicuna - LLaMA fine-tuned on conversational data.
For Beginners: Vicuna is LLaMA fine-tuned on user conversations, making it better at natural dialogue and instruction-following.
Key characteristics:
- Based on LLaMA architecture
- Fine-tuned for conversation
- Better at following complex instructions
- Popular choice for LLaVA-1.5+
Remarks
This enum specifies which language model architecture is used as the backbone for text generation and understanding in multimodal models like BLIP-2, LLaVA, and Flamingo. The backbone determines the model's capacity, vocabulary, and generation capabilities.
For Beginners: Think of the language model backbone as the "brain" that processes and generates text in vision-language models.
When a model like BLIP-2 needs to describe an image or answer a question about it:
- The vision encoder extracts features from the image
- The Q-Former/adapter bridges vision and language
- The language model backbone generates the actual text response
Different backbones have different strengths:
- OPT: Good for general text generation, used in BLIP-2
- FlanT5: Better for instruction-following, used in BLIP-2
- LLaMA: Efficient and powerful, used in LLaVA
- Vicuna: LLaMA fine-tuned for conversations, used in LLaVA
- Mistral: Fast and efficient, newer alternative for LLaVA
- Chinchilla: Used in Flamingo, optimized for multimodal learning
The choice affects model size, speed, and quality of text generation.