Class BlipNeuralNetwork<T>
- Namespace
- AiDotNet.NeuralNetworks
- Assembly
- AiDotNet.dll
BLIP (Bootstrapped Language-Image Pre-training) neural network for vision-language tasks.
public class BlipNeuralNetwork<T> : NeuralNetworkBase<T>, INeuralNetworkModel<T>, INeuralNetwork<T>, IFullModel<T, Tensor<T>, Tensor<T>>, IModel<Tensor<T>, Tensor<T>, ModelMetadata<T>>, IModelSerializer, ICheckpointableModel, IParameterizable<T, Tensor<T>, Tensor<T>>, IFeatureAware, IFeatureImportance<T>, ICloneable<IFullModel<T, Tensor<T>, Tensor<T>>>, IGradientComputable<T, Tensor<T>, Tensor<T>>, IJitCompilable<T>, IInterpretableModel<T>, IInputGradientComputable<T>, IDisposable, IBlipModel<T>, IMultimodalEmbedding<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
BlipNeuralNetwork<T>
- Implements
-
IBlipModel<T>
- Inherited Members
- Extension Methods
Remarks
BLIP extends CLIP's capabilities with image captioning, image-text matching, and visual question answering. It uses a unified framework with both understanding and generation tasks. This implementation supports both ONNX pretrained models and native library layers.
For Beginners: BLIP is a more powerful version of CLIP!
CLIP can:
- Match images with text descriptions
- Zero-shot classification
BLIP adds:
- Generate captions ("a dog playing in the park")
- Answer questions ("What color is the car?" -> "Red")
- More accurate image-text matching
Training innovation:
- BLIP was trained on noisy web data
- It learned to filter out bad captions automatically
- Then it generated better captions to train on!
- This "bootstrapping" creates a cleaner dataset
Use cases:
- Accessibility (auto-generate alt-text for images)
- Content moderation (answer "is there violence in this image?")
- Visual search (find images matching a description)
- Image organization (auto-tag photos)
Constructors
BlipNeuralNetwork(NeuralNetworkArchitecture<T>, int, int, int, int, int, int, int, int, int, int, int, ITokenizer?, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a BLIP network using native library layers.
public BlipNeuralNetwork(NeuralNetworkArchitecture<T> architecture, int imageSize = 384, int channels = 3, int patchSize = 16, int vocabularySize = 30522, int maxSequenceLength = 35, int embeddingDimension = 256, int hiddenDim = 768, int numEncoderLayers = 12, int numDecoderLayers = 12, int numHeads = 12, int mlpDim = 3072, ITokenizer? tokenizer = null, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
imageSizeintExpected image size (default 384 for BLIP).
channelsintNumber of image channels (default 3 for RGB).
patchSizeintPatch size for vision transformer.
vocabularySizeintText vocabulary size (BERT: 30522).
maxSequenceLengthintMaximum text sequence length.
embeddingDimensionintDimension of shared embedding space.
hiddenDimintHidden dimension for transformers.
numEncoderLayersintNumber of encoder transformer layers.
numDecoderLayersintNumber of decoder transformer layers.
numHeadsintNumber of attention heads.
mlpDimintMLP hidden dimension.
tokenizerITokenizerOptional tokenizer for text processing.
optimizerIOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer for training.
lossFunctionILossFunction<T>Optional loss function.
Remarks
This constructor creates a fully trainable BLIP network using the library's native layers. All operations use the Engine for CPU/GPU acceleration.
BlipNeuralNetwork(NeuralNetworkArchitecture<T>, string, string, string, ITokenizer, int, int, int, IOptimizer<T, Tensor<T>, Tensor<T>>?, ILossFunction<T>?)
Creates a BLIP network using pretrained ONNX models.
public BlipNeuralNetwork(NeuralNetworkArchitecture<T> architecture, string visionEncoderPath, string textEncoderPath, string textDecoderPath, ITokenizer tokenizer, int embeddingDimension = 256, int maxSequenceLength = 35, int imageSize = 384, IOptimizer<T, Tensor<T>, Tensor<T>>? optimizer = null, ILossFunction<T>? lossFunction = null)
Parameters
architectureNeuralNetworkArchitecture<T>The neural network architecture configuration.
visionEncoderPathstringPath to the vision encoder ONNX model.
textEncoderPathstringPath to the text encoder ONNX model.
textDecoderPathstringPath to the text decoder ONNX model.
tokenizerITokenizerThe tokenizer for text processing.
embeddingDimensionintDimension of the shared embedding space.
maxSequenceLengthintMaximum text sequence length.
imageSizeintExpected image size.
optimizerIOptimizer<T, Tensor<T>, Tensor<T>>Optional optimizer for fine-tuning.
lossFunctionILossFunction<T>Optional loss function.
Properties
EmbeddingDimension
Gets the dimensionality of the embedding space.
public int EmbeddingDimension { get; }
Property Value
ImageSize
Gets the expected image size (square images: ImageSize x ImageSize pixels).
public int ImageSize { get; }
Property Value
MaxSequenceLength
Gets the maximum sequence length for text input.
public int MaxSequenceLength { get; }
Property Value
ParameterCount
Gets the total number of trainable parameters.
public override int ParameterCount { get; }
Property Value
SupportsTraining
Indicates whether this network supports training (learning from data).
public override bool SupportsTraining { get; }
Property Value
Remarks
For Beginners: Not all neural networks can learn. Some are designed only for making predictions with pre-set parameters. This property tells you if the network can learn from data.
Methods
AnswerQuestion(Tensor<T>, string, int)
Answers a question about an image's content.
public string AnswerQuestion(Tensor<T> image, string question, int maxLength = 20)
Parameters
imageTensor<T>The preprocessed image tensor.
questionstringThe question to answer (e.g., "What color is the car?").
maxLengthintMaximum length of the answer.
Returns
- string
The generated answer.
Remarks
Visual Question Answering (VQA) generates natural language answers to questions about image content. The model uses cross-attention to focus on relevant image regions when generating the answer.
For Beginners: Ask questions about images and get answers!
Examples:
- Image: Photo of a kitchen
- "What appliances are visible?" → "refrigerator, microwave, and stove"
- "What color are the cabinets?" → "white"
- "Is there a window?" → "yes, above the sink"
This is useful for:
- Accessibility (describe images for visually impaired users)
- Content moderation (is there alcohol in this photo?)
- Data extraction (what brand is this product?)
ComputeImageTextMatch(Tensor<T>, string)
Determines whether a given text accurately describes an image.
public T ComputeImageTextMatch(Tensor<T> image, string text)
Parameters
imageTensor<T>The preprocessed image tensor.
textstringThe text description to evaluate.
Returns
- T
A probability score between 0 and 1 indicating match quality.
Remarks
Uses the Image-Text Matching (ITM) head with cross-attention between image patches and text tokens for fine-grained matching. This is more accurate than simple embedding similarity for detailed matching.
For Beginners: This checks if a caption accurately describes an image.
Unlike simple similarity (dot product), this uses "cross-attention" which:
- Looks at specific parts of the image
- Compares them to specific words in the text
- Gives a more accurate yes/no answer
Example:
- Image: A red car parked on a street
- "A red vehicle on pavement" → 0.92 (accurate!)
- "A blue car in a garage" → 0.15 (wrong color and location)
Use this when you need precise matching, not just "related content."
ComputeSimilarity(Vector<T>, Vector<T>)
Computes similarity between two embeddings.
public T ComputeSimilarity(Vector<T> textEmbedding, Vector<T> imageEmbedding)
Parameters
textEmbeddingVector<T>imageEmbeddingVector<T>
Returns
- T
Similarity score (cosine similarity for normalized embeddings).
CreateNewInstance()
Creates a new instance of the same type as this neural network.
protected override IFullModel<T, Tensor<T>, Tensor<T>> CreateNewInstance()
Returns
- IFullModel<T, Tensor<T>, Tensor<T>>
A new instance of the same neural network type.
Remarks
For Beginners: This creates a blank version of the same type of neural network.
It's used internally by methods like DeepCopy and Clone to create the right type of network before copying the data into it.
DeserializeNetworkSpecificData(BinaryReader)
Deserializes network-specific data that was not covered by the general deserialization process.
protected override void DeserializeNetworkSpecificData(BinaryReader reader)
Parameters
readerBinaryReaderThe BinaryReader to read the data from.
Remarks
This method is called at the end of the general deserialization process to allow derived classes to read any additional data specific to their implementation.
For Beginners: Continuing the suitcase analogy, this is like unpacking that special compartment. After the main deserialization method has unpacked the common items (layers, parameters), this method allows each specific type of neural network to unpack its own unique items that were stored during serialization.
Dispose(bool)
Protected Dispose pattern implementation.
protected override void Dispose(bool disposing)
Parameters
disposingboolTrue if called from Dispose(), false if called from finalizer.
EmbedAsync(string)
public Task<Vector<T>> EmbedAsync(string text)
Parameters
textstring
Returns
- Task<Vector<T>>
EmbedBatchAsync(IEnumerable<string>)
public Task<Matrix<T>> EmbedBatchAsync(IEnumerable<string> texts)
Parameters
textsIEnumerable<string>
Returns
- Task<Matrix<T>>
EncodeImage(double[])
Encodes an image into an embedding vector.
public Vector<T> EncodeImage(double[] imageData)
Parameters
imageDatadouble[]The preprocessed image data as a flattened array in CHW format.
Returns
- Vector<T>
A normalized embedding vector.
EncodeImageBatch(IEnumerable<double[]>)
Encodes multiple images into embedding vectors in a batch.
public Matrix<T> EncodeImageBatch(IEnumerable<double[]> imageDataBatch)
Parameters
imageDataBatchIEnumerable<double[]>The preprocessed images as flattened arrays.
Returns
- Matrix<T>
A matrix where each row is an embedding for the corresponding image.
EncodeText(string)
Encodes text into an embedding vector.
public Vector<T> EncodeText(string text)
Parameters
textstringThe text to encode.
Returns
- Vector<T>
A normalized embedding vector.
EncodeTextBatch(IEnumerable<string>)
Encodes multiple texts into embedding vectors in a batch.
public Matrix<T> EncodeTextBatch(IEnumerable<string> texts)
Parameters
textsIEnumerable<string>The texts to encode.
Returns
- Matrix<T>
A matrix where each row is an embedding for the corresponding text.
GenerateCaption(Tensor<T>, int, int)
Generates a caption describing the content of an image.
public string GenerateCaption(Tensor<T> image, int maxLength = 30, int numBeams = 3)
Parameters
imageTensor<T>The preprocessed image tensor with shape [channels, height, width].
maxLengthintMaximum number of tokens to generate. Default is 30.
numBeamsintNumber of beams for beam search. Default is 3 for quality/speed balance.
Returns
- string
A generated caption describing the image.
Remarks
Uses the image-grounded text decoder to generate descriptive captions. The generation uses beam search by default for higher quality outputs.
For Beginners: This automatically describes what's in an image!
Example:
- Input: Photo of a dog playing fetch in a park
- Output: "a brown dog catching a frisbee on a grassy field"
Parameters:
- maxLength: How long the caption can be (30 = roughly 25 words)
- numBeams: More beams = better captions but slower (3 is a good balance)
Uses "beam search" - it explores multiple possible captions and picks the best one.
GenerateCaptions(Tensor<T>, int, int)
Generates multiple candidate captions for an image.
public IEnumerable<string> GenerateCaptions(Tensor<T> image, int numCaptions = 5, int maxLength = 30)
Parameters
imageTensor<T>The preprocessed image tensor.
numCaptionsintNumber of captions to generate.
maxLengthintMaximum length per caption.
Returns
- IEnumerable<string>
A collection of candidate captions.
Remarks
Uses nucleus (top-p) sampling to generate diverse captions. Useful for getting multiple perspectives on an image's content.
GetImageEmbedding(Tensor<T>)
public Vector<T> GetImageEmbedding(Tensor<T> image)
Parameters
imageTensor<T>
Returns
- Vector<T>
GetImageEmbeddings(IEnumerable<Tensor<T>>)
public IEnumerable<Vector<T>> GetImageEmbeddings(IEnumerable<Tensor<T>> images)
Parameters
imagesIEnumerable<Tensor<T>>
Returns
- IEnumerable<Vector<T>>
GetModelMetadata()
Retrieves metadata about the BLIP neural network model.
public override ModelMetadata<T> GetModelMetadata()
Returns
- ModelMetadata<T>
A ModelMetaData object containing information about the network.
GetParameters()
Gets all trainable parameters of the network as a single vector.
public override Vector<T> GetParameters()
Returns
- Vector<T>
A vector containing all parameters of the network.
Remarks
For Beginners: Neural networks learn by adjusting their "parameters" (also called weights and biases). This method collects all those adjustable values into a single list so they can be updated during training.
GetTextEmbedding(string)
public Vector<T> GetTextEmbedding(string text)
Parameters
textstring
Returns
- Vector<T>
GetTextEmbeddings(IEnumerable<string>)
public IEnumerable<Vector<T>> GetTextEmbeddings(IEnumerable<string> texts)
Parameters
textsIEnumerable<string>
Returns
- IEnumerable<Vector<T>>
InitializeLayers()
Initializes the layers of the neural network based on the architecture.
protected override void InitializeLayers()
Remarks
For Beginners: This method sets up all the layers in your neural network according to the architecture you've defined. It's like assembling the parts of your network before you can use it.
Predict(Tensor<T>)
Makes a prediction using the neural network.
public override Tensor<T> Predict(Tensor<T> input)
Parameters
inputTensor<T>The input data to process.
Returns
- Tensor<T>
The network's prediction.
Remarks
For Beginners: This is the main method you'll use to get results from your trained neural network. You provide some input data (like an image or text), and the network processes it through all its layers to produce an output (like a classification or prediction).
RankCaptions(Tensor<T>, IEnumerable<string>)
Ranks a set of candidate captions by how well they match an image.
public IEnumerable<(string Caption, T Score)> RankCaptions(Tensor<T> image, IEnumerable<string> candidates)
Parameters
imageTensor<T>The preprocessed image tensor.
candidatesIEnumerable<string>The candidate captions to rank.
Returns
- IEnumerable<(string Caption, T Score)>
Captions ranked by match score, from best to worst.
Remarks
Uses the ITM head to score each candidate, then returns them in descending order. Useful for caption reranking in retrieval applications.
RetrieveImages(string, IEnumerable<Vector<T>>, int)
Retrieves the most relevant images for a text query from a collection.
public IEnumerable<(int Index, T Score)> RetrieveImages(string query, IEnumerable<Vector<T>> imageEmbeddings, int topK = 10)
Parameters
querystringThe text query describing desired images.
imageEmbeddingsIEnumerable<Vector<T>>Pre-computed image embeddings.
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score)>
Indices of the top-K matching images with their scores.
Remarks
Performs efficient text-to-image retrieval using embedding similarity. For large collections, pre-compute and cache image embeddings.
RetrieveTexts(Tensor<T>, IEnumerable<Vector<T>>, int)
Retrieves the most relevant texts for an image from a collection.
public IEnumerable<(int Index, T Score)> RetrieveTexts(Tensor<T> image, IEnumerable<Vector<T>> textEmbeddings, int topK = 10)
Parameters
imageTensor<T>The preprocessed image tensor.
textEmbeddingsIEnumerable<Vector<T>>Pre-computed text embeddings.
topKintNumber of results to return.
Returns
- IEnumerable<(int Index, T Score)>
Indices of the top-K matching texts with their scores.
Remarks
Performs efficient image-to-text retrieval using embedding similarity. Useful for finding relevant captions or descriptions for images.
SerializeNetworkSpecificData(BinaryWriter)
Serializes network-specific data that is not covered by the general serialization process.
protected override void SerializeNetworkSpecificData(BinaryWriter writer)
Parameters
writerBinaryWriterThe BinaryWriter to write the data to.
Remarks
This method is called at the end of the general serialization process to allow derived classes to write any additional data specific to their implementation.
For Beginners: Think of this as packing a special compartment in your suitcase. While the main serialization method packs the common items (layers, parameters), this method allows each specific type of neural network to pack its own unique items that other networks might not have.
Train(Tensor<T>, Tensor<T>)
Trains the neural network on a single input-output pair.
public override void Train(Tensor<T> input, Tensor<T> target)
Parameters
inputTensor<T>The input data.
targetTensor<T>
Remarks
This method performs one training step on the neural network using the provided input and expected output. It updates the network's parameters to reduce the error between the network's prediction and the expected output.
For Beginners: This is how your neural network learns. You provide: - An input (what the network should process) - The expected output (what the correct answer should be)
The network then:
- Makes a prediction based on the input
- Compares its prediction to the expected output
- Calculates how wrong it was (the loss)
- Adjusts its internal values to do better next time
After training, you can get the loss value using the GetLastLoss() method to see how well the network is learning.
UpdateParameters(Vector<T>)
Updates the network's parameters with new values.
public override void UpdateParameters(Vector<T> parameters)
Parameters
parametersVector<T>The new parameter values to set.
Remarks
For Beginners: During training, a neural network's internal values (parameters) get adjusted to improve its performance. This method allows you to update all those values at once by providing a complete set of new parameters.
This is typically used by optimization algorithms that calculate better parameter values based on training data.
ZeroShotClassify(Tensor<T>, IEnumerable<string>)
public Dictionary<string, T> ZeroShotClassify(Tensor<T> image, IEnumerable<string> classLabels)
Parameters
imageTensor<T>classLabelsIEnumerable<string>
Returns
- Dictionary<string, T>
ZeroShotClassify(double[], IEnumerable<string>)
Performs zero-shot classification of an image against text labels.
public Dictionary<string, T> ZeroShotClassify(double[] imageData, IEnumerable<string> labels)
Parameters
imageDatadouble[]The preprocessed image data.
labelsIEnumerable<string>The candidate class labels.
Returns
- Dictionary<string, T>
A dictionary mapping each label to its probability score.