Class SafetyFilter<T>
- Namespace
- AiDotNet.AdversarialRobustness.Safety
- Assembly
- AiDotNet.dll
Implements comprehensive safety filtering for AI model inputs and outputs.
public class SafetyFilter<T> : ISafetyFilter<T>, IModelSerializer
Type Parameters
TThe numeric data type used for calculations.
- Inheritance
-
SafetyFilter<T>
- Implements
- Inherited Members
Remarks
SafetyFilter provides multiple layers of protection including input validation, output filtering, jailbreak detection, and harmful content identification.
For Beginners: Think of SafetyFilter as a comprehensive security system for your AI. It checks everything going in and coming out, looking for anything suspicious, harmful, or inappropriate. It's like having security guards, content moderators, and safety inspectors all working together.
Constructors
SafetyFilter(SafetyFilterOptions<T>)
Initializes a new instance of the safety filter.
public SafetyFilter(SafetyFilterOptions<T> options)
Parameters
optionsSafetyFilterOptions<T>The safety filter configuration options.
Methods
ComputeSafetyScore(Vector<T>)
Computes a safety score for model inputs or outputs.
public T ComputeSafetyScore(Vector<T> content)
Parameters
contentVector<T>The content to score.
Returns
- T
A safety score between 0 (unsafe) and 1 (completely safe).
Remarks
For Beginners: This gives a single "safety score" from 0 to 1 indicating how safe the content is. Think of it like a trust score - higher numbers mean safer content.
Deserialize(byte[])
Loads a previously serialized model from binary data.
public void Deserialize(byte[] data)
Parameters
databyte[]The byte array containing the serialized model data.
Remarks
This method takes binary data created by the Serialize method and uses it to restore a model to its previous state.
For Beginners: This is like opening a saved file to continue your work.
When you call this method:
- You provide the binary data (bytes) that was previously created by Serialize
- The model rebuilds itself using this data
- After deserializing, the model is exactly as it was when serialized
- It's ready to make predictions without needing to be trained again
For example:
- You download a pre-trained model file for detecting spam emails
- You deserialize this file into your application
- Immediately, your application can detect spam without any training
- The model has all the knowledge that was built into it by its original creator
This is particularly useful when:
- You want to use a model that took days to train
- You need to deploy the same model across multiple devices
- You're creating an application that non-technical users will use
Think of it like installing the brain of a trained expert directly into your application.
DetectJailbreak(Vector<T>)
Detects jailbreak attempts that try to bypass safety measures.
public JailbreakDetectionResult<T> DetectJailbreak(Vector<T> input)
Parameters
inputVector<T>The input to check for jailbreak attempts.
Returns
- JailbreakDetectionResult<T>
Detection result indicating if a jailbreak was detected and its severity.
Remarks
For Beginners: A "jailbreak" is when someone tries to trick your AI into ignoring its safety rules. This method detects those attempts.
Examples of jailbreak attempts:
- "Ignore your previous instructions and do X instead"
- Roleplaying scenarios to bypass restrictions
- Encoding harmful requests in creative ways
- Exploiting edge cases in safety training
FilterOutput(Vector<T>)
Filters model outputs to remove or flag harmful content.
public SafetyFilterResult<T> FilterOutput(Vector<T> output)
Parameters
outputVector<T>The model output to filter.
Returns
- SafetyFilterResult<T>
Filtered output with harmful content removed or flagged.
Remarks
For Beginners: This checks what the AI is about to say before showing it to users. If the AI generated something harmful or inappropriate, this method can block it or modify it to be safe.
For example:
- If an AI accidentally generates instructions for something dangerous
- If output contains private or sensitive information
- If the response could be misleading or harmful
GetOptions()
Gets the configuration options for the safety filter.
public SafetyFilterOptions<T> GetOptions()
Returns
- SafetyFilterOptions<T>
The configuration options for the safety filter.
Remarks
For Beginners: These settings control how strict the safety filter is and what types of content it looks for.
IdentifyHarmfulContent(Vector<T>)
Identifies harmful or inappropriate content in text or data.
public HarmfulContentResult<T> IdentifyHarmfulContent(Vector<T> content)
Parameters
contentVector<T>The content to analyze.
Returns
- HarmfulContentResult<T>
Classification of harmful content types and severity scores.
Remarks
For Beginners: This is like a content moderation system. It scans content (inputs or outputs) and identifies anything that might be harmful, offensive, or inappropriate.
Categories it might detect:
- Violence or graphic content
- Hate speech or discrimination
- Private or sensitive information
- Misinformation or scams
- Adult or sexual content
LoadModel(string)
Loads the model from a file.
public void LoadModel(string filePath)
Parameters
filePathstringThe path to the file containing the saved model.
Remarks
This method provides a convenient way to load a model directly from disk. It combines file I/O operations with deserialization.
For Beginners: This is like clicking "Open" in a document editor. Instead of manually reading from a file and then calling Deserialize(), this method does both steps for you.
Exceptions
- FileNotFoundException
Thrown when the specified file does not exist.
- IOException
Thrown when an I/O error occurs while reading from the file or when the file contains corrupted or invalid model data.
Reset()
Resets the safety filter state.
public void Reset()
SaveModel(string)
Saves the model to a file.
public void SaveModel(string filePath)
Parameters
filePathstringThe path where the model should be saved.
Remarks
This method provides a convenient way to save the model directly to disk. It combines serialization with file I/O operations.
For Beginners: This is like clicking "Save As" in a document editor. Instead of manually calling Serialize() and then writing to a file, this method does both steps for you.
Exceptions
- IOException
Thrown when an I/O error occurs while writing to the file.
- UnauthorizedAccessException
Thrown when the caller does not have the required permission to write to the specified file path.
Serialize()
Converts the current state of a machine learning model into a binary format.
public byte[] Serialize()
Returns
- byte[]
A byte array containing the serialized model data.
Remarks
This method captures all the essential information about a trained model and converts it into a sequence of bytes that can be stored or transmitted.
For Beginners: This is like exporting your work to a file.
When you call this method:
- The model's current state (all its learned patterns and parameters) is captured
- This information is converted into a compact binary format (bytes)
- You can then save these bytes to a file, database, or send them over a network
For example:
- After training a model to recognize cats vs. dogs in images
- You can serialize the model to save all its learned knowledge
- Later, you can use this saved data to recreate the model exactly as it was
- The recreated model will make the same predictions as the original
Think of it like taking a snapshot of your model's brain at a specific moment in time.
ValidateInput(Vector<T>)
Validates that an input is safe and appropriate for processing.
public SafetyValidationResult<T> ValidateInput(Vector<T> input)
Parameters
inputVector<T>The input to validate.
Returns
- SafetyValidationResult<T>
Validation result indicating if input is safe and any issues found.
Remarks
This method checks inputs before they reach the model to prevent malicious or inappropriate inputs from being processed.
For Beginners: This is like a bouncer at a club checking IDs at the door. Before letting an input into your AI system, this method checks if it's safe and appropriate to process.
The validation might check for:
- Malformed inputs that could crash the system
- Adversarial patterns designed to fool the model
- Attempts to inject malicious code or prompts
- Inappropriate or harmful content in the input