Table of Contents

Class SafetyFilter<T>

Namespace
AiDotNet.AdversarialRobustness.Safety
Assembly
AiDotNet.dll

Implements comprehensive safety filtering for AI model inputs and outputs.

public class SafetyFilter<T> : ISafetyFilter<T>, IModelSerializer

Type Parameters

T

The numeric data type used for calculations.

Inheritance
SafetyFilter<T>
Implements
Inherited Members

Remarks

SafetyFilter provides multiple layers of protection including input validation, output filtering, jailbreak detection, and harmful content identification.

For Beginners: Think of SafetyFilter as a comprehensive security system for your AI. It checks everything going in and coming out, looking for anything suspicious, harmful, or inappropriate. It's like having security guards, content moderators, and safety inspectors all working together.

Constructors

SafetyFilter(SafetyFilterOptions<T>)

Initializes a new instance of the safety filter.

public SafetyFilter(SafetyFilterOptions<T> options)

Parameters

options SafetyFilterOptions<T>

The safety filter configuration options.

Methods

ComputeSafetyScore(Vector<T>)

Computes a safety score for model inputs or outputs.

public T ComputeSafetyScore(Vector<T> content)

Parameters

content Vector<T>

The content to score.

Returns

T

A safety score between 0 (unsafe) and 1 (completely safe).

Remarks

For Beginners: This gives a single "safety score" from 0 to 1 indicating how safe the content is. Think of it like a trust score - higher numbers mean safer content.

Deserialize(byte[])

Loads a previously serialized model from binary data.

public void Deserialize(byte[] data)

Parameters

data byte[]

The byte array containing the serialized model data.

Remarks

This method takes binary data created by the Serialize method and uses it to restore a model to its previous state.

For Beginners: This is like opening a saved file to continue your work.

When you call this method:

  • You provide the binary data (bytes) that was previously created by Serialize
  • The model rebuilds itself using this data
  • After deserializing, the model is exactly as it was when serialized
  • It's ready to make predictions without needing to be trained again

For example:

  • You download a pre-trained model file for detecting spam emails
  • You deserialize this file into your application
  • Immediately, your application can detect spam without any training
  • The model has all the knowledge that was built into it by its original creator

This is particularly useful when:

  • You want to use a model that took days to train
  • You need to deploy the same model across multiple devices
  • You're creating an application that non-technical users will use

Think of it like installing the brain of a trained expert directly into your application.

DetectJailbreak(Vector<T>)

Detects jailbreak attempts that try to bypass safety measures.

public JailbreakDetectionResult<T> DetectJailbreak(Vector<T> input)

Parameters

input Vector<T>

The input to check for jailbreak attempts.

Returns

JailbreakDetectionResult<T>

Detection result indicating if a jailbreak was detected and its severity.

Remarks

For Beginners: A "jailbreak" is when someone tries to trick your AI into ignoring its safety rules. This method detects those attempts.

Examples of jailbreak attempts:

  • "Ignore your previous instructions and do X instead"
  • Roleplaying scenarios to bypass restrictions
  • Encoding harmful requests in creative ways
  • Exploiting edge cases in safety training

FilterOutput(Vector<T>)

Filters model outputs to remove or flag harmful content.

public SafetyFilterResult<T> FilterOutput(Vector<T> output)

Parameters

output Vector<T>

The model output to filter.

Returns

SafetyFilterResult<T>

Filtered output with harmful content removed or flagged.

Remarks

For Beginners: This checks what the AI is about to say before showing it to users. If the AI generated something harmful or inappropriate, this method can block it or modify it to be safe.

For example:

  • If an AI accidentally generates instructions for something dangerous
  • If output contains private or sensitive information
  • If the response could be misleading or harmful

GetOptions()

Gets the configuration options for the safety filter.

public SafetyFilterOptions<T> GetOptions()

Returns

SafetyFilterOptions<T>

The configuration options for the safety filter.

Remarks

For Beginners: These settings control how strict the safety filter is and what types of content it looks for.

IdentifyHarmfulContent(Vector<T>)

Identifies harmful or inappropriate content in text or data.

public HarmfulContentResult<T> IdentifyHarmfulContent(Vector<T> content)

Parameters

content Vector<T>

The content to analyze.

Returns

HarmfulContentResult<T>

Classification of harmful content types and severity scores.

Remarks

For Beginners: This is like a content moderation system. It scans content (inputs or outputs) and identifies anything that might be harmful, offensive, or inappropriate.

Categories it might detect:

  • Violence or graphic content
  • Hate speech or discrimination
  • Private or sensitive information
  • Misinformation or scams
  • Adult or sexual content

LoadModel(string)

Loads the model from a file.

public void LoadModel(string filePath)

Parameters

filePath string

The path to the file containing the saved model.

Remarks

This method provides a convenient way to load a model directly from disk. It combines file I/O operations with deserialization.

For Beginners: This is like clicking "Open" in a document editor. Instead of manually reading from a file and then calling Deserialize(), this method does both steps for you.

Exceptions

FileNotFoundException

Thrown when the specified file does not exist.

IOException

Thrown when an I/O error occurs while reading from the file or when the file contains corrupted or invalid model data.

Reset()

Resets the safety filter state.

public void Reset()

SaveModel(string)

Saves the model to a file.

public void SaveModel(string filePath)

Parameters

filePath string

The path where the model should be saved.

Remarks

This method provides a convenient way to save the model directly to disk. It combines serialization with file I/O operations.

For Beginners: This is like clicking "Save As" in a document editor. Instead of manually calling Serialize() and then writing to a file, this method does both steps for you.

Exceptions

IOException

Thrown when an I/O error occurs while writing to the file.

UnauthorizedAccessException

Thrown when the caller does not have the required permission to write to the specified file path.

Serialize()

Converts the current state of a machine learning model into a binary format.

public byte[] Serialize()

Returns

byte[]

A byte array containing the serialized model data.

Remarks

This method captures all the essential information about a trained model and converts it into a sequence of bytes that can be stored or transmitted.

For Beginners: This is like exporting your work to a file.

When you call this method:

  • The model's current state (all its learned patterns and parameters) is captured
  • This information is converted into a compact binary format (bytes)
  • You can then save these bytes to a file, database, or send them over a network

For example:

  • After training a model to recognize cats vs. dogs in images
  • You can serialize the model to save all its learned knowledge
  • Later, you can use this saved data to recreate the model exactly as it was
  • The recreated model will make the same predictions as the original

Think of it like taking a snapshot of your model's brain at a specific moment in time.

ValidateInput(Vector<T>)

Validates that an input is safe and appropriate for processing.

public SafetyValidationResult<T> ValidateInput(Vector<T> input)

Parameters

input Vector<T>

The input to validate.

Returns

SafetyValidationResult<T>

Validation result indicating if input is safe and any issues found.

Remarks

This method checks inputs before they reach the model to prevent malicious or inappropriate inputs from being processed.

For Beginners: This is like a bouncer at a club checking IDs at the door. Before letting an input into your AI system, this method checks if it's safe and appropriate to process.

The validation might check for:

  1. Malformed inputs that could crash the system
  2. Adversarial patterns designed to fool the model
  3. Attempts to inject malicious code or prompts
  4. Inappropriate or harmful content in the input