Class PPOOptions<T>

Namespace: AiDotNet.Models.Options

Assembly: AiDotNet.dll

Configuration options for Proximal Policy Optimization (PPO) agents.

public class PPOOptions<T>

Type Parameters

T: The numeric type used for calculations.

Inheritance: object

PPOOptions<T>

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.MemberwiseClone()

object.ReferenceEquals(object, object)

object.ToString()

Remarks

PPO is a state-of-the-art policy gradient algorithm that achieves a balance between sample efficiency, simplicity, and reliability. It uses a clipped surrogate objective to prevent destructively large policy updates.

For Beginners: PPO learns a policy (strategy for choosing actions) by making careful, controlled updates. It's like learning to drive - you make small adjustments to your steering rather than jerking the wheel wildly. This makes learning stable and efficient.

Key features:

Actor-Critic: Learns both a policy (actor) and value function (critic)
Clipped Updates: Prevents too-large changes that could break learning
GAE: Generalized Advantage Estimation for better gradient estimates
Multi-Epoch: Reuses collected experience multiple times

Famous for: OpenAI's ChatGPT uses PPO for RLHF (Reinforcement Learning from Human Feedback)

Constructors

PPOOptions()

public PPOOptions()

Properties

ActionSize

Number of possible actions (discrete) or action dimensions (continuous).

public int ActionSize { get; set; }

Property Value

int

ClipEpsilon

PPO clipping parameter (epsilon).

public T ClipEpsilon { get; set; }

Property Value

T

Remarks

Typical values: 0.1-0.3. Limits how much the policy can change in one update. Smaller = more conservative updates, more stable.

DiscountFactor

Discount factor (gamma) for future rewards.

public T DiscountFactor { get; set; }

Property Value

T

Remarks

Typical values: 0.95-0.99.

EntropyCoefficient

Entropy coefficient for exploration.

public T EntropyCoefficient { get; set; }

Property Value

T

Remarks

Typical values: 0.01-0.1. Encourages exploration by penalizing deterministic policies. Higher = more exploration.

GaeLambda

GAE (Generalized Advantage Estimation) lambda parameter.

public T GaeLambda { get; set; }

Property Value

T

Remarks

Typical values: 0.95-0.99. Controls bias-variance tradeoff in advantage estimation. Higher values = lower bias, higher variance.

IsContinuous

Whether the action space is continuous (true) or discrete (false).

public bool IsContinuous { get; set; }

Property Value

bool

MaxGradNorm

Maximum gradient norm for gradient clipping.

public double MaxGradNorm { get; set; }

Property Value

double

Remarks

Typical values: 0.5-5.0. Prevents exploding gradients.

MiniBatchSize

Mini-batch size for training.

public int MiniBatchSize { get; set; }

Property Value

int

Remarks

Typical values: 32-256. Should divide StepsPerUpdate evenly.

PolicyHiddenLayers

Hidden layer sizes for policy network.

public List<int> PolicyHiddenLayers { get; set; }

Property Value

List<int>

PolicyLearningRate

Learning rate for the policy network.

public T PolicyLearningRate { get; set; }

Property Value

T

Seed

Random seed for reproducibility (optional).

public int? Seed { get; set; }

Property Value

int?

StateSize

Size of the state observation space.

public int StateSize { get; set; }

Property Value

int

StepsPerUpdate

Number of steps to collect before each training update.

public int StepsPerUpdate { get; set; }

Property Value

int

Remarks

Typical values: 128-2048. PPO collects trajectories, then trains on them.

TrainingEpochs

Number of epochs to train on collected data.

public int TrainingEpochs { get; set; }

Property Value

int

Remarks

Typical values: 3-10. PPO reuses collected experiences multiple times.

ValueHiddenLayers

Hidden layer sizes for value network.

public List<int> ValueHiddenLayers { get; set; }

Property Value

List<int>

ValueLearningRate

Learning rate for the value network.

public T ValueLearningRate { get; set; }

Property Value

T

ValueLossCoefficient

Value function loss coefficient.

public T ValueLossCoefficient { get; set; }

Property Value

T

Remarks

Typical values: 0.5-1.0. Weight of value loss relative to policy loss.

ValueLossFunction

Loss function for value network (typically MSE).

public ILossFunction<T> ValueLossFunction { get; set; }

Property Value

ILossFunction<T>

Table of Contents

Class PPOOptions<T>

Type Parameters

Remarks

Constructors

PPOOptions()

Properties

ActionSize

Property Value

ClipEpsilon

Property Value

Remarks

DiscountFactor

Property Value

Remarks

EntropyCoefficient

Property Value

Remarks

GaeLambda

Property Value

Remarks

IsContinuous

Property Value

MaxGradNorm

Property Value

Remarks

MiniBatchSize

Property Value

Remarks

PolicyHiddenLayers

Property Value

PolicyLearningRate

Property Value

Seed

Property Value

StateSize

Property Value

StepsPerUpdate

Property Value

Remarks

TrainingEpochs

Property Value

Remarks

ValueHiddenLayers

Property Value

ValueLearningRate

Property Value

ValueLossCoefficient

Property Value

Remarks

ValueLossFunction

Property Value