Table of Contents

Class PPOOptions<T>

Namespace
AiDotNet.Models.Options
Assembly
AiDotNet.dll

Configuration options for Proximal Policy Optimization (PPO) agents.

public class PPOOptions<T>

Type Parameters

T

The numeric type used for calculations.

Inheritance
PPOOptions<T>
Inherited Members

Remarks

PPO is a state-of-the-art policy gradient algorithm that achieves a balance between sample efficiency, simplicity, and reliability. It uses a clipped surrogate objective to prevent destructively large policy updates.

For Beginners: PPO learns a policy (strategy for choosing actions) by making careful, controlled updates. It's like learning to drive - you make small adjustments to your steering rather than jerking the wheel wildly. This makes learning stable and efficient.

Key features:

  • Actor-Critic: Learns both a policy (actor) and value function (critic)
  • Clipped Updates: Prevents too-large changes that could break learning
  • GAE: Generalized Advantage Estimation for better gradient estimates
  • Multi-Epoch: Reuses collected experience multiple times

Famous for: OpenAI's ChatGPT uses PPO for RLHF (Reinforcement Learning from Human Feedback)

Constructors

PPOOptions()

public PPOOptions()

Properties

ActionSize

Number of possible actions (discrete) or action dimensions (continuous).

public int ActionSize { get; set; }

Property Value

int

ClipEpsilon

PPO clipping parameter (epsilon).

public T ClipEpsilon { get; set; }

Property Value

T

Remarks

Typical values: 0.1-0.3. Limits how much the policy can change in one update. Smaller = more conservative updates, more stable.

DiscountFactor

Discount factor (gamma) for future rewards.

public T DiscountFactor { get; set; }

Property Value

T

Remarks

Typical values: 0.95-0.99.

EntropyCoefficient

Entropy coefficient for exploration.

public T EntropyCoefficient { get; set; }

Property Value

T

Remarks

Typical values: 0.01-0.1. Encourages exploration by penalizing deterministic policies. Higher = more exploration.

GaeLambda

GAE (Generalized Advantage Estimation) lambda parameter.

public T GaeLambda { get; set; }

Property Value

T

Remarks

Typical values: 0.95-0.99. Controls bias-variance tradeoff in advantage estimation. Higher values = lower bias, higher variance.

IsContinuous

Whether the action space is continuous (true) or discrete (false).

public bool IsContinuous { get; set; }

Property Value

bool

MaxGradNorm

Maximum gradient norm for gradient clipping.

public double MaxGradNorm { get; set; }

Property Value

double

Remarks

Typical values: 0.5-5.0. Prevents exploding gradients.

MiniBatchSize

Mini-batch size for training.

public int MiniBatchSize { get; set; }

Property Value

int

Remarks

Typical values: 32-256. Should divide StepsPerUpdate evenly.

PolicyHiddenLayers

Hidden layer sizes for policy network.

public List<int> PolicyHiddenLayers { get; set; }

Property Value

List<int>

PolicyLearningRate

Learning rate for the policy network.

public T PolicyLearningRate { get; set; }

Property Value

T

Seed

Random seed for reproducibility (optional).

public int? Seed { get; set; }

Property Value

int?

StateSize

Size of the state observation space.

public int StateSize { get; set; }

Property Value

int

StepsPerUpdate

Number of steps to collect before each training update.

public int StepsPerUpdate { get; set; }

Property Value

int

Remarks

Typical values: 128-2048. PPO collects trajectories, then trains on them.

TrainingEpochs

Number of epochs to train on collected data.

public int TrainingEpochs { get; set; }

Property Value

int

Remarks

Typical values: 3-10. PPO reuses collected experiences multiple times.

ValueHiddenLayers

Hidden layer sizes for value network.

public List<int> ValueHiddenLayers { get; set; }

Property Value

List<int>

ValueLearningRate

Learning rate for the value network.

public T ValueLearningRate { get; set; }

Property Value

T

ValueLossCoefficient

Value function loss coefficient.

public T ValueLossCoefficient { get; set; }

Property Value

T

Remarks

Typical values: 0.5-1.0. Weight of value loss relative to policy loss.

ValueLossFunction

Loss function for value network (typically MSE).

public ILossFunction<T> ValueLossFunction { get; set; }

Property Value

ILossFunction<T>