Class PPOOptions<T>
Configuration options for Proximal Policy Optimization (PPO) agents.
public class PPOOptions<T>
Type Parameters
TThe numeric type used for calculations.
- Inheritance
-
PPOOptions<T>
- Inherited Members
Remarks
PPO is a state-of-the-art policy gradient algorithm that achieves a balance between sample efficiency, simplicity, and reliability. It uses a clipped surrogate objective to prevent destructively large policy updates.
For Beginners: PPO learns a policy (strategy for choosing actions) by making careful, controlled updates. It's like learning to drive - you make small adjustments to your steering rather than jerking the wheel wildly. This makes learning stable and efficient.
Key features:
- Actor-Critic: Learns both a policy (actor) and value function (critic)
- Clipped Updates: Prevents too-large changes that could break learning
- GAE: Generalized Advantage Estimation for better gradient estimates
- Multi-Epoch: Reuses collected experience multiple times
Famous for: OpenAI's ChatGPT uses PPO for RLHF (Reinforcement Learning from Human Feedback)
Constructors
PPOOptions()
public PPOOptions()
Properties
ActionSize
Number of possible actions (discrete) or action dimensions (continuous).
public int ActionSize { get; set; }
Property Value
ClipEpsilon
PPO clipping parameter (epsilon).
public T ClipEpsilon { get; set; }
Property Value
- T
Remarks
Typical values: 0.1-0.3. Limits how much the policy can change in one update. Smaller = more conservative updates, more stable.
DiscountFactor
Discount factor (gamma) for future rewards.
public T DiscountFactor { get; set; }
Property Value
- T
Remarks
Typical values: 0.95-0.99.
EntropyCoefficient
Entropy coefficient for exploration.
public T EntropyCoefficient { get; set; }
Property Value
- T
Remarks
Typical values: 0.01-0.1. Encourages exploration by penalizing deterministic policies. Higher = more exploration.
GaeLambda
GAE (Generalized Advantage Estimation) lambda parameter.
public T GaeLambda { get; set; }
Property Value
- T
Remarks
Typical values: 0.95-0.99. Controls bias-variance tradeoff in advantage estimation. Higher values = lower bias, higher variance.
IsContinuous
Whether the action space is continuous (true) or discrete (false).
public bool IsContinuous { get; set; }
Property Value
MaxGradNorm
Maximum gradient norm for gradient clipping.
public double MaxGradNorm { get; set; }
Property Value
Remarks
Typical values: 0.5-5.0. Prevents exploding gradients.
MiniBatchSize
Mini-batch size for training.
public int MiniBatchSize { get; set; }
Property Value
Remarks
Typical values: 32-256. Should divide StepsPerUpdate evenly.
PolicyHiddenLayers
Hidden layer sizes for policy network.
public List<int> PolicyHiddenLayers { get; set; }
Property Value
PolicyLearningRate
Learning rate for the policy network.
public T PolicyLearningRate { get; set; }
Property Value
- T
Seed
Random seed for reproducibility (optional).
public int? Seed { get; set; }
Property Value
- int?
StateSize
Size of the state observation space.
public int StateSize { get; set; }
Property Value
StepsPerUpdate
Number of steps to collect before each training update.
public int StepsPerUpdate { get; set; }
Property Value
Remarks
Typical values: 128-2048. PPO collects trajectories, then trains on them.
TrainingEpochs
Number of epochs to train on collected data.
public int TrainingEpochs { get; set; }
Property Value
Remarks
Typical values: 3-10. PPO reuses collected experiences multiple times.
ValueHiddenLayers
Hidden layer sizes for value network.
public List<int> ValueHiddenLayers { get; set; }
Property Value
ValueLearningRate
Learning rate for the value network.
public T ValueLearningRate { get; set; }
Property Value
- T
ValueLossCoefficient
Value function loss coefficient.
public T ValueLossCoefficient { get; set; }
Property Value
- T
Remarks
Typical values: 0.5-1.0. Weight of value loss relative to policy loss.
ValueLossFunction
Loss function for value network (typically MSE).
public ILossFunction<T> ValueLossFunction { get; set; }