Class DiscretePolicyOptions<T>
- Namespace
- AiDotNet.ReinforcementLearning.Policies
- Assembly
- AiDotNet.dll
Configuration options for discrete action space policies in reinforcement learning. Discrete policies select from a finite set of actions using categorical (softmax) distributions.
public class DiscretePolicyOptions<T>
Type Parameters
TThe numeric type used for calculations (float, double, etc.).
- Inheritance
-
DiscretePolicyOptions<T>
- Inherited Members
Remarks
Discrete policies are fundamental to reinforcement learning in environments with finite action spaces, such as game playing (left/right/jump), robot arm control with discrete positions, or trading decisions (buy/sell/hold). The policy network outputs logits (unnormalized log probabilities) for each action, which are then converted to a probability distribution via softmax. Actions are sampled from this distribution during training to enable exploration, while the most probable action is typically selected during evaluation.
This configuration class provides sensible defaults aligned with modern deep reinforcement learning best practices from libraries like Stable Baselines3 and RLlib. The default epsilon-greedy exploration strategy balances exploration (trying random actions) with exploitation (using learned policy).
For Beginners: Discrete policies are for situations where your AI agent must choose between specific, separate options rather than continuous values.
Think of it like a video game character deciding between actions:
- Move Left
- Move Right
- Jump
- Duck
The policy learns which action is best in each situation by:
- Looking at the current state (what's on screen)
- Calculating probabilities for each action (40% jump, 35% left, 20% right, 5% duck)
- Choosing an action based on these probabilities
During training, it sometimes picks random actions (exploration) to discover new strategies. During evaluation/playing, it picks the best action it has learned.
This options class lets you configure:
- How many different actions are available (ActionSize)
- How complex the neural network should be (HiddenLayers)
- How much random exploration to use (ExplorationStrategy)
Properties
ActionSize
Gets or sets the number of discrete actions available to the agent.
public int ActionSize { get; set; }
Property Value
- int
The number of distinct actions the agent can choose from. Must be greater than 0.
Remarks
This defines the output size of the policy network and the dimensionality of the action probability distribution. Common values range from 2 (binary decisions) to hundreds (complex action spaces like language models). The network outputs logits for each action, which are converted to probabilities via softmax.
For Beginners: How many different actions can your agent choose from?
Examples:
- Trading bot: 3 actions (buy, sell, hold)
- Pac-Man: 4 actions (up, down, left, right)
- Fighting game: 12 actions (punch, kick, block, move in 4 directions, etc.)
More actions make learning harder because the agent has more to explore. Start simple with fewer actions when possible.
ExplorationStrategy
Gets or sets the exploration strategy for balancing exploration vs exploitation during training.
public IExplorationStrategy<T> ExplorationStrategy { get; set; }
Property Value
- IExplorationStrategy<T>
The exploration strategy. Defaults to epsilon-greedy with decaying epsilon from 1.0 to 0.01.
Remarks
Exploration is critical in reinforcement learning because the agent must try different actions to discover which ones lead to high rewards. Epsilon-greedy exploration randomly selects actions with probability ε (epsilon), and follows the learned policy with probability 1-ε. The epsilon typically starts high (e.g., 1.0 for 100% random) and gradually decreases (to 0.01 for 1% random) as the agent gains experience. Alternative strategies include Boltzmann (softmax) exploration, or no exploration for pure exploitation.
For Beginners: Exploration means trying new things instead of always doing what you think is best.
The default epsilon-greedy strategy works like this:
- Start of training: 100% random actions (explore everything!)
- Middle of training: Mix of random and learned actions
- End of training: 99% learned actions, 1% random (mostly exploit what you know)
Think of learning to play a new video game:
- First hour: Press random buttons to see what they do (high exploration)
- After some practice: Mostly use moves you know work, occasionally try something new
- Expert level: Almost always use best strategies, rarely experiment
You might want different exploration if:
- Your environment is very random → Keep higher exploration longer
- Your environment is very predictable → Reduce exploration faster
- You're fine-tuning a pre-trained model → Start with low exploration
Available strategies:
- EpsilonGreedyExploration (default): Simple, effective for discrete actions
- BoltzmannExploration: Temperature-based, good for multi-armed bandits
- NoExploration: For evaluation or when using off-policy algorithms
HiddenLayers
Gets or sets the architecture of hidden layers in the policy network.
public int[] HiddenLayers { get; set; }
Property Value
- int[]
An array where each element specifies the number of neurons in that hidden layer. Defaults to [128, 128] for a two-layer network with 128 neurons each.
Remarks
The hidden layer configuration determines the network's capacity to learn complex policies. Deeper networks (more layers) can learn more complex relationships but are harder to train and slower to execute. Wider networks (more neurons per layer) increase capacity without adding depth. The default [128, 128] works well for many problems including Atari games and robotic control tasks. For simple problems (like CartPole), [64] may suffice. For complex problems (like Go or high-dimensional robotics), consider [256, 256, 256] or larger.
For Beginners: This controls how "smart" your neural network can be.
The default [128, 128] means:
- Your network has 2 hidden layers
- Each layer has 128 artificial neurons
- This creates a network like: Input → [128 neurons] → [128 neurons] → Output
Think of layers like levels of thinking:
- First layer: Recognizes basic patterns ("is enemy close?")
- Second layer: Combines patterns into strategies ("enemy close + have weapon = attack")
You might want more layers/neurons [256, 256, 256] if:
- Your problem is very complex (chess, robot navigation)
- Simple networks aren't learning well
- You have lots of training data and computing power
You might want fewer [64] or [64, 64] if:
- Your problem is simple (tic-tac-toe, balancing a pole)
- Training is too slow
- You're just experimenting
Good rule of thumb: Start with the default and adjust based on results.
LossFunction
Gets or sets the loss function used to train the policy network.
public ILossFunction<T> LossFunction { get; set; }
Property Value
- ILossFunction<T>
The loss function for computing training error. Defaults to Mean Squared Error.
Remarks
The loss function quantifies how well the policy's predictions match the target values during training. For policy gradient methods (PPO, A2C), this is typically used for value function approximation or advantage estimation. Mean Squared Error is the standard choice as it provides stable gradients and works well with continuous value predictions. Some advanced algorithms may benefit from Huber loss for robustness to outliers.
For Beginners: The loss function measures "how wrong" the policy is during learning.
The default Mean Squared Error (MSE) works by:
- Taking the difference between predicted and actual values
- Squaring it (so negatives don't cancel positives)
- Averaging across all examples
You almost never need to change this from the default. MSE is the industry standard and works well for reinforcement learning. Only consider alternatives if you're implementing advanced research algorithms or experiencing specific training instabilities.
Seed
Gets or sets the random seed for reproducible training runs.
public int? Seed { get; set; }
Property Value
- int?
Optional random seed. When null, uses a random seed. When set to a value, ensures deterministic behavior.
Remarks
Setting a specific seed value ensures that training runs are reproducible, which is essential for debugging, comparing algorithms, and scientific research. However, in production or when seeking diverse solutions, using null (random seed) allows for variation across runs that might discover better policies. Note that reproducibility also requires deterministic environment implementations and consistent hardware/software configurations.
For Beginners: Random seed controls whether your training is the same every time.
- Set to a number (e.g., 42): Training will be identical each time you run it
- Set to null (default): Each training run will be different
Use a fixed seed when:
- Debugging (you want to see the exact same behavior)
- Comparing algorithms (fair comparison requires same randomness)
- Publishing research (others should be able to reproduce your results)
Use null (random) when:
- Training multiple models to pick the best one
- You want variation in learned behaviors
- Running in production where diversity is valuable
Common practice: Use seed=42 during development, null in production.
StateSize
Gets or sets the size of the observation/state space.
public int StateSize { get; set; }
Property Value
- int
The number of input features that describe the environment state. Must be greater than 0.
Remarks
The state size defines the dimensionality of observations from the environment. For example, in a CartPole environment this might be 4 (cart position, cart velocity, pole angle, pole velocity). In an Atari game using pixel inputs, this would be the flattened image size or number of features extracted from preprocessing.
For Beginners: This is how many numbers describe "what's happening" in your environment.
Examples:
- Simple game: 4 numbers (player X, player Y, enemy X, enemy Y)
- Chess board: 64 squares × types of pieces = hundreds of features
- Robot arm: 6 numbers (one for each joint angle)
Set this to match your environment's observation space size.