Exploration vs Exploitation

Every MCTS playout faces the same tension: go deeper on the best-looking branch (exploit) or try something less visited (explore). Get this balance wrong and the search either fixates on a local optimum or wastes visits on dead ends.

This library exposes five independent mechanisms that control the explore/exploit tradeoff. Each operates at a different stage of the search. Understanding how they interact is the difference between a search that converges in 1,000 playouts and one that needs 100,000.

The bandit problem

A gambler faces k slot machines with unknown payout distributions. Each pull yields a reward drawn from the machine's distribution. The gambler wants to maximize cumulative reward over n pulls.

The formal measure of failure is regret: the difference between the reward you accumulated and the reward you would have accumulated had you always pulled the best arm. A strategy with zero regret pulls the best arm every time, which is impossible since you do not know which arm is best.

Lai and Robbins (1985) proved that no strategy can achieve regret growing slower than O(ln n). Any strategy that explores less will eventually be trapped on a suboptimal arm. Any strategy that explores more wastes pulls unnecessarily.

UCB1 and optimism

UCB1 achieves the optimal O(ln n) regret bound with a simple principle: be optimistic in the face of uncertainty.

For each arm, compute an upper confidence bound on its true mean. Select the arm whose upper bound is highest. If an arm has been pulled many times, its confidence interval is tight and the bound is close to the empirical mean. If an arm has been pulled few times, the interval is wide and the bound is high -- optimistically assuming it might be good.

Why does optimism work? A pessimistic strategy would avoid uncertain arms, potentially never discovering a better option. A realistic strategy (estimating the true mean) would also avoid uncertain arms because the estimate is noisy and likely too low with few samples. Optimism breaks the deadlock: it over-estimates uncertain arms, ensuring they get tried. Once tried, the estimate improves and the arm is either confirmed as good (kept) or revealed as bad (abandoned). The worst case is a few wasted pulls on bad arms -- logarithmically many, as Auer et al. proved.

The exploration term C * sqrt(ln(N) / n(a)) is the width of the confidence interval. It grows logarithmically with total pulls (so rarely-pulled arms eventually get tried) and shrinks with the square root of the arm's pull count (so frequently-pulled arms need strong evidence to justify continued exploration).

The exploration constant C

C scales the entire exploration term. It is the single most important hyperparameter in MCTS.

Low C (0.1-0.5): The search exploits aggressively. It finds the best-looking move quickly and pours visits into it. Good when the evaluation signal is reliable and the branching factor is small. Risk: the search misses a strong move that initially looks mediocre because it never gets enough visits to reveal its true value.

High C (1.5-3.0): The search explores broadly. It distributes visits more evenly across children, giving every move a fair hearing. Good when the evaluation signal is noisy or the branching factor is large. Risk: the search spreads too thin, never investing enough visits in any single move to read out a deep tactical line.

Tuning guidance: Start with C = sqrt(2) for UCT (the theoretically motivated value for rewards in [0,1]). For PUCT, C = 1.5-2.5 is typical. If the search consistently misses obvious tactics, lower C. If it gets stuck in ruts, raise it. The right value depends on the game, the evaluator, and the playout budget.

C also interacts with the reward scale. The theoretically optimal C = sqrt(2) assumes rewards normalized to [0, 1]. If your evaluator returns rewards in [0, 100], the same C produces 100x less exploration (the exploitation term dominates). Either normalize rewards or scale C proportionally. The interpret_evaluation_for_player method is where this normalization typically happens.

Virtual loss

Virtual loss is a parallel search mechanism that doubles as an exploration tool.

When a thread begins descending a path, it subtracts virtual_loss from each node's reward sum. This makes the path look worse to other threads selecting simultaneously. The result: threads naturally spread across different branches instead of all diving into the same line.

After the playout completes, the virtual loss is added back and replaced with the actual reward. The temporary pessimism is corrected.

The magnitude controls diversity pressure. With virtual_loss = 0, parallel threads behave identically and provide no diversification. With a large virtual loss, threads aggressively avoid each other's paths, approximating root parallelization. Typical values are 1-10.

Choosing a virtual loss value

Set virtual loss larger than any realistic evaluation your game can produce. If your evaluator returns values in [0, 100], virtual loss of 500 works. If your evaluator returns [-1, 1], virtual loss of 10 works. The exact value doesn't matter much — it just needs to be "obviously bad" so threads avoid piling onto the same path. Virtual loss is only meaningful during parallel search; set it to 0 for single-threaded use.

Virtual loss introduces a bias: paths explored by many threads simultaneously appear worse than they are. For short playouts this bias corrects quickly. For deep trees with slow evaluation (e.g., neural networks), large virtual losses can cause the search to systematically undervalue the principal variation. This is one reason batch evaluation matters -- it reduces the number of concurrent descents and limits how long the bias persists.

The virtual loss also interacts with the solver. A proven-Win child (bad for the parent) already receives f64::NEG_INFINITY during selection. Virtual loss on top of this is redundant. A proven-Loss child (good for the parent) receives f64::INFINITY, which overrides any virtual loss depression. In practice, virtual loss only matters for unresolved nodes, which is where diversification is most valuable.

First Play Urgency (FPU)

FPU answers: what score does an unvisited child get during selection?

The default is f64::INFINITY: every unvisited child has infinite urgency, so all children are tried once before any child is revisited. This is the classical approach and works well for UCT, where moves start with no prior and the search must discover their quality empirically.

With PUCT and neural network priors, infinity FPU defeats the purpose of having priors. If every child must be visited once regardless, the prior probabilities cannot guide early exploration. Setting FPU to a finite value (typically 0.0 or a pessimistic estimate like -1.0) lets the search trust the prior: low-prior moves may never be visited if high-prior moves perform well enough.

The interaction between FPU and tree policy is critical:

UCT + infinity FPU. Classical behavior. Round-robin all children, then exploit. Correct for small branching factors.
UCT + finite FPU. Unusual but viable when branching factor is too high for round-robin. The finite value acts as a pessimistic prior -- unvisited children are assumed to be this good until proven otherwise.
PUCT + infinity FPU. Contradictory. You have priors but force round-robin anyway. Wastes visits on moves the network assigns near-zero probability. This configuration usually indicates a misconfiguration.
PUCT + finite FPU. The intended combination. Priors control which children are explored. Moves with 0.001 prior may never be visited. Set FPU to 0.0 for a neutral assumption about unvisited children, or to a negative value to be pessimistic about them.

Dirichlet noise

In self-play training, the search generates its own training data. Without noise, the search converges to a single deterministic policy. Every game from the same position plays the same moves. The training data lacks diversity, and the network overfits to its own patterns.

Dirichlet noise injects randomness into the root node's prior probabilities:

noisy_prior(a) = (1 - epsilon) * prior(a) + epsilon * Dir(alpha)

Alpha controls the noise shape. Small alpha (0.03, used for Go) produces spiky distributions -- most noise concentrates on one or two moves, forcing the search to occasionally investigate a single unusual option deeply. Large alpha (0.3, used for chess) produces flatter distributions -- noise spreads across many moves, creating broad perturbations.

Epsilon controls how much noise matters. At epsilon = 0.25 (standard), the prior still dominates. The search mostly follows the network but occasionally deviates.

Dirichlet noise only applies at the root. Deeper in the tree, the standard prior guides selection. This is intentional: you want diversity in which root move is explored, but you want the search within each subtree to remain sharp.

Temperature

Temperature operates after search is complete, controlling how the final move is selected from the root's visit distribution.

Temperature = 0 (argmax). Select the most-visited move. This is the default and produces the strongest play.

Temperature = 1 (proportional). Select moves in proportion to their visit counts. A move with 60% of visits is chosen 60% of the time. Standard during self-play data generation for the opening portion of games.

Temperature between 0 and 1. Visit counts are raised to the power 1/T before normalizing. Lower temperature sharpens the distribution toward the most-visited move; higher temperature flattens it.

Temperature > 1. Flattens the distribution beyond proportional. Rarely useful in practice but available if you want highly random play (e.g., generating maximally diverse training data).

Temperature and Dirichlet noise solve related but distinct problems. Noise diversifies what the search explores. Temperature diversifies which move is ultimately selected. In AlphaZero-style training, both are active: noise ensures the search considers unusual moves, and temperature ensures the selected move varies across training games. A common pattern is temperature = 1.0 for the first 30 moves of a game (diverse openings) and temperature = 0.0 for the remainder (strongest play).

Interaction with the solver

When MCTS-Solver is enabled, the exploration/exploitation tradeoff changes qualitatively for proven nodes. A child proven as a Loss (from the child's perspective, meaning a Win for the parent) receives f64::INFINITY during selection -- it is always chosen because it is a guaranteed win. A child proven as a Win (opponent wins) receives f64::NEG_INFINITY -- it is never chosen because it is a guaranteed loss.

These infinite scores override C, FPU, virtual loss, and priors. Once a node is proven, exploration of that subtree stops. The search budget is automatically redirected to unresolved parts of the tree. This is one of the solver's key benefits: it does not just prove results, it reallocates search effort to where it matters.

Score-bounded pruning has a similar but softer effect. Children whose upper bound (from the parent's perspective) falls below the best guaranteed lower bound are assigned f64::NEG_INFINITY. They are skipped, but only because their ceiling is below another child's floor -- not because the game outcome is fully resolved.

How they interact

These five mechanisms form a pipeline:

Dirichlet noise perturbs the priors at the root before search begins.
C controls how much the tree policy weighs exploration vs exploitation during selection.
FPU determines the threshold for visiting new children vs revisiting known ones.
Virtual loss diversifies parallel threads during simultaneous descent.
Temperature adds stochasticity to the final move selection after search ends.

Each mechanism is independent. You can use any combination. But the interactions matter:

High C with finite FPU is redundant. Both push toward exploration. The search over-explores.
Dirichlet noise with temperature 0 gives diverse search but deterministic selection. Training data still varies because the noise varies per game.
Large virtual loss with low C can starve the principal variation. Combined pessimism and weak exploration may prevent the search from returning to a strong move after threads vacate it.

Start with the defaults (C = sqrt(2), FPU = infinity, no virtual loss, no noise, temperature 0) for basic game play. Add the other mechanisms as you need parallelism, neural network integration, or self-play training.

A useful diagnostic: if the search's best move changes rapidly as you add more playouts, the explore/exploit balance is probably wrong. A well-tuned search should settle on the best move relatively early and spend remaining playouts confirming and deepening the analysis. If the best move keeps flipping, either C is too high (the search never commits) or the evaluation signal is too noisy (the search cannot distinguish good from bad moves with the current sample size).

Conversely, if the search immediately commits to one move and never wavers, it may be exploiting too aggressively. Check whether the second-best move has received enough visits to establish a reliable estimate. If the top move has 9,000 visits and the second move has 50, the search did not explore enough to be confident in its choice.

The root_child_stats() method provides visit counts and average rewards for all root children, which makes this analysis straightforward. In a well-balanced search, the top 3-5 moves should have meaningful visit counts, with the best move having a clear plurality but not a monopoly.

For more on how C affects UCT and PUCT differently, and when to choose each policy, see Tree Policies. For how virtual loss interacts with the parallel search architecture, see Lock-Free Parallel Search.

The theoretical perspective

From an information-theoretic standpoint, every playout provides a fixed amount of information about the game tree. The explore/exploit tradeoff determines how that information is allocated. Exploration acquires information about poorly-known regions. Exploitation refines information about well-known regions.

UCB1 achieves the information-theoretically optimal allocation rate. No policy can gather information more efficiently in the worst case. But PUCT often does better in practice because prior knowledge provides "free" information that does not need to be acquired through sampling. The theoretical regret bound assumes no prior knowledge; with good priors, the constant factor shrinks dramatically.

This is the core reason AlphaZero-style systems are so effective: the neural network concentrates search on the 5-10 moves that matter, and the search refines the network's estimates through targeted sampling. The exploration constant C, FPU, and virtual loss are tuning knobs that control how quickly the search trusts vs verifies the network's recommendations.

The bandit problem​

UCB1 and optimism​

The exploration constant C​

Virtual loss​

First Play Urgency (FPU)​

Dirichlet noise​

Temperature​

Interaction with the solver​

How they interact​

The theoretical perspective​