-
Notifications
You must be signed in to change notification settings - Fork 0
Implement positive policy decay #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implement positive policy decay #14
Conversation
Add positive policy decay based on Naphthalin's work in PRs LeelaChessZero#1173, LeelaChessZero#1288, and LeelaChessZero#2093. Formula: scaling = FastInvSqrt(1.0f + N_child / policy_decay_scale) odds = 1.0f / P - 1.0f P_eff = 1.0f / (1.0f + odds * scaling) Changes: - Add FastInvSqrt function to utils/fastmath.h for fast 1/sqrt(x) calculation - Add policy_decay_scale parameter (float, default 500.0, range 0-1000000) - When scale=0, policy decay is disabled (P_eff = P) - Lower scale = faster convergence to 100% policy - Modify GetU() in both classic and dag_classic node.h to use P_eff - Update search.cc files to pass policy_decay_scale to GetU() Behavior: - As N_child increases, all policies gradually approach 100% - Low-policy moves grow faster than high-policy moves - Preserves PUCT invariants: U decreases after visiting move, increases after visiting sibling - P_eff depends only on N_child, not N_parent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
lc0/src/search/classic/search.cc
Lines 1733 to 1738 in 00538ee
| int nstarted = current_nstarted[idx]; | |
| const float util = current_util[idx]; | |
| if (idx > cache_filled_idx) { | |
| current_score[idx] = | |
| current_pol[idx] * puct_mult / (1 + nstarted) + util; | |
| cache_filled_idx++; |
The new policy_decay_scale option is passed to EdgeAndNode::GetU, but the core PUCT selection loop still computes current_score[idx] using current_pol[idx] * puct_mult / (1 + nstarted) without any decay. Because this loop chooses which child to expand, positive policy decay never influences actual move selection—only verbose stats/prefetch use the decayed value. As a result the default value of 500 has no gameplay effect. Consider applying the same decay formula (or calling GetU with the scale) before computing current_score so the feature is effective.
lc0/src/search/dag_classic/search.cc
Lines 1831 to 1836 in 00538ee
| int nstarted = current_nstarted[idx]; | |
| const float util = current_util[idx]; | |
| if (idx > cache_filled_idx) { | |
| current_score[idx] = | |
| cur_iters[idx].GetP() * puct_mult / (1 + nstarted) + util; | |
| cache_filled_idx++; |
The DAG search loop continues to compute each edge score as cur_iters[idx].GetP() * puct_mult / (1 + nstarted) + util and never applies the new policy-decay scaling. Consequently the added policy_decay_scale parameter does not change DAG selection behaviour, despite being enabled by default. To make the feature functional, apply the decay to the policy before computing current_score or reuse EdgeAndNode::GetU with the scale parameter.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Apply positive policy decay to the actual move selection in both classic and dag_classic search implementations. The previous commit only applied decay to verbose stats and edge sorting, not the critical PUCT scoring loop that determines which moves to explore. Changes: - Apply policy decay formula in PickNodesToExtendTask() for both classic and dag_classic - Policy decay now affects current_score[idx] calculation in the hot path - This makes the policy_decay_scale parameter (default 500.0) functional Without this fix, the policy decay parameter had no effect on move selection, only on display and prefetch calculations. Fixes the issue identified by codex review bot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a positive policy decay feature in the search algorithm using a fast inverse square root approximation. The decay modulates the effective policy values based on visit counts to influence exploration-exploitation balance during MCTS.
- Adds
FastInvSqrtfunction based on the classic Quake III algorithm for fast 1/sqrt(x) approximation - Introduces
PolicyDecayScaleparameter (default 500.0) to control the policy decay behavior - Modifies the
GetUmethod to apply policy decay transformation when the scale parameter is positive
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/utils/fastmath.h | Adds FastInvSqrt function using bit manipulation for fast inverse square root approximation |
| src/search/classic/params.h | Adds PolicyDecayScale parameter declaration and getter method |
| src/search/classic/params.cc | Defines PolicyDecayScale option with default value of 500.0 and initializes the parameter |
| src/search/classic/node.h | Updates GetU method to apply policy decay using FastInvSqrt when scale > 0 |
| src/search/classic/search.cc | Passes policy_decay_scale parameter to GetU calls in multiple locations |
| src/search/dag_classic/node.h | Updates GetU method with same policy decay logic as classic variant |
| src/search/dag_classic/search.cc | Passes policy_decay_scale parameter to GetU calls in multiple locations |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Address critical issues identified in code review: 1. Add zero-check guards to prevent division by zero when p==0.0f - Policy values can be zero, causing division by zero in the decay formula - Added p > 0.0f check before applying decay in all locations 2. Document FastInvSqrt input requirements - Added comment noting it expects positive values - Documents undefined behavior for zero, negative, NaN, infinity 3. Extract policy decay logic to helper function - Created ApplyPolicyDecay() in utils/fastmath.h - Reduces code duplication across 4 locations - Centralizes zero-checking logic 4. Simplify GetU() implementations - Both classic and dag_classic node.h now use ApplyPolicyDecay() - Search loops also use the helper function This prevents crashes when encountering zero-policy edges and improves code maintainability by eliminating duplication.
Add policy_decay_exponent parameter to control the strength of policy decay, making the feature more flexible and experimentally tunable. New formula: power_term = (1 + N_child/scale)^(-exponent) P_eff = 1 / (1 + odds * power_term) where odds = 1/P - 1 Changes: - Add PolicyDecayExponent parameter (range 0.01-10.0, default 1.0) - exponent=1.0: linear decay (new default, replaces sqrt) - exponent=0.5: sqrt decay (original behavior) - exponent<1.0: gentler decay - exponent>1.0: stronger decay - Update ApplyPolicyDecay() to accept exponent parameter - Optimize for exponent=1.0: use 1/(1+N/scale) directly - Optimize for exponent=0.5: use FastInvSqrt - General case: use std::pow - Update all call sites to pass exponent: - GetU() in both classic and dag_classic node.h - PUCT selection loops in both search implementations - Verbose stats output - Prefetch calculations Benefits: - More flexible tuning for different use cases - Default exponent=1.0 gives simpler, faster computation - Maintains fast path for sqrt case (exponent=0.5) - Allows exploring stronger or gentler decay behaviors
Integrate Martin Ankerl's widely-used fast power approximation functions to replace std::pow in policy decay calculations for significant performance improvement. New functions: - FastPow(a, b): Basic fast power approximation - ~4x faster than std::pow for fractional exponents - Accuracy within 5%, rare cases up to 12% error - Uses bit manipulation similar to FastInvSqrt - FastPrecisePow(a, b): Accurate fast power approximation - ~3x faster than std::pow - Separates integer and fractional exponent parts - Integer part handled via exponentiation by squaring - Significantly more accurate when exponent > 1 Changes to ApplyPolicyDecay: - General case now uses FastPrecisePow instead of std::pow - Provides ~3x speedup for non-optimized exponent values - Maintains fast paths for exponent=1.0 and exponent=0.5 Source: https://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/ Performance impact: - Policy decay calculations run ~3x faster in general case - Particularly beneficial when exponent != 1.0 and != 0.5 - Critical for PUCT selection loop (hot path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Update policy decay implementation to multiply the scale parameter by the number of legal moves in each position. This makes the decay proportional to the position complexity, so positions with more legal moves require proportionally more visits before decay takes effect. Changes: - Updated ApplyPolicyDecay() to accept num_legal_moves parameter - Modified decay formula: base = 1 + N/(scale * num_legal_moves) - Updated GetU() signature in both classic and dag_classic node.h - Updated all call sites to pass node->GetNumEdges()
Replace simple per-move policy decay with sum-normalized approach to ensure policies always sum to 1.0 throughout search. PARAMETER CHANGES: - Rename policy_decay_scale to policy_decay_scale_per_move - Default changed from 500.0 to 20.0 - Range changed from 0-1000000 to 0-1000 - Default exponent changed from 1.0 to 0.5 (sqrt decay) - Parameter interpretation: "visits per available move before significant decay" NORMALIZATION METHOD: - Calculate sum of raw P_eff across all edges at each node - Normalize each P_eff by dividing by the sum - Ensures sum(P_eff) ≈ 1.0 at N=0 (equals original policies) - As N→∞, P_eff → 1/num_legal (uniform distribution) IMPLEMENTATION: - ApplyPolicyDecay now returns raw (unnormalized) P_eff - Added policy_decay_sum parameter to GetU() - Calculate sum once per node evaluation before PUCT selection - Apply normalization in GetU() and inline PUCT loops - Updated verbose stats and prefetch to use normalized values FORMULA: effective_scale = scale_per_move * num_legal_moves power_term = (1 + N_child / effective_scale)^(-exponent) odds = 1/P - 1 raw_P_eff = 1 / (1 + odds * power_term) P_eff = raw_P_eff / sum(raw_P_eff for all edges) Files modified: - src/utils/fastmath.h: Updated ApplyPolicyDecay documentation - src/search/classic/params.cc: Renamed parameter and updated defaults - src/search/classic/node.h: Added policy_decay_sum to GetU() - src/search/dag_classic/node.h: Added policy_decay_sum to GetU() - src/search/classic/search.cc: Calculate sum and normalize in 3 locations - src/search/dag_classic/search.cc: Calculate sum and normalize in 2 locations
1. Changed default policy_decay_exponent from 0.5 to 1.0 (linear decay) - Updated default in params.cc - Updated documentation to list 1.0 first as default - Updated GetU() default parameter values in both node.h files 2. Display P_eff in verbose move statistics instead of raw P - Label remains "P" for compatibility with tools like Nibbler - Calculates normalized P_eff for each edge in verbose stats - Users can now watch P_eff evolve during search - Updated both classic and dag_classic search.cc
Simplifies implementation by removing the policy_decay_exponent parameter and hardcoding sqrt decay (exponent=0.5), which always uses FastInvSqrt for optimal performance. CHANGES: 1. Updated ApplyPolicyDecay() in fastmath.h: - Removed exponent parameter - Fixed to use sqrt decay: power_term = FastInvSqrt(base) - Simplified from 3 code paths (1.0, 0.5, general) to single path - Updated documentation 2. Removed policy_decay_exponent parameter: - Deleted kPolicyDecayExponentId from params.cc - Removed GetPolicyDecayExponent() getter from params.h - Removed kPolicyDecayExponent member variable from params.h - Removed option registration in Populate() - Removed from constructor initialization 3. Updated GetU() signature in both node.h files: - Removed policy_decay_exponent parameter - Updated documentation to reflect fixed sqrt decay 4. Updated all call sites in classic/search.cc: - Removed exponent parameter from 3 locations - Verbose stats (line ~571) - PUCT selection loop (line ~1747, ~1779) - Prefetch (line ~2134) 5. Updated all call sites in dag_classic/search.cc: - Removed exponent parameter from 2 locations - Verbose stats (line ~633) - PUCT selection loop (line ~1845, ~1877) This simplification reduces code complexity and ensures consistent performance by always using the optimized FastInvSqrt path.
1. REVERT DAG_CLASSIC: - Reverted all policy decay changes to dag_classic/node.h - Reverted all policy decay changes to dag_classic/search.cc - Policy decay is now only implemented for classic search 2. CLEAN UP FASTMATH: - Removed FastPow() function (no longer needed) - Removed FastPrecisePow() function (no longer needed) - Only FastInvSqrt remains for sqrt decay implementation - Simplified ApplyPolicyDecay() guard condition 3. UPDATE PARAMETER: - Changed range from 0-1000 to 1-1000000 - Changed default from 20 to 100 - Updated documentation to reflect new range - Removed "set to 0 to disable" from description This focuses the PR on classic search only and simplifies the code by removing unused fast math functions.
Add positive policy decay based on Naphthalin's work in PRs LeelaChessZero#1173, LeelaChessZero#1288, and LeelaChessZero#2093.
Formula:
scaling = FastInvSqrt(1.0f + N_child / policy_decay_scale)
odds = 1.0f / P - 1.0f
P_eff = 1.0f / (1.0f + odds * scaling)
Changes:
Behavior: