Skip to content

Conversation

@AlexisOlson
Copy link
Owner

Add positive policy decay based on Naphthalin's work in PRs LeelaChessZero#1173, LeelaChessZero#1288, and LeelaChessZero#2093.

Formula:
scaling = FastInvSqrt(1.0f + N_child / policy_decay_scale)
odds = 1.0f / P - 1.0f
P_eff = 1.0f / (1.0f + odds * scaling)

Changes:

  • Add FastInvSqrt function to utils/fastmath.h for fast 1/sqrt(x) calculation
  • Add policy_decay_scale parameter (float, default 500.0, range 0-1000000)
    • When scale=0, policy decay is disabled (P_eff = P)
    • Lower scale = faster convergence to 100% policy
  • Modify GetU() in both classic and dag_classic node.h to use P_eff
  • Update search.cc files to pass policy_decay_scale to GetU()

Behavior:

  • As N_child increases, all policies gradually approach 100%
  • Low-policy moves grow faster than high-policy moves
  • Preserves PUCT invariants: U decreases after visiting move, increases after visiting sibling
  • P_eff depends only on N_child, not N_parent

Add positive policy decay based on Naphthalin's work in PRs LeelaChessZero#1173, LeelaChessZero#1288, and LeelaChessZero#2093.

Formula:
  scaling = FastInvSqrt(1.0f + N_child / policy_decay_scale)
  odds = 1.0f / P - 1.0f
  P_eff = 1.0f / (1.0f + odds * scaling)

Changes:
- Add FastInvSqrt function to utils/fastmath.h for fast 1/sqrt(x) calculation
- Add policy_decay_scale parameter (float, default 500.0, range 0-1000000)
  - When scale=0, policy decay is disabled (P_eff = P)
  - Lower scale = faster convergence to 100% policy
- Modify GetU() in both classic and dag_classic node.h to use P_eff
- Update search.cc files to pass policy_decay_scale to GetU()

Behavior:
- As N_child increases, all policies gradually approach 100%
- Low-policy moves grow faster than high-policy moves
- Preserves PUCT invariants: U decreases after visiting move, increases after visiting sibling
- P_eff depends only on N_child, not N_parent
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

int nstarted = current_nstarted[idx];
const float util = current_util[idx];
if (idx > cache_filled_idx) {
current_score[idx] =
current_pol[idx] * puct_mult / (1 + nstarted) + util;
cache_filled_idx++;

P1 Badge Apply policy decay to PUCT score in classic search

The new policy_decay_scale option is passed to EdgeAndNode::GetU, but the core PUCT selection loop still computes current_score[idx] using current_pol[idx] * puct_mult / (1 + nstarted) without any decay. Because this loop chooses which child to expand, positive policy decay never influences actual move selection—only verbose stats/prefetch use the decayed value. As a result the default value of 500 has no gameplay effect. Consider applying the same decay formula (or calling GetU with the scale) before computing current_score so the feature is effective.


int nstarted = current_nstarted[idx];
const float util = current_util[idx];
if (idx > cache_filled_idx) {
current_score[idx] =
cur_iters[idx].GetP() * puct_mult / (1 + nstarted) + util;
cache_filled_idx++;

P1 Badge DAG search also ignores policy decay when scoring edges

The DAG search loop continues to compute each edge score as cur_iters[idx].GetP() * puct_mult / (1 + nstarted) + util and never applies the new policy-decay scaling. Consequently the added policy_decay_scale parameter does not change DAG selection behaviour, despite being enabled by default. To make the feature functional, apply the decay to the policy before computing current_score or reuse EdgeAndNode::GetU with the scale parameter.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Apply positive policy decay to the actual move selection in both classic
and dag_classic search implementations. The previous commit only applied
decay to verbose stats and edge sorting, not the critical PUCT scoring
loop that determines which moves to explore.

Changes:
- Apply policy decay formula in PickNodesToExtendTask() for both classic and dag_classic
- Policy decay now affects current_score[idx] calculation in the hot path
- This makes the policy_decay_scale parameter (default 500.0) functional

Without this fix, the policy decay parameter had no effect on move selection,
only on display and prefetch calculations.

Fixes the issue identified by codex review bot.
@AlexisOlson AlexisOlson requested a review from Copilot November 6, 2025 05:28
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a positive policy decay feature in the search algorithm using a fast inverse square root approximation. The decay modulates the effective policy values based on visit counts to influence exploration-exploitation balance during MCTS.

  • Adds FastInvSqrt function based on the classic Quake III algorithm for fast 1/sqrt(x) approximation
  • Introduces PolicyDecayScale parameter (default 500.0) to control the policy decay behavior
  • Modifies the GetU method to apply policy decay transformation when the scale parameter is positive

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/utils/fastmath.h Adds FastInvSqrt function using bit manipulation for fast inverse square root approximation
src/search/classic/params.h Adds PolicyDecayScale parameter declaration and getter method
src/search/classic/params.cc Defines PolicyDecayScale option with default value of 500.0 and initializes the parameter
src/search/classic/node.h Updates GetU method to apply policy decay using FastInvSqrt when scale > 0
src/search/classic/search.cc Passes policy_decay_scale parameter to GetU calls in multiple locations
src/search/dag_classic/node.h Updates GetU method with same policy decay logic as classic variant
src/search/dag_classic/search.cc Passes policy_decay_scale parameter to GetU calls in multiple locations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Address critical issues identified in code review:

1. Add zero-check guards to prevent division by zero when p==0.0f
   - Policy values can be zero, causing division by zero in the decay formula
   - Added p > 0.0f check before applying decay in all locations

2. Document FastInvSqrt input requirements
   - Added comment noting it expects positive values
   - Documents undefined behavior for zero, negative, NaN, infinity

3. Extract policy decay logic to helper function
   - Created ApplyPolicyDecay() in utils/fastmath.h
   - Reduces code duplication across 4 locations
   - Centralizes zero-checking logic

4. Simplify GetU() implementations
   - Both classic and dag_classic node.h now use ApplyPolicyDecay()
   - Search loops also use the helper function

This prevents crashes when encountering zero-policy edges and improves
code maintainability by eliminating duplication.
Add policy_decay_exponent parameter to control the strength of policy decay,
making the feature more flexible and experimentally tunable.

New formula:
  power_term = (1 + N_child/scale)^(-exponent)
  P_eff = 1 / (1 + odds * power_term)
  where odds = 1/P - 1

Changes:
- Add PolicyDecayExponent parameter (range 0.01-10.0, default 1.0)
  - exponent=1.0: linear decay (new default, replaces sqrt)
  - exponent=0.5: sqrt decay (original behavior)
  - exponent<1.0: gentler decay
  - exponent>1.0: stronger decay

- Update ApplyPolicyDecay() to accept exponent parameter
  - Optimize for exponent=1.0: use 1/(1+N/scale) directly
  - Optimize for exponent=0.5: use FastInvSqrt
  - General case: use std::pow

- Update all call sites to pass exponent:
  - GetU() in both classic and dag_classic node.h
  - PUCT selection loops in both search implementations
  - Verbose stats output
  - Prefetch calculations

Benefits:
- More flexible tuning for different use cases
- Default exponent=1.0 gives simpler, faster computation
- Maintains fast path for sqrt case (exponent=0.5)
- Allows exploring stronger or gentler decay behaviors
Integrate Martin Ankerl's widely-used fast power approximation functions
to replace std::pow in policy decay calculations for significant performance
improvement.

New functions:
- FastPow(a, b): Basic fast power approximation
  - ~4x faster than std::pow for fractional exponents
  - Accuracy within 5%, rare cases up to 12% error
  - Uses bit manipulation similar to FastInvSqrt

- FastPrecisePow(a, b): Accurate fast power approximation
  - ~3x faster than std::pow
  - Separates integer and fractional exponent parts
  - Integer part handled via exponentiation by squaring
  - Significantly more accurate when exponent > 1

Changes to ApplyPolicyDecay:
- General case now uses FastPrecisePow instead of std::pow
- Provides ~3x speedup for non-optimized exponent values
- Maintains fast paths for exponent=1.0 and exponent=0.5

Source: https://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/

Performance impact:
- Policy decay calculations run ~3x faster in general case
- Particularly beneficial when exponent != 1.0 and != 0.5
- Critical for PUCT selection loop (hot path)
@AlexisOlson AlexisOlson requested a review from Copilot November 7, 2025 04:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Update policy decay implementation to multiply the scale parameter by
the number of legal moves in each position. This makes the decay
proportional to the position complexity, so positions with more
legal moves require proportionally more visits before decay takes effect.

Changes:
- Updated ApplyPolicyDecay() to accept num_legal_moves parameter
- Modified decay formula: base = 1 + N/(scale * num_legal_moves)
- Updated GetU() signature in both classic and dag_classic node.h
- Updated all call sites to pass node->GetNumEdges()
Replace simple per-move policy decay with sum-normalized approach
to ensure policies always sum to 1.0 throughout search.

PARAMETER CHANGES:
- Rename policy_decay_scale to policy_decay_scale_per_move
- Default changed from 500.0 to 20.0
- Range changed from 0-1000000 to 0-1000
- Default exponent changed from 1.0 to 0.5 (sqrt decay)
- Parameter interpretation: "visits per available move before significant decay"

NORMALIZATION METHOD:
- Calculate sum of raw P_eff across all edges at each node
- Normalize each P_eff by dividing by the sum
- Ensures sum(P_eff) ≈ 1.0 at N=0 (equals original policies)
- As N→∞, P_eff → 1/num_legal (uniform distribution)

IMPLEMENTATION:
- ApplyPolicyDecay now returns raw (unnormalized) P_eff
- Added policy_decay_sum parameter to GetU()
- Calculate sum once per node evaluation before PUCT selection
- Apply normalization in GetU() and inline PUCT loops
- Updated verbose stats and prefetch to use normalized values

FORMULA:
effective_scale = scale_per_move * num_legal_moves
power_term = (1 + N_child / effective_scale)^(-exponent)
odds = 1/P - 1
raw_P_eff = 1 / (1 + odds * power_term)
P_eff = raw_P_eff / sum(raw_P_eff for all edges)

Files modified:
- src/utils/fastmath.h: Updated ApplyPolicyDecay documentation
- src/search/classic/params.cc: Renamed parameter and updated defaults
- src/search/classic/node.h: Added policy_decay_sum to GetU()
- src/search/dag_classic/node.h: Added policy_decay_sum to GetU()
- src/search/classic/search.cc: Calculate sum and normalize in 3 locations
- src/search/dag_classic/search.cc: Calculate sum and normalize in 2 locations
1. Changed default policy_decay_exponent from 0.5 to 1.0 (linear decay)
   - Updated default in params.cc
   - Updated documentation to list 1.0 first as default
   - Updated GetU() default parameter values in both node.h files

2. Display P_eff in verbose move statistics instead of raw P
   - Label remains "P" for compatibility with tools like Nibbler
   - Calculates normalized P_eff for each edge in verbose stats
   - Users can now watch P_eff evolve during search
   - Updated both classic and dag_classic search.cc
Simplifies implementation by removing the policy_decay_exponent parameter
and hardcoding sqrt decay (exponent=0.5), which always uses FastInvSqrt
for optimal performance.

CHANGES:

1. Updated ApplyPolicyDecay() in fastmath.h:
   - Removed exponent parameter
   - Fixed to use sqrt decay: power_term = FastInvSqrt(base)
   - Simplified from 3 code paths (1.0, 0.5, general) to single path
   - Updated documentation

2. Removed policy_decay_exponent parameter:
   - Deleted kPolicyDecayExponentId from params.cc
   - Removed GetPolicyDecayExponent() getter from params.h
   - Removed kPolicyDecayExponent member variable from params.h
   - Removed option registration in Populate()
   - Removed from constructor initialization

3. Updated GetU() signature in both node.h files:
   - Removed policy_decay_exponent parameter
   - Updated documentation to reflect fixed sqrt decay

4. Updated all call sites in classic/search.cc:
   - Removed exponent parameter from 3 locations
   - Verbose stats (line ~571)
   - PUCT selection loop (line ~1747, ~1779)
   - Prefetch (line ~2134)

5. Updated all call sites in dag_classic/search.cc:
   - Removed exponent parameter from 2 locations
   - Verbose stats (line ~633)
   - PUCT selection loop (line ~1845, ~1877)

This simplification reduces code complexity and ensures consistent
performance by always using the optimized FastInvSqrt path.
1. REVERT DAG_CLASSIC:
   - Reverted all policy decay changes to dag_classic/node.h
   - Reverted all policy decay changes to dag_classic/search.cc
   - Policy decay is now only implemented for classic search

2. CLEAN UP FASTMATH:
   - Removed FastPow() function (no longer needed)
   - Removed FastPrecisePow() function (no longer needed)
   - Only FastInvSqrt remains for sqrt decay implementation
   - Simplified ApplyPolicyDecay() guard condition

3. UPDATE PARAMETER:
   - Changed range from 0-1000 to 1-1000000
   - Changed default from 20 to 100
   - Updated documentation to reflect new range
   - Removed "set to 0 to disable" from description

This focuses the PR on classic search only and simplifies the code
by removing unused fast math functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants