Introduce skill rating and team balancing/scrambling systems

I noticed that some of your goals center around player statistics, rankings, team scrambling and balancing, and things of that sort.  I've recently built and deployed a skill rating and team balancing system that's currently being tested on one of my servers.  I've finally gotten around to documenting the generals of it, and what follows is a description of how it works.  The system and the documentation are both still a work in progress, and since testing so far has been limited to Arena, some parts may not fully generalize to other TF2 gamemodes.  Feedback, questions, and suggestions are welcome.

---

# Overview of SynElo: yazoo.tf's Arena Rating & Balancing System
## Session Flow and Protocol
SynElo has two components: the RBS (ratings, balance, scramble) webservice and a SourceMod plugin.  They communicate with each other over HTTP.  The plugin reports round outcomes and requests optimal team splits, while the webservice owns all the math.

On player connect, the plugin sends the player's SteamID64 to the webservice which returns that player's current rating, visibility status, and rounds played.  New (and unknown) players are assigned $R_0 = 1000$ and enter a shadow period of $T_\text{shadow} = 50$ rounds before their rating is made visible.  Shadow players are included in team balancing/scrambling and their ratings are fully trusted by the solver.  The shadow period is purely a display concern; revealing a player's rating too early (and while it's wildly fluctuating) could undermine a player's confidence in the rating system.

Round 1 is always a 1v1 duel.  At round 1's end, the plugin requests an optimal split from the webservice so that the cache is populated before round 2 begins.  Round 2 is always scrambled unconditionally.  From round 3 onward, a scramble triggers when one team wins 3 rounds in a row, with optimal splits being requested at the end of every round.  

At round end, the plugin sends a rating update payload to the webservice which includes the SteamID64 of each player on the team, the winning team, the map name, and a server identifier.  The service computes new ratings for all participants and returns them.  A participant is any player who was on a team when the spawn doors went up and who did not disconnect before dealing damage, receiving damage, or healing another player.  Optimal splits are requested by the plugin during this time, computed by the webservice, and sent to the plugin for caching.

The split cache holds optimal team assignments (splits) for every team size from 2v2 up to $\lfloor N/2 \rfloor$ v $\lfloor N/2 \rfloor$ where $N$ is the eligible pool of players that includes everyone who is not on teams Unassigned or Spectator.  When a scramble fires, the plugin selects the largest split that fits the current eligible count and this includes determining which player(s) should be Waiting to Play.

On service unavailability, failed rating update payloads are queued in an 8-slot ring buffer and retried on next round end before the new update is sent.  For scramble requests, the plugin continues to use the most recently cached splits if a fresh request fails or times out.  If the cache is empty and the service is unreachable, scrambling is deferred: the "scramble pending" flag stays set and the plugin retries at the next opportunity.

The webservice API has 3 endpoints:
 - **Update:** Submitting round outcomes and receiving updated ratings
 - **Scramble:** Submitting eligible player pool and receiving optimal splits for all team sizes
 - **Admin:** Getting/setting/resetting individual ratings and querying pool quality indicators

All requests carry a shared secret in the headers.

---

## Parameters and Formulas
### Parameters
| Symbol | Value |Description |
|--------|-------|-------------|
| $R_0$ | 1000 | Starting rating |
| $R_\min$ | 100 | Absolute rating floor |
| $R_{T_\text{shadow}}$ | 2200 | Maximum rating on shadow graduation |
| $T_\text{shadow}$ | 50 | Rounds before rating is made visible |
| $T_\text{stale}$ | 30 | Days of inactivity before stale flag |
| $K_\min$ | 2 | Minimum K-factor |
| $K_\max$ | 72 | Maximum K-factor |
| $\Theta$ | 2400 | Logistic scale parameter (tune after 500 rounds) |
| $W$ | 500 | Rolling window size (rounds) |
| $w_\text{vel}$ | 0.25 | Velocity weight in convergence score |
| $w_\text{wre}$ | 0.25 | WR-excess weight in convergence score |
| $w_\text{vol}$ | 0.50 | Volatility weight in convergence score |
| $\sigma$ | 400 | Gaussian compression width |
| $N_\text{max}$ | 32 | Maximum pool size for exact solver |
### Rating Algorithm
#### F1: Win Probability
$$P_A = \frac{1}{1 + e^{-(\Sigma_A - \Sigma_B)\/\Theta}}$$

$\Sigma_A$ and $\Sigma_B$ are the sums of ratings of players on teams A and B at round start.  $\Theta$ is a per-pool parameter derived as:

$$\Theta = 400 \times \left\lfloor \text{median}\lbrace 1, 2, \ldots, k_{\text{max}} \rbrace \right\rfloor$$

where $k_\max$ is the maximum team size on any server feeding into the rating pool.  This is consistent with standard Elo ($\Theta = 400$) for $k_\max = 1$, scaling linearly with expected team size.  For a pool where the largest teams are 12v12, $\Theta = 2400$.  For multi-server pools with varying team sizes, set $k_\max$ to the median of the maximum team sizes across all servers feeding into the pool.  Note that while this gives a reasonably sound starting prior, tuning may still be necessary after 500 rounds.
#### F2: Rating Update
$$R'_i = \max\left(R_\min, R_i + K_i \left( S_i - P_i \right)\right)$$

where $S_i \in \lbrace 0, 0.5, 1 \rbrace$ and $S_i = 0$ if the player's team lost, $0.5$ in the event the round ended in stalemate, and $1$ if the player's team won; $P_i = P_A$ if on team A, else $1 - P_A$.  All updates should use Banker's rounding.
#### F3: Convergence Score
Let $\mathbf{w}$ be an ordered array of the player's ratings after each of their last $W$ rounds, from oldest (index 0) to most recent (index $-1$). Let $\mathbf{e}$ be the corresponding expected win probabilities and $\mathbf{o}$ the actual outcomes ($0$, $0.5$, or $1$). Let $m = \lfloor |\mathbf{w}| / 2 \rfloor$.

$$v_\text{vel} = \frac{\left|\overline{\mathbf{w}}_\text{new} - \overline{\mathbf{w}}_\text{old}\right|}{K_\text{max}}$$

$$v_\text{wre} = \frac{\left|\overline{\mathbf{o}} - \overline{\mathbf{e}}\right|}{0.10}$$

$$v_\text{vol} = \frac{\text{std}(\mathbf{w})}{K_\text{max}}$$

$$C = \min\left(1,\max\left(0,w_\text{vel}v_\text{vel} + w_\text{wre}v_\text{wre} + w_\text{vol}v_\text{vol}\right)\right)$$

where $\mathbf{w_\text{old}}$ and $\mathbf{w_\text{new}}$ are the older and more recent halves of $\mathbf{w}$ respectively, each of length $m$. The three signals split the work as follows:
- $\mathbf{vel}$ captures directional drift: is the rating still moving in a consistent direction?
- $\mathbf{wre}$ captures predictive miscalibration: is the model's win probability estimate still off? Normalized by 0.10; a 10% WR deviation saturates the signal.
- $\mathbf{vol}$ captures dispersion, the normalized spread of rating values within the window regardless of direction.
#### F4: Gaussian K Compression
$$G = \exp\left( -\frac{(R - R_0)^2}{2 \sigma^2} \right)$$

Applied in both directions from $R_0$. Stabilizes ratings at both extremes: elite players slow down as they climb while struggling players are protected from large downward swings near the floor.
#### F5: Dynamic K-Factor
$$K_i = K_\min + (K_\max - K_\min) \cdot C \cdot G$$
#### F6: Brier Score
$$\mathrm{Brier} = \frac{1}{N} \sum_{i=1}^{N} \left( P_i - O_i \right)^2$$

where $P_i$ is the predicted win probability and $O_i \in \lbrace 0,0.5,1\rbrace$ is the actual outcome for round $i$.  Target = 0.25 exactly.   Lower is not necessarily always better.  A score of 0.25 means every match was called as a fair coin flip, which only happens when GPE is consistently producing balanced teams and the ratings driving it are accurate.  This is the primary health signal of the ratings-balance/scramble feedback loop.
### Gosper Partition Enumeration (GPE)
#### F7: Objective Function
$$\underset{A,B:|A|=|B|=k}{\mathrm{minimize}} \left|\Sigma(A) - \Sigma(B)\right| \quad \forall k \in \left[2, \left\lfloor N/2 \right\rfloor\right]$$

Player 0 (highest-rated) is anchored to team A, exploiting the symmetry that any partition and its mirror are functionally identical. This halves the search space without discarding any optimal solution.
#### F8: Gosper's Hack (next k-subset)
Given a bitmask $x$ representing a $k$-subset:

$$c = x \mathbin{\\&} (-x), \quad r = x + c, \quad x_\text{next} = \left(\frac{r \oplus x}{c} \gg 2\right) \mathbin{|} r$$

Enumerates all $k$-subsets of an $N$-bit set in $O(1)$ per step.
#### F9: Two Layer Enumeration
Implement what was shown in F8 as the function `gosper_next(...)` and then find the best split.  The pseudocode does not deliberately approximate any specific language, though there is a loose mix of python-style syntax and C-style bitwise operators.  It even syntax highlights mostly-correctly when python is specified as the language.  Wowee!
```python
indices_of_set_bits(x):
    result = []
    for i in 0..N-1:
        if (x >> i) & 1:           // is bit i of x a 1?
            result.append(i)
    return result

gosper_next(x):
    c = x & (-x)                     // isolate the lowest set bit
    r = x + c                        // carry that bit forward
    return ((r ^ x) / c) >> 2 | r    // fix the lower bits and combine

find_best_split(ratings, N):
    best_score = INF
    best_split = null

    for k in 2 .. floor(N/2):
        outer = (1 << (2k-1)) - 1                                   // lowest (2k-1)-bit mask
        while outer < (1 << (N-1)):                                 // stay within N-1 bits; player 0 excluded
            active = indices_of_set_bits(outer)
            inner = (1 << (k-1)) - 1                                // lowest (k-1)-bit mask
            while inner < (1 << (2k-1)):                            // stay within the active subset
                B = { active[i] : bit i set in inner }              // bit i is 1? take him to B
                A = { 0 } + { active[i] : bit i clear in inner }    // bit i is 0? take him to A
                score = |Sum(ratings[A]) - Sum(ratings[B])|         // bit i is 2? take him to Detroit
                if score == 0:
                    return A, B
                if score < best_score:
                    best_score = score
                    best_split = (A, B)
                inner = gosper_next(inner)
            outer = gosper_next(outer)

    return best_split
```
#### F10: Complexity Bounds
The time-complexity of GPE is $T(N)$ such that:

$$O\left(\frac{2^{N-1}}{\sqrt{N}}\right) \leq T(N) \leq O\left(\frac{3^N}{\sqrt{N}}\right)$$

The lower bound applies when the pool is balanced and early exit fires frequently.  The upper bound is a pathological worst case.  Observed latency for $N = 24$ is under 100ms.

The exact solver is used for $N \leq N_\text{max}$.  For larger pools, a bench-aware greedy fallback heuristic would need to be used: starting from the current team assignment, all possible single-player swaps should be evaluated and the best improving swap should be taken.  The scan should restart from the beginning after each swap and should repeat until no improving swap exists.

---

## Generalization to TF2 Outside of Arena
### Structural Differences From Arena:
- No queue. Mid-round joins go straight to their engine-assigned team.
- No player substitutions outside of Arena.
- Waiting for Players period.

### Rating Participation
Any player who has dealt damage, received damage, or healed another player during a round should be rated for that round.  Win probability uses round-start team sums regardless of when a participant joined: the probability is viewed as being a property of the match, not that of any of the individuals.

### Scramble Triggers By Gamemode:
- Attack/Defend, Payload: attackers run out of time without capturing any objectives or 2 full rounds completed
- KOTH, 5CP: one team never captures the/a control point before the other team wins
- CTF: one team never captures the flag before the other team wins
- All modes: unconditional scramble when the Waiting For Players period expires.  This is directly analogous to the round 2 Arena scramble that always fires.

Streak-based scrambles do not apply outside Arena.  Some rounds can take 20–30 minutes and some modes have no time cap.

### Autobalance on Mid-Round Joins:

Only trigger when $|k(A) - k(B)| \geq 2$. When $|k(A) - k(B)| = 1$, place the odd player on whichever team minimizes $|\Sigma_A - \Sigma_B|$.

#### F11: Target Transfer Rating

When $|k(A) - k(B)| \geq 2$, compute:

$$T = \frac{|\Sigma_A - \Sigma_B|}{2}$$

Move the player on the larger team whose rating is closest to $T$ to the smaller team. Moving a player rated $r$ changes $|\Sigma_A - \Sigma_B|$ by exactly $2r$, so the player closest to $T$ minimizes the residual imbalance after the move. For $|\Delta k| \gt 2$, solve all moves jointly rather than greedily. Apply within 10 seconds of the imbalance being detected.

Adversarial behavior such as intentionally throwing is self-defeating.  A suppressed rating causes GPE to place the player against weaker teams, who they then defeat, and with stronger teammates that help them win; both of which would push the rating back up.  The system self-corrects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce skill rating and team balancing/scrambling systems #696

Overview of SynElo: yazoo.tf's Arena Rating & Balancing System

Session Flow and Protocol

Parameters and Formulas

Parameters

Rating Algorithm

F1: Win Probability

F2: Rating Update

F3: Convergence Score

F4: Gaussian K Compression

F5: Dynamic K-Factor

F6: Brier Score

Gosper Partition Enumeration (GPE)

F7: Objective Function

F8: Gosper's Hack (next k-subset)

F9: Two Layer Enumeration

F10: Complexity Bounds

Generalization to TF2 Outside of Arena

Structural Differences From Arena:

Rating Participation

Scramble Triggers By Gamemode:

Autobalance on Mid-Round Joins:

F11: Target Transfer Rating

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Symbol	Value	Description
$R_0$	1000	Starting rating
$R_\min$	100	Absolute rating floor
$R_{T_\text{shadow}}$	2200	Maximum rating on shadow graduation
$T_\text{shadow}$	50	Rounds before rating is made visible
$T_\text{stale}$	30	Days of inactivity before stale flag
$K_\min$	2	Minimum K-factor
$K_\max$	72	Maximum K-factor
$\Theta$	2400	Logistic scale parameter (tune after 500 rounds)
$W$	500	Rolling window size (rounds)
$w_\text{vel}$	0.25	Velocity weight in convergence score
$w_\text{wre}$	0.25	WR-excess weight in convergence score
$w_\text{vol}$	0.50	Volatility weight in convergence score
$\sigma$	400	Gaussian compression width
$N_\text{max}$	32	Maximum pool size for exact solver

Introduce skill rating and team balancing/scrambling systems #696

Description

Overview of SynElo: yazoo.tf's Arena Rating & Balancing System

Session Flow and Protocol

Parameters and Formulas

Parameters

Rating Algorithm

F1: Win Probability

F2: Rating Update

F3: Convergence Score

F4: Gaussian K Compression

F5: Dynamic K-Factor

F6: Brier Score

Gosper Partition Enumeration (GPE)

F7: Objective Function

F8: Gosper's Hack (next k-subset)

F9: Two Layer Enumeration

F10: Complexity Bounds

Generalization to TF2 Outside of Arena

Structural Differences From Arena:

Rating Participation

Scramble Triggers By Gamemode:

Autobalance on Mid-Round Joins:

F11: Target Transfer Rating

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions