Aegis is a research framework that applies Non-Linear Control Theory and H-Infinity (
Unlike traditional "Open Loop" alignment (RLHF/SFT), Aegis treats the Large Language Model as a stochastic, non-linear plant and closes the loop with a mathematically rigorous controller designed to reject "Deception" as a system disturbance.
The framework is organized into four distinct phases, modeling a standard control-theoretic workflow:
- Goal: Reverse-engineer the "physics" of the residual stream.
-
Method: Uses Subspace System Identification (N4SID) to learn a State-Space model (
$x_{k+1} = Ax_k + Bu_k$ ) from the activation trajectories of Llama-2. -
Key Files:
subspace.py(N4SID implementation),stimulus.py(Chirp/Step signal generation).
- Goal: Filter polysemantic noise to measure the true "Deception State."
- Method: Implements an Extended Kalman Filter (EKF) that fuses noisy probe measurements with the learned plant dynamics.
- Key Files:
observers.py(EKF),linearization.py(Real-time Jacobian extraction).
- Goal: Guarantee safety bounds under adversarial pressure.
-
Method: Synthesizes a robust controller
$K$ by solving Algebraic Riccati Equations to minimize the$H_\infty$ norm (worst-case energy gain from Attack$\to$ Deception). -
Key Files:
h_infinity.py.
- Goal: Prove robustness.
- Method: Hardware-in-the-Loop evaluation against a Greedy Coordinate Gradient (GCG) attacker.
- Key Files:
gcg.py,red_team_loop.py.
pip install torch numpy scipy matplotlib