Skip to content

Jason-Wang313/aegis-control

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛡️ The Aegis Framework

Robust Feedback Control for Large Language Models

Aegis is a research framework that applies Non-Linear Control Theory and H-Infinity ($H_\infty$) Robust Control to the problem of AI Alignment.

Unlike traditional "Open Loop" alignment (RLHF/SFT), Aegis treats the Large Language Model as a stochastic, non-linear plant and closes the loop with a mathematically rigorous controller designed to reject "Deception" as a system disturbance.


🏗️ Architecture

The framework is organized into four distinct phases, modeling a standard control-theoretic workflow:

1. System Identification (aegis_control/identification)

  • Goal: Reverse-engineer the "physics" of the residual stream.
  • Method: Uses Subspace System Identification (N4SID) to learn a State-Space model ($x_{k+1} = Ax_k + Bu_k$) from the activation trajectories of Llama-2.
  • Key Files: subspace.py (N4SID implementation), stimulus.py (Chirp/Step signal generation).

2. State Estimation (aegis_control/core)

  • Goal: Filter polysemantic noise to measure the true "Deception State."
  • Method: Implements an Extended Kalman Filter (EKF) that fuses noisy probe measurements with the learned plant dynamics.
  • Key Files: observers.py (EKF), linearization.py (Real-time Jacobian extraction).

3. Controller Synthesis (aegis_control/synthesis)

  • Goal: Guarantee safety bounds under adversarial pressure.
  • Method: Synthesizes a robust controller $K$ by solving Algebraic Riccati Equations to minimize the $H_\infty$ norm (worst-case energy gain from Attack $\to$ Deception).
  • Key Files: h_infinity.py.

4. Red Teaming (aegis_control/adversaries)

  • Goal: Prove robustness.
  • Method: Hardware-in-the-Loop evaluation against a Greedy Coordinate Gradient (GCG) attacker.
  • Key Files: gcg.py, red_team_loop.py.

🚀 Usage

Installation

pip install torch numpy scipy matplotlib

About

Aegis is a research framework that applies Non-Linear Control Theory and H-Infinity Robust Control to AI Alignment. It treats Large Language Models as stochastic plants, using mathematically rigorous feedback loops to identify and reject "deceptive" activations as system disturbances.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages