You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are arguably 150 people who know how to train frontier models and even fewer who really understand the resulting artifacts. The goal of Mechanistic Interpretability (MI) is to help us approach the last problem. But if the goal is to discover vulnerabilities then wouldn't we need many more eyeballs than a small pool of AI frontier researchers? What if we gave tools to agent developers to build risk discovery tools themselves? To argue by analogy, Tim Bernards Lee reduced (the unworkable) SGML to HTML and the browser allowed users to view source. Arguably that ability to copy+paste and edit the html code was a major scaling factor. Can we do the same for MI?
We want to test this network effects hypothesis. To do that we at Krnel have open sourced representation engineering infra we call “Policy neurons” that performs SOTA on detection and controls for agent security. Could you/your team try it out and give us some feedback? It’s minimal for now, but we want to see what people think. You can see the details at
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
There are arguably 150 people who know how to train frontier models and even fewer who really understand the resulting artifacts. The goal of Mechanistic Interpretability (MI) is to help us approach the last problem. But if the goal is to discover vulnerabilities then wouldn't we need many more eyeballs than a small pool of AI frontier researchers? What if we gave tools to agent developers to build risk discovery tools themselves? To argue by analogy, Tim Bernards Lee reduced (the unworkable) SGML to HTML and the browser allowed users to view source. Arguably that ability to copy+paste and edit the html code was a major scaling factor. Can we do the same for MI?
We want to test this network effects hypothesis. To do that we at Krnel have open sourced representation engineering infra we call “Policy neurons” that performs SOTA on detection and controls for agent security. Could you/your team try it out and give us some feedback? It’s minimal for now, but we want to see what people think. You can see the details at
The graph is our attempt to democratize model risk discovery by:
We would value your feedback and comments.
Beta Was this translation helpful? Give feedback.
All reactions