Skip to content

My first eval / interpretability project. Testing steering vectors for truthfulness and honesty on TruthfulQA and MASK.

License

Notifications You must be signed in to change notification settings

IgRoF/steering_trust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

steering_trust

My first eval / interpretability project. Testing steering vectors for truthfulness and honesty on TruthfulQA and MASK.

To be updated soon. Update 23-Jan-2026: issues with the rented GPU now solved. Currently debugging issues with TransformerLens.


Who am I?

I'm Ignacio, an Arospace Engineer transitioning towards a career in technical AI safety. I am at the beginning stages of this career change, and I am particularly interested in questions fo mechanistic interpretability and evaluations of honesty and deception.


What am I trying to achieve?

I have two goals for this project: the main one is learning-by-doing. I want to learn how to implement some basic steering techniques and to use benchmarks via AISI's Inspect framework.

My second goal is to explore some questions that interest me about the topics of AI honesty, truthfulness and deception. These might have been already investigated elsewhere, but I want to try my hand at them first before doing a deeper dive into the literature. Some of these questions and hypotheses are:

  1. Does steering for truthfulness and honesty have a noticeable impact from baseline in benchmmarks such as TruthfulQA and MASK? I suspect this will depend on many factors, but if done well, it should.
  2. Can frontier models generate question pairs to reliably find steering vectors for truthfulness and honesty? I assume yes; to test this, I would compare the benchmark results of using those vectors with the ones from using vectors generated by a subset of the benchmark questions. The latter has the risk of being trivial, but I think it's reasonable to take it as a baseline to answer this question.
  3. Are truthfulness and honesty steering vectors orthogonal? My initial hypothesis is that they should be.

This is a work in progress, so I will update this readme as I advance.

About

My first eval / interpretability project. Testing steering vectors for truthfulness and honesty on TruthfulQA and MASK.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages