My first eval / interpretability project. Testing steering vectors for truthfulness and honesty on TruthfulQA and MASK.
To be updated soon. Update 23-Jan-2026: issues with the rented GPU now solved. Currently debugging issues with TransformerLens.
I'm Ignacio, an Arospace Engineer transitioning towards a career in technical AI safety. I am at the beginning stages of this career change, and I am particularly interested in questions fo mechanistic interpretability and evaluations of honesty and deception.
I have two goals for this project: the main one is learning-by-doing. I want to learn how to implement some basic steering techniques and to use benchmarks via AISI's Inspect framework.
My second goal is to explore some questions that interest me about the topics of AI honesty, truthfulness and deception. These might have been already investigated elsewhere, but I want to try my hand at them first before doing a deeper dive into the literature. Some of these questions and hypotheses are:
- Does steering for truthfulness and honesty have a noticeable impact from baseline in benchmmarks such as TruthfulQA and MASK? I suspect this will depend on many factors, but if done well, it should.
- Can frontier models generate question pairs to reliably find steering vectors for truthfulness and honesty? I assume yes; to test this, I would compare the benchmark results of using those vectors with the ones from using vectors generated by a subset of the benchmark questions. The latter has the risk of being trivial, but I think it's reasonable to take it as a baseline to answer this question.
- Are truthfulness and honesty steering vectors orthogonal? My initial hypothesis is that they should be.
This is a work in progress, so I will update this readme as I advance.