steering_trust

My first eval / interpretability project. Testing steering vectors for truthfulness and honesty on TruthfulQA and MASK.

To be updated soon. Update 23-Jan-2026: issues with the rented GPU now solved. Currently debugging issues with TransformerLens.

Who am I?

I'm Ignacio, an Arospace Engineer transitioning towards a career in technical AI safety. I am at the beginning stages of this career change, and I am particularly interested in questions fo mechanistic interpretability and evaluations of honesty and deception.

What am I trying to achieve?

I have two goals for this project: the main one is learning-by-doing. I want to learn how to implement some basic steering techniques and to use benchmarks via AISI's Inspect framework.

My second goal is to explore some questions that interest me about the topics of AI honesty, truthfulness and deception. These might have been already investigated elsewhere, but I want to try my hand at them first before doing a deeper dive into the literature. Some of these questions and hypotheses are:

Does steering for truthfulness and honesty have a noticeable impact from baseline in benchmmarks such as TruthfulQA and MASK? I suspect this will depend on many factors, but if done well, it should.
Can frontier models generate question pairs to reliably find steering vectors for truthfulness and honesty? I assume yes; to test this, I would compare the benchmark results of using those vectors with the ones from using vectors generated by a subset of the benchmark questions. The latter has the risk of being trivial, but I think it's reasonable to take it as a baseline to answer this question.
Are truthfulness and honesty steering vectors orthogonal? My initial hypothesis is that they should be.

This is a work in progress, so I will update this readme as I advance.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
eval_truthfulqa_mc1.py		eval_truthfulqa_mc1.py
make_vectors_v2.py		make_vectors_v2.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

steering_trust

Who am I?

What am I trying to achieve?

About

Uh oh!

Releases

Packages

Languages

License

IgRoF/steering_trust

Folders and files

Latest commit

History

Repository files navigation

steering_trust

Who am I?

What am I trying to achieve?

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages