Releases: dtch1997/steering-bench
v0.0.1
v0.0.1 (2024-12-13)
Fix
Unknown
-
add ci (#3)
-
add ci, pre-commit
-
fix ci
-
update ci
-
update ruff config
Co-authored-by: Daniel Tan <dtch1997@users.noreply.github.com> (33c8a2b)
-
Update README.md (
7ed9e3b) -
add arxiv badge (
f302b40) -
update (
15f3930) -
rename persona_generalization to steering_generalization (
0ddd1ad) -
Update README.md (
a17bfba) -
Update README.md (
b566646) -
Update README.md (
95295e6) -
Create README.md (
fae2efb) -
Create README.md (
9151b2a) -
Update README.md (
0f25019) -
Refusal (#2)
-
add refusal dataset
-
wip refusal experiment
Co-authored-by: Daniel CH Tan <dtch1997@users.noreply.github.com> (893d67a)
-
remove redundant script (
3d4870f) -
refactor (
ffe4666) -
defer formatting to apply_chat_template (
53021b6) -
add persona generalization experiment (
3d70293) -
delete results (
8dac290) -
steering experiments (#1)
-
wip steering experiment
-
minimal working version done
-
refactor
-
add working layer sweep script
-
working layer sweep
Co-authored-by: Daniel Tan <dtch1997@users.noreply.github.com> (b571672)