We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.
machine-learning research ai deep-learning pytorch artificial-intelligence safety llama steering neurips llm mechanistic-interpretability llm-safety refusal llama3 transformer-lens llm-refusal
-
Updated
Dec 31, 2025 - Jupyter Notebook