#

llm-refusal

Here is 1 public repository matching this topic...

RishabSA / interp-refusal-tokens

We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.

machine-learning research ai deep-learning pytorch artificial-intelligence safety llama steering neurips llm mechanistic-interpretability llm-safety refusal llama3 transformer-lens llm-refusal

Updated Dec 31, 2025
Jupyter Notebook

Improve this page

Add a description, image, and links to the llm-refusal topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-refusal topic, visit your repo's landing page and select "manage topics."