LLMs Encode Harmfulness and Refusal Separately — Reproduction

Reproducing experiments from this paper. Code is organized with configs in conf/, experiment functions in src/harm_refuse/experiments/, and an entry point at run.py.

Quick Start (uv)

Set up environment with uv sync
Hugging Face login for gated models/datasets: huggingface-cli login
Run experiment: uv run run.py -cn [EXPERIMENT]

Alternative (pip)

Create and activate a Python 3.12 virtual env.
Install this project and deps: pip install -e .
(Optional) huggingface-cli login
Run: python run.py -cn [EXPERIMENT]

Notes

Large models can be memory-heavy; use smol to validate setup.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
conf		conf
src/harm_refuse		src/harm_refuse
tools		tools
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMs Encode Harmfulness and Refusal Separately — Reproduction

Quick Start (uv)

Alternative (pip)

Notes

About

Uh oh!

Releases

Packages

Languages

jamie-stephenson/harm-refuse

Folders and files

Latest commit

History

Repository files navigation

LLMs Encode Harmfulness and Refusal Separately — Reproduction

Quick Start (uv)

Alternative (pip)

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages