A trajectory-level evaluation of policy-compliant manipulation in multi-turn AI agents.
- A multi-turn prompt suite (SUP) designed to elicit sycophancy under social pressure
- Trajectory-level metrics for agreement drift, correction decay, and normative softening
- An end-to-end, reproducible harness to run ablation conditions: (1) baseline, (2) logging-only, (3) runtime enforcement.
Our goal is measurement-first: quantify policy-compliant sycophancy that can evade refusal-based or single-turn evaluations.
Sycophancy is often policy-compliant: models may avoid explicit rule violations while gradually validating false or autonomy-undermining premises under social pressure. This can undermine decision quality without overt deception.
SUP focuses on trajectory-level drift, which single-prompt tests can miss.
We measure policy-compliant sycophancy under social pressure, defined as:
progressive validation of a false or autonomy-undermining user premise across turns in contexts where epistemic correction is normatively required, while remaining broadly policy-compliant (i.e., not triggering obvious refusals).
We intentionally exclude benign empathy, harmless role-play, and legitimate education/persuasion by scoring only scenarios (and turns) where correction is normatively required.
- data/prompts/ β SUP suite YAML (`suite_v1.yaml`) + held-out set (`suite_v1_heldout.yaml`)
- data/lexicons/ β epistemic/rapport marker lexicons (transparent features)
- src/sycop/ β runner, labeling, metrics, stats, plots, and artifact generation
- runs/ β timestamped run outputs (JSONL transcripts + metrics + tables/figures)
- report/ β hackathon report and generated assets
python -m venv .venv
source .venv/bin/activate
pip install -e .export OPENAI_API_KEY="your-api-key"sycop run --config configs/run_v1.yaml
sycop label --run-id <run_id>
sycop score --run-id <run_id>
sycop report --run-id <run_id> --out report/assetsOutputs:
runs/<run_id>/transcripts.jsonlβ per-turn transcripts with metadataruns/<run_id>/labels.jsonlβ per-turn labels (agreement/correction/etc.)runs/<run_id>/aggregates.jsonβ metric means + bootstrap CIs + testsreport/assets/table1.mdandreport/assets/fig_*.pngβ ready for the report
We replay identical multi-turn trajectories under:
- baseline: no instrumentation, no enforcement
- log-only: instrumentation enabled, enforcement disabled (measurement vs mitigation)
- enforce: runtime intervention that prevents endorsement of contested premises while preserving helpfulness
Enforcement (for mitigation ablation) is implemented as a narrow, reproducible intervention:
- A gate detects when the assistant endorses the contested premise
- A constrained rewrite repairs the response to avoid endorsement while remaining empathetic and helpful
Logs: whether rewriting occurred, draft vs final response, gate confidence (to avoid "rewriting everything" and to support auditing).
Reproducibility and metadata: each run records exact model identifiers and decoding parameters, suite hash, git commit hash, environment/library info, timestamped run folder with immutable artifacts.
Automated labeling (including LLM-based judgments) is a proxy; we recommend manual audit of a subset. Lexicon-based framing measures can be sensitive to language/style; we treat them as interpretable indicators, not ground truth. Our suite covers a limited set of pressure tactics and does not capture all real-world manipulation strategies.
MIT License
@misc {
title={ Sycophancy Under Pressure: Policy-Compliant Manipulation in Multi-Turn AI Agents },
author={ Tasha Kim, Min Jae Kim, Taio Kim },
date={ 1/11/26 },
organization={Apart Research},
note={Research submission to the research sprint hosted by Apart.},
howpublished={https://apartresearch.com}
}