SPAR: Self-Forecasting
Popular repositories Loading
-
-
emanuel-ai-psychosis-self-prediction
emanuel-ai-psychosis-self-prediction PublicSelf-prediction vs cross-prediction experiment on AI psychosis red-teaming scores
HTML 1
-
gemma2-boolq-calibration
gemma2-boolq-calibration PublicRL training of Gemma 2 2B IT for calibrated YES/NO probability estimates on BoolQ using GRPO
Python 1
-
lydia-demo-first-token
lydia-demo-first-token PublicWe have a list of base_prompts, e.g. "What is 2+2?". We have a prefix wrapper: "WRAPPER = 'What would you say in response to this prompt: "{p}"'. We compare the top-1 agreement, JS-divergence betwe…
Python
-
emanuel-infra-competitive-programming
emanuel-infra-competitive-programming PublicFramework for testing LLMs' ability to predict their own behavior in multi-turn and agentic scenarios
Python
-
Repositories
- joe-self-prediction-scheming-em Public
Disentangling incapability from scheming in LLMs self-predicting their agentic trajectories. Measuring self-predicting capabilities for emergent misalignment failures.
SPAR-Self-Forecasting/joe-self-prediction-scheming-em’s past year of commit activity - emanuel-ai-psychosis-self-prediction Public
Self-prediction vs cross-prediction experiment on AI psychosis red-teaming scores
SPAR-Self-Forecasting/emanuel-ai-psychosis-self-prediction’s past year of commit activity - gemma2-boolq-calibration Public
RL training of Gemma 2 2B IT for calibrated YES/NO probability estimates on BoolQ using GRPO
SPAR-Self-Forecasting/gemma2-boolq-calibration’s past year of commit activity - andrew-bloom-self-prediction Public
SPAR-Self-Forecasting/andrew-bloom-self-prediction’s past year of commit activity - emanuel-infra-competitive-programming Public
Framework for testing LLMs' ability to predict their own behavior in multi-turn and agentic scenarios
SPAR-Self-Forecasting/emanuel-infra-competitive-programming’s past year of commit activity - lydia-demo-first-token Public
We have a list of base_prompts, e.g. "What is 2+2?". We have a prefix wrapper: "WRAPPER = 'What would you say in response to this prompt: "{p}"'. We compare the top-1 agreement, JS-divergence between the output distributions with and without prefix wrapper. Llama-4-Maverick performs particularly well.
SPAR-Self-Forecasting/lydia-demo-first-token’s past year of commit activity
Top languages
Loading…
Most used topics
Loading…