-
Notifications
You must be signed in to change notification settings - Fork 373
docs: improve FINAL/FINAL_VAR format requirements in system prompt #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@LIHUA919 I like this general idea, but want to make sure the prompts actually work well because it's a lot of extra tokens in the system prompt (i.e. not sure if all the in context examples are necessary). Can you run a small test and see if this nukes general performance? Before I make any prompt changes, I just want to make sure of that. Thanks! For example, for the task where you observed it go "I will provide..." can you see show a diff of if it changes? |
|
Thanks for the feedback @alexzhang13! I completely understand the concern about token usage. Let me run some tests to compare performance before and after the prompt changes:
I'll also test on the specific case mentioned in #37 where the model was including
Would you like me to also test a simplified version with fewer examples if the full version adds too many tokens? For example, we could keep just the critical format requirements without all the visual examples. I'll share the test results shortly so we can make an informed decision about the trade-off between token cost and reliability. |
|
Thanks @alexzhang13! I've completed the testing. Here are the detailed results: Short Answer: No, it doesn't nuke performance ✅Test Setup
📊 Performance DataSystem Prompt Overhead
This ~140 token cost is added once per iteration. The question is whether we save more than 140 tokens by reducing iterations. Overall Results (All Tests Combined)
Net savings: -5,084 tokens despite ~140 × 8.4 = ~1,176 extra prompt tokens 🎯 Issue #37 Case: Print 100 Powers of TwoThis is the specific case you mentioned where the model said "I will return FINAL_VAR(output)".
Raw data (5 runs):
The model now correctly places FINAL_VAR on its own line, fixing the detection bug. 📋 All Test Cases (Raw Data)Test 1: Print 100 powers of two
Test 2: Simple math (15 * 23 + 7)
Test 3: Count 1-10
💡 Cost-Benefit AnalysisPer-Iteration Breakdown (Test 1 example)Old version:
New version:
Net savings: 29,029 - 13,039 = -15,990 tokens (-55.1%) Why does this work?
🎯 RecommendationI recommend merging this PR because: ✅ Fixes Issue #37: Model now correctly formats FINAL_VAR (Test 1: -36.2% tokens) Optional: Simplified VersionIf you're concerned about prompt length, I can create a version with:
Let me know if you'd like me to test a simplified version or if this looks good to merge! |
Summary
This PR addresses Issue #37 by clarifying the format requirements for
FINAL()andFINAL_VAR()statements in the system prompt.Problem
As reported in #37, models were naturally including
FINAL_VAR()in conversational context (e.g., "I will return FINAL_VAR(output) now") instead of placing it on a separate line. This caused the detection regex (^\s*FINAL_VAR) to fail, leading to unnecessary iterations and wasted tokens.Solution
Improve the system prompt to explicitly state format requirements:
Changes
File:
rlm/utils/prompts.pyTesting
Expected Impact
Fixes #37