Implement v3 fully iterative agent with user approval workflow and enhanced schema detection#1
Implement v3 fully iterative agent with user approval workflow and enhanced schema detection#1
Conversation
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
|
@copilot I tested it. The reasoning model thinks too long, especially when prevbious iteration fails. it double guesses itself a lot. Also before executing always ask user for permission, with them being able to stop it at any point. |
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Fixed both issues in commit 7f61eaf: 1. Faster decision-making: Added prompt rules to reduce overthinking
2. User approval required: Each action now shows Execute/Skip/Stop buttons before execution Button behavior:
Users now have full control at every step. |
…constant for delay Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
|
@copilot Now it thinks too less and messes up the tool to use. The approve etc is good no need to change that. also we need to use the schema tool as input whenever we change page like manage tabs, etc. Wehnever page has a cyhange. The page doewsnt change on click etc. but after navigating to another page you need to call that in default to decide next input tool. As that way iut is informed what is has toi work with on the page. |
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Fixed in commits 23ee9a2 and 795fce5: 1. Balanced reasoning (23ee9a2):
2. Auto-schema after navigation (23ee9a2):
3. Full schema in context (795fce5):
The agent now has proper context to make informed decisions about which elements to interact with. |
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
|
@copilot Here is the first input. See howthe reaSONING IS so big. also still it messed up the fact that url is not in executor call of youtube. earlier this was working First, I'm on the Google homepage. The task is to open YouTube in a new tab. The known URLs include youtube.com, so I should navigate to that. But wait, the user said "in another tab", so I need to open a new tab first. Wait, the available tools include manageTabs. So the first step is to open a new tab. Then navigate to youtube.com. But the user also wants to search for the specific query. However, the instructions say to do it in another tab, so the next action is to create a new tab, then go to YouTube. Wait, the user's instruction says "Open yoiutube in another tab and search for...". So first, open a new tab with YouTube. Then, once on YouTube, perform the search. But according to the rules, after navigating to a new tab (using manageTabs), I need to call getSchema before interacting. Wait, the schema is for the page structure. So first, manageTabs to open a new tab, then navigate to youtube.com, then wait for the page to load, then getSchema to check if the search box is present. Wait, the user's previous conversation: the user said "Open yoiutube in another tab and search for 'never gonna give you up'". So current page is Google. The first step is to open a new tab. The available tools are manageTabs, navigate, etc. Wait, the correct sequence here is: Open a new tab (manageTabs with action to open a new tab) Wait, the user's instruction is to open YouTube in another tab. So the immediate next action is to open a new tab. Then navigate to YouTube. Wait, but the current state is that the user is on Google. The user wants to open YouTube in another tab. So first, manageTabs to open a new tab. Then, navigate to youtube.com. Wait, but the user says "Open yoiutube in another tab". So first, create a new tab, then go to YouTube. But in the current iteration, the user is on Google. So the first action is to open a new tab. Wait, the available tools include manageTabs. So the action would be to open a new tab. Wait, the manageTabs tool can open a new tab. So the next action is manageTabs with action 'open new tab'. But the user's instruction is to open YouTube in another tab, so after opening the new tab, navigate to YouTube. Wait, but the current step is the first action. So the first action is to open a new tab. Wait, but the user's instruction is to open YouTube in another tab and search. So first, open the tab, then go to YouTube. But according to the rules, after navigating (which includes manageTabs), the next action is waitFor. Wait, no. The manageTabs is a separate tool. Let me check the tools again. Available Tools: click, type, select, pressKey, scroll, navigate, manageTabs, waitFor, getSchema, getHTML So manageTabs is a tool. The description says: "manageTabs: open, close, switch, or manage browser tabs." So to open a new tab, the action would be manageTabs with action 'open new tab'. But in the context of the user's request, the first step is to open a new tab. Then, navigate to YouTube. But the user's instruction is to "Open yoiutube in another tab and search for...". So the immediate next action is to open a new tab. Wait, but the user's current page is Google. So the next action is to open a new tab. Wait, but the user is asking to open YouTube in another tab. So the first step is to create a new tab. Then, navigate to YouTube. But the current step is the first action. So the next action is to open a new tab. Wait, but the user's previous conversation shows that the user is on Google. So the agent should first open a new tab. Wait, the rules state that after any page navigation (manageTabs, navigate), you must call getSchema before attempting interactions. Wait, no. The rule says: "After opening tabs or navigating, next action should be waitFor page load." Wait, the rule says: "After opening tabs or navigating, next action should be waitFor page load." Wait, the exact rule 9: "After opening tabs or navigating, next action should be waitFor page load." Wait, the user's instruction is to open a new tab (manageTabs), then navigate to YouTube. So after manageTabs, the next action is to navigate to YouTube. Wait, no. Wait, the manageTabs action is opening a new tab, so the next step is to navigate to the URL in that new tab. Wait, the manageTabs tool's parameters would specify the action (open new tab), but how to specify which tab? Maybe the manageTabs action is to open a new tab, and then the next step is to navigate to the URL in that new tab. Wait, perhaps the correct sequence is: manageTabs to open a new tab. navigate to youtube.com in that new tab. But according to the tools, navigate is a separate tool. So after opening the new tab, the next action is navigate to youtube.com. But the rule says that after navigating (using navigate), the next action should be waitFor. Wait, no. Rule 9 says: "After opening tabs or navigating, next action should be waitFor page load." Wait, the rule is a bit confusing. Let me read it again. Rule 9: "After opening tabs or navigating, next action should be waitFor page load." Wait, so if you open a new tab (manageTabs), then the next action should be waitFor to load the page. But no, because manageTabs opens a new tab, but the page isn't loaded yet. Wait, no. If you open a new tab, you have to navigate to a URL first. So the correct sequence is: manageTabs to open a new tab. Then, navigate to the URL (youtube.com). Then, waitFor the page to load. Wait, but the rule says after navigating (which is the navigate tool), you must waitFor. So: After navigate (which is the action of going to a URL), the next action is waitFor. So the steps would be: manageTabs (open new tab) navigate to youtube.com waitFor page load getSchema interact with the search box. But according to the current situation, the user is on Google. So the first action is to open a new tab. Wait, but the user's instruction is to open YouTube in another tab and search for the query. So the immediate next action is to open a new tab. Wait, but the user's previous conversation shows that the user is on Google. So the agent's first action should be to open a new tab. Wait, but the problem here is that the agent needs to execute the next action. So the first step is to open a new tab. Wait, but the user's instruction is to open YouTube in another tab. So the agent should open a new tab, then navigate to YouTube. But according to the rules, after opening a tab (manageTabs), the next action should be waitFor. Wait, no. The manageTabs action is to open a new tab. Then, the next action is to navigate to the URL in that tab. Wait, maybe the manageTabs tool is used to open a new tab, and then navigate to the URL. So the next action after manageTabs is navigate to youtube.com. Wait, but the rule says: "After opening tabs or navigating, next action should be waitFor page load." Wait, opening tabs (manageTabs) is a separate action from navigating. So when you open a new tab (manageTabs), the page isn't loaded yet. You need to navigate to the URL first. So the sequence is: manageTabs (open new tab) navigate to youtube.com waitFor page load getSchema type search query pressKey to submit But according to the problem statement, the agent is supposed to output the single next action. So in the current state, the user is on Google. The next action should be to open a new tab. Wait, but the user's instruction is to open YouTube in another tab. So the first step is to open a new tab. Therefore, the next action is to manageTabs to open a new tab. Wait, but the agent has to output the action. So the next action is to open a new tab. Wait, but the user might be confused because the current page is Google. So the agent needs to open a new tab, then go to YouTube. But the immediate next action is to open a new tab. Wait, but the agent must not plan multiple steps. So the immediate next action is to open a new tab. Wait, but the user's instruction is to open YouTube in another tab. So the agent needs to first open a new tab, then navigate to YouTube. So the first action is to open a new tab. Wait, but the available tools include manageTabs. So the action is manageTabs with action 'open new tab'. Then, after that, navigate to youtube.com. But in the current iteration, the agent's next action is to open a new tab. Wait, but the user's instruction is to open YouTube in another tab and search. So the first action is to open a new tab. Therefore, the next action is to open a new tab. Wait, but the user is on Google. So the agent should open a new tab. Wait, but the correct action here is to open a new tab. So the action would be manageTabs with action 'open new tab'. Wait, but the manageTabs tool's parameters might require more details. However, the problem statement says to be specific. The tool is manageTabs, so the action description should be clear. Wait, the example shows that manageTabs can be used to open a new tab. So the action is "open a new tab". Wait, the output JSON should have the action as "manageTabs: open new tab". But according to the schema, the action is a string. So the action would be "manageTabs: open new tab". But the rules say: "BE SPECIFIC: Include exact URLs, text, and details in action descriptions." So the action should be precise. So the action is to open a new tab. Wait, but the user's instruction is to open YouTube in another tab. So the next action is to open a new tab, then navigate to YouTube. But the agent must output only the next immediate action. So in this case, the next action is to open a new tab. Wait, but the user's previous conversation shows that the user is on Google. So the agent should first open a new tab. Yes. So the next action is to open a new tab. Wait, but the problem is that the agent needs to output the action. So the action is manageTabs to open a new tab. But according to the rules, after opening tabs (manageTabs), the next action should be waitFor. Wait, no. The rule says: "After opening tabs or navigating, next action should be waitFor page load." Wait, but manageTabs is opening a new tab, not navigating. So after opening the tab, you need to navigate to the URL. So the next action after manageTabs is navigate to youtube.com. But the agent's current step is to decide the next action. So if the agent is on the Google page, the immediate next action is to open a new tab. Wait, no. Because the agent is supposed to open YouTube in another tab. So the first action is to open a new tab, then navigate to YouTube. So the first action is to open a new tab. Therefore, the next action is to open a new tab. Wait, but the user is asking to open YouTube in another tab. So the agent's first action is to open a new tab. Therefore, the action is to manageTabs with open new tab. But the problem is that the user's instruction is to open YouTube in another tab, so the agent needs to open a new tab, then go to YouTube. But according to the rules, the agent can only do one action at a time. So the immediate next action is to open a new tab. Wait, but the agent must not create multi-step plans. So the agent should only do the next immediate action. Therefore, the next action is to open a new tab. Wait, but the user's instruction is to open YouTube in another tab. So the agent needs to open a new tab, then navigate to YouTube. So the first step is to open a new tab. Therefore, the correct action is to manageTabs to open a new tab. But the output should be a JSON with action as "manageTabs: open new tab". Wait, but the example shows that when navigating, the action is "navigate to URL". Wait, the available tools include navigate. So to open YouTube in another tab, the agent would first open a new tab (manageTabs), then navigate to YouTube. But the immediate next action is to open the new tab. So the answer would be: needs_action: true action: "manageTabs: open new tab" reasoning: "To open YouTube in another tab, first create a new tab." message: "Opening a new tab to access YouTube..." But wait, the user's instruction is to open YouTube in another tab and search. So the agent is supposed to first open a new tab, then navigate to YouTube. But according to the rules, after opening a new tab (manageTabs), the next action should be waitFor. Wait, no. The rule says: "After opening tabs or navigating, next action should be waitFor page load." Wait, the rule says after opening tabs (manageTabs) or navigating (navigate), the next action should be waitFor. Wait, but manageTabs is opening a new tab, which doesn't load a page yet. You need to navigate to a URL first. So after manageTabs (open new tab), the next action is navigate to youtube.com. But according to the rule 9: "After opening tabs or navigating, next action should be waitFor page load." Wait, the rule is a bit confusing. If you open a new tab (manageTabs), then navigate to a URL (navigate), then you need to waitFor. Wait, the rule says: "After opening tabs or navigating, next action should be waitFor page load." So if you open a new tab (manageTabs), then navigate (which is a separate action), then after navigating, the next action is waitFor. But in this case, the agent is on the Google page. To open YouTube in another tab, the steps are: manageTabs: open new tab navigate to youtube.com waitFor page load getSchema type search query pressKey But the agent can only do one action at a time. So the immediate next action after being on Google is to open a new tab. Wait, but the user is asking to open YouTube in another tab. So the first step is to open a new tab. Then navigate to YouTube. Therefore, the next action is to open a new tab. But the rule 9 says that after opening tabs (manageTabs), the next action should be waitFor. Wait, but manageTabs opens a new tab, which is empty. To load a page, you need to navigate. So the waitFor would be after navigating. Wait, perhaps the rule is that after any navigation (navigate), you need to waitFor. But manageTabs is not navigation. So after manageTabs, you need to navigate, then waitFor. But in the current situation, the agent is on Google. To open YouTube in another tab, the agent must first open a new tab (manageTabs), then navigate to YouTube (navigate), then waitFor. So the immediate next action is manageTabs: open new tab. Then, after that, navigate to YouTube. But according to the current iteration, the agent's next action is to open a new tab. Wait, the user's previous conversation shows that the agent is on Google. So the next action is to open a new tab. Therefore, the correct action is to open a new tab. Wait, but the user's instruction is to open YouTube in another tab. So the agent needs to open a new tab, then go to YouTube. Thus, the first step is to open a new tab. Therefore, the action is manageTabs: open new tab. But the problem is that the agent must output this as the action. Wait, but the user's instruction is to open YouTube in another tab and search for the query. So the agent needs to open a new tab, then navigate to YouTube. The immediate next action is to open a new tab. So the answer should be: { "needs_action": true, "action": "manageTabs: open new tab", "reasoning": "To fulfill the user's request to open YouTube in another tab, first create a new tab.", "message": "Opening a new tab to access YouTube...", "ask_user": null } But wait, the user said "Open yoiutube in another tab and search for...". The correct URL is youtube.com, but the user misspelled it as "yoiutube". However, the agent should proceed with the correct URL. Wait, the known URLs include youtube.com, so the agent would navigate to that. But the next action is to open a new tab, then navigate to YouTube. So the immediate next action is to open a new tab. Yes. Another check: the previous conversation shows that the user is on Google. So the agent needs to open a new tab. Yes. Therefore, the correct action is to open a new tab. Opening a new tab to access YouTube... 💭 To fulfill the user's request to open YouTube in another tab, the immediate next step is to create a new browser tab. This allows navigating to YouTube without disrupting the current Google tab. |
…iptions Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Fixed both issues in commit c1e675d: 1. Constrained verbose reasoning:
2. Fixed missing URL in executor:
Result:
|
|
@copilot The get schema is too constrivctive This is the object and we have 13 but the sarch input eolement is not shown so it tries to click search button directly. Which is wrong |
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
Fixed in commit 7382c9b: Problem: Schema filter was too restrictive and excluded search inputs without explicit labels, showing only 13 elements instead of the search input field. Solution:
Result:
|
Co-authored-by: Varun-Patkar <50433487+Varun-Patkar@users.noreply.github.com>
V3 Iterative System Implementation
Status: Complete with concise reasoning, proper parameter passing, and improved schema extraction
Changes Made
Core Features
1. Constrained Reasoning Length:
"reasoning": "1-2 sentence explanation only"2. Required Parameters in Actions:
"Open a new tab with URL https://www.youtube.com"3. Enhanced Schema Extraction:
Example Output Format
Before (problematic):
{ "action": "manageTabs: open new tab", "reasoning": "Okay, let's break this down... [300+ words of circular thinking]..." }Error: "URL is required for open action"
Schema Issue: Only 13 elements, missing search input field
After (correct):
{ "action": "Open a new tab with URL https://www.youtube.com", "reasoning": "User wants to access YouTube in a new tab.", "message": "Opening YouTube..." }Result: Executor successfully extracts URL and executes
Schema Fixed: Includes search input field, agent can type queries before clicking search button
Technical Changes
Prompt Updates:
Schema Extraction Improvements:
computeAccessibleName()to check title and name attributesisImportantInputType()helper functionBenefits:
Testing
Test scenario: "Open YouTube in another tab and search for 'never gonna give you up'"
Expected behavior:
Original prompt
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.