This document records the current design direction for turning a recorded user trace into an executable OpenBrowser workflow.
Current product status:
- Recording infrastructure exists in server/api/routes/recordings.py and server/core/recording_manager.py.
- Browser-side recording exists in extension/src/recording/recorder.ts and extension/src/content/index.ts.
- Recording review UI exists in frontend/index.html, and it now focuses on reviewing captured events and keyframes only.
- A first-pass rule-based compiler still exists in server/core/workflow_compiler.py, but it is no longer the intended product path for user-facing workflow draft generation.
Recent progress on the recording layer:
- pre-action keyframes now exist for
pointerdown -> clickandkeydown Enter, so keyframes can capture the pre-navigation or pre-submit state instead of the post-action page. - noisy root-level clicks are filtered out.
- text input is merged into a final meaningful input result instead of one event per character.
page_viewis now restricted to the top-level frame only, avoiding iframe noise.- browser history navigation is now distinguished as
tab_backandtab_forward, instead of collapsing everything intotab_navigated.
The current system should be treated as:
recording trace capture and review
The missing gap is:
recording trace -> compiler-agent-authored workflow draft -> approved workflow -> execution
OpenBrowser should not rely on rule code to directly turn low-level events into a user-facing workflow draft.
Reason:
- low-level events capture what physically happened,
- but workflow compilation must express what the user meant.
Examples:
clickis a physical browser event,select market cap = largeoveris the user intent,changeis a DOM event,search for housing in Xixi Wetlandis the workflow meaning.
That semantic jump is too brittle to encode as a fixed event-to-step mapping.
Therefore:
- raw recording trace remains the source of truth,
- lightweight normalization may still exist as an internal cleanup layer,
- but the user-facing workflow draft should be generated by a Compiler Agent, not by deterministic trace rules.
The intended product flow is now explicitly four stages.
The user records the browser workflow manually.
Requirements for this stage:
- capture the key browser events,
- capture meaningful keyframes,
- keep the trace reviewable,
- keep the trace faithful to what actually happened.
This stage is the current implementation focus and should be considered the main completed foundation.
Status update:
- mostly complete as a product foundation,
- still open for continued event-quality improvements where trace semantics are obviously under-specified or noisy.
After stopping recording, the user adds a short note that explains the overall intent.
Examples:
- "Search Zhihu for posts about AI rent discussions."
- "Filter Finviz to large-cap stocks and inspect the results."
- "Collect useful findings and prepare a summary post."
This note is required because trace alone usually does not encode the business rule behind the actions.
This stage is part of the intended design, but it is not yet integrated into the current product flow.
This is now the immediate next product task.
A dedicated Compiler Agent reads:
- the raw recording trace,
- the keyframes,
- the user intent note,
- and optionally an internal normalized trace.
It then produces a workflow draft.
This draft should describe:
- the intended steps,
- the reasoning behind the steps,
- the missing ambiguities,
- and the questions that must be answered before execution.
Important constraint:
The Compiler Agent is not executing the browser at this stage. It is only interpreting and compiling the demonstrated workflow.
This is now the next major implementation milestone after the intent-note step.
The Compiler Agent and the user iterate on the workflow draft.
The user should be able to:
- correct the draft,
- answer clarification questions,
- refine ambiguous rules,
- and approve the final version.
The result of this stage is the final workflow that later execution will use.
The current recording system does not need to solve workflow semantics.
It only needs to be strong on the following:
- accurate event capture,
- accurate keyframe capture,
- noise reduction where clearly justified,
- stable review of the trace after recording.
That means the recording layer is responsible for facts, not interpretation.
Examples of facts:
- which page was open,
- which element was clicked,
- what text was entered,
- what keyframe was visible before or during the interaction,
- what scroll or form change happened.
Interpretation belongs to the Compiler Agent, not the recorder.
Normalization may still exist, but only as an internal helper.
Its purpose is to make the trace easier for the Compiler Agent to consume.
Examples:
- remove obvious iframe or ad noise,
- group closely related low-level events,
- dedupe focus and click when they represent the same interaction,
- collapse scroll bursts,
- keep supporting events attached to a primary interaction.
Normalization should not pretend to be the final workflow.
That is the key design boundary.
The Compiler Agent should:
- inspect the recorded trace,
- inspect keyframes,
- read the user intent note,
- infer the likely workflow,
- decide where the trace is ambiguous,
- ask clarification questions,
- produce a reviewable draft,
- update that draft after user feedback.
The Compiler Agent should not:
- replay the trace mechanically,
- assume every click is a workflow step,
- assume every DOM event directly maps to intent,
- skip clarification when the trace is ambiguous.
The exact final schema is still open, but the draft should eventually include:
- workflow goal,
- ordered steps,
- reasoning per step,
- evidence references back to trace events,
- clarification questions,
- approved user answers,
- final execution-ready version.
The important point is not the exact JSON format yet. The important point is that the draft is agent-authored and user-reviewable.
The current recording UI should focus on trace review only.
That means:
- captured events,
- keyframes,
- history of saved recordings,
- raw event detail.
It should not present the current rule-generated workflow draft as if it were a reliable semantic interpretation.
Workflow generation should appear later as a dedicated Compiler Agent step.
Next product work should proceed in this order:
- Keep improving recording quality until trace and keyframes are trustworthy.
- Add a post-recording user intent note step.
- Implement a Compiler Agent that consumes trace plus intent note.
- Build a review loop where the Compiler Agent and the user refine the draft.
- Produce the final workflow artifact for later execution.
Recent progress on the compilation layer:
- post-recording intent note is now implemented: the user can add a short text
note after stopping a recording, saved to recording session metadata via
POST /recordings/{id}/intent-note. - the Compiler Agent is implemented in
server/core/compiler_agent.pyusing the openhands-sdkAgent+Conversationpattern with three tools:trace_viewer— lets the agent navigate events incrementally (summary, paginated event list, single event detail, keyframe screenshots, normalized steps) instead of receiving the entire trace in one message.file(FileEditorTool) — lets the agent write the Routine file.submit_workflow— validates the Routine file structure and ends the conversation.
- the compiler agent system prompt teaches OpenBrowser's tool vocabulary (highlight, click, keyboard_input, scroll, etc.) and the Browser Routine format, so the output is an executable Browser Routine.
- the Routine is pure text (no embedded images). Keyframes are only used by the compiler agent to understand the recorded trace.
- the compile endpoint is
POST /recordings/{id}/compile. The previous iteration endpoint has been removed — clarification happens during compilation as part of the agent conversation loop. - the frontend has a "Compile Routine" button and displays the resulting Routine markdown.
The next concrete work item is now:
- test the full end-to-end flow with a real recording,
- tune the compiler agent system prompt and tool descriptions based on real-world trace quality,
- integrate the approved Routine with the execution layer.
OpenBrowser should separate facts from interpretation.
- Recording captures facts.
- The user provides intent.
- The Compiler Agent produces the executable Browser Routine.
- The Routine instructs OpenBrowser what to do step by step.
That is the intended foundation for workflow execution.