Skip to content

Latest commit

 

History

History
296 lines (206 loc) · 9.42 KB

File metadata and controls

296 lines (206 loc) · 9.42 KB

Recording To Workflow Compilation

Status

This document records the current design direction for turning a recorded user trace into an executable OpenBrowser workflow.

Current product status:

Recent progress on the recording layer:

  • pre-action keyframes now exist for pointerdown -> click and keydown Enter, so keyframes can capture the pre-navigation or pre-submit state instead of the post-action page.
  • noisy root-level clicks are filtered out.
  • text input is merged into a final meaningful input result instead of one event per character.
  • page_view is now restricted to the top-level frame only, avoiding iframe noise.
  • browser history navigation is now distinguished as tab_back and tab_forward, instead of collapsing everything into tab_navigated.

The current system should be treated as:

recording trace capture and review

The missing gap is:

recording trace -> compiler-agent-authored workflow draft -> approved workflow -> execution

Core Decision

OpenBrowser should not rely on rule code to directly turn low-level events into a user-facing workflow draft.

Reason:

  • low-level events capture what physically happened,
  • but workflow compilation must express what the user meant.

Examples:

  • click is a physical browser event,
  • select market cap = largeover is the user intent,
  • change is a DOM event,
  • search for housing in Xixi Wetland is the workflow meaning.

That semantic jump is too brittle to encode as a fixed event-to-step mapping.

Therefore:

  • raw recording trace remains the source of truth,
  • lightweight normalization may still exist as an internal cleanup layer,
  • but the user-facing workflow draft should be generated by a Compiler Agent, not by deterministic trace rules.

Product Flow

The intended product flow is now explicitly four stages.

Stage 1: Recording

The user records the browser workflow manually.

Requirements for this stage:

  • capture the key browser events,
  • capture meaningful keyframes,
  • keep the trace reviewable,
  • keep the trace faithful to what actually happened.

This stage is the current implementation focus and should be considered the main completed foundation.

Status update:

  • mostly complete as a product foundation,
  • still open for continued event-quality improvements where trace semantics are obviously under-specified or noisy.

Stage 2: User Intent Note

After stopping recording, the user adds a short note that explains the overall intent.

Examples:

  • "Search Zhihu for posts about AI rent discussions."
  • "Filter Finviz to large-cap stocks and inspect the results."
  • "Collect useful findings and prepare a summary post."

This note is required because trace alone usually does not encode the business rule behind the actions.

This stage is part of the intended design, but it is not yet integrated into the current product flow.

This is now the immediate next product task.

Stage 3: Compiler Agent Draft

A dedicated Compiler Agent reads:

  • the raw recording trace,
  • the keyframes,
  • the user intent note,
  • and optionally an internal normalized trace.

It then produces a workflow draft.

This draft should describe:

  • the intended steps,
  • the reasoning behind the steps,
  • the missing ambiguities,
  • and the questions that must be answered before execution.

Important constraint:

The Compiler Agent is not executing the browser at this stage. It is only interpreting and compiling the demonstrated workflow.

This is now the next major implementation milestone after the intent-note step.

Stage 4: Draft Iteration And Finalization

The Compiler Agent and the user iterate on the workflow draft.

The user should be able to:

  • correct the draft,
  • answer clarification questions,
  • refine ambiguous rules,
  • and approve the final version.

The result of this stage is the final workflow that later execution will use.

What Recording Must Guarantee

The current recording system does not need to solve workflow semantics.

It only needs to be strong on the following:

  • accurate event capture,
  • accurate keyframe capture,
  • noise reduction where clearly justified,
  • stable review of the trace after recording.

That means the recording layer is responsible for facts, not interpretation.

Examples of facts:

  • which page was open,
  • which element was clicked,
  • what text was entered,
  • what keyframe was visible before or during the interaction,
  • what scroll or form change happened.

Interpretation belongs to the Compiler Agent, not the recorder.

Role Of Normalization

Normalization may still exist, but only as an internal helper.

Its purpose is to make the trace easier for the Compiler Agent to consume.

Examples:

  • remove obvious iframe or ad noise,
  • group closely related low-level events,
  • dedupe focus and click when they represent the same interaction,
  • collapse scroll bursts,
  • keep supporting events attached to a primary interaction.

Normalization should not pretend to be the final workflow.

That is the key design boundary.

Compiler Agent Responsibilities

The Compiler Agent should:

  • inspect the recorded trace,
  • inspect keyframes,
  • read the user intent note,
  • infer the likely workflow,
  • decide where the trace is ambiguous,
  • ask clarification questions,
  • produce a reviewable draft,
  • update that draft after user feedback.

The Compiler Agent should not:

  • replay the trace mechanically,
  • assume every click is a workflow step,
  • assume every DOM event directly maps to intent,
  • skip clarification when the trace is ambiguous.

Draft Shape

The exact final schema is still open, but the draft should eventually include:

  • workflow goal,
  • ordered steps,
  • reasoning per step,
  • evidence references back to trace events,
  • clarification questions,
  • approved user answers,
  • final execution-ready version.

The important point is not the exact JSON format yet. The important point is that the draft is agent-authored and user-reviewable.

UI Direction

The current recording UI should focus on trace review only.

That means:

  • captured events,
  • keyframes,
  • history of saved recordings,
  • raw event detail.

It should not present the current rule-generated workflow draft as if it were a reliable semantic interpretation.

Workflow generation should appear later as a dedicated Compiler Agent step.

Near-Term Implementation Plan

Next product work should proceed in this order:

  1. Keep improving recording quality until trace and keyframes are trustworthy.
  2. Add a post-recording user intent note step.
  3. Implement a Compiler Agent that consumes trace plus intent note.
  4. Build a review loop where the Compiler Agent and the user refine the draft.
  5. Produce the final workflow artifact for later execution.

Recent progress on the compilation layer:

  • post-recording intent note is now implemented: the user can add a short text note after stopping a recording, saved to recording session metadata via POST /recordings/{id}/intent-note.
  • the Compiler Agent is implemented in server/core/compiler_agent.py using the openhands-sdk Agent + Conversation pattern with three tools:
    • trace_viewer — lets the agent navigate events incrementally (summary, paginated event list, single event detail, keyframe screenshots, normalized steps) instead of receiving the entire trace in one message.
    • file (FileEditorTool) — lets the agent write the Routine file.
    • submit_workflow — validates the Routine file structure and ends the conversation.
  • the compiler agent system prompt teaches OpenBrowser's tool vocabulary (highlight, click, keyboard_input, scroll, etc.) and the Browser Routine format, so the output is an executable Browser Routine.
  • the Routine is pure text (no embedded images). Keyframes are only used by the compiler agent to understand the recorded trace.
  • the compile endpoint is POST /recordings/{id}/compile. The previous iteration endpoint has been removed — clarification happens during compilation as part of the agent conversation loop.
  • the frontend has a "Compile Routine" button and displays the resulting Routine markdown.

The next concrete work item is now:

  1. test the full end-to-end flow with a real recording,
  2. tune the compiler agent system prompt and tool descriptions based on real-world trace quality,
  3. integrate the approved Routine with the execution layer.

Summary

OpenBrowser should separate facts from interpretation.

  • Recording captures facts.
  • The user provides intent.
  • The Compiler Agent produces the executable Browser Routine.
  • The Routine instructs OpenBrowser what to do step by step.

That is the intended foundation for workflow execution.