Skip to content

Fixes kdd#128

Merged
hubert-marek merged 3 commits intomainfrom
fixes-kdd
Feb 1, 2026
Merged

Fixes kdd#128
hubert-marek merged 3 commits intomainfrom
fixes-kdd

Conversation

@hubert-marek
Copy link
Owner

@hubert-marek hubert-marek commented Feb 1, 2026

Summary by CodeRabbit

  • New Features

    • Added GraphQL fields for querying team members and labels with pagination and filtering support.
    • Introduced comprehensive benchmarking framework for agent testing with checkpoint management and result analysis.
  • Bug Fixes

    • Enhanced validation for assignee operations in issue creation and updates to prevent data consistency issues.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 1, 2026

Walkthrough

The PR introduces logging improvements via a new filter for noisy HTTP warnings, adds two new GraphQL field resolvers for team members and labels with validation enhancements for issue mutations, reformats a calendar API route definitions table for readability, updates a Git LFS path pattern for recursive JSON file matching, and adds a comprehensive benchmarking notebook for agent-diff experiments with ReAct-based workflow and checkpoint management.

Changes

Cohort / File(s) Summary
Configuration Updates
.gitattributes
Updated path pattern to enable recursive matching for JSON files in evaluation outputs directory (from *.json to **/*.json).
Logging Infrastructure
backend/src/platform/logging_config.py
Added new ContentLengthErrorFilter class to filter out noisy h11 Content-Length related log messages from uvicorn.error logger.
Calendar API
backend/src/services/calendar/api/batch.py
Restructured route definitions table from single-line tuples to multi-line indented format for improved readability; no functional or behavioral changes.
Linear API Resolvers
backend/src/services/linear/api/resolvers.py
Added two new Team field resolvers (resolve_team_members, resolve_team_labels) with pagination and filtering; integrated pre-creation validation for assignee existence in issue mutations; includes duplicate resolver declarations within the same file.
Experiment Benchmarking
experiments/kdd 2026/agent-diff bench.ipynb
Added comprehensive benchmarking framework with ReAct-style agent execution, API docs ingestion, system prompt generation, OpenRouter integration, test orchestration, checkpoint management, and result analysis utilities.
Experiment Checkpoints
experiments/kdd 2026/evaluation_outputs/checkpoints/*, experiments/kdd 2026/evaluation_outputs/old/*
Added Git LFS pointer files for benchmark checkpoint data.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Benchmark Suite
    participant ReAct as ReAct Agent
    participant API as OpenRouter API
    participant Executor as Bash Executor
    participant CP as Checkpoint
    
    Client->>ReAct: Initialize with system prompt & task
    loop Max Iterations
        ReAct->>ReAct: Generate Thought
        ReAct->>ReAct: Parse Thought/Action/Observation
        alt Action is Tool Use
            ReAct->>Executor: Execute bash command
            Executor-->>ReAct: Return output
        else Action is Final
            ReAct-->>Client: Return result
        end
        ReAct->>API: Send ReAct state to model
        API-->>ReAct: Receive model response
    end
    Client->>CP: Mark test completed
    CP->>CP: Persist checkpoint state
    Client->>Client: Aggregate results & analyze
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #121: Modifies the same .gitattributes entry for experiments/kdd evaluation JSON files—PR #121 initially added the single-level pattern while this PR extends it to recursive matching.
  • PR #92: Alters backend/src/services/linear/api/resolvers.py with resolver implementation and issue creation/update flows, overlapping directly with this PR's resolver enhancements and validation additions.
  • PR #99: Expands Team-related GraphQL field resolvers in backend/src/services/linear/api/resolvers.py—this PR adds resolve_team_members and resolve_team_labels which are related to the broader team resolver enhancements in PR #99.

Suggested reviewers

  • artzhuravel

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link

claude bot commented Feb 1, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@backend/src/services/linear/api/resolvers.py`:
- Around line 1245-1278: The resolver accepts a filter param but never applies
it; update the labels query to apply the provided filter (similar to
resolve_team_members). Import or use the existing apply_issue_label_filter (or
implement it if missing) and call it with the current base_query and the
incoming filter param (e.g., base_query = apply_issue_label_filter(base_query,
filter)) before pagination/ordering, ensuring you reference IssueLabel,
base_query, and order_field when composing the filtered query.

In `@experiments/kdd` 2026/agent-diff bench.ipynb:
- Around line 718-719: Remove the unused local variable assignment to score:
instead of assigning score = client.evaluate_run(runId=run.runId), invoke
client.evaluate_run(runId=run.runId) for its side effects and drop the score
variable; keep the subsequent run_result =
client.get_results_for_run(runId=run.runId) unchanged so the evaluation call
still runs but no unused variable remains.
- Around line 48-53: The load_api_docs function currently calls
json.load(open(filepath)) which leaves the file handle open; change it to open
the file using a context manager and explicit encoding (e.g. use with
open(filepath, "r", encoding="utf-8") as f: return json.load(f)) so the file is
closed reliably; keep the existing filepath.exists() check and the fallback
print/return {} behavior in load_api_docs.
- Around line 359-435: The call_openrouter function can end up raising None when
max_retries <= 0 because the retry loop never runs; add an upfront guard in
call_openrouter to validate max_retries (e.g., if max_retries < 1: raise
ValueError("max_retries must be >= 1") or coerce to 1) before entering the for
loop, so last_error is never None and behavior is deterministic; update any
callers/tests if you choose to change the parameter semantics.
- Around line 675-744: The current await of asyncio.to_thread(run_react_agent,
...) is cancelled on timeout but the worker thread keeps running; change to
create a persistent task handle (e.g., worker_task =
asyncio.create_task(asyncio.to_thread(run_react_agent, ...))) instead of
directly awaiting asyncio.to_thread, await worker_task with asyncio.wait_for,
and on asyncio.TimeoutError set stop_event and then await worker_task again with
a short grace period (e.g., grace_seconds) to allow the thread to exit before
proceeding to cleanup; if that second wait_for also times out, proceed but
record that the worker did not stop. Also replace the silent except around
client.delete_env with logging of the exception (use an existing logger or
logging.exception) so deletion failures are visible. Ensure references:
run_react_agent, stop_event, worker_task (new), and client.delete_env are
updated accordingly.
🧹 Nitpick comments (3)
backend/src/platform/logging_config.py (1)

21-24: Rename unused exc_type binding.

Line 22 binds exc_type but never uses it; rename to _exc_type to avoid unused-variable noise.

Suggested change
-            exc_type, exc_value, _ = record.exc_info
+            _exc_type, exc_value, _ = record.exc_info
backend/src/services/calendar/api/batch.py (1)

301-306: Mark unused param_names as intentionally unused.

Line 301 binds param_names but never reads it; rename to _param_names to make intent explicit.

Suggested change
-    for route_method, pattern, handler, param_names in ROUTE_PATTERNS:
+    for route_method, pattern, handler, _param_names in ROUTE_PATTERNS:
backend/src/services/linear/api/resolvers.py (1)

1280-1327: Consider extracting duplicated cursor pagination logic.

The pagination logic in this resolver (lines 1280-1327) is nearly identical to resolve_team_members (lines 1189-1236). The only differences are the model class and its attributes. This could be extracted into a generic helper function that accepts the model class and query as parameters.

Comment on lines +48 to +53
"def load_api_docs(filepath: Path) -> dict:\n",
" \"\"\"Load API docs JSON, return empty dict if not found.\"\"\"\n",
" if filepath.exists():\n",
" return json.load(open(filepath))\n",
" print(f\"Docs not found: {filepath}\")\n",
" return {}\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Close the API docs file handle after reading.

Line 51 uses json.load(open(...)) without closing the file, which can leak descriptors during long runs. Use a context manager (and set encoding).

🔧 Suggested fix
-    if filepath.exists():
-        return json.load(open(filepath))
+    if filepath.exists():
+        with open(filepath, "r", encoding="utf-8") as f:
+            return json.load(f)
🤖 Prompt for AI Agents
In `@experiments/kdd` 2026/agent-diff bench.ipynb around lines 48 - 53, The
load_api_docs function currently calls json.load(open(filepath)) which leaves
the file handle open; change it to open the file using a context manager and
explicit encoding (e.g. use with open(filepath, "r", encoding="utf-8") as f:
return json.load(f)) so the file is closed reliably; keep the existing
filepath.exists() check and the fallback print/return {} behavior in
load_api_docs.

Comment on lines +675 to +744
" start = time.perf_counter()\n",
" try:\n",
" response = await asyncio.wait_for(\n",
" asyncio.to_thread(\n",
" run_react_agent,\n",
" model_name=model_name,\n",
" task_prompt=prompt,\n",
" bash_executor=bash_executor,\n",
" system_prompt=system_prompt,\n",
" max_iterations=max_iterations,\n",
" trace_accumulator=trace_accumulator,\n",
" stop_event=stop_event,\n",
" ),\n",
" timeout=test_timeout_seconds\n",
" )\n",
" except asyncio.TimeoutError:\n",
" timed_out = True\n",
" # Signal thread to stop at next iteration\n",
" stop_event.set()\n",
" # Give thread a moment to finish current operation and exit\n",
" await asyncio.sleep(2)\n",
" # Use accumulated trace (partial) instead of losing it\n",
" response = {\n",
" \"steps\": trace_accumulator.get(\"steps\", []),\n",
" \"final\": trace_accumulator.get(\"final\"),\n",
" \"iterations\": len(trace_accumulator.get(\"steps\", [])),\n",
" \"completed\": False,\n",
" \"usage\": trace_accumulator.get(\"usage\", {}),\n",
" \"timeout_error\": f\"Test timed out after {test_timeout_seconds} seconds\",\n",
" }\n",
" except Exception as e:\n",
" response = {\n",
" \"steps\": trace_accumulator.get(\"steps\", []),\n",
" \"final\": trace_accumulator.get(\"final\"),\n",
" \"iterations\": len(trace_accumulator.get(\"steps\", [])),\n",
" \"completed\": False,\n",
" \"usage\": trace_accumulator.get(\"usage\", {}),\n",
" \"error\": str(e),\n",
" }\n",
" finally:\n",
" execution_time = time.perf_counter() - start\n",
"\n",
" # Evaluation\n",
" score = client.evaluate_run(runId=run.runId)\n",
" run_result = client.get_results_for_run(runId=run.runId)\n",
"\n",
" result = {\n",
" \"prompt\": prompt,\n",
" \"status\": \"timeout\" if timed_out else run_result.status,\n",
" \"passed\": False if timed_out else run_result.passed,\n",
" \"score\": 0 if timed_out else run_result.score.get(\"percent\", 0),\n",
" \"time\": round(execution_time, 2),\n",
" \"failures\": [\"Test timed out\"] if timed_out else run_result.failures,\n",
" \"runId\": run.runId,\n",
" \"trace\": response,\n",
" \"diff\": getattr(run_result, \"diff\", None),\n",
" }\n",
"\n",
" # Cleanup\n",
" client.delete_env(envId=env.environmentId)\n",
" return test_id, result\n",
"\n",
" except Exception as e:\n",
" # Cleanup on error if environment was created\n",
" if env:\n",
" try:\n",
" client.delete_env(envId=env.environmentId)\n",
" except:\n",
" pass\n",
" return test_id, {\"passed\": False, \"score\": 0, \"status\": \"error\", \"error\": str(e)}\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

asyncio.wait_for cancel task behavior asyncio.to_thread timeout

💡 Result:

  • asyncio.wait_for(aw, timeout) works by cancelling the awaited Task/awaitable when the timeout expires, then raising TimeoutError (Python 3.11+) to the caller. It also waits for the cancellation to actually finish, so the total time spent can exceed timeout. If the caller is cancelled, wait_for also cancels aw. To prevent cancelling aw, wrap it in asyncio.shield(). [1]
  • asyncio.to_thread(func, ...) runs func in a separate OS thread and returns an awaitable for its result. The docs describe execution-in-a-thread and result propagation, but there is no general mechanism in Python to forcibly stop an arbitrary running thread once started. [2]
  • Therefore, await asyncio.wait_for(asyncio.to_thread(...), timeout=...) will typically:
    1. time out,
    2. cancel the await (the asyncio Task waiting for the thread result),
    3. but the underlying thread keeps running in the background until func returns (or never returns). This is the same behavior people observe with run_in_executor, which to_thread is built upon. [3][4]
  • Practical implication: if you need “real” cancellation on timeout for work done in to_thread, you must build it into func (e.g., periodically check a threading.Event / cancellation flag, or use a cancellable API in the blocking call). [4]

Sources:
[1] Python docs: asyncio.wait_for timeout/cancellation semantics https://docs.python.org/3.12/library/asyncio-task.html
[2] Python docs: asyncio.to_thread (“Running in Threads”) https://docs.python.org/3.12/library/asyncio-task.html
[3] CPython issue: cancelling run_in_executor doesn’t stop the thread python/cpython#107505
[4] Stack Overflow: cancelling a to_thread task won’t stop the thread https://stackoverflow.com/questions/71416383/python-asyncio-cancelling-a-to-thread-task-wont-stop-the-thread


🌐 Web query:

asyncio.to_thread documentation timeout handling thread cleanup

💡 Result:

asyncio.to_thread: timeouts, cancellation, and “thread cleanup”

What to_thread() does

asyncio.to_thread(func, *args, **kwargs) runs func in a separate thread (via asyncio’s thread-pool machinery) and returns an awaitable for its result; it also propagates contextvars into that thread. [1]

Timeout handling (recommended patterns)

Use an asyncio timeout around the await:

import asyncio

async def main():
    try:
        return await asyncio.wait_for(asyncio.to_thread(blocking_fn), timeout=2.0)
    except TimeoutError:
        ...

Or (Python 3.11+) structured timeout:

async with asyncio.timeout(2.0):
    await asyncio.to_thread(blocking_fn)

Important: timeout/cancellation does not stop the underlying thread

If the timeout triggers, asyncio cancels the awaiting task, but a running thread-pool job generally keeps running; it is not forcibly terminated. This is the same behavior people observe with run_in_executor() (which to_thread() is built on top of). [2]

“Thread cleanup”: what you can and can’t do

  • You typically cannot kill the worker thread safely from asyncio.
  • The thread-pool threads are reused, so there’s usually no per-call “thread cleanup” to perform—your concern is stopping/cleaning up the work function you ran.

Practical options:

  1. Cooperative cancellation: pass a threading.Event (or similar) that the worker checks periodically and exits cleanly.
  2. Design func with timeouts (e.g., socket timeouts, request timeouts) so it returns on its own.
  3. If you manage your own ThreadPoolExecutor, you can shut it down; shutdown(cancel_futures=True) only cancels tasks that haven’t started yet—running ones still run to completion. [3]
  4. If you rely on the loop’s default executor, note that asyncio uses a lazily-created ThreadPoolExecutor for run_in_executor(None, ...)-style calls. [4]

Sources

  1. Python docs: asyncio.to_thread() (asyncio tasks / “Running in Threads”). [1]
  2. CPython issue discussing that cancelling run_in_executor() doesn’t stop the running thread. [2]
  3. Python docs: ThreadPoolExecutor.shutdown(cancel_futures=...) semantics. [3]
  4. Python docs: default executor behavior in loop.run_in_executor() (ThreadPoolExecutor is lazy-initialized if needed). [4]

🏁 Script executed:

# First, find the notebook file
find . -name "*agent-diff*" -type f 2>/dev/null | head -20

Repository: hubert-marek/agent-diff

Length of output: 233


🏁 Script executed:

# Check if the file exists in the expected path
ls -la "experiments/kdd 2026/agent-diff bench.ipynb" 2>/dev/null || echo "File not found at expected location"

Repository: hubert-marek/agent-diff

Length of output: 160


🏁 Script executed:

# Extract the notebook content as JSON and examine the code cell around lines 675-744
python3 << 'EOF'
import json
import sys

try:
    with open('./experiments/kdd 2026/agent-diff bench.ipynb', 'r') as f:
        nb = json.load(f)
    
    # Find the cell containing the code around lines 675-744
    # Note: line numbers in the comment refer to line numbers within the cell
    line_count = 0
    for idx, cell in enumerate(nb.get('cells', [])):
        if cell['cell_type'] == 'code':
            source_text = ''.join(cell['source'])
            cell_lines = source_text.split('\n')
            cell_line_count = len(cell_lines)
            
            # Print info to help locate the right cell
            if 'asyncio.wait_for' in source_text or 'to_thread' in source_text:
                print(f"\n=== Cell {idx} (Type: {cell['cell_type']}) ===")
                print(f"Total lines in this cell: {cell_line_count}")
                print("=" * 60)
                # Print the entire cell
                for i, line in enumerate(cell_lines, 1):
                    print(f"{i:4d}: {line}")
                print("=" * 60)
except Exception as e:
    print(f"Error: {e}", file=sys.stderr)
    sys.exit(1)
EOF

Repository: hubert-marek/agent-diff

Length of output: 37275


🏁 Script executed:

# Also search for the run_react_agent function to understand if stop_event is actually used
python3 << 'EOF'
import json
import sys

try:
    with open('./experiments/kdd 2026/agent-diff bench.ipynb', 'r') as f:
        nb = json.load(f)
    
    # Find cells containing run_react_agent definition or calls
    for idx, cell in enumerate(nb.get('cells', [])):
        if cell['cell_type'] == 'code':
            source_text = ''.join(cell['source'])
            if 'def run_react_agent' in source_text:
                print(f"\n=== Found run_react_agent definition in Cell {idx} ===")
                print("=" * 60)
                cell_lines = source_text.split('\n')
                for i, line in enumerate(cell_lines, 1):
                    print(f"{i:4d}: {line}")
                print("=" * 60)
except Exception as e:
    print(f"Error: {e}", file=sys.stderr)
    sys.exit(1)
EOF

Repository: hubert-marek/agent-diff

Length of output: 37268


Ensure the background ReAct thread stops before cleanup.

When asyncio.wait_for times out, asyncio.to_thread does not terminate the underlying thread—it only cancels the awaiting task. The worker thread continues running, and with a fixed 2-second sleep, there is no guarantee it finishes before cleanup deletes the environment. This creates a race condition where in-flight API calls (lines 194–198) or bash executions (line 227) can race with environment deletion (line 439).

Keep a handle to the worker task, signal stop via stop_event, then await task completion with a grace period. Only after confirming completion (or timing out again) should cleanup proceed. Additionally, log deletion failures instead of silently swallowing them (lines 446–448).

🔧 Suggested fix
-        try:
-            response = await asyncio.wait_for(
-                asyncio.to_thread(
-                    run_react_agent,
-                    model_name=model_name,
-                    task_prompt=prompt,
-                    bash_executor=bash_executor,
-                    system_prompt=system_prompt,
-                    max_iterations=max_iterations,
-                    trace_accumulator=trace_accumulator,
-                    stop_event=stop_event,
-                ),
-                timeout=test_timeout_seconds
-            )
+        grace_period_seconds = 5
+        worker_task = asyncio.create_task(
+            asyncio.to_thread(
+                run_react_agent,
+                model_name=model_name,
+                task_prompt=prompt,
+                bash_executor=bash_executor,
+                system_prompt=system_prompt,
+                max_iterations=max_iterations,
+                trace_accumulator=trace_accumulator,
+                stop_event=stop_event,
+            )
+        )
+        try:
+            response = await asyncio.wait_for(worker_task, timeout=test_timeout_seconds)
         except asyncio.TimeoutError:
             timed_out = True
             # Signal thread to stop at next iteration
             stop_event.set()
-            # Give thread a moment to finish current operation and exit
-            await asyncio.sleep(2)
-            # Use accumulated trace (partial) instead of losing it
-            response = {
+            try:
+                response = await asyncio.wait_for(worker_task, timeout=grace_period_seconds)
+            except asyncio.TimeoutError:
+                # Use accumulated trace (partial) instead of losing it
+                response = {
                     "steps": trace_accumulator.get("steps", []),
                     "final": trace_accumulator.get("final"),
                     "iterations": len(trace_accumulator.get("steps", [])),
                     "completed": False,
                     "usage": trace_accumulator.get("usage", {}),
                     "timeout_error": f"Test timed out after {test_timeout_seconds} seconds",
-            }
+                }
             except Exception as e:
                 response = {
                     "steps": trace_accumulator.get("steps", []),
                     "final": trace_accumulator.get("final"),
                     "iterations": len(trace_accumulator.get("steps", [])),
                     "completed": False,
                     "usage": trace_accumulator.get("usage", {}),
                     "error": str(e),
                 }
             finally:
                 execution_time = time.perf_counter() - start
...
-            except:
-                pass
+            except Exception as e:
+                print(f"[WARN] Failed to delete env {env.environmentId}: {e}")
🧰 Tools
🪛 Ruff (0.14.14)

[error] 680-680: Local variable score is assigned to but never used

Remove assignment to unused variable score

(F841)


[warning] 697-697: Consider moving this statement to an else block

(TRY300)


[warning] 699-699: Do not catch blind exception: Exception

(BLE001)


[error] 704-704: Do not use bare except

(E722)


[error] 704-705: try-except-pass detected, consider logging the exception

(S110)


[warning] 713-713: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents
In `@experiments/kdd` 2026/agent-diff bench.ipynb around lines 675 - 744, The
current await of asyncio.to_thread(run_react_agent, ...) is cancelled on timeout
but the worker thread keeps running; change to create a persistent task handle
(e.g., worker_task = asyncio.create_task(asyncio.to_thread(run_react_agent,
...))) instead of directly awaiting asyncio.to_thread, await worker_task with
asyncio.wait_for, and on asyncio.TimeoutError set stop_event and then await
worker_task again with a short grace period (e.g., grace_seconds) to allow the
thread to exit before proceeding to cleanup; if that second wait_for also times
out, proceed but record that the worker did not stop. Also replace the silent
except around client.delete_env with logging of the exception (use an existing
logger or logging.exception) so deletion failures are visible. Ensure
references: run_react_agent, stop_event, worker_task (new), and
client.delete_env are updated accordingly.

@hubert-marek hubert-marek merged commit 446cd92 into main Feb 1, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant