diff --git a/docs/devnotes/posts/search-agent.md b/docs/devnotes/posts/search-agent.md new file mode 100644 index 000000000..404d77440 --- /dev/null +++ b/docs/devnotes/posts/search-agent.md @@ -0,0 +1,618 @@ +--- +date: 2026-02-18 +authors: + - dnathawani +--- + +# **Search Agent SFT Data: Teaching LLMs to Browse the Web** + +Training search agents requires trajectory data --- the full multi-turn interaction showing how a model searches, reads, reasons, and answers. We built a four-stage pipeline that generates synthetic search trajectories from Wikidata knowledge graph paths, converts them into BrowseComp-style riddles using NeMo Data Designer, generates multi-step search rollouts with live web search via Tavily, and post-processes the results into SFT-ready training data. Starting from 50,000 seeds, the pipeline yielded ~7,000 high-quality tool-use trajectories for training Nemotron models. + + + +--- + +## **Why This Matters: The Agentic Shift** + +The industry is moving from **models that answer questions** to **agents that take actions**. Real-world AI applications orchestrate multiple steps --- searching the web, querying databases, reading documents, calling APIs --- with the LLM as the reasoning engine deciding *what to do next*. + +Training these agents requires a fundamentally different kind of data. A standard QA pair ("Q: Who is the CEO of NVIDIA? A: Jensen Huang") teaches the model to recall facts. But a search agent needs to learn a *process*: formulate a search query, evaluate the results, decide whether to refine the query or search again, and eventually synthesize an answer grounded in what it found. This is the **thought-action-observation loop**, and the training data must capture the complete loop --- every decision point, every tool call, every result. + +**Benchmark: [BrowseComp](https://openai.com/index/browsecomp/)** measures exactly this capability. It contains 1,266 challenging problems that require an AI agent to locate hard-to-find, entangled information by browsing the web. These aren't simple factoid questions --- they're multi-hop riddles: + +> *Between 1990 and 1994 inclusive, what teams played in a soccer match with a Brazilian referee that had four yellow cards, two for each team where three of the total four were not issued during the first half, and four substitutions, one of which was for an injury in the first 25 minutes of the match.* +> +> **Answer:** Ireland v Romania + +To train Nemotron Super for this kind of task, we needed thousands of tool-use trajectories. Human annotation is prohibitively expensive --- each trajectory involves 5-15 search queries with careful evaluation at each step, taking a skilled annotator 15-30 minutes per example. At the volume we needed, that would take months. + +--- + +## **End-to-End Pipeline Architecture** + +``` + SEARCH AGENT SFT PIPELINE + ========================= + + ┌─────────────────────────────────────────────────────────────────────────────────────┐ + │ STAGE 1: SEED DATA (Wikidata KG Walks) │ + │ │ + │ Random walks on the Wikidata knowledge graph │ + │ ├─ Anti-meta filters (no category/template/list-y nodes) │ + │ ├─ Hop range: 4 minimum, 8 maximum │ + │ └─ SPARQL queries to fetch neighbors │ + │ │ + │ Output: seed JSONL with hops[], seed_entity, final_answer_entity, path_length │ + │ 50,000 seeds generated │ + └─────────────────────────────────────────┬───────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────────────────────────────┐ + │ STAGE 2: SEARCH RIDDLE GENERATION (Data Designer) │ + │ │ + │ user_query_draft ──────────► user_query_obfuscated │ + │ (chain clues from path, (BrowseComp-style rewrite: │ + │ hide intermediate nodes, concise, natural, no breadcrumbs, │ + │ don't name the answer) 1-2 sentences max) │ + │ │ + │ + Heuristic filters: answer leakage, intermediate node leakage, INVALID_PATH │ + │ 50,000 → 37,000 valid seeds → 24,000 valid questions │ + └─────────────────────────────────────────┬───────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────────────────────────────┐ + │ STAGE 3: SEARCH TRAJECTORY ROLLOUTS │ + │ │ + │ Thought-Action-Observation loop with live web search (Tavily API) │ + │ ├─ Rollout model: MiniMax-M2 (strong BrowseComp + tool-calling scores) │ + │ ├─ Average ~12 tool calls per sample │ + │ ├─ Multiple rollouts per question for rejection sampling │ + │ └─ 6,974 completed (stop) / 177 truncated (length) │ + │ │ + │ 24,000 questions → 7,000 valid trajectories │ + └─────────────────────────────────────────┬───────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────────────────────────────┐ + │ STAGE 4: POST-PROCESSING → SFT DATASET │ + │ │ + │ ├─ Normalize tool outputs to consistent JSON "tool response" shape │ + │ ├─ Drop broken/truncated interactions │ + │ ├─ Select best rollout per question (min tool calls among correct) │ + │ ├─ Write OpenAI-messages style: messages[], tools[], metadata{} │ + │ └─ Manual review + LLM spot-checking (Gemini) │ + │ │ + │ ~7,000 SFT records for Nemotron Super │ + └─────────────────────────────────────────────────────────────────────────────────────┘ +``` + +--- + +## **Step 1: Seed Data from Wikidata Knowledge Graph Walks** + +The core idea: start at a random entity in the [Wikidata](https://www.wikidata.org/) knowledge graph and perform a random walk through its relations, producing a chain of hops that becomes a multi-hop search problem. Each chain provides a `seed_entity` (start), a `final_answer_entity` (destination), and a `readable_path` describing the edges traversed. + +We used **Wikidata SPARQL queries** to fetch neighbors at each hop. The number of hops is directly proportional to the number of tool calls the model would need to solve questions derived from that path --- more hops means harder riddles. + +**Example paths:** + +``` +START ENTITY: NVIDIA (Q182477) + ⬇ [chief executive officer (P169)] + NODE: Jensen Huang (Q332838) + ⬇ [educated at (P69)] + NODE: Oregon State University (Q861888) + ⬇ [located in the administrative territorial entity (P131)] + NODE: Benton County (Q115372) + ⬇ [named after (P138)] + NODE: Thomas Hart Benton (Q178712) +``` + +``` +START ENTITY: toothache (Q143925) + ⬇ [risk factor (P564)] + NODE: smoking (Q662860) + ⬇ [has effect (P1542)] + NODE: Crohn's disease (Q1472) + ⬇ [drug or therapy used for treatment (P2176)] + NODE: TNF inhibitor (Q1536078) + ⬇ [(is possible treatment of) (P2175)] reverse relation + NODE: Behçet's disease (Q911427) + ⬇ [symptoms and signs (P780)] + NODE: inflammation (Q101991) + ⬇ [drug or therapy used for treatment (P2176)] + NODE: (±)-flurbiprofen (Q419890) + ⬇ [significant drug interaction (P769)] + NODE: parecoxib (Q347941) + ⬇ [significant drug interaction (P769)] + NODE: ibuprofen (Q186969) +``` + +### Heuristics to Keep Walks Coherent + +Unrestricted random walks go off the rails quickly --- you'd get paths like `CEO → Human Being → Civilization → Indus Valley`. We applied several filters: + +- **Anti-meta filters:** Avoid category nodes, template pages, list-y entities, and other degenerate hops that exist for Wikidata bookkeeping rather than representing real-world relationships. +- **Hop range: 4 minimum, 8 maximum.** Below 4 hops, the questions aren't difficult enough to require multi-step search. Above 8, the path wanders off-topic and produces unsolvable riddles. +- **Generic entity filtering:** Remove seeds where the `final_answer_entity` is too abstract ("technology", "people", "field", "concept"). These produce questions where any answer could be correct. + +The resulting seed dataset: **50,000 JSONL records**, each containing `hops[]`, `seed_entity`, `final_answer_entity`, `readable_path`, and `path_length`. + +--- + +## **Step 2: Creating Search Riddles with Data Designer** + +Each seed path needs to be converted into two things: a search question (obfuscated so it doesn't leak the answer) and a ground truth target entity (the final node in the path). We use two chained LLM columns in Data Designer for this. + +**Stage 1 --- Draft question:** Chain clues from the knowledge path into a multi-hop riddle. Critical rules: don't name intermediate nodes, don't name the final answer, skip weak or illogical hops, and output `INVALID_PATH` if the path is unsalvageable. + +**Stage 2 --- Obfuscated question:** Rewrite the draft in BrowseComp style --- concise, natural, 1-2 sentences max. The solver must figure out *what* to search rather than following explicit breadcrumbs. No relational breadcrumbing like "X is member of Y; Y is based in Z...". + +```python +import data_designer.config as dd +from data_designer.interface import DataDesigner + +config = dd.DataDesignerConfigBuilder(model_configs=[ + dd.ModelConfig( + alias="riddle-gen", + model="qwen/qwen3-235b-a22b", + provider="nvidia", + ), +]) + +config.with_seed_dataset( + dd.LocalFileSeedSource(path="search_agent_seeds.parquet"), + sampling_strategy=dd.SamplingStrategy.SHUFFLE, +) + +# Stage 1: Draft question from knowledge path +config.add_column(dd.LLMTextColumnConfig( + name="user_query_draft", + model_alias="riddle-gen", + prompt=( + "You are an expert Search Evaluator designing Grandmaster-Level search tests.\n" + "Create a complex, multi-step search riddle based on this knowledge path:\n\n" + "{{ readable_path }}\n\n" + "Start Entity: {{ seed_entity }}\n" + "Final Answer Entity: {{ final_answer_entity }}\n\n" + "RULES:\n" + "1. DO NOT name the intermediate nodes. Hide them behind descriptions.\n" + "2. DO NOT name the Final Answer.\n" + "3. Chain the clues logically.\n" + "4. If a step is weak or nonsensical, IGNORE IT.\n" + "5. Output INVALID_PATH if the path is unsalvageable.\n\n" + "Return ONLY the question string (or INVALID_PATH)." + ), +)) + +# Stage 2: BrowseComp-style obfuscation +config.add_column(dd.LLMTextColumnConfig( + name="user_query_obfuscated", + model_alias="riddle-gen", + prompt=( + "Rewrite this search riddle to be MORE obfuscated and natural.\n\n" + "Original: {{ user_query_draft }}\n" + "Secret path: {{ readable_path }}\n\n" + "REQUIREMENTS:\n" + "1. DO NOT reveal the step-by-step plan. No breadcrumb chains.\n" + "2. DO NOT name intermediate or final entities.\n" + "3. 1-2 sentences max. Sound like a real user question.\n" + "4. If original == INVALID_PATH, output INVALID_PATH.\n\n" + "Return ONLY the rewritten question." + ), +)) +``` + +**Example transformation (NVIDIA path):** + +``` +Draft: "Starting from NVIDIA, identify the current CEO, then find + where they received their bachelor's degree, determine which + county houses that university's main campus, and finally + identify the nickname of the 19th-century U.S. Senator + the county is named after." + +Obfuscated: "Identify the nickname ('Old ____') of the 19th-century U.S. + Senator who is the namesake of the specific county that houses + the main campus of the university where the current CEO of + NVIDIA received his bachelor's degree." + +Answer: "Old Bullion" +``` + +The obfuscated version requires the solver to: + +1. Identify Jensen Huang as NVIDIA's CEO +2. Find where he got his bachelor's degree (Oregon State, not Stanford) +3. Identify the county (Benton County, OR) +4. Find who the county is named after (Thomas Hart Benton) +5. Find his nickname --- forcing one final hop to verify it's the Senator, not the painter + +--- + +## **Step 3: Search Trajectory Rollouts** + +For each riddle, we generate multi-step tool-using conversation rollouts where the model makes several web-search tool calls, reads tool outputs, reasons about results, and ends with a final answer. The agent operates through a cyclical **thought-action-observation loop**: + +1. **Thought:** Analyze the current state, identify the next missing link, formulate a search query +2. **Action (Tool Call):** Execute `web_search(query="...")`, retrieving real data from the web +3. **Observation & Synthesis:** Read the results, extract the relevant fact, loop back to step 1 + +We used [NeMo Skills](https://github.com/NVIDIA/NeMo-Skills) for trajectory generation, which has the Tavily API built in. The rollout model was [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2), chosen for its strong performance on BrowseComp and tool-calling benchmarks. + +### Rollout Statistics + +```bash +$ jq -r '.finish_reason[-1] // "MISSING"' shard_*/output-rs0.jsonl-async \ + | sort | uniq -c | sort -nr + 6974 stop + 177 length +``` + +- **6,974 completed** (`stop`) --- the model reached a final answer +- **177 truncated** (`length`) --- hit the context limit before answering +- **Average ~12 tool calls** per sample + +### What a Trajectory Looks Like + +Each trajectory captures the full search process. Here's a condensed example: + +```json +{ + "messages": [ + { + "role": "system", + "content": "You are an expert search agent. Your goal is to answer the user's question efficiently using verifiable online sources.\n\nExecution Rules:\n1. Search Results...\n2. Planning...\n3. Tool Use...\n4. MAXIMUM 25 Calls...\n5. Verification...\n6. Output Format - The last line must be EXACTLY: \"Final Answer: \"...\n" + }, + { + "role": "user", + "content": "Problem: Which equatorial coordinate system, using an angular measurement instead of right ascension, can be reached by starting with lawrencium, then the physicist it honors, the university they attended...?" + }, + { + "role": "assistant", + "content": "\nWe need to parse the clues step by step:\n1. Start with lawrencium... named after Ernest Lawrence...\n2. The university they attended... need to verify...\n", + "tool_calls": [{"function": {"name": "web-search", "arguments": "{\"query\": \"Ernest Lawrence attended university\"}"}}] + }, + { + "role": "tool", + "name": "web-search", + "content": "{\"results\": [{\"title\": \"Physics - Ernest Lawrence\", \"content\": \"He attended Manchester High School prior to entering Manchester University...\"}, {\"title\": \"2025 Hall of Fame...\", \"content\": \"...Ernest Lawrence graduated from USD in 1922...\"}]}" + }, + { + "role": "assistant", + "content": "\nErnest Lawrence attended University of South Dakota (BS) and Yale (PhD)... The clue mentions 'six-month academic schedule' — likely 'semester'...\n", + "tool_calls": [{"function": {"name": "web-search", "arguments": "{\"query\": \"semester six month academic schedule\"}"}}] + }, + { + "role": "tool", + "name": "web-search", + "content": "{\"results\": [{\"title\": \"Academic Calendar\", \"content\": \"Many colleges follow a semester plan...\"}]}" + }, + { + "role": "assistant", + "content": "Final Answer: hour-angle coordinate system" + } + ], + "metadata": { + "ground_truth": "first equatorial coordinate system", + "num_hops_in_graph": 8, + "seed_entity": "lawrencium", + "final_answer_entity": "first equatorial coordinate system", + "num_tool_calls": 6, + "finish_reason": ["tool_calls", "tool_calls", "stop"] + }, + "tools": [ + { + "type": "function", + "function": { + "name": "web-search", + "description": "Search the web for a query.", + "parameters": {"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]} + } + } + ] +} +``` + +--- + +## **Step 4: Post-Processing to SFT Dataset** + +Raw rollouts need cleanup before they become trainable SFT records: + +1. **Normalize tool outputs** into a consistent JSON "tool response" shape +2. **Drop broken/truncated interactions** (the 177 `length` records) +3. **Select the best rollout per question** (minimum tool calls among correct ones) +4. **Write OpenAI-messages style** JSONL with `messages[]`, `tools[]`, and `metadata{}` +5. **Manual review + LLM spot-checking** --- we reviewed as much SFT data as we could manually and used Gemini to spot-check chunks + +--- + +## **Production Yield Analysis** + +``` + PIPELINE YIELD + ============== + + ┌─────────────────────────────────────────────────────────────────────────────────────┐ + │ 50,000 ───74%──► 37,000 ───65%──► 24,000 ───29%──► 7,000 │ + │ Seeds Valid Seeds Valid Questions Valid Trajectories │ + │ │ + │ Total Yield: 14% │ + └─────────────────────────────────────────────────────────────────────────────────────┘ +``` + +| Stage | Input | Output | Yield | +|-------|-------|--------|-------| +| Seed Creation (Wikidata walks) | 50,000 | 37,000 | 74% | +| Riddle / Question Generation (DD) | 37,000 | 24,000 | 65% | +| Search Trajectory Rollouts (NeMo Skills + Tavily) | 24,000 | 7,000 | 29% | +| **End-to-End** | **50,000** | **~7,000** | **~14%** | + +The 14% yield might seem low, but each surviving record is a *verified, multi-turn search trajectory* showing a model successfully navigating web search. The alternative --- human annotation at 15-30 minutes per trajectory --- would take months for the same volume. + +--- + +## **The Correctness Challenge** + +Measuring correctness in post-processing was one of the hardest parts of this project, for reasons that go beyond typical evaluation: + +**1. Questions can have multiple valid answers.** A question about "which country contains X" might have a valid answer at multiple levels of granularity, or the entity might have multiple correct associations. + +**2. Wikidata has stale ground truth.** The knowledge graph reflects a snapshot in time. The model's parametric knowledge or live search results may be more current. For example: + +> **Question:** "...city that contains the headquarters of the owner of U.S. Steel?" +> +> **Ground truth (from Wikidata):** United States +> +> **Model answer:** Japan +> +> **Reality:** U.S. Steel was acquired by Nippon Steel (Japan) in Dec 2023. The model's answer is *factually correct today* but wrong according to the outdated KG path. + +### Accuracy Results + +We evaluated 657 trajectories against ground truth using fuzzy matching: + +``` +------------------------------------------------------------------------------------------------------------ +# | GROUND TRUTH | #TC | MODEL ANSWER | STATUS +------------------------------------------------------------------------------------------------------------ +1 | Ramsar Convention | 3 | Ramsar Convention (the Convention. | ✅ MATCH +2 | United States | 4 | United States | ✅ MATCH +3 | South Korea | 3 | Uzbekistan | ❌ MISS +4 | France | 3 | Germany | ❌ MISS +5 | Joseph Poelaert | 4 | Joseph Poelaert | ✅ MATCH +... +653 | Bangladesh | 11 | Bangladesh | ✅ MATCH +654 | Portal:Arithmetic | 10 | Portal:Arithmetic | ✅ MATCH +655 | Monumento 6 Gran Vía.. | 11 | Monumento V (the Monumento a los.. | ❌ MISS +656 | Tehran | 11 | Constantinople | ❌ MISS +657 | United Kingdom | 11 | Germany | ❌ MISS +------------------------------------------------------------------------------------------------------------ + +📊 RESULTS: 181/657 (27.5%) Correct +``` + +The 27.5% accuracy on this sample is for *raw, unfiltered* trajectories. After the full pipeline (rejection sampling, best-rollout selection, manual review), the final SFT dataset has much higher quality. The low raw accuracy underscores why multi-stage filtering is essential. + +--- + +## **Implementing with Data Designer's MCP Integration** + +While production runs used NeMo Skills for scale, the same pipeline is reproducible with Data Designer's MCP integration. Three components make this work: + +**`LocalStdioMCPProvider`** launches a Tavily MCP server as a subprocess: + +```python +from data_designer.config.mcp import LocalStdioMCPProvider, ToolConfig + +tavily_provider = LocalStdioMCPProvider( + name="tavily", + command=sys.executable, + args=[str(tavily_server_path)], + env={"TAVILY_API_KEY": os.environ["TAVILY_API_KEY"]}, +) +``` + +**`ToolConfig`** controls safety and limits: + +```python +tool_config = ToolConfig( + tool_alias="tavily", + providers=["tavily"], + allow_tools=["tavily_search"], + max_tool_call_turns=15, + timeout_sec=300.0, +) +``` + +**`tool_alias` + `with_trace`** on the LLM column enables tool calling and captures the full conversation: + +```python +config.add_column(dd.LLMTextColumnConfig( + name="agent_solution_raw", + prompt=AGENT_SYSTEM_PROMPT + "\n\n" + AGENT_USER_PROMPT, + model_alias="search-agent", + tool_alias="tavily", + with_trace=dd.TraceType.ALL_MESSAGES, +)) +``` + +The resulting `agent_solution_raw__trace` column contains the complete ChatML conversation --- every user message, every tool call with arguments, every tool response with search results, and the final assistant response. This trace *is* the SFT training data. + +--- + +## **Key Takeaways** + +1. **Wikidata provides infinite seed diversity.** Random walks on a knowledge graph with 100M+ entities produce an inexhaustible supply of multi-hop problems. The hop count directly controls difficulty --- 4 hops for warm-up, 8 for expert-level riddles. + +2. **Two-stage obfuscation prevents leakage.** Draft questions tend to follow the path structure too closely (breadcrumbing). The obfuscation rewrite produces concise, natural questions that force the solver to figure out *what* to search. + +3. **Low yield is expected and acceptable.** 14% end-to-end yield from 50k seeds still produces ~7,000 high-quality trajectories --- enough for meaningful SFT impact. Multi-hop search is genuinely hard, and most generated paths or questions are legitimately unsolvable. + +4. **Stale knowledge graphs are a real problem.** Wikidata doesn't update in real-time. Models with current parametric knowledge or live search results will disagree with ground truth on entities that have changed (mergers, leadership changes, geopolitical shifts). Correctness evaluation needs to account for this. + +5. **Iterate on seeds, not just prompts.** Seed filtering (removing generic answers, constraining hop counts, anti-meta filters) has as much impact on quality as prompt engineering. Filter early, save compute. + +6. **Traces are the training data.** The full thought-action-observation loop --- every search query formulation, every result evaluation, every reasoning step --- is what teaches tool-use capability. Final answers alone are worthless without the process. + +7. **Safety controls are essential.** `allow_tools` prevents the model from calling unexpected tools. `max_tool_call_turns=15` prevents infinite search loops. `timeout_sec=300` prevents hung connections. Without these, a fraction of records would consume unbounded resources. + +--- + +## **Next Steps** + +- **Scale question generation.** Generate closer to ~25,000 filtered questions using Data Designer, up from the current 7k trajectories. +- **Push difficulty higher.** Target questions where `num_tool_calls` consistently exceeds 15+, requiring deeper reasoning chains. +- **Explore fresher knowledge bases.** Wikidata staleness is a real limitation. Investigate more recently updated, freely available knowledge bases for seed generation. +- **Search RL environment.** Use the filtered questions as an RL environment where the model gets reward for correct final answers, complementing the SFT data. + +--- + +## **Try For Yourself** + +
+Full source: search agent trajectory pipeline + +```python +import os +import sys +from pathlib import Path + +import data_designer.config as dd +from data_designer.config.seed import SamplingStrategy +from data_designer.config import LocalFileSeedSource +from data_designer.config.mcp import LocalStdioMCPProvider, ToolConfig +from data_designer.interface import DataDesigner + +# --- Tavily MCP server (written to disk, launched as subprocess) --- +TAVILY_SERVER_CODE = '''\ +import os, time, httpx +from mcp.server.fastmcp import FastMCP + +mcp = FastMCP("tavily") + +@mcp.tool() +async def tavily_search(query: str, max_results: int = 10, search_depth: str = "basic"): + """Search the web via Tavily.""" + payload = { + "api_key": os.environ["TAVILY_API_KEY"], + "query": query, "max_results": max_results, "search_depth": search_depth, + } + async with httpx.AsyncClient(timeout=60.0) as client: + r = await client.post("https://api.tavily.com/search", json=payload) + r.raise_for_status() + data = r.json() + data["response_time"] = round(time.monotonic(), 3) + return data + +if __name__ == "__main__": + mcp.run() +''' + +tavily_server_path = Path("tavily_stdio_mcp_server.py").resolve() +tavily_server_path.write_text(TAVILY_SERVER_CODE) + +# --- MCP provider + tool config --- +tavily_provider = LocalStdioMCPProvider( + name="tavily", + command=sys.executable, + args=[str(tavily_server_path)], + env={"TAVILY_API_KEY": os.environ["TAVILY_API_KEY"]}, +) + +tool_config = ToolConfig( + tool_alias="tavily", + providers=["tavily"], + allow_tools=["tavily_search"], + max_tool_call_turns=15, + timeout_sec=300.0, +) + +# --- Model --- +config = dd.DataDesignerConfigBuilder(model_configs=[ + dd.ModelConfig(alias="gen", model="qwen/qwen3-235b-a22b", provider="nvidia"), +]) +config.add_tool_config(tool_config) + +# --- Seed data (Wikidata KG walks) --- +config.with_seed_dataset( + LocalFileSeedSource(path="search_agent_seeds.parquet"), + sampling_strategy=SamplingStrategy.SHUFFLE, +) + +# --- Column 1: Draft question from knowledge path --- +config.add_column(dd.LLMTextColumnConfig( + name="user_query_draft", + model_alias="gen", + prompt=( + "You are an expert Search Evaluator designing Grandmaster-Level search tests.\n" + "Create a complex, multi-step search riddle based on this knowledge path:\n\n" + "{{ readable_path }}\n\n" + "Start Entity: {{ seed_entity }}\n" + "Final Answer Entity: {{ final_answer_entity }}\n\n" + "RULES:\n" + "1. DO NOT name the intermediate nodes. Hide them behind descriptions.\n" + "2. DO NOT name the Final Answer.\n" + "3. Chain the clues logically.\n" + "4. If a step is weak or nonsensical, IGNORE IT.\n" + "5. Output INVALID_PATH if the path is unsalvageable.\n\n" + "Return ONLY the question string (or INVALID_PATH)." + ), +)) + +# --- Column 2: BrowseComp-style obfuscation --- +config.add_column(dd.LLMTextColumnConfig( + name="user_query_obfuscated", + model_alias="gen", + prompt=( + "Rewrite this search riddle to be MORE obfuscated and natural.\n\n" + "Original: {{ user_query_draft }}\n" + "Secret path: {{ readable_path }}\n\n" + "REQUIREMENTS:\n" + "1. DO NOT reveal the step-by-step plan. No breadcrumb chains.\n" + "2. DO NOT name intermediate or final entities.\n" + "3. 1-2 sentences max. Sound like a real user question.\n" + "4. If original == INVALID_PATH, output INVALID_PATH.\n\n" + "Return ONLY the rewritten question." + ), +)) + +# --- Column 3: Agent trajectory with MCP tool calling --- +AGENT_SYSTEM_PROMPT = """You are an expert search agent that uses web search to answer questions. +Output ONLY valid JSON: {"final_answer": "...", "supporting_urls": [...], "short_justification": "..."} +Use "tavily_search" with {"query": "..."} for web searches. Maximum 15 tool calls. +Start broad, refine to specific entities, cross-verify across sources.""" + +config.add_column(dd.LLMTextColumnConfig( + name="agent_solution_raw", + model_alias="gen", + prompt=AGENT_SYSTEM_PROMPT + "\n\nProblem: {{ user_query_obfuscated }}", + tool_alias="tavily", + with_trace=dd.TraceType.ALL_MESSAGES, +)) + +# --- Run --- +data_designer = DataDesigner(mcp_providers=[tavily_provider]) +results = data_designer.create( + config_builder=config, + num_records=1000, + dataset_name="search-agent-trajectories", +) +``` + +
+ +--- + +Key Resources: + +1. [NeMo Data Designer on GitHub](https://github.com/NVIDIA-NeMo/DataDesigner) +2. [BrowseComp Benchmark (OpenAI)](https://openai.com/index/browsecomp/) +3. [Wikidata Knowledge Graph](https://www.wikidata.org/) +4. [Tavily Search API](https://tavily.com/) +5. [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) + +--- + +*Want to learn more about NeMo Data Designer? Check out our [documentation](https://github.com/NVIDIA-NeMo/DataDesigner) and start building your own synthetic data pipelines today.* diff --git a/mkdocs.yml b/mkdocs.yml index 614fe1eef..4b5f9966e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -69,6 +69,7 @@ nav: - Design Principles: devnotes/posts/design-principles.md - RQA Dataset: devnotes/posts/rqa.md - Deep Research Trajectories: devnotes/posts/deep-research-trajectories.md + - Search Agent: devnotes/posts/search-agent.md theme: name: material