caal-ministral: fine-tuning web_search tool calls #77

AbdulShahzeb · 2026-02-13T11:15:24Z

AbdulShahzeb
Feb 13, 2026

In your latest video you mentioned that you included hundreds (or thousands) of examples of web search, hass, n8n tool calls, and general conversation. I wanted to ask how you're handling the web search examples in your dataset or if you'd be willing to share one example. I assume it's something like

{
  "messages": [
    {"role": "system", "content": "You are CAAL..."},
    {"role": "user", "content": "Have the NVIDIA 60-series GPUs been announced yet?"},

    {"role": "assistant", "content": null, "tool_calls": [
      {"id": "call_1", "function": {"name": "web_search", "arguments": "{\"query\": \"NVIDIA 60-series GPU announcement\"}}
    ]},
    {"role": "tool", "tool_call_id": "call_1", "content": [IDK WHAT GOES HERE],
    
    {"role": "assistant", "content": "No, NVIDIA has not announced the RTX 60-series cards yet."}
  ]
}

But I have no idea what the tool content is supposed to be, much less how to get hundreds of examples for it XD

Cheers,
Abdul

Answered by cmac86

Feb 13, 2026

Hey Abdul,

Good question and after looking at my training data, I actually had the wrong format for web_search. The web_search tool just returns a plain string to the LLM. The search tool searches with DuckDuckGo, then sends the response to a separate LLM call to summarize, then returns the summary as a string to the main agent.

The example should look like this:

{                                                                                                                   
  "messages": [
    {"role": "system", "content": "You are CAAL, a voice assistant created by CoreWorxLab."},                       
    {"role": "user", "content": "Have the NVIDIA 60-series GPUs been announced ye…

View full answer

cmac86 · 2026-02-13T22:54:36Z

cmac86
Feb 13, 2026
Maintainer

Hey Abdul,

Good question and after looking at my training data, I actually had the wrong format for web_search. The web_search tool just returns a plain string to the LLM. The search tool searches with DuckDuckGo, then sends the response to a separate LLM call to summarize, then returns the summary as a string to the main agent.

The example should look like this:

{                                                                                                                   
  "messages": [
    {"role": "system", "content": "You are CAAL, a voice assistant created by CoreWorxLab."},                       
    {"role": "user", "content": "Have the NVIDIA 60-series GPUs been announced yet?"},
    {"role": "assistant", "content": null, "tool_calls": [
      {"id": "call_1", "type": "function", "function": {"name": "web_search", "arguments": "{\"query\": \"NVIDIA RTX 5060 60-series GPU announcement 2026\"}"}}
    ]},
    {"role": "tool", "tool_call_id": "call_1", "content": "NVIDIA hasn't officially announced the RTX 60-series lineup yet. The RTX 50-series launched in early 2025, and there's no confirmed timeline for a 60-series release."},
    {"role": "assistant", "content": "No, NVIDIA hasn't announced the 60-series yet. The 50-series is still the latest lineup."}
  ]
}

I ran another training run yesterday with a bit of a different approach. I used only real tools from the registry and made sure that the tool responses in the examples follow the real schema of the tools. I also mixed in about 15% of the examples being just conversational examples. I found that the model was biasing too hard towards using a tool for every call. The last training run had 1,434 examples, down from ~4000

I’m training and testing in these categories:

Single Tool Calls (includes web search)
Chain Tool Calls
- Implicit (eg. Email John the latest tweet from Elon Musk)
- Explicit (eg. Get Elon Musk’s latest tweet then get John’s email address and email it to him)
Conversational (general conversation, jokes, knowledge questions, etc.)
Memory
- Implicit (where is my package)
- Explicit (use the package tracking number you have in memory and track the package)

Here are the results from the basic test:

Results: 37/46 passed (80%)
chain_explicit : 5/7
chain_implicit : 1/3
conversational : 10/10
memory_explicit: 5/5
memory_implicit: 0/5
single : 16/16

Now it’s a bit unfair because the chaining test doesn’t really work. It’s only looking at the first tool call of the chain. I need to be able to get response data and test the full chain.

But this is showing me i need more training for implicit memory_short use and more training for implicit tool chaining. These are the trickiest for the model to perform because it has to infer the order of the tool calls, it’s not laid out for it in the request.

That is my next step, to add more implicit training examples and to build out the test script to test training more effectively.

Really curious too how this would perform on a 14b model.

Corey

0 replies

AbdulShahzeb · 2026-02-14T07:33:06Z

AbdulShahzeb
Feb 14, 2026
Author

Hey Corey,

Thanks for the detailed breakdown and clarification - that makes a lot of sense.

The test results are interesting as well. I’m not sure if this is directly related, but I’ve observed something similar with my own reasoning model (Thinking mode is automatically toggled). For example:

Example 1: Explain why increasing the LoRA rank in a fine-tuning setup can improve task-specific performance but also increase the risk of overfitting. Use reasoning.

Example 2: Use reasoning: explain why increasing the LoRA rank in a fine-tuning setup can improve task-specific performance but also increase the risk of overfitting.

In my testing, placing the “Use reasoning” trigger at the beginning of the prompt (as in Example 2) almost always activates reasoning mode. When it appears at the end (especially in longer prompts) the model sometimes skips structured reasoning and responds immediately.

Not sure if that maps to what you're seeing with implicit chaining and memory, but I thought it was interesting.

Abdul

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

caal-ministral: fine-tuning web_search tool calls #77

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

caal-ministral: fine-tuning web_search tool calls #77

Uh oh!

AbdulShahzeb Feb 13, 2026

Replies: 2 comments

Uh oh!

cmac86 Feb 13, 2026 Maintainer

Uh oh!

AbdulShahzeb Feb 14, 2026 Author

AbdulShahzeb
Feb 13, 2026

cmac86
Feb 13, 2026
Maintainer

AbdulShahzeb
Feb 14, 2026
Author