caal-ministral: fine-tuning web_search tool calls #77
-
|
Hey @cmac86 In your latest video you mentioned that you included hundreds (or thousands) of examples of web search, hass, n8n tool calls, and general conversation. I wanted to ask how you're handling the web search examples in your dataset or if you'd be willing to share one example. I assume it's something like {
"messages": [
{"role": "system", "content": "You are CAAL..."},
{"role": "user", "content": "Have the NVIDIA 60-series GPUs been announced yet?"},
{"role": "assistant", "content": null, "tool_calls": [
{"id": "call_1", "function": {"name": "web_search", "arguments": "{\"query\": \"NVIDIA 60-series GPU announcement\"}}
]},
{"role": "tool", "tool_call_id": "call_1", "content": [IDK WHAT GOES HERE],
{"role": "assistant", "content": "No, NVIDIA has not announced the RTX 60-series cards yet."}
]
}But I have no idea what the tool content is supposed to be, much less how to get hundreds of examples for it XD Cheers, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
Hey Abdul, Good question and after looking at my training data, I actually had the wrong format for web_search. The web_search tool just returns a plain string to the LLM. The search tool searches with DuckDuckGo, then sends the response to a separate LLM call to summarize, then returns the summary as a string to the main agent. The example should look like this: I ran another training run yesterday with a bit of a different approach. I used only real tools from the registry and made sure that the tool responses in the examples follow the real schema of the tools. I also mixed in about 15% of the examples being just conversational examples. I found that the model was biasing too hard towards using a tool for every call. The last training run had 1,434 examples, down from ~4000 I’m training and testing in these categories:
Here are the results from the basic test: Results: 37/46 passed (80%) Now it’s a bit unfair because the chaining test doesn’t really work. It’s only looking at the first tool call of the chain. I need to be able to get response data and test the full chain. But this is showing me i need more training for implicit memory_short use and more training for implicit tool chaining. These are the trickiest for the model to perform because it has to infer the order of the tool calls, it’s not laid out for it in the request. That is my next step, to add more implicit training examples and to build out the test script to test training more effectively. Really curious too how this would perform on a 14b model. Corey |
Beta Was this translation helpful? Give feedback.
-
|
Hey Corey, Thanks for the detailed breakdown and clarification - that makes a lot of sense. The test results are interesting as well. I’m not sure if this is directly related, but I’ve observed something similar with my own reasoning model (Thinking mode is automatically toggled). For example: Example 1: Explain why increasing the LoRA rank in a fine-tuning setup can improve task-specific performance but also increase the risk of overfitting. Use reasoning. Example 2: Use reasoning: explain why increasing the LoRA rank in a fine-tuning setup can improve task-specific performance but also increase the risk of overfitting. In my testing, placing the “Use reasoning” trigger at the beginning of the prompt (as in Example 2) almost always activates reasoning mode. When it appears at the end (especially in longer prompts) the model sometimes skips structured reasoning and responds immediately. Not sure if that maps to what you're seeing with implicit chaining and memory, but I thought it was interesting. Abdul |
Beta Was this translation helpful? Give feedback.
Hey Abdul,
Good question and after looking at my training data, I actually had the wrong format for web_search. The web_search tool just returns a plain string to the LLM. The search tool searches with DuckDuckGo, then sends the response to a separate LLM call to summarize, then returns the summary as a string to the main agent.
The example should look like this: