Specialization in LLM-based agents concerns the activation of domain-relevant reasoning behaviors through textual conditioning rather than external knowledge acquisition. This experiment measures whether modifying an agent's analytical context through role descriptions, explicit planning opportunities, or expert methodological guidance can activate or better structure the domain-specific reasoning already encoded in the model.
We use concrete machine learning tasks to evaluate how LLM behavior changes under different role conditions (e.g., assigning the agent a data scientist role). These tasks are drawn from public datasets that the model has likely encountered during pretraining, ensuring that performance differences reflect changes in reasoning behavior rather than access to external knowledge.
The benchmark uses five public machine learning datasets covering regression, binary, and multiclass classification tasks, collected from the CatDB repository:
- Utility - Regression task (target: CSRI) - CatDB repository
- Wifi - Binary classification (target: TechCenter) - CatDB repository
- EU-IT - Multiclass classification (target: Position) - CatDB repository
- Yelp - Multiclass classification (target: stars) - Yelp Open Dataset
- Volkert - Multiclass classification (target: class) - OpenML
Dataset Availability: Some datasets are included in this repository:
Utility/Utility.csv✓ (included)Wifi/WiFi.csv✓ (included, note: filename isWiFi.csvwith capital letters)EU-IT/EU-IT_cleaned.csv✓ (included)Yelp/Yelp_Merged.csv✗ (not included - must be downloaded)Volkert/volkert.csv✗ (not included - must be downloaded)
Manual Download Required: The following datasets must be downloaded manually:
-
Utility, Wifi, EU-IT: These datasets from the CatDB repository are included in the repository. If you need to re-download them, obtain them from the CatDB source (specific download links may vary).
-
Yelp: Download from the Yelp Open Dataset. The benchmark uses a merged CSV file (
Yelp_Merged.csv) created from multiple Yelp dataset files:Business.csvCheckins.csvReviews.csvUsers.csv
Place the merged file as
Yelp_Merged.csvin theYelp/directory. -
Volkert: Download from OpenML (dataset ID: 41166). Save it as
volkert.csvin theVolkert/directory.
Important: The generated code expects CSV files to be in the current working directory when executed. When running the generated scripts:
- Navigate to the dataset's directory (e.g.,
cd Utility/) before executing, OR - Modify the file paths in the generated scripts to point to the correct dataset location
Each dataset is evaluated under three conditioning strategies with identical tasks, data, and models:
- Role-based prompting - Assigns a professional identity (e.g., data scientist) alongside the task description
- Planning-based conditioning - Adds an explicit intermediate step where the LLM generates a high-level solution plan before producing executable code
- Expert-guided conditioning - Injects methodological instructions reflecting standard data-science workflows
These variants isolate the effects of identity framing, added reasoning structure, and explicit procedural guidance on agent behavior.
Agent Roles: The benchmark uses five agent roles defined in agents.yaml:
- Data Scientist: Focuses on cleaning, preprocessing, and building reproducible ML pipelines
- Researcher: Emphasizes pattern discovery, experimentation, and result interpretation
- Engineer: Prioritizes reliable, efficient implementations with performance optimization
- Data Analyst: Focuses on trend analysis, relationship identification, and clear interpretation
- No Role: Baseline condition without role assignment
Conditioning Strategies:
- Role-based: Implemented via CrewAI agent configurations with role-specific goals and backstories (
crew.py) - Planning-based: Enabled by setting
planning=Truein the Crew configuration (crewplanning.py), which adds explicit step-by-step planning before code generation - Expert-guided: Implemented through detailed task descriptions in
task.yamlthat include explicit methodological workflows (e.g., data profiling, cleaning, feature engineering, modeling, evaluation)
Task Structure: Each dataset requires generating executable Python code that:
- Loads the dataset from CSV
- Performs appropriate preprocessing (encoding, scaling, imputation)
- Trains a model (regression or classification depending on task)
- Evaluates using standard metrics (MAE for regression; Accuracy, Precision, Recall, F1-score for classification)
- Outputs results including predicted vs actual values
Code Generation: Generated code files follow naming conventions:
{Dataset}_{role}.py- Role-based conditioning{Dataset}_{role}planning.py- Planning-based conditioning{Dataset}_{role}exp.py- Expert-guided conditioning
Our contribution is a controlled specialization evaluation framework that isolates the effect of textual conditioning on agent behavior. By holding the model, data, and task specifications fixed and varying only the conditioning strategy, the benchmark enables direct comparison of role-based prompting, planning opportunities, and expert guidance. This design supports principled analysis of when domain-relevant reasoning can be activated through lightweight prompt interventions alone and when stronger procedural structure is required.