A Python utility for running multiple parameter sweep jobs across multiple GPUs using tmux sessions. This tool is particularly useful for machine learning experiments where you need to run the same script with different hyperparameter combinations distributed across available GPUs.
- Multi-GPU support: Automatically distributes jobs across available GPUs
- Parameter sweeps: Generate all combinations of hyperparameters
- YAML configuration: Define jobs and parameters in YAML files
- Config concatenation: Combine multiple YAML configs into a single config file
- Tmux session management: Each job runs in its own tmux session
- Load balancing: Automatically assigns jobs to the GPU with the fewest running jobs
- Logging: Saves output from each job to separate log files
- Job monitoring: Waits for jobs to complete before launching new ones when GPU capacity is reached
- Programmatic API: Use
load_yaml_config_and_generate_commands()to generate commands programmatically
Install directly from GitHub using pip:
pip install git+https://github.com/omi-n/job_batcher.git- Clone this repository:
git clone https://github.com/omi-n/job_batcher.git
cd job_batcher- Install in development mode:
pip install -e .- Python 3.6+
tyro- For command-line interfacePyYAML- For YAML configuration file parsingtmux- For session management (install via your system package manager)nvidia-smi- For GPU detection (comes with NVIDIA drivers)
After installation, you can use the job-batcher command from anywhere:
You can use the job batcher in two ways:
job-batcher \
--command_template "python train.py --lr {{learning_rate}} --batch_size {{batch_size}}" \
--template_args '{"learning_rate": [0.001, 0.01, 0.1], "batch_size": [32, 64, 128]}'job-batcher --config_file configs/humanoid_train.yamlCombine multiple YAML configuration files into a single config:
job-batcher \
--concatenate configs/humanoid_train.yaml configs/walker2d_train.yaml \
--output_path configs/combined.yamlYou can also point to a folder, and it will automatically find and parse all YAML files in that folder (recursively):
job-batcher \
--concatenate configs/experiments/ \
--output_path configs/all_experiments.yamlThis will:
- Load each config file (or all YAML files in specified folders) and generate all command combinations
- Create a new YAML file at
output_pathwith all commands - The new config uses
{{command}}as the template with all commands as values
You can then run the combined config:
job-batcher --config_file configs/combined.yamlcommand_template: Template for the command to run each job. Use{{parameter_name}}for placeholderstemplate_args: Dictionary of parameters with their possible values (as lists)config_file: Path to YAML configuration fileconcatenate: List of YAML config files or folders to concatenate (requiresoutput_path). Folders are searched recursively for all.yamlfilesoutput_path: Path to save concatenated config file (used withconcatenate)job_prefix: Prefix for tmux session names (default: "job")setup_str: Setup commands to run before each job (e.g., environment variables)workers_per_gpu: Number of concurrent jobs per GPU (default: 1)log_dir: Directory to store log files (default: "logs")
command_template: >-
uv run main.py
--data.num_workers 2
--trainer.eval_num_episodes 50
--data.max_epochs {{num_epochs}}
--data.reward_percentiles 0.0
--use_wandb
--wandb_project {{wandb_project}}
--experiment_name lr{{learning_rate}}-ema{{ema_decay}}-{{loss_function}}-wd{{weight_decay}}-e{{num_epochs}}
--data.ds_name mujoco/{{env}}/{{level}}
--trainer.loss_function {{loss_function}}
--trainer.optimizer_config.learning_rate {{learning_rate}}
--trainer.optimizer_config.weight_decay {{weight_decay}}
--trainer.optimizer_config.use_scheduler
--trainer.use_ema
--trainer.ema_start_epoch 3
--trainer.ema_update_interval 1
--trainer.ema_decay {{ema_decay}}
agent:dnn-agent-config --agent.n_future 1 --agent.n_history 1
template_args:
learning_rate: [1e-5, 5e-5, 1e-4, 5e-4]
ema_decay: [0.0, 0.99, 0.995, 0.999]
loss_function: "mse_loss"
num_epochs: [80, 160]
env: "humanoid"
level: "expert-v0"
weight_decay: 1e-2
wandb_project: "minari-humanoid-2"
job_prefix: "minari_job"
setup_str: "export MINARI_DATASETS_PATH=\"/path/to/data\""This configuration will generate 32 different job combinations (4 × 4 × 1 × 2 learning rates × ema_decay × loss_function × num_epochs).
Similar to the humanoid configuration but for the Walker2D environment with different wandb project and environment settings.
You can use the job batcher programmatically in your Python scripts:
from job_batcher import load_yaml_config_and_generate_commands
# Load a config file and generate all command combinations
commands, config = load_yaml_config_and_generate_commands("configs/humanoid_train.yaml")
print(f"Generated {len(commands)} commands")
for i, cmd in enumerate(commands[:3]): # Print first 3 commands
print(f"Command {i}: {cmd}")
# Access configuration
print(f"Job prefix: {config.job_prefix}")
print(f"Workers per GPU: {config.workers_per_gpu}")This is useful for:
- Previewing commands before running them
- Integrating job generation into larger workflows
- Custom job scheduling logic
- Debugging configuration issues
- Parameter Expansion: The tool takes the
template_argsand generates all possible combinations using Cartesian product - GPU Detection: Automatically detects available GPUs using
nvidia-smi - Load Balancing: Assigns each job to the GPU with the fewest currently running jobs
- Tmux Sessions: Each job runs in its own detached tmux session for isolation
- Job Queue: When all GPUs are at capacity, new jobs wait until existing jobs complete
- Logging: Each job's output is redirected to a separate log file in the specified directory
tmux list-sessions | grep <job_prefix>tmux attach-session -t <session_name>tail -f logs/<job_prefix>_<job_id>.logtmux list-sessions | grep <job_prefix> | cut -d: -f1 | xargs -I {} tmux kill-session -t {}- Create a YAML configuration file with your experiment parameters
- Run the job batcher:
job-batcher --config_file configs/my_experiment.yaml
- Monitor progress:
# List running jobs tmux list-sessions | grep minari_job # View logs tail -f logs/minari_job_gpu0_0.log
- Results will be logged to individual files in the
logs/directory
If you have multiple experiment configurations and want to run them all together:
-
Create individual config files for each experiment:
configs/humanoid_train.yamlconfigs/walker2d_train.yamlconfigs/hopper_train.yaml
-
Concatenate them into a single config (you can mix files and folders):
job-batcher \ --concatenate configs/robotics/ configs/extra_experiment.yaml \ --output_path configs/all_experiments.yaml \ --job_prefix "combined_exp" \ --workers_per_gpu 2Or concatenate an entire folder:
job-batcher \ --concatenate configs/ \ --output_path configs/all_experiments.yaml
-
Run the combined configuration:
job-batcher --config_file configs/all_experiments.yaml
This approach is useful when:
- You want to run diverse experiments with different hyperparameters
- You need to fairly distribute GPU resources across multiple projects
- You want a single unified config for reproducibility
- You have many config files organized in folders
- Use
workers_per_gpu > 1if your jobs don't fully utilize GPU - Set up proper environment variables in
setup_strfor reproducible experiments - Use descriptive
job_prefixnames to easily identify different experiment runs - Monitor GPU memory usage with
nvidia-smito optimizeworkers_per_gpu