Job Batcher

A Python utility for running multiple parameter sweep jobs across multiple GPUs using tmux sessions. This tool is particularly useful for machine learning experiments where you need to run the same script with different hyperparameter combinations distributed across available GPUs.

Features

Multi-GPU support: Automatically distributes jobs across available GPUs
Parameter sweeps: Generate all combinations of hyperparameters
YAML configuration: Define jobs and parameters in YAML files
Config concatenation: Combine multiple YAML configs into a single config file
Tmux session management: Each job runs in its own tmux session
Load balancing: Automatically assigns jobs to the GPU with the fewest running jobs
Logging: Saves output from each job to separate log files
Job monitoring: Waits for jobs to complete before launching new ones when GPU capacity is reached
Programmatic API: Use load_yaml_config_and_generate_commands() to generate commands programmatically

Installation

From GitHub (Recommended)

Install directly from GitHub using pip:

pip install git+https://github.com/omi-n/job_batcher.git

From Source

Clone this repository:

git clone https://github.com/omi-n/job_batcher.git
cd job_batcher

Install in development mode:

pip install -e .

Requirements

Python 3.6+
tyro - For command-line interface
PyYAML - For YAML configuration file parsing
tmux - For session management (install via your system package manager)
nvidia-smi - For GPU detection (comes with NVIDIA drivers)

Usage

After installation, you can use the job-batcher command from anywhere:

Basic Usage

You can use the job batcher in two ways:

1. Command Line Arguments

job-batcher \
  --command_template "python train.py --lr {{learning_rate}} --batch_size {{batch_size}}" \
  --template_args '{"learning_rate": [0.001, 0.01, 0.1], "batch_size": [32, 64, 128]}'

2. YAML Configuration File

job-batcher --config_file configs/humanoid_train.yaml

3. Concatenate Multiple Configs

Combine multiple YAML configuration files into a single config:

job-batcher \
  --concatenate configs/humanoid_train.yaml configs/walker2d_train.yaml \
  --output_path configs/combined.yaml

You can also point to a folder, and it will automatically find and parse all YAML files in that folder (recursively):

job-batcher \
  --concatenate configs/experiments/ \
  --output_path configs/all_experiments.yaml

This will:

Load each config file (or all YAML files in specified folders) and generate all command combinations
Create a new YAML file at output_path with all commands
The new config uses {{command}} as the template with all commands as values

You can then run the combined config:

job-batcher --config_file configs/combined.yaml

Configuration Options

command_template: Template for the command to run each job. Use {{parameter_name}} for placeholders
template_args: Dictionary of parameters with their possible values (as lists)
config_file: Path to YAML configuration file
concatenate: List of YAML config files or folders to concatenate (requires output_path). Folders are searched recursively for all .yaml files
output_path: Path to save concatenated config file (used with concatenate)
job_prefix: Prefix for tmux session names (default: "job")
setup_str: Setup commands to run before each job (e.g., environment variables)
workers_per_gpu: Number of concurrent jobs per GPU (default: 1)
log_dir: Directory to store log files (default: "logs")

Example Configuration Files

Humanoid Training (`configs/humanoid_train.yaml`)

command_template: >-
  uv run main.py 
  --data.num_workers 2 
  --trainer.eval_num_episodes 50 
  --data.max_epochs {{num_epochs}} 
  --data.reward_percentiles 0.0 
  --use_wandb 
  --wandb_project {{wandb_project}} 
  --experiment_name lr{{learning_rate}}-ema{{ema_decay}}-{{loss_function}}-wd{{weight_decay}}-e{{num_epochs}} 
  --data.ds_name mujoco/{{env}}/{{level}} 
  --trainer.loss_function {{loss_function}} 
  --trainer.optimizer_config.learning_rate {{learning_rate}} 
  --trainer.optimizer_config.weight_decay {{weight_decay}}
  --trainer.optimizer_config.use_scheduler 
  --trainer.use_ema 
  --trainer.ema_start_epoch 3 
  --trainer.ema_update_interval 1 
  --trainer.ema_decay {{ema_decay}} 
  agent:dnn-agent-config --agent.n_future 1 --agent.n_history 1

template_args:
  learning_rate: [1e-5, 5e-5, 1e-4, 5e-4]
  ema_decay: [0.0, 0.99, 0.995, 0.999]
  loss_function: "mse_loss"
  num_epochs: [80, 160]
  env: "humanoid"
  level: "expert-v0"
  weight_decay: 1e-2
  wandb_project: "minari-humanoid-2"

job_prefix: "minari_job"
setup_str: "export MINARI_DATASETS_PATH=\"/path/to/data\""

This configuration will generate 32 different job combinations (4 × 4 × 1 × 2 learning rates × ema_decay × loss_function × num_epochs).

Walker2D Training (`configs/walker2d_train.yaml`)

Similar to the humanoid configuration but for the Walker2D environment with different wandb project and environment settings.

Programmatic Usage

You can use the job batcher programmatically in your Python scripts:

from job_batcher import load_yaml_config_and_generate_commands

# Load a config file and generate all command combinations
commands, config = load_yaml_config_and_generate_commands("configs/humanoid_train.yaml")

print(f"Generated {len(commands)} commands")
for i, cmd in enumerate(commands[:3]):  # Print first 3 commands
    print(f"Command {i}: {cmd}")

# Access configuration
print(f"Job prefix: {config.job_prefix}")
print(f"Workers per GPU: {config.workers_per_gpu}")

This is useful for:

Previewing commands before running them
Integrating job generation into larger workflows
Custom job scheduling logic
Debugging configuration issues

How It Works

Parameter Expansion: The tool takes the template_args and generates all possible combinations using Cartesian product
GPU Detection: Automatically detects available GPUs using nvidia-smi
Load Balancing: Assigns each job to the GPU with the fewest currently running jobs
Tmux Sessions: Each job runs in its own detached tmux session for isolation
Job Queue: When all GPUs are at capacity, new jobs wait until existing jobs complete
Logging: Each job's output is redirected to a separate log file in the specified directory

Monitoring Jobs

View Running Jobs

tmux list-sessions | grep <job_prefix>

Attach to a Specific Job

tmux attach-session -t <session_name>

View Job Logs

tail -f logs/<job_prefix>_<job_id>.log

Kill All Jobs with Prefix

tmux list-sessions | grep <job_prefix> | cut -d: -f1 | xargs -I {} tmux kill-session -t {}

Example Workflow

Basic Workflow

Create a YAML configuration file with your experiment parameters

Run the job batcher:

job-batcher --config_file configs/my_experiment.yaml

Monitor progress:

# List running jobs
tmux list-sessions | grep minari_job

# View logs
tail -f logs/minari_job_gpu0_0.log

Results will be logged to individual files in the logs/ directory

Advanced Workflow: Combining Multiple Experiments

If you have multiple experiment configurations and want to run them all together:

Create individual config files for each experiment:
- configs/humanoid_train.yaml
- configs/walker2d_train.yaml
- configs/hopper_train.yaml

Concatenate them into a single config (you can mix files and folders):

job-batcher \
  --concatenate configs/robotics/ configs/extra_experiment.yaml \
  --output_path configs/all_experiments.yaml \
  --job_prefix "combined_exp" \
  --workers_per_gpu 2

Or concatenate an entire folder:

job-batcher \
  --concatenate configs/ \
  --output_path configs/all_experiments.yaml

Run the combined configuration:

job-batcher --config_file configs/all_experiments.yaml

This approach is useful when:

You want to run diverse experiments with different hyperparameters
You need to fairly distribute GPU resources across multiple projects
You want a single unified config for reproducibility
You have many config files organized in folders

Tips

Use workers_per_gpu > 1 if your jobs don't fully utilize GPU
Set up proper environment variables in setup_str for reproducible experiments
Use descriptive job_prefix names to easily identify different experiment runs
Monitor GPU memory usage with nvidia-smi to optimize workers_per_gpu

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
job_batcher.py		job_batcher.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Job Batcher

Features

Installation

From GitHub (Recommended)

From Source

Requirements

Usage

Basic Usage

1. Command Line Arguments

2. YAML Configuration File

3. Concatenate Multiple Configs

Configuration Options

Example Configuration Files

Humanoid Training (`configs/humanoid_train.yaml`)

Walker2D Training (`configs/walker2d_train.yaml`)

Programmatic Usage

How It Works

Monitoring Jobs

View Running Jobs

Attach to a Specific Job

View Job Logs

Kill All Jobs with Prefix

Example Workflow

Basic Workflow

Advanced Workflow: Combining Multiple Experiments

Tips

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Job Batcher

Features

Installation

From GitHub (Recommended)

From Source

Requirements

Usage

Basic Usage

1. Command Line Arguments

2. YAML Configuration File

3. Concatenate Multiple Configs

Configuration Options

Example Configuration Files

Humanoid Training (configs/humanoid_train.yaml)

Walker2D Training (configs/walker2d_train.yaml)

Programmatic Usage

How It Works

Monitoring Jobs

View Running Jobs

Attach to a Specific Job

View Job Logs

Kill All Jobs with Prefix

Example Workflow

Basic Workflow

Advanced Workflow: Combining Multiple Experiments

Tips

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Humanoid Training (`configs/humanoid_train.yaml`)

Walker2D Training (`configs/walker2d_train.yaml`)

Packages