diff --git a/README.md b/README.md index b1c0e55e..aca609a7 100644 --- a/README.md +++ b/README.md @@ -86,7 +86,7 @@ Before you start training, however, please follow the installation instructions Then use the same command as before, but provide the CARL environment, in this example CARLCartPoleEnv, and information about the context distribution as keywords: ```bash -python mighty/run_mighty.py 'algorithm=dqn' 'env=CARLCartPole' 'num_envs=10' '+env_kwargs.num_contexts=10' '+env_kwargs.context_feature_args.gravity=[normal, 9.8, 1.0, -100.0, 100.0]' 'env_wrappers=[mighty.mighty_utils.wrappers.FlattenVecObs]' +python mighty/run_mighty.py 'algorithm=ppo' 'env=CARLCartPole' '+env_kwargs.num_contexts=10' '+env_kwargs.context_feature_args.gravity=[normal, 9.8, 1.0, -100.0, 100.0]' 'env_wrappers=[mighty.mighty_utils.wrappers.FlattenVecObs]' 'algorithm_kwargs.rollout_buffer_kwargs.buffer_size=2048' ``` For more complex configurations like this, we recommend making an environment configuration file. Check out our [CARL Ant](mighty/configs/environment/carl_walkers/ant_goals.yaml) file to see how this simplifies the process of working with configurable environments. diff --git a/examples/README.md b/examples/README.md index d2795b20..1836f9aa 100644 --- a/examples/README.md +++ b/examples/README.md @@ -73,7 +73,7 @@ python mighty/run_mighty.py 'env=CartPole-v1' We can also be more specific, e.g. by adding our desired number of interaction steps and the number of parallel environments we want to run: ```bash -python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=10' +python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=16' ``` For some environments, including CartPole-1, these details are pre-configured in the Mighty configs, meaning we can use the environment keyword to set them all at once: @@ -98,7 +98,7 @@ python mighty/run_mighty.py 'environment=gymnasium/cartpole' 'algorithm=dqn' 'al Or to use e.g. an ez-greedy exploration policy for DQN: ```bash -python mighty/run_mighty.py 'environment=gymnasium/cartpole' 'algorithm=dqn' '+algorithm_kwargs.policy_class=mighty.mighty_exploration.EZGreedy' +python mighty/run_mighty.py 'environment=gymnasium/cartpole' 'algorithm=dqn' 'algorithm_kwargs.policy_class=mighty.mighty_exploration.EZGreedy' 'algorithm_kwargs.policy_kwargs=null' ``` You can see that in this case, the value we pass to the script is a class name string which can take the value of any function you want, including custom ones as we'll see further down. @@ -109,7 +109,7 @@ You can see that in this case, the value we pass to the script is a class name s The meta components are a bit more complex, since they are a list of class names and optional keyword arguments: ```bash -python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=10' '+algorithm_kwargs.meta_methods=[mighty.mighty_meta.RND]' +python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=16' '+algorithm_kwargs.meta_methods=[mighty.mighty_meta.RND]' ``` As this can become complex, we recommend configuring these in Hydra config files. @@ -121,7 +121,7 @@ Hydra has a multirun functionality with which you can specify a grid of argument Its best use is probably for easily running multiple seeds at once like this: ```bash -python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=10' 'seed=0,1,2,3,4' 'output_dir=examples/multiple_runs' -m +python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=16' 'seed=0,1,2,3,4' 'output_dir=examples/multiple_runs' -m ``` @@ -196,7 +196,7 @@ Compare their structure: the custom policy has a fixed set of methods inherited If you want to run these custom modules, you can do so by adding them by their import path: ```bash -python mighty/run_mighty.py 'algorithm=dqn' '+algorithm_kwargs.policy_class=examples.custom_policy.QValueUCB' '+algorithm_kwargs.policy_kwargs={}' +python mighty/run_mighty.py 'algorithm=dqn' 'algorithm_kwargs.policy_class=examples.custom_policy.QValueUCB' 'algorithm_kwargs.policy_kwargs=null' ``` For the meta-module, it works exactly the same way: ```bash diff --git a/examples/hypersweeper_smac_example_config.yaml b/examples/hypersweeper_smac_example_config.yaml index e340bc05..42649a95 100644 --- a/examples/hypersweeper_smac_example_config.yaml +++ b/examples/hypersweeper_smac_example_config.yaml @@ -17,49 +17,46 @@ env_kwargs: {} env_wrappers: [] num_envs: 64 -# @package _global_ algorithm: PPO algorithm_kwargs: - # Hyperparameters - n_policy_units: 128 - n_critic_units: 128 - soft_update_weight: 0.01 + rescale_action: False + tanh_squash: False rollout_buffer_class: - _target_: mighty.mighty_replay.MightyRolloutBuffer # Using rollout buffer + _target_: mighty.mighty_replay.MightyRolloutBuffer + rollout_buffer_kwargs: - buffer_size: 4096 # Size of the rollout buffer. - gamma: 0.99 # Discount factor for future rewards. - gae_lambda: 0.95 # GAE lambda. - obs_shape: ??? # Placeholder for observation shape - act_dim: ??? # Placeholder for action dimension + buffer_size: 128 # (16 × 128 = 2048 total) + gamma: 0.99 + gae_lambda: 0.95 + obs_shape: ??? + act_dim: ??? n_envs: ??? - + discrete_action: ??? - # Training - learning_rate: 3e-4 - batch_size: 1024 # Batch size for training. - gamma: 0.99 # The amount by which to discount future rewards. - n_gradient_steps: 3 # Number of epochs for updating policy. - ppo_clip: 0.2 # Clipping parameter for PPO. - value_loss_coef: 0.5 # Coefficient for value loss. - entropy_coef: 0.01 # Coefficient for entropy loss. - max_grad_norm: 0.5 # Maximum value for gradient clipping. - + # Optimiser and update settings + learning_rate: 3e-4 + batch_size: 2048 # 16 environments × 128 steps = 2048 total samples + gamma: 0.99 + ppo_clip: 0.2 + value_loss_coef: 0.5 + entropy_coef: 0.01 + max_grad_norm: 0.5 # gradient clipping - hidden_sizes: [64, 64] - activation: 'tanh' + hidden_sizes: [256, 256] + activation: "tanh" - n_epochs: 10 - minibatch_size: 64 - kl_target: 0.01 - use_value_clip: True - value_clip_eps: 0.2 + n_gradient_steps: 1 # one gradient step per rollout + n_epochs: 10 # ten update epochs per rollout + minibatch_size: 128 # 2048 ÷ 64 = 32 minibatches + kl_target: null # disable KL‑based early stopping + use_value_clip: true - policy_class: mighty.mighty_exploration.StochasticPolicy # Policy class for exploration + policy_class: mighty.mighty_exploration.StochasticPolicy policy_kwargs: - entropy_coefficient: 0.0 # Coefficient for entropy-based exploration. + entropy_coefficient: 0.0 + # Training eval_every_n_steps: 1e4 # After how many steps to evaluate. diff --git a/examples/optuna_example_config.yaml b/examples/optuna_example_config.yaml index acb42570..7c8ed1a8 100644 --- a/examples/optuna_example_config.yaml +++ b/examples/optuna_example_config.yaml @@ -18,49 +18,46 @@ env_kwargs: {} env_wrappers: [] num_envs: 64 -# @package _global_ algorithm: PPO algorithm_kwargs: - # Hyperparameters - n_policy_units: 128 - n_critic_units: 128 - soft_update_weight: 0.01 + rescale_action: False + tanh_squash: False rollout_buffer_class: - _target_: mighty.mighty_replay.MightyRolloutBuffer # Using rollout buffer + _target_: mighty.mighty_replay.MightyRolloutBuffer + rollout_buffer_kwargs: - buffer_size: 4096 # Size of the rollout buffer. - gamma: 0.99 # Discount factor for future rewards. - gae_lambda: 0.95 # GAE lambda. - obs_shape: ??? # Placeholder for observation shape - act_dim: ??? # Placeholder for action dimension + buffer_size: 128 # (16 × 128 = 2048 total) + gamma: 0.99 + gae_lambda: 0.95 + obs_shape: ??? + act_dim: ??? n_envs: ??? - + discrete_action: ??? - # Training - learning_rate: 3e-4 - batch_size: 1024 # Batch size for training. - gamma: 0.99 # The amount by which to discount future rewards. - n_gradient_steps: 3 # Number of epochs for updating policy. - ppo_clip: 0.2 # Clipping parameter for PPO. - value_loss_coef: 0.5 # Coefficient for value loss. - entropy_coef: 0.01 # Coefficient for entropy loss. - max_grad_norm: 0.5 # Maximum value for gradient clipping. - + # Optimiser and update settings + learning_rate: 3e-4 + batch_size: 2048 # 16 environments × 128 steps = 2048 total samples + gamma: 0.99 + ppo_clip: 0.2 + value_loss_coef: 0.5 + entropy_coef: 0.01 + max_grad_norm: 0.5 # gradient clipping - hidden_sizes: [64, 64] - activation: 'tanh' + hidden_sizes: [256, 256] + activation: "tanh" - n_epochs: 10 - minibatch_size: 64 - kl_target: 0.01 - use_value_clip: True - value_clip_eps: 0.2 + n_gradient_steps: 1 # one gradient step per rollout + n_epochs: 10 # ten update epochs per rollout + minibatch_size: 128 # 2048 ÷ 64 = 32 minibatches + kl_target: null # disable KL‑based early stopping + use_value_clip: true - policy_class: mighty.mighty_exploration.StochasticPolicy # Policy class for exploration + policy_class: mighty.mighty_exploration.StochasticPolicy policy_kwargs: - entropy_coefficient: 0.0 # Coefficient for entropy-based exploration. + entropy_coefficient: 0.0 + # Training eval_every_n_steps: 1e4 # After how many steps to evaluate. diff --git a/mighty/mighty_agents/dqn.py b/mighty/mighty_agents/dqn.py index 0ced9ce4..711f5714 100644 --- a/mighty/mighty_agents/dqn.py +++ b/mighty/mighty_agents/dqn.py @@ -121,8 +121,10 @@ def __init__( # Policy Class policy_class = retrieve_class(cls=policy_class, default_cls=EpsilonGreedy) # type: ignore - if policy_kwargs is None: + if policy_kwargs is None and isinstance(policy_class, EpsilonGreedy): policy_kwargs = {"epsilon": 0.1} # type: ignore + elif policy_kwargs is None: + policy_kwargs = {} self.policy_class = policy_class self.policy_kwargs = policy_kwargs