automl · amsks · Jul 31, 2025 · Jul 31, 2025 · Jul 31, 2025
diff --git a/README.md b/README.md
@@ -86,7 +86,7 @@ Before you start training, however, please follow the installation instructions
 Then use the same command as before, but provide the CARL environment, in this example CARLCartPoleEnv,
 and information about the context distribution as keywords:
 ```bash
-python mighty/run_mighty.py 'algorithm=dqn' 'env=CARLCartPole' 'num_envs=10' '+env_kwargs.num_contexts=10' '+env_kwargs.context_feature_args.gravity=[normal, 9.8, 1.0, -100.0, 100.0]' 'env_wrappers=[mighty.mighty_utils.wrappers.FlattenVecObs]'
+python mighty/run_mighty.py 'algorithm=ppo' 'env=CARLCartPole' '+env_kwargs.num_contexts=10' '+env_kwargs.context_feature_args.gravity=[normal, 9.8, 1.0, -100.0, 100.0]' 'env_wrappers=[mighty.mighty_utils.wrappers.FlattenVecObs]' 'algorithm_kwargs.rollout_buffer_kwargs.buffer_size=2048'
 ```
 
 For more complex configurations like this, we recommend making an environment configuration file. Check out our [CARL Ant](mighty/configs/environment/carl_walkers/ant_goals.yaml) file to see how this simplifies the process of working with configurable environments.

diff --git a/examples/README.md b/examples/README.md
@@ -73,7 +73,7 @@ python mighty/run_mighty.py 'env=CartPole-v1'
 We can also be more specific, e.g. by adding our desired number of interaction steps and the number of parallel environments we want to run:
 
 ```bash
-python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=10'
+python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=16'
 ```
 For some environments, including CartPole-1, these details are pre-configured in the Mighty configs, meaning we can use the environment keyword to set them all at once:
 
@@ -98,7 +98,7 @@ python mighty/run_mighty.py 'environment=gymnasium/cartpole' 'algorithm=dqn' 'al
 Or to use e.g. an ez-greedy exploration policy for DQN:
 
 ```bash
-python mighty/run_mighty.py 'environment=gymnasium/cartpole' 'algorithm=dqn' '+algorithm_kwargs.policy_class=mighty.mighty_exploration.EZGreedy'
+python mighty/run_mighty.py 'environment=gymnasium/cartpole' 'algorithm=dqn' 'algorithm_kwargs.policy_class=mighty.mighty_exploration.EZGreedy' 'algorithm_kwargs.policy_kwargs=null'
 ```
 You can see that in this case, the value we pass to the script is a class name string which can take the value of any function you want, including custom ones as we'll see further down.
 </details>
@@ -109,7 +109,7 @@ You can see that in this case, the value we pass to the script is a class name s
 The meta components are a bit more complex, since they are a list of class names and optional keyword arguments:
 
 ```bash
-python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=10' '+algorithm_kwargs.meta_methods=[mighty.mighty_meta.RND]'
+python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=16' '+algorithm_kwargs.meta_methods=[mighty.mighty_meta.RND]'
 ```
 As this can become complex, we recommend configuring these in Hydra config files.
 </details>
@@ -121,7 +121,7 @@ Hydra has a multirun functionality with which you can specify a grid of argument
 Its best use is probably for easily running multiple seeds at once like this:
 
 ```bash
-python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=10' 'seed=0,1,2,3,4' 'output_dir=examples/multiple_runs' -m 
+python mighty/run_mighty.py 'env=CartPole-v1' 'num_steps=50_000' 'num_envs=16' 'seed=0,1,2,3,4' 'output_dir=examples/multiple_runs' -m 
 ```
 </details>
 
@@ -196,7 +196,7 @@ Compare their structure: the custom policy has a fixed set of methods inherited
 
 If you want to run these custom modules, you can do so by adding them by their import path:
 ```bash
-python mighty/run_mighty.py 'algorithm=dqn' '+algorithm_kwargs.policy_class=examples.custom_policy.QValueUCB' '+algorithm_kwargs.policy_kwargs={}'
+python mighty/run_mighty.py 'algorithm=dqn' 'algorithm_kwargs.policy_class=examples.custom_policy.QValueUCB' 'algorithm_kwargs.policy_kwargs=null'
 ```
 For the meta-module, it works exactly the same way:
 ```bash

diff --git a/examples/hypersweeper_smac_example_config.yaml b/examples/hypersweeper_smac_example_config.yaml
@@ -17,49 +17,46 @@ env_kwargs: {}
 env_wrappers: []
 num_envs: 64
 
-# @package _global_
 algorithm: PPO
 
 algorithm_kwargs:
-  # Hyperparameters
-  n_policy_units: 128
-  n_critic_units: 128
-  soft_update_weight: 0.01
+  rescale_action: False
+  tanh_squash: False
 
   rollout_buffer_class:
-    _target_: mighty.mighty_replay.MightyRolloutBuffer  # Using rollout buffer
+    _target_: mighty.mighty_replay.MightyRolloutBuffer
+
   rollout_buffer_kwargs:
-    buffer_size: 4096  # Size of the rollout buffer.
-    gamma: 0.99  # Discount factor for future rewards.
-    gae_lambda: 0.95  # GAE lambda.
-    obs_shape: ???  # Placeholder for observation shape
-    act_dim: ???  # Placeholder for action dimension
+    buffer_size: 128        # (16 × 128 = 2048 total)
+    gamma: 0.99
+    gae_lambda: 0.95
+    obs_shape: ???
+    act_dim: ???
     n_envs: ???
-
+    discrete_action: ???
 
-  # Training
-  learning_rate:  3e-4
-  batch_size: 1024  # Batch size for training.
-  gamma: 0.99  # The amount by which to discount future rewards.
-  n_gradient_steps: 3  # Number of epochs for updating policy.
-  ppo_clip: 0.2  # Clipping parameter for PPO.
-  value_loss_coef: 0.5  # Coefficient for value loss.
-  entropy_coef: 0.01  # Coefficient for entropy loss.
-  max_grad_norm: 0.5  # Maximum value for gradient clipping.
-
+  # Optimiser and update settings
+  learning_rate: 3e-4
+  batch_size: 2048        # 16 environments × 128 steps = 2048 total samples
+  gamma: 0.99
+  ppo_clip: 0.2
+  value_loss_coef: 0.5
+  entropy_coef: 0.01
+  max_grad_norm: 0.5      # gradient clipping
 
-  hidden_sizes: [64, 64]
-  activation: 'tanh'
+  hidden_sizes: [256, 256]
+  activation: "tanh"
 
-  n_epochs: 10
-  minibatch_size: 64
-  kl_target: 0.01
-  use_value_clip: True
-  value_clip_eps: 0.2
+  n_gradient_steps: 1     # one gradient step per rollout
+  n_epochs: 10            # ten update epochs per rollout
+  minibatch_size: 128     # 2048 ÷ 64 = 32 minibatches
+  kl_target: null         # disable KL‑based early stopping
+  use_value_clip: true
 
-  policy_class: mighty.mighty_exploration.StochasticPolicy  # Policy class for exploration
+  policy_class: mighty.mighty_exploration.StochasticPolicy
   policy_kwargs:
-    entropy_coefficient: 0.0  # Coefficient for entropy-based exploration.
+    entropy_coefficient: 0.0
+
 
 # Training
 eval_every_n_steps: 1e4  # After how many steps to evaluate.

diff --git a/examples/optuna_example_config.yaml b/examples/optuna_example_config.yaml
@@ -18,49 +18,46 @@ env_kwargs: {}
 env_wrappers: []
 num_envs: 64
 
-# @package _global_
 algorithm: PPO
 
 algorithm_kwargs:
-  # Hyperparameters
-  n_policy_units: 128
-  n_critic_units: 128
-  soft_update_weight: 0.01
+  rescale_action: False
+  tanh_squash: False
 
   rollout_buffer_class:
-    _target_: mighty.mighty_replay.MightyRolloutBuffer  # Using rollout buffer
+    _target_: mighty.mighty_replay.MightyRolloutBuffer
+
   rollout_buffer_kwargs:
-    buffer_size: 4096  # Size of the rollout buffer.
-    gamma: 0.99  # Discount factor for future rewards.
-    gae_lambda: 0.95  # GAE lambda.
-    obs_shape: ???  # Placeholder for observation shape
-    act_dim: ???  # Placeholder for action dimension
+    buffer_size: 128        # (16 × 128 = 2048 total)
+    gamma: 0.99
+    gae_lambda: 0.95
+    obs_shape: ???
+    act_dim: ???
     n_envs: ???
-
+    discrete_action: ???
 
-  # Training
-  learning_rate:  3e-4
-  batch_size: 1024  # Batch size for training.
-  gamma: 0.99  # The amount by which to discount future rewards.
-  n_gradient_steps: 3  # Number of epochs for updating policy.
-  ppo_clip: 0.2  # Clipping parameter for PPO.
-  value_loss_coef: 0.5  # Coefficient for value loss.
-  entropy_coef: 0.01  # Coefficient for entropy loss.
-  max_grad_norm: 0.5  # Maximum value for gradient clipping.
-
+  # Optimiser and update settings
+  learning_rate: 3e-4
+  batch_size: 2048        # 16 environments × 128 steps = 2048 total samples
+  gamma: 0.99
+  ppo_clip: 0.2
+  value_loss_coef: 0.5
+  entropy_coef: 0.01
+  max_grad_norm: 0.5      # gradient clipping
 
-  hidden_sizes: [64, 64]
-  activation: 'tanh'
+  hidden_sizes: [256, 256]
+  activation: "tanh"
 
-  n_epochs: 10
-  minibatch_size: 64
-  kl_target: 0.01
-  use_value_clip: True
-  value_clip_eps: 0.2
+  n_gradient_steps: 1     # one gradient step per rollout
+  n_epochs: 10            # ten update epochs per rollout
+  minibatch_size: 128     # 2048 ÷ 64 = 32 minibatches
+  kl_target: null         # disable KL‑based early stopping
+  use_value_clip: true
 
-  policy_class: mighty.mighty_exploration.StochasticPolicy  # Policy class for exploration
+  policy_class: mighty.mighty_exploration.StochasticPolicy
   policy_kwargs:
-    entropy_coefficient: 0.0  # Coefficient for entropy-based exploration.
+    entropy_coefficient: 0.0
+
 
 # Training
 eval_every_n_steps: 1e4  # After how many steps to evaluate.

diff --git a/mighty/mighty_agents/dqn.py b/mighty/mighty_agents/dqn.py
@@ -121,8 +121,10 @@ def __init__(
 
         # Policy Class
         policy_class = retrieve_class(cls=policy_class, default_cls=EpsilonGreedy)  # type: ignore
-        if policy_kwargs is None:
+        if policy_kwargs is None and isinstance(policy_class, EpsilonGreedy):
             policy_kwargs = {"epsilon": 0.1}  # type: ignore
+        elif policy_kwargs is None:
+            policy_kwargs = {}
         self.policy_class = policy_class
         self.policy_kwargs = policy_kwargs