To account for the effects of stochasticity due to initialization and the training process, each option [theta x, theta y, ...] could be repeated a maximum of N times (I'm calling this 'runs'), and the decision about the best hyperparameter option could be made using some aggregation (mean, median etc.) across runs. May be extreme performance values (high or low) could be rejected if they seem to be outliers.
May be for each hyperparameter option, running statistics (mean, variance) could be tracked and runs can be stopped before N if the stats don't change by much.
Another question is how to integrate cross-validation (to account for bad validation splits), within each run? Should each run be cross validated or should each run get only one (or more) of the folds?