Reinforcement-Learning-Multi-Arm-Bandit Simulating multi-arm bandit problem Practicing Upper Confidence Bound Algorithm Practicing KL-Divergence Algorithm Practicing Constrained Stochastic Optimization with Power Control System