This project involved implementing agents that solve Markov Decision Processes (MDPs) using Value Iteration and Q-Learning techniques. We modified three main files:
valueIterationAgents.py: for implementing Value IterationqlearningAgents.py: for implementing Q-Learning and Approximate Q-Learninganalysis.py: for parameter tuning of the Gridworld environment
We completed the implementation of the ValueIterationAgent class:
-
runValueIteration:
Runs batch value iteration for the given number of iterations using the Bellman update equation. Values are updated using a separate copy (newValues) to avoid in-place updates. -
computeQValueFromValues(state, action):
Computes the Q-value for a(state, action)pair using the current value function. -
computeActionFromValues(state):
Returns the best action based on the current value estimates by selecting the action that yields the highest Q-value.
We implemented two agents: QLearningAgent and ApproximateQAgent.
-
getQValue(state, action):
Returns Q-values stored in autil.Counter, defaulting to 0.0 for unseen (state, action) pairs. -
computeValueFromQValues(state):
Computes the maximum Q-value over all legal actions in a given state. -
computeActionFromQValues(state):
Returns the best action based on current Q-values, breaking ties randomly. -
getAction(state):
Implements epsilon-greedy action selection: with probabilityepsilonchooses a random action; otherwise, chooses the best known action. -
update(state, action, nextState, reward):
Performs the Q-learning update rule to adjust the Q-value of the current(state, action)pair.
-
__init__:
Initializes the feature extractor and weight counter. -
getQValue(state, action):
Calculates Q-values as a dot product between features and learned weights. -
update(state, action, nextState, reward):
Updates feature weights using the difference between the predicted and target Q-values. -
final(state)(optional):
Displays final weights after training is complete.
We implemented functions question3a() through question3e() to return specific parameter settings for the DiscountGrid environment.
Each function returns a tuple of:
- Discount factor
- Noise
- Living reward
These were tuned to encourage different policies (e.g., risk-taking, avoiding the cliff, preferring distant rewards, or avoiding all exits).
Each question was tested using the autograder:
python autograder.py -q q1
python autograder.py -q q2
...
python autograder.py -q q6