2. learning rate of\alpha= 0:5,\gamma= 0.5 an initial Q-table of all zeros, and the following experience traces (given as (s, a, s', r) tuples): (a) (s2, Up, s1, -0.04) (b) (s1, Right, s4, 1.0) (c) (s2, Right, s3, -0.04) (d) (s3, Up, s2, -0.04) (e) (s2, Up, s1, -0.04) (f) (s1, Right, s4, 1.0) A. Assuming that the world resets after the agent visits state s4, and the agent starts in state s2, do these experience traces suggest that this is a greedy agent? Why or why not? B. If a greedy agent were being used to generate experience traces for Q-Learning in this environment, would we be guaranteed to visit every state (in the limit)? What single aspect of the environment could be changed to flip your answer (yes to no, no to yes)?
2.
learning rate of\alpha= 0:5,\gamma= 0.5 an initial Q-table of all zeros, and the following experience traces (given as (s, a, s', r) tuples):
(a) (s2, Up, s1, -0.04)
(b) (s1, Right, s4, 1.0)
(c) (s2, Right, s3, -0.04)
(d) (s3, Up, s2, -0.04)
(e) (s2, Up, s1, -0.04)
(f) (s1, Right, s4, 1.0)
A. Assuming that the world resets after the agent visits state s4, and the agent starts in state s2, do these experience traces suggest that this is a greedy agent? Why or why not?
B. If a greedy agent were being used to generate experience traces for Q-Learning in this environment, would we be guaranteed to visit every state (in the limit)? What single aspect of the environment could be changed to flip your answer (yes to no, no to yes)?
Step by step
Solved in 3 steps