Calculating Hoeffding Bounds and Bandits Regret Explained

4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 1/11

Q1 1 Point Calculate the Hoeffding bound for a variable which is the sample mean of 10 random variables which are each bounded between 3 and 8. In other words, compute a bound for the following probability where is the expected value of . Z P ( Z − μ ≥ t ) μ Z 10 10 E [ X ] i 25 2 1 t >=0 min e 10 t E [ e ] t ( Z − μ ) exp − ( 5 4 t 2 ) 4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 2/11

Q2 Bandits 2 Points Let be the set of arms in the setting of the bandits problem. For each arm , let the random variable be the random reward received for pulling arm at time and let be the expected reward from pulling arm . Finally, let the random variable be the arm pulled at time (so is the reward received at time ). All expectations (e.g., , ) are taken over the randomness in the payouts as well as the arm pulled at each timestep. Q2.1 Regret 1 Point Select all of the definitions which are true. A a ∈ A X i a a i μ = a E[ X ] i a a A i i X i A i i E[ X ] i a E[ X ] i A i X i a A i The regret is . R ( t ) X − X i =1 ∑ t ( a ∈ A max i a i A i ) The regret is . R ( t ) μ − X i =1 ∑ t ( a ∈ A max a i A i ) The regret is . R ( t ) X − E[ X ] i =1 ∑ t ( a ∈ A max i a i A i ) The regret is . R ( t ) μ − E[ X ] i =1 ∑ t ( a ∈ A max a i A i ) The regret is R ( t ) t μ − a ∈ A max a X i =1 ∑ t i A i The regret is R ( t ) t μ − a ∈ A max a E[ X ] i =1 ∑ t i A i 4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 3/11

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Q2.2 Expected Regret and Pseudo-Regret 1 Point Select all of the definitions which are true. 4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 4/11

The expected regret is the expectation of the regret over the randomness in the payout, conditioned on the sequence of choices made. The expected regret is the expectation of the regret over the randomness in the payout and the sequence of choices made. The expected regret is . X − X i =1 ∑ t ( a ∈ A max i a i A i ) The expected regret is . X − E[ X ] i =1 ∑ t ( a ∈ A max i a i A i ) The expected regret is μ − E[ X ] i =1 ∑ t ( a ∈ A max a i A i ) The pseudo-regret is the expectation of the regret over the randomness in the payout, conditioned on the sequence of choices made. The pseudo-regret is the expectation of the regret over the randomness in the payout and the sequence of choices made. The pseudo-regret is . X − X i =1 ∑ t ( a ∈ A max i a i A i ) The pseudo-regret is . X − E[ X ] i =1 ∑ t ( a ∈ A max i a i A i ) The pseudo-regret is μ − E[ X ] i =1 ∑ t ( a ∈ A max a i A i ) 4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 5/11

Q3 Dynamic Programming 1 Point Which of the following are true statements about dynamic programming? Q4 Markov Decision Processes 1 Point When transitioning from to in a Markov Decision Process, we don’t need to know anything about previous states. It is a technique for improving the accuracy of certain algorithms Backpropagation is an example of dynamic programming Memoization is an example of dynamic programming If we can use dynamic programming for a particular problem, we will end up with a more eﬃcient algorithm s t s t +1 True False 4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 6/11

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Q5 2 Points Q5.1 1 Point Consider the following gridworld environment: represents an inaccessible state and 1 and 10 are terminal states with values of 1 and 10, respectively. Transitions between all states have a reward of 0. Assume state transitions are deterministic. In other words, if our action is to move in a particular direction, we always move in that direction; unless there is a wall, in which case we stay in that same state. If the discount factor ( ) is 0.7, compute the optimal value calculate for the state marked . 7.2 X V ∗ γ V ( A ) ∗ A 4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 7/11

Q5.2 Gridworld 1 Point Suppose we change movement in Gridworld so that when the robot takes an action to move in a particular direction, it has a 70% chance to move in that direction, a 10% chance to move sideways from that direction, and a 10% chance to continue in whatever direction it moved before (e.g., the motor is stuck). If possible, what change(s) is/are necessary to make these changes in the Markov Decision Process (MDP)? Impossible: since the potential next states now depend on the previous action, we can no longer use a Markov Decision Process. Augment the state with the previous direction moved, so that each state is now a (location, prev action) pair Augment the action to include two actions at a time, so that each action is now a (current action, previous action) Augment the action to include two actions at a time, so that each action is now a (current action, next action) Change the transition probabilities to include a chance of moving in the previous direction Change the reward function by increasing the reward for moving in the same direction twice 4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 8/11

Q6 True or False 3 Points Identify whether the following are true or false. Q6.1 True or False 1 Point Regret from the Explore-then-Commit algorithm grows logarithmically in the number of timesteps, which is better than the Upper Confidence Bound algorithm with a linear order of growth. Q6.2 1 Point Explore-then-Commit with a very large number of exploration timesteps (e.g. ) is a better choice than UCB. True False m = 1000 True False 4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 9/11

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Q6.3 1 Point One of the advantages of Thompson sampling compared to UCB algorithm is that, regardless of the prior we choose, Thompson sampling will always yield better results than the UCB algorithm at every time step. True False Graded Vitamin 10 Select each question to review feedback and grading details. Student Jesus Reynosa Total Points 8 / 10 pts Question 1 (no title) 1 / 1 pt Question 2 Bandits 1 / 2 pts 2.1 Regret 1 / 1 pt 2.2 Expected Regret and Pseudo-Regret 0 / 1 pt Question 3 Dynamic Programming 1 / 1 pt  4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 10/11

Question 4 Markov Decision Processes 1 / 1 pt Question 5 (no title) 1 / 2 pts 5.1 (no title) 0 / 1 pt 5.2 Gridworld 1 / 1 pt Question 6 True or False 3 / 3 pts 6.1 True or False 1 / 1 pt 6.2 (no title) 1 / 1 pt 6.3 (no title) 1 / 1 pt 4/27/24, 7:24 PM View Submission | Gradescope https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232 11/11

vitamin 10

Related Documents