vitamin 10
pdf
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
C102
Subject
Computer Science
Date
Apr 29, 2024
Type
Pages
11
Uploaded by GeneralProtonBeaver21
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
1/11
Q1
1 Point
Calculate the Hoeffding bound for a variable which is the sample mean of 10 random variables which are each bounded between 3 and 8. In other words, compute a bound for the following probability where is the expected value of .
Z
P
(
Z
−
μ
≥
t
)
μ
Z
10
10
E
[
X
]
i
25
2
1
t
>=0
min
e
10
t
E
[
e
]
t
(
Z
−
μ
)
exp
−
(
5
4
t
2
)
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
2/11
Q2 Bandits
2 Points
Let be the set of arms in the setting of the bandits problem. For each arm , let the random variable be the random reward received for pulling arm at time and let be the expected reward from pulling arm . Finally, let the random variable be the arm pulled at time (so is the reward received at time ).
All expectations (e.g., , ) are taken over the randomness in the payouts as well as the arm pulled at each timestep.
Q2.1 Regret
1 Point
Select all of the definitions which are true.
A
a
∈
A
X
i
a
a
i
μ
=
a
E[
X
]
i
a
a
A
i
i
X
i
A
i
i
E[
X
]
i
a
E[
X
]
i
A
i
X
i
a
A
i
The regret is .
R
(
t
)
X
−
X
i
=1
∑
t
(
a
∈
A
max
i
a
i
A
i
)
The regret is .
R
(
t
)
μ
−
X
i
=1
∑
t
(
a
∈
A
max
a
i
A
i
)
The regret is .
R
(
t
)
X
− E[
X
]
i
=1
∑
t
(
a
∈
A
max
i
a
i
A
i
)
The regret is .
R
(
t
)
μ
− E[
X
]
i
=1
∑
t
(
a
∈
A
max
a
i
A
i
)
The regret is R
(
t
)
t
μ
−
a
∈
A
max
a
X
i
=1
∑
t
i
A
i
The regret is R
(
t
)
t
μ
−
a
∈
A
max
a
E[
X
]
i
=1
∑
t
i
A
i
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
3/11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Q2.2 Expected Regret and Pseudo-Regret
1 Point
Select all of the definitions which are true.
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
4/11
The expected regret is the expectation of the regret over
the randomness in the payout, conditioned on the
sequence of choices made.
The expected regret is the expectation of the regret over
the randomness in the payout and the sequence of
choices made.
The expected regret is .
X
−
X
i
=1
∑
t
(
a
∈
A
max
i
a
i
A
i
)
The expected regret is .
X
− E[
X
]
i
=1
∑
t
(
a
∈
A
max
i
a
i
A
i
)
The expected regret is μ
− E[
X
]
i
=1
∑
t
(
a
∈
A
max
a
i
A
i
)
The pseudo-regret is the expectation of the regret over the
randomness in the payout, conditioned on the sequence
of choices made.
The pseudo-regret is the expectation of the regret over the
randomness in the payout and the sequence of choices
made.
The pseudo-regret is .
X
−
X
i
=1
∑
t
(
a
∈
A
max
i
a
i
A
i
)
The pseudo-regret is .
X
− E[
X
]
i
=1
∑
t
(
a
∈
A
max
i
a
i
A
i
)
The pseudo-regret is μ
− E[
X
]
i
=1
∑
t
(
a
∈
A
max
a
i
A
i
)
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
5/11
Q3 Dynamic Programming
1 Point
Which of the following are true statements about dynamic programming?
Q4 Markov Decision Processes
1 Point
When transitioning from to in a Markov Decision Process, we don’t need to know anything about previous states.
It is a technique for improving the accuracy of certain
algorithms
Backpropagation is an example of dynamic programming
Memoization is an example of dynamic programming
If we can use dynamic programming for a particular
problem, we will end up with a more efficient algorithm
s
t
s
t
+1
True
False
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
6/11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Q5
2 Points
Q5.1
1 Point
Consider the following gridworld environment:
represents an inaccessible state and 1 and 10 are terminal states with values of 1 and 10, respectively. Transitions between all states have a reward of 0.
Assume state transitions are deterministic. In other words, if our action is to move in a particular direction, we always move in that direction; unless there is a wall, in which case we stay in that same state. If the discount factor (
) is 0.7, compute the optimal value calculate for the state marked .
7.2
X
V
∗
γ
V
(
A
)
∗
A
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
7/11
Q5.2 Gridworld
1 Point
Suppose we change movement in Gridworld so that when the robot takes an action to move in a particular direction, it has a 70% chance to move in that direction, a 10% chance to move sideways from that direction, and a 10% chance to continue in whatever direction it moved before (e.g., the motor is stuck). If possible, what change(s) is/are necessary to make these changes in the Markov Decision Process (MDP)?
Impossible: since the potential next states now depend on
the previous action, we can no longer use a Markov
Decision Process.
Augment the state with the previous direction moved, so
that each state is now a (location, prev action) pair
Augment the action to include two actions at a time, so
that each action is now a (current action, previous action)
Augment the action to include two actions at a time, so
that each action is now a (current action, next action)
Change the transition probabilities to include a chance of
moving in the previous direction
Change the reward function by increasing the reward for
moving in the same direction twice
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
8/11
Q6 True or False
3 Points
Identify whether the following are true or false.
Q6.1 True or False
1 Point
Regret from the Explore-then-Commit algorithm grows logarithmically in the number of timesteps, which is better than the Upper Confidence Bound algorithm with a linear order of growth.
Q6.2
1 Point
Explore-then-Commit with a very large number of exploration timesteps (e.g. ) is a better choice than UCB.
True
False
m
= 1000
True
False
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
9/11
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Q6.3
1 Point
One of the advantages of Thompson sampling compared to UCB algorithm is that, regardless of the prior we choose, Thompson sampling will always yield better results than the UCB algorithm at every time step.
True
False
Graded
Vitamin 10
Select each question to review feedback and grading details.
Student
Jesus Reynosa
Total Points
8 / 10 pts
Question 1
(no title)
1
/ 1 pt
Question 2
Bandits
1
/ 2 pts
2.1
Regret
1
/ 1 pt
2.2
Expected Regret and Pseudo-Regret
0
/ 1 pt
Question 3
Dynamic Programming
1
/ 1 pt
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
10/11
Question 4
Markov Decision Processes
1
/ 1 pt
Question 5
(no title)
1
/ 2 pts
5.1
(no title)
0
/ 1 pt
5.2
Gridworld
1
/ 1 pt
Question 6
True or False
3
/ 3 pts
6.1
True or False
1
/ 1 pt
6.2
(no title)
1
/ 1 pt
6.3
(no title)
1
/ 1 pt
4/27/24, 7:24 PM
View Submission | Gradescope
https://www.gradescope.com/courses/711377/assignments/4324125/submissions/244847232
11/11
Related Documents
Recommended textbooks for you

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole
Recommended textbooks for you
- Operations Research : Applications and AlgorithmsComputer ScienceISBN:9780534380588Author:Wayne L. WinstonPublisher:Brooks Cole

Operations Research : Applications and Algorithms
Computer Science
ISBN:9780534380588
Author:Wayne L. Winston
Publisher:Brooks Cole