Reinforcement learning
Pick your preferences¶
Consider a MAB problem with two possible actions, . Let the distribution of rewards for action be given by . Let the distribution of rewards for action be given by .
Assuming the policy π is computed as the softmax of a preference vector , which preference vector would lead to the highest expected reward for an actor following policy π?
Solutions
is best, though it will be pretty close to . Basically the only thing that matters is , which gives the log odds of picking action 2.
Stochastic gradient descent¶
Stochastic gradient descent is an iterative method for minimizing a loss function . It is performed iteratively. At each step, we have parameters β, we produce an estimate , and update β according to .
If the estimate is awful, this won’t work. Which of the following is a typical condition that we seek to satisfy in designing an SGD algorithm?
Solutions
Third answer is correct: unbiasedness.
Expected reward¶
Consider a MAB problem with two possible actions, . Let the distribution of rewards for action be given by . Let the distribution of rewards for action be given by . What is the expected reward for a policy that selects with probability 0.2 and selects with probability 0.8?
Solutions
.
What kind of RL problem is this?¶
Consider the following three problems.
a. You are training a robot to balance a carrot on its nose. In each episode you place a carrot on the robot’s nose. Every 10 milliseconds, you estimate how balanced the carrot is and give the robot a reward for being balanced. The robot then calculates how it is going to move over the next 10 milliseconds. You continue the episode for 3 minutes. The goal is for the robot to keep the carrot well-balanced for all 3 minutes. The episode can also end early if the robot drops the carrot. You train the robot over many episodes.
b. You have 10 bouncy icosahedrons and you want to know which one bounces highest on average. How high the balls bounce is somewhat random (it depends on the angle they land at). You can perform experiments where you drop a ball from a one-story balcony and then observe how high it bounces. Unfortunately, you only have time to perform 50 experiments.
c. You have a machine that can throw balls, and you want to learn how to make the machine throw different kinds of balls very far. For any given ball, you can find out how heavy it is and what diameter it is, program the settings of the machine, and ask the machine to perform the throw. You want to find out how to pick the best settings based on the properties of the ball.
For each problem, indicate what kind of problem it is: MAB, Contextual MAB, or POMDP.
Solutions
(a) POMDP, (b) MAB, (c) contextual MAB
Expectations¶
Consider a MAB problem with possible actions (“arms”). Let denote the average reward for taking action . Consider a policy on the four possible actions. Assume . Assuming and , compute
Solutions
The answer is zero, because .
More expectations¶
Consider a MAB problem with possible actions (“arms”). Assume that the reward is a deterministic function of the action,
Consider a policy on the four possible actions. Assuming , calculate the value of
Solutions
The expectation can be expanded as a sum
The indicator function is zero unless . So the only nonzero term is .