Skip to article frontmatterSkip to article content

Reinforcement learning

Pick your preferences

Consider a MAB problem with two possible actions, a{1,2}a \in \{1,2\}. Let the distribution of rewards for action a=1a=1 be given by ρ1=N(5,1)\rho_1 = \mathcal{N}(5,1). Let the distribution of rewards for action a=2a=2 be given by ρ2=N(10,1)\rho_2 = \mathcal{N}(10,1).

Assuming the policy π is computed as the softmax of a preference vector HR2H \in \mathbb{R}^2, which preference vector would lead to the highest expected reward for an actor following policy π?

  1. H=(0,1)H = (0,1)

  2. H=(5,1)H = (-5,1)

  3. H=(0,0)H = (0,0)

  4. H=(5,0)H = (-5,0)

Stochastic gradient descent

Stochastic gradient descent is an iterative method for minimizing a loss function L(β)L(\beta). It is performed iteratively. At each step, we have parameters β, we produce an estimate gL(β)g \approx \nabla L(\beta), and update β according to gg.

If the estimate gg is awful, this won’t work. Which of the following is a typical condition that we seek to satisfy in designing an SGD algorithm?

  1. E[g]/var(g)>L(β)/var(g)\mathbb{E}[\Vert g\Vert] / \mathrm{var}(\Vert g\Vert) > \Vert \nabla L(\beta) \Vert / \mathrm{var}(\Vert g\Vert)

  2. P(gL(β)>L(β))<12\mathbb{P}(\Vert g- \nabla L(\beta) \Vert > \Vert \nabla L(\beta) \Vert) < \frac{1}{2}

  3. E[g]=L(β)\mathbb{E}[g]= \nabla L(\beta)

  4. E[g]=L(β)\Vert \mathbb{E}[g]\Vert = \Vert \nabla L(\beta)\Vert

Expected reward

Consider a MAB problem with two possible actions, a{1,2}a \in \{1,2\}. Let the distribution of rewards for action a=1a=1 be given by ρ1=N(5,1)\rho_1 = \mathcal{N}(5,1). Let the distribution of rewards for action a=2a=2 be given by ρ2=N(10,1)\rho_2 = \mathcal{N}(10,1). What is the expected reward for a policy π=(0.2,0.8)\pi=(0.2,0.8) that selects a=1a=1 with probability 0.2 and selects a=2a=2 with probability 0.8?

What kind of RL problem is this?

Consider the following three problems.

a. You are training a robot to balance a carrot on its nose. In each episode you place a carrot on the robot’s nose. Every 10 milliseconds, you estimate how balanced the carrot is and give the robot a reward for being balanced. The robot then calculates how it is going to move over the next 10 milliseconds. You continue the episode for 3 minutes. The goal is for the robot to keep the carrot well-balanced for all 3 minutes. The episode can also end early if the robot drops the carrot. You train the robot over many episodes.

b. You have 10 bouncy icosahedrons and you want to know which one bounces highest on average. How high the balls bounce is somewhat random (it depends on the angle they land at). You can perform experiments where you drop a ball from a one-story balcony and then observe how high it bounces. Unfortunately, you only have time to perform 50 experiments.

c. You have a machine that can throw balls, and you want to learn how to make the machine throw different kinds of balls very far. For any given ball, you can find out how heavy it is and what diameter it is, program the settings of the machine, and ask the machine to perform the throw. You want to find out how to pick the best settings based on the properties of the ball.

For each problem, indicate what kind of problem it is: MAB, Contextual MAB, or POMDP.

Expectations

Consider a MAB problem with K=4K=4 possible actions (“arms”). Let μk=Eρk[R]\mu_k = \mathbb{E}_{\rho_k}[R] denote the average reward for taking action kk. Consider a policy π=(.2,.3,.1,.4)\pi=(.2,.3,.1,.4) on the four possible actions. Assume μ=(5,10,20,30)\mu=(5,10,20,30). Assuming AπA \sim \pi and [RA]ρA[R|A]\sim \rho_A, compute

E[μAR]\mathbb{E}[\mu_A - R]

More expectations

Consider a MAB problem with K=4K=4 possible actions (“arms”). Assume that the reward is a deterministic function of the action,

R(A)={2if A{1,2,3}10if A=4R(A) = \begin{cases} 2 & \mathrm{if}\ A\in \{1,2,3\} \\ 10 & \mathrm{if}\ A=4\\ \end{cases}

Consider a policy π=(.2,.3,.1,.4)\pi=(.2,.3,.1,.4) on the four possible actions. Assuming AπA \sim \pi, calculate the value of

E[R(A)×I(A=3)]\mathbb{E}[R(A)\times\mathbb{I}(A=3)]