IBM Research | Reliable ML Study Group

Explaining the exploration/exploitation trade off:

The multi-armed bandit problem: UCB heuristics -
be an optimist:

Markov chain introduction:

These ideas are at the heart of reinforcement learning.

Markov chain convergence (stable state):

Markov chain video (English):

Watch the video

Banach fixed-point theorem:

Bellman equation is a contraction operator:

The difference between MC and TD

MC/TD strategies in reinforcement learning:

Watch the video

The simplest Monte Carlo evaluation of a policy value

How to learn the value of one policy by simulating another

Sampling one distribution using another distribution

Covering the use of Monte Carlo methods in reinforcement learning

Reinforcement learning - Recurring chicken game

The chicken game
Nash equilibrium f_1(x, y) >= f_1(z, y), f_2(x, y) >= f_2(x, z)
(2, 7), (7, 2) - Nash equilibrium in the chicken game
The minmax/maxmin is 2. The first player can ensure he gets something regardless of what the other player does.
If the row player plays a dare, the worst that can happen is 0. If she plays chicken, the worst that can happen is 2, so by playing chicken she can ensure a score of 2. The same is true for the column player.
We'll repeat the game an infinite number of times. The payment of the row play is ri at stage. We'll define the payment to be the liminf of the average.
Sum_{1 to k}(ri)/k as k goes to infinite. Same thing with the column player
The folk theorem in game theory - https://en.wikipedia.org/wiki/Folk_theorem_(game_theory)
Objective - obtain an equilibrium of (1/2)(0,0) + (1/2)(6, 6). In fact, any convex combination of the possible game payments can be obtained.
Strategy players play (0, 0) in odd stages and (6, 6) in even stages. If the column player changes from the prescribed behavior, the row player plays dare always, ensuring the player gets no more than 2. Same with the column player.

Analysis
Payments when nobody deviates - same for both players. For row player the payment is

0 6 0 6 0 6 ....
0 3 2 3 12/5 3 ....

Liminf of the series is 3
https://en.wikipedia.org/wiki/Limit_superior_and_limit_inferior

Assume a deviation of the column play. At the deviation, the column player gets a maximal addition of 2. Then moving forward after the deviation, the column player gets a maximum of 2.
Payments of column player are bounded by -

0 6 0 6 2 2 2 2 2 2 ....

Assume the length of the prefix is 2k before the deviation. The payment for the column player is then bounded by

Lim [k6 + (n - 2k)2]/n = lim 2 = 2

Which is less than 3.

Therefore, the above strategy is a Nash equilibrium.

Can we marry repeated games with reinforcement learning? See some additional interesting reading.

In reinforcement learning, we deal with infinite sequences of actions. We thus need to learn how to think about their probability.

A specific scheme of Monte Carlo control is monotonic if Q(a, pi) is well estimated by the exploration stage. In the context of reinforcement learning, we show that a specific scheme of Monte Carlo control is monotonic if Q(a, pi) is well estimated by the exploration stage.

A simple example that demonstrates the curse of high dimensions.

Reliable ML Study Group

Contact

More on IBM Research

More on IBM Research