Reinforcement Learning 2021 Guest Lecture on Pure Exploration

Page created by Luis Cobb

Style & Fashion

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

Reinforcement Learning 2021 Guest Lecture on Pure Exploration

Reinforcement Learning 2021
 Guest Lecture on Pure Exploration

                  Wouter M. Koolen

                  Download these slides from
     https://wouterkoolen.info/Talks/2021-10-19.pdf!

▶ Pure Exploration:
    ▶ PAC Learning
    ▶ Best Arm Identification
    ▶ Minimax Strategies in Noisy Games (Zero-Sum, Extensive Form)

Introduction and Motivation

                              2 / 34

Grand Goal: Interactive Machine Learning
Query:
 most effective drug dose?
most appealing website layout?
safest next robot action?

                    experiment

                                  or

                     outcome
42

                                                3 / 34

Grand Goal: Interactive Machine Learning
Query:
 most effective drug dose?
most appealing website layout?
safest next robot action?

                    experiment

                                            or

                     outcome
42

Main scientific questions
 ▶ Efficient systems
 ▶ Sample complexity as function of query and environment

                                                            3 / 34

Pure Exploration and Reinforcement Learning

Both are about learning in uncertain environments.

                                                     4 / 34

Pure Exploration and Reinforcement Learning

Both are about learning in uncertain environments.

Pure Exploration focuses on the statistical problem (learn the truth),
while Reinforcement Learning focuses on behaviour (maximise reward).

                                                                         4 / 34

Pure Exploration and Reinforcement Learning

Both are about learning in uncertain environments.

Pure Exploration focuses on the statistical problem (learn the truth),
while Reinforcement Learning focuses on behaviour (maximise reward).

Pure Exploration occurs as sub-module in some RL algorithms (i.e.
Phased Q-Learning by Even-Dar, Mannor, and Mansour, 2002)

                                                                         4 / 34

Pure Exploration and Reinforcement Learning

Both are about learning in uncertain environments.

Pure Exploration focuses on the statistical problem (learn the truth),
while Reinforcement Learning focuses on behaviour (maximise reward).

Pure Exploration occurs as sub-module in some RL algorithms (i.e.
Phased Q-Learning by Even-Dar, Mannor, and Mansour, 2002)

Some problems approached with RL are in fact better modelled as pure
exploration problems. Most notably MCTS for playing games.

                                                                         4 / 34

Best Arm Identification

                          5 / 34

max
                           Formal model

Environment (Multi-armed bandit model)
K distributions parameterised by their means µ = (µ1 , . . . , µK ).
The best arm is
                             i ∗ = argmax µi
                                      i∈[K ]

                                                                       6 / 34

max
                           Formal model

Environment (Multi-armed bandit model)
K distributions parameterised by their means µ = (µ1 , . . . , µK ).
The best arm is
                             i ∗ = argmax µi
                                      i∈[K ]

Strategy
 ▶ Stopping rule τ ∈ N
 ▶ In round t ≤ τ sampling rule picks It ∈ [K ]. See Xt ∼ µIt .
 ▶ Recommendation rule Iˆ ∈ [K ].

                                                                       6 / 34

max
                                 Formal model

Environment (Multi-armed bandit model)
K distributions parameterised by their means µ = (µ1 , . . . , µK ).
The best arm is
                                   i ∗ = argmax µi
                                              i∈[K ]

Strategy
  ▶ Stopping rule τ ∈ N
  ▶ In round t ≤ τ sampling rule picks It ∈ [K ]. See Xt ∼ µIt .
  ▶ Recommendation rule Iˆ ∈ [K ].

Realisation of interaction: (I1 , X1 ), . . . , (Iτ , Xτ ), Iˆ.

                                                                        6 / 34

max
                                 Formal model

Environment (Multi-armed bandit model)
K distributions parameterised by their means µ = (µ1 , . . . , µK ).
The best arm is
                                   i ∗ = argmax µi
                                              i∈[K ]

Strategy
  ▶ Stopping rule τ ∈ N
  ▶ In round t ≤ τ sampling rule picks It ∈ [K ]. See Xt ∼ µIt .
  ▶ Recommendation rule Iˆ ∈ [K ].

Realisation of interaction: (I1 , X1 ), . . . , (Iτ , Xτ ), Iˆ.

Two objectives: sample efficiency τ and correctness Iˆ = i ∗ .

                                                                        6 / 34

Objective

On bandit µ, strategy (τ, (It )t , Iˆ) has
 ▶ error probability Pµ Iˆ ̸= i ∗ (µ) , and
                                        

 ▶ sample complexity Eµ [τ ].
Idea: constrain one, optimise the other.

                                              7 / 34

Objective

On bandit µ, strategy (τ, (It )t , Iˆ) has
 ▶ error probability Pµ Iˆ ̸= i ∗ (µ) , and
                                        

 ▶ sample complexity Eµ [τ ].
Idea: constrain one, optimise the other.

Definition
Fix small confidence δ ∈ (0, 1). A strategy is δ-correct (aka δ-PAC) if

            P Iˆ ̸= i ∗ (µ) ≤ δ
                           
                                     for every bandit model µ.
             µ

(Generalisation: output ϵ-best arm)

                                                                          7 / 34

Objective

On bandit µ, strategy (τ, (It )t , Iˆ) has
 ▶ error probability Pµ Iˆ ̸= i ∗ (µ) , and
                                        

 ▶ sample complexity Eµ [τ ].
Idea: constrain one, optimise the other.

Definition
Fix small confidence δ ∈ (0, 1). A strategy is δ-correct (aka δ-PAC) if

            P Iˆ ̸= i ∗ (µ) ≤ δ
                           
                                     for every bandit model µ.
             µ

(Generalisation: output ϵ-best arm)

Goal: minimise Eµ [τ ] over all δ-correct strategies.

                                                                          7 / 34

Algorithms

▶ Sampling rule It ?
▶ Stopping rule τ ?
▶ Recommendation rule Iˆ?

                          Iˆ = argmax µ̂i (τ )
                                   i∈[K ]

  where µ̂(t) is empirical mean.

                                                 8 / 34

Algorithms

 ▶ Sampling rule It ?
 ▶ Stopping rule τ ?
 ▶ Recommendation rule Iˆ?

                            Iˆ = argmax µ̂i (τ )
                                     i∈[K ]

    where µ̂(t) is empirical mean.

Approach: start investigating lower bounds

                                                   8 / 34

Instance-Dependent Sample Complexity Lower
                  bound
Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}.

                                                                   9 / 34

Instance-Dependent Sample Complexity Lower
                  bound
Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}.

Theorem (Castro 2014; Garivier and Kaufmann 2016)
Fix a δ-correct strategy. Then for every bandit model µ
                                                  1
                            E[τ ] ≥ T ∗ (µ) ln
                            µ                     δ

where the characteristic time T ∗ (µ) is given by
                                           K
                 1                         X
                ∗
                     =      max     min           wi KL(µi ∥λi ).
               T (µ)       w∈△K λ∈Alt(µ)    i=1

                                                                    9 / 34

Instance-Dependent Sample Complexity Lower
                  bound
Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}.

Theorem (Castro 2014; Garivier and Kaufmann 2016)
Fix a δ-correct strategy. Then for every bandit model µ
                                                  1
                            E[τ ] ≥ T ∗ (µ) ln
                            µ                     δ

where the characteristic time T ∗ (µ) is given by
                                            K
                 1                          X
                ∗
                     =      max     min           wi KL(µi ∥λi ).
               T (µ)       w∈△K λ∈Alt(µ)    i=1

Intuition (going back to Lai and Robbins [1985]): if observations are
likely under both µ and λ, yet i ∗ (µ) ̸= i ∗ (λ), then learner cannot stop
and be correct in both.
                                                                              9 / 34

Instance-Dependent Sample Complexity Lower
                  bound

Blackboard proof

                                             10 / 34

Example

                         0.0 0.1 0.2 0.3 0.4

K = 5 arms, Bernoulli µ = (0.0, 0.1, 0.2, 0.3, 0.4).

      T ∗ (µ) = 200.4        w∗ (µ) = (0.01, 0.02, 0.06, 0.46, 0.45)

At δ = 0.05, the time gets multiplied by ln δ1 = 3.0.

                                                                       11 / 34

Sampling Rule

Look at the lower bound again. Any good algorithm must sample with
optimal (oracle) proportions
                                        K
                                        X
            w∗ (µ) = argmax min               wi KL(µi ∥λi )
                        w∈△K λ∈Alt(µ)   i=1

                                                                     12 / 34

Sampling Rule

Look at the lower bound again. Any good algorithm must sample with
optimal (oracle) proportions
                                         K
                                         X
             w∗ (µ) = argmax min               wi KL(µi ∥λi )
                         w∈△K λ∈Alt(µ)   i=1

Track-and-Stop
Idea: draw It ∼ w∗ (µ̂(t − 1)).
  ▶ Ensure µ̂(t) → µ hence Ni (t)/t → wi∗ by “forced exploration”
  ▶ Draw arm with Ni (t)/t below wi∗ (tracking)
 ▶ Computation of w∗ (reduction to 1d line search)

                                                                     12 / 34

Stopping

When can we stop?

                               13 / 34

Stopping

When can we stop?
When can we stop and give answer ı̂?

                                       13 / 34

Stopping

When can we stop?
When can we stop and give answer ı̂?
There is no plausible bandit model λ on which ı̂ is wrong.

                                                             13 / 34

Stopping

When can we stop?
When can we stop and give answer ı̂?
There is no plausible bandit model λ on which ı̂ is wrong.

Definition
Generalized Likelihood Ratio (GLR) measure of evidence

                                 supµ:ı̂∈i ∗ (µ) P (X n |An , µ)
               GLRn (ı̂) := ln
                                 supλ:ı̸̂∈i ∗ (λ) P (X n |An , λ)

                                                                    13 / 34

Stopping

When can we stop?
When can we stop and give answer ı̂?
There is no plausible bandit model λ on which ı̂ is wrong.

Definition
Generalized Likelihood Ratio (GLR) measure of evidence

                                  supµ:ı̂∈i ∗ (µ) P (X n |An , µ)
                GLRn (ı̂) := ln
                                  supλ:ı̸̂∈i ∗ (λ) P (X n |An , λ)

Idea: stop when GLRn (ı̂) is big for some answer ı̂.

                                                                     13 / 34

GLR Stopping

For any plausible answer ı̂ ∈ i ∗ (µ̂(n)), the GLRn simplifies to
                                            K
                                            X
              GLRn (ı̂) =       inf               Na (n) KL(µ̂a (n), λa )
                            λ:ı̸̂∈i ∗ (λ)   a=1

where KL(x, y ) is the Kullback-Leibler divergence in the exponential
family.

                                                                            14 / 34

GLR Stopping, Treshold
What is a suitable threshold for GLRn so that we do not make mistakes?

                                                                         15 / 34

GLR Stopping, Treshold
What is a suitable threshold for GLRn so that we do not make mistakes?
A mistake is made when GLRn (ı̂) is big while ı̂ ̸∈ i ∗ (µ).

                                                                         15 / 34

GLR Stopping, Treshold
What is a suitable threshold for GLRn so that we do not make mistakes?
A mistake is made when GLRn (ı̂) is big while ı̂ ̸∈ i ∗ (µ).
But then

                          K
                          X                                   K
                                                              X
GLRn (ı̂) =      inf              Na (n) KL(µ̂a (n), λa ) ≤         Na (n) KL(µ̂a (n), µa ) .
              λ:ı̸̂∈i ∗ (λ) a=1                               a=1

                                                                                           15 / 34

GLR Stopping, Treshold
What is a suitable threshold for GLRn so that we do not make mistakes?
A mistake is made when GLRn (ı̂) is big while ı̂ ̸∈ i ∗ (µ).
But then

                           K
                           X                                   K
                                                               X
GLRn (ı̂) =         inf            Na (n) KL(µ̂a (n), λa ) ≤         Na (n) KL(µ̂a (n), µa ) .
               λ:ı̸̂∈i ∗ (λ) a=1                               a=1

Good anytime deviation inequalities exist for that upper bound.

Theorem (Kaufmann and Koolen, 2018)

                                                                               
              K
              X                                  X
  P ∃n :           Na (n) KL(µ̂a (n), µa ) −          ln ln Na (n) ≥ C (K , δ) ≤ δ
              a=1                                  n

for C (K , δ) ≈ ln δ1 + K ln ln δ1 .

                                                                                            15 / 34

All in all

Final result: lower and upper bound meet on every problem instance.

Theorem (Garivier and Kaufmann 2016)
For the Track-and-Stop algorithm, for any bandit µ

                             Eµ [τ ]
                       lim sup   1   = T ∗ (µ)
                          δ→0 ln δ

                                                                      16 / 34

All in all

Final result: lower and upper bound meet on every problem instance.

Theorem (Garivier and Kaufmann 2016)
For the Track-and-Stop algorithm, for any bandit µ

                             Eµ [τ ]
                       lim sup   1   = T ∗ (µ)
                          δ→0 ln δ

Very similar optimality result for Top Two Thompson Sampling by Russo
(2016). Here Ni (t)/t → wi∗ result of posterior sampling.

                                                                        16 / 34

Problem Variations and Algorithms

                                    17 / 34

Variations

▶ Prior knowledge about µ
    ▶ Shape constraints: linear, convex, unimodal, etc. bandits
    ▶ Non-parametric (and heavy-tailed) reward distributions (Agrawal,
      Koolen, and Juneja, 2021)
    ▶ ...
▶ Questions beyond Best Arm
    ▶ A/B/n testing (Russac et al., 2021)
    ▶ Robust best arm (part 2 today)
    ▶ Thresholding
    ▶ Best VaR, CVaR and other tail risk measures (Agrawal, Koolen, and
      Juneja, 2021)
    ▶ ...
▶ Multiple correct answers
    ▶ ϵ-best arm
    ▶ In general (Degenne and Koolen, 2019)
      (Requires a change in lower bound and upper bound)

                                                                          18 / 34

Lazy Iterative Optimisation of w∗

Instead of computing at every round the plug-in oracle weights
                                            K
                                            X
              w∗ (µ̂) = argmax min                wi KL(µ̂i ∥λi )
                          w∈△K λ∈Alt(µ̂)    i=1
                                 |             {z              }
                                           concave in w

We may work as follows
 ▶ The inner problem is concave in w.
 ▶ It can be maximised iteratively, i.e. with gradient descent.
 ▶ We may interleave sampling and gradient steps.
 ▶ A single gradient step per sample is enough (Degenne, Koolen, and
   Ménard, 2019)

                                                                       19 / 34

Minimax Action Identification

                                20 / 34

Model (Teraoka, Hatano, and Takimoto, 2014)

                                max

             min                                 min

                      max

Maximin Action Identification Problem
Find best move at root from samples of leaves.

                                                       21 / 34

Model (Teraoka, Hatano, and Takimoto, 2014)

                                max

             min                                 min

                      max

Maximin Action Identification Problem
Find best move at root from samples of leaves.
                                                       guarantee?
                                                                    21 / 34

My Brief History

     9     2    13    9                  7     18   20      9

  Best Arm Identification                    Depth 2 Game
(Garivier and Kaufmann, 2016)   (Garivier, Kaufmann, and Koolen, 2016)
    Solved, continuous                  Open, continuous?

                                                γ

                  7   8     20 17

                            Depth 1.5 Game
               (Kaufmann, Koolen, and Garivier, 2018)
                      Solved, discontinuous
                                                                         22 / 34

What we are able to solve today

Noisy games of any depth

  5   4   4   2   1     7   8   5     3   4     2   6    8    1   9   8

                  MAX           MIN           µ is N (µ, 1)

                                                                          23 / 34

Example Backward Induction Computation

5   4   4   2   1     7   8   5     3   4     2   6    8    1   9   8

                MAX           MIN           µ is N (µ, 1)

                                                                        24 / 34

Example Backward Induction Computation

                                                            4

                            4                                                               3

            4                               5                               3                               8

    4               2               1               5               3               2               1               8

5       4       4       2       1       7       8       5       3       4       2       6       8       1       9       8

                                MAX                     MIN                 µ is N (µ, 1)

                                                                                                                            24 / 34

Model
Definition
A game tree is a min-max tree with leaves L. A bandit model µ
assigns a distribution µℓ to each leaf ℓ ∈ L.

                                                                25 / 34

Model
Definition
A game tree is a min-max tree with leaves L. A bandit model µ
assigns a distribution µℓ to each leaf ℓ ∈ L.

The maximin action (best action at the root) is
             i ∗ (µ) := argmax min max min · · · µa1 a2 a3 a4 ...
                             a1    a2   a3   a4

                                                                    25 / 34

Model
Definition
A game tree is a min-max tree with leaves L. A bandit model µ
assigns a distribution µℓ to each leaf ℓ ∈ L.

The maximin action (best action at the root) is
             i ∗ (µ) := argmax min max min · · · µa1 a2 a3 a4 ...
                             a1    a2   a3   a4

Protocol
For t = 1, 2, . . . , τ :
 ▶ Learner picks a leaf Lt ∈ L.
 ▶ Learner sees Xt ∼ µLt
Learner recommends action Iˆ

                                                                    25 / 34

Model
Definition
A game tree is a min-max tree with leaves L. A bandit model µ
assigns a distribution µℓ to each leaf ℓ ∈ L.

The maximin action (best action at the root) is
              i ∗ (µ) := argmax min max min · · · µa1 a2 a3 a4 ...
                              a1    a2   a3   a4

Protocol
For t = 1, 2, . . . , τ :
 ▶ Learner picks a leaf Lt ∈ L.
 ▶ Learner sees Xt ∼ µLt
Learner recommends action Iˆ

Learner is δ-PAC if
                                                 
                      ∀µ : P τ < ∞ ∧ Iˆ ̸= i ∗ (µ) ≤ δ
                          µ

                                                                     25 / 34

Main Theorem I: Lower Bound

Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}.
NB here i ∗ is best action at the root

                                                                   26 / 34

Main Theorem I: Lower Bound

Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}.
NB here i ∗ is best action at the root

Theorem (Castro 2014; Garivier and Kaufmann 2016)
Fix a δ-correct strategy. Then for every bandit model µ
                                                  1
                            E[τ ] ≥ T ∗ (µ) ln
                            µ                     δ

where the characteristic time T ∗ (µ) is given by
                                           K
                  1                        X
                       =    max     min           wi KL(µi ∥λi ).
               T ∗ (µ)     w∈△K λ∈Alt(µ)    i=1

                                                                    26 / 34

Main Theorem II: Algorithm
Idea is still to consider the oracle weight map
                                            K
                                            X
              w∗ (µ) := argmax min                wi KL(µi ∥λi )
                           w∈△K λ∈Alt(µ)    i=1

and track the plug-in estimate: Lt ∼ w∗ (µ̂(t − 1)).

                                                                   27 / 34

Main Theorem II: Algorithm
Idea is still to consider the oracle weight map
                                            K
                                            X
              w∗ (µ) := argmax min                wi KL(µi ∥λi )
                           w∈△K λ∈Alt(µ)    i=1

and track the plug-in estimate: Lt ∼ w∗ (µ̂(t − 1)).

But what about continuity? Does µ̂(t) → µ imply w∗ (µ(t)) → w∗ (µ)?

                                                                      27 / 34

Main Theorem II: Algorithm
Idea is still to consider the oracle weight map
                                            K
                                            X
              w∗ (µ) := argmax min                wi KL(µi ∥λi )
                           w∈△K λ∈Alt(µ)    i=1

and track the plug-in estimate: Lt ∼ w∗ (µ̂(t − 1)).

But what about continuity? Does µ̂(t) → µ imply w∗ (µ(t)) → w∗ (µ)?

But w∗ is not continuous. Even at depth “1.5” with 2 arms.

                                                                      27 / 34

Main Theorem II: Algorithm
Idea is still to consider the oracle weight map
                                            K
                                            X
              w∗ (µ) := argmax min                wi KL(µi ∥λi )
                           w∈△K λ∈Alt(µ)    i=1

and track the plug-in estimate: Lt ∼ w∗ (µ̂(t − 1)).

But what about continuity? Does µ̂(t) → µ imply w∗ (µ(t)) → w∗ (µ)?

But w∗ is not continuous. Even at depth “1.5” with 2 arms.

Theorem (Degenne and Koolen, 2019)
Take set-valued interpretation of argmax defining w∗ . Then µ 7→ w∗ (µ)
is upper-hemicontinuous and convex-valued. Suitable tracking ensures
that as µ̂(t) → µ, any wt ∈ w∗ (µ̂(t − 1)) have

                           min     ∥wt − w∥∞ → 0
                        w∈w∗ (µ)

Track-and-Stop is asymptotically optimal.
                                                                          27 / 34

Example

5   4   4   2   1   7   8    5   3   4   2   6   8   1   9   8

                                                                 28 / 34

On Computation

To compute a gradient (in w) we need to differentiate
                                     K
                                     X
                   w 7→     min            wi KL(µi ∥λi )
                          λ∈Alt(µ)   i=1

An optimal λ ∈ Alt(µ) can be found by binary search for common value
plus tree reasoning in O(|L|) (board).

                                                                       29 / 34

Conclusion

             30 / 34

Conclusion

This concludes the guest lecture.
 ▶ It has been a pleasure
 ▶ Good luck for the exam
 ▶ If you have an idea that you want to work on . . .

                                                        31 / 34

Agrawal, S., W. M. Koolen, and S. Juneja (Dec. 2021). “Optimal
Best-Arm Identification Methods for Tail-Risk Measures”. In:
Advances in Neural Information Processing Systems (NeurIPS) 34.
Accepted.
 Castro, R. M. (Nov. 2014). “Adaptive sensing performance lower
bounds for sparse signal detection and support estimation”. In:
Bernoulli 20.4, pp. 2217–2246.
 Degenne, R. and W. M. Koolen (Dec. 2019). “Pure Exploration with
Multiple Correct Answers”. In: Advances in Neural Information
Processing Systems (NeurIPS) 32, pp. 14591–14600.
 Degenne, R., W. M. Koolen, and P. Ménard (Dec. 2019).
“Non-Asymptotic Pure Exploration by Solving Games”. In: Advances
in Neural Information Processing Systems (NeurIPS) 32,
pp. 14492–14501.
 Even-Dar, E., S. Mannor, and Y. Mansour (2002). “PAC Bounds for
Multi-armed Bandit and Markov Decision Processes”. In:
Computational Learning Theory, 15th Annual Conference on
Computational Learning Theory, COLT 2002, Sydney, Australia, July
8-10, 2002, Proceedings. Vol. 2375. Lecture Notes in Computer
Science, pp. 255–270.
                                                                    32 / 34

Garivier, A. and E. Kaufmann (2016). “Optimal Best arm
Identification with Fixed Confidence”. In: Proceedings of the 29th
Conference On Learning Theory (COLT).
 Garivier, A., E. Kaufmann, and W. M. Koolen (June 2016).
“Maximin Action Identification: A New Bandit Framework for
Games”. In: Proceedings of the 29th Annual Conference on Learning
Theory (COLT).
 Kaufmann, E. and W. M. Koolen (Oct. 2018). “Mixture Martingales
Revisited with Applications to Sequential Tests and Confidence
Intervals”. Preprint.
 Kaufmann, E., W. M. Koolen, and A. Garivier (Dec. 2018).
“Sequential Test for the Lowest Mean: From Thompson to Murphy
Sampling”. In: Advances in Neural Information Processing Systems
(NeurIPS) 31, pp. 6333–6343.
 Russac, Y., C. Katsimerou, D. Bohle, O. Cappé, A. Garivier, and
W. M. Koolen (Dec. 2021). “A/B/n Testing with Control in the
Presence of Subpopulations”. In: Advances in Neural Information
Processing Systems (NeurIPS) 34. Accepted.

                                                                     33 / 34

Russo, D. (2016). “Simple Bayesian Algorithms for Best Arm
Identification”. In: CoRR abs/1602.08448.
 Teraoka, K., K. Hatano, and E. Takimoto (2014). “Efficient
Sampling Method for Monte Carlo Tree Search Problem”. In: IEICE
Transactions on Infomation and Systems, pp. 392–398.

                                                                  34 / 34

You can also read