Reinforcement Learning 2021 Guest Lecture on Pure Exploration
←
→
Page content transcription
If your browser does not render page correctly, please read the page content below
Reinforcement Learning 2021 Guest Lecture on Pure Exploration Wouter M. Koolen Download these slides from https://wouterkoolen.info/Talks/2021-10-19.pdf! ▶ Pure Exploration: ▶ PAC Learning ▶ Best Arm Identification ▶ Minimax Strategies in Noisy Games (Zero-Sum, Extensive Form)
Grand Goal: Interactive Machine Learning Query: most effective drug dose? most appealing website layout? safest next robot action? experiment or outcome 42 3 / 34
Grand Goal: Interactive Machine Learning Query: most effective drug dose? most appealing website layout? safest next robot action? experiment or outcome 42 Main scientific questions ▶ Efficient systems ▶ Sample complexity as function of query and environment 3 / 34
Pure Exploration and Reinforcement Learning Both are about learning in uncertain environments. 4 / 34
Pure Exploration and Reinforcement Learning Both are about learning in uncertain environments. Pure Exploration focuses on the statistical problem (learn the truth), while Reinforcement Learning focuses on behaviour (maximise reward). 4 / 34
Pure Exploration and Reinforcement Learning Both are about learning in uncertain environments. Pure Exploration focuses on the statistical problem (learn the truth), while Reinforcement Learning focuses on behaviour (maximise reward). Pure Exploration occurs as sub-module in some RL algorithms (i.e. Phased Q-Learning by Even-Dar, Mannor, and Mansour, 2002) 4 / 34
Pure Exploration and Reinforcement Learning Both are about learning in uncertain environments. Pure Exploration focuses on the statistical problem (learn the truth), while Reinforcement Learning focuses on behaviour (maximise reward). Pure Exploration occurs as sub-module in some RL algorithms (i.e. Phased Q-Learning by Even-Dar, Mannor, and Mansour, 2002) Some problems approached with RL are in fact better modelled as pure exploration problems. Most notably MCTS for playing games. 4 / 34
Best Arm Identification 5 / 34
max Formal model Environment (Multi-armed bandit model) K distributions parameterised by their means µ = (µ1 , . . . , µK ). The best arm is i ∗ = argmax µi i∈[K ] 6 / 34
max Formal model Environment (Multi-armed bandit model) K distributions parameterised by their means µ = (µ1 , . . . , µK ). The best arm is i ∗ = argmax µi i∈[K ] Strategy ▶ Stopping rule τ ∈ N ▶ In round t ≤ τ sampling rule picks It ∈ [K ]. See Xt ∼ µIt . ▶ Recommendation rule Iˆ ∈ [K ]. 6 / 34
max Formal model Environment (Multi-armed bandit model) K distributions parameterised by their means µ = (µ1 , . . . , µK ). The best arm is i ∗ = argmax µi i∈[K ] Strategy ▶ Stopping rule τ ∈ N ▶ In round t ≤ τ sampling rule picks It ∈ [K ]. See Xt ∼ µIt . ▶ Recommendation rule Iˆ ∈ [K ]. Realisation of interaction: (I1 , X1 ), . . . , (Iτ , Xτ ), Iˆ. 6 / 34
max Formal model Environment (Multi-armed bandit model) K distributions parameterised by their means µ = (µ1 , . . . , µK ). The best arm is i ∗ = argmax µi i∈[K ] Strategy ▶ Stopping rule τ ∈ N ▶ In round t ≤ τ sampling rule picks It ∈ [K ]. See Xt ∼ µIt . ▶ Recommendation rule Iˆ ∈ [K ]. Realisation of interaction: (I1 , X1 ), . . . , (Iτ , Xτ ), Iˆ. Two objectives: sample efficiency τ and correctness Iˆ = i ∗ . 6 / 34
Objective On bandit µ, strategy (τ, (It )t , Iˆ) has ▶ error probability Pµ Iˆ ̸= i ∗ (µ) , and ▶ sample complexity Eµ [τ ]. Idea: constrain one, optimise the other. 7 / 34
Objective On bandit µ, strategy (τ, (It )t , Iˆ) has ▶ error probability Pµ Iˆ ̸= i ∗ (µ) , and ▶ sample complexity Eµ [τ ]. Idea: constrain one, optimise the other. Definition Fix small confidence δ ∈ (0, 1). A strategy is δ-correct (aka δ-PAC) if P Iˆ ̸= i ∗ (µ) ≤ δ for every bandit model µ. µ (Generalisation: output ϵ-best arm) 7 / 34
Objective On bandit µ, strategy (τ, (It )t , Iˆ) has ▶ error probability Pµ Iˆ ̸= i ∗ (µ) , and ▶ sample complexity Eµ [τ ]. Idea: constrain one, optimise the other. Definition Fix small confidence δ ∈ (0, 1). A strategy is δ-correct (aka δ-PAC) if P Iˆ ̸= i ∗ (µ) ≤ δ for every bandit model µ. µ (Generalisation: output ϵ-best arm) Goal: minimise Eµ [τ ] over all δ-correct strategies. 7 / 34
Algorithms ▶ Sampling rule It ? ▶ Stopping rule τ ? ▶ Recommendation rule Iˆ? Iˆ = argmax µ̂i (τ ) i∈[K ] where µ̂(t) is empirical mean. 8 / 34
Algorithms ▶ Sampling rule It ? ▶ Stopping rule τ ? ▶ Recommendation rule Iˆ? Iˆ = argmax µ̂i (τ ) i∈[K ] where µ̂(t) is empirical mean. Approach: start investigating lower bounds 8 / 34
Instance-Dependent Sample Complexity Lower bound Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}. 9 / 34
Instance-Dependent Sample Complexity Lower bound Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}. Theorem (Castro 2014; Garivier and Kaufmann 2016) Fix a δ-correct strategy. Then for every bandit model µ 1 E[τ ] ≥ T ∗ (µ) ln µ δ where the characteristic time T ∗ (µ) is given by K 1 X ∗ = max min wi KL(µi ∥λi ). T (µ) w∈△K λ∈Alt(µ) i=1 9 / 34
Instance-Dependent Sample Complexity Lower bound Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}. Theorem (Castro 2014; Garivier and Kaufmann 2016) Fix a δ-correct strategy. Then for every bandit model µ 1 E[τ ] ≥ T ∗ (µ) ln µ δ where the characteristic time T ∗ (µ) is given by K 1 X ∗ = max min wi KL(µi ∥λi ). T (µ) w∈△K λ∈Alt(µ) i=1 Intuition (going back to Lai and Robbins [1985]): if observations are likely under both µ and λ, yet i ∗ (µ) ̸= i ∗ (λ), then learner cannot stop and be correct in both. 9 / 34
Instance-Dependent Sample Complexity Lower bound Blackboard proof 10 / 34
Example 0.0 0.1 0.2 0.3 0.4 K = 5 arms, Bernoulli µ = (0.0, 0.1, 0.2, 0.3, 0.4). T ∗ (µ) = 200.4 w∗ (µ) = (0.01, 0.02, 0.06, 0.46, 0.45) At δ = 0.05, the time gets multiplied by ln δ1 = 3.0. 11 / 34
Sampling Rule Look at the lower bound again. Any good algorithm must sample with optimal (oracle) proportions K X w∗ (µ) = argmax min wi KL(µi ∥λi ) w∈△K λ∈Alt(µ) i=1 12 / 34
Sampling Rule Look at the lower bound again. Any good algorithm must sample with optimal (oracle) proportions K X w∗ (µ) = argmax min wi KL(µi ∥λi ) w∈△K λ∈Alt(µ) i=1 Track-and-Stop Idea: draw It ∼ w∗ (µ̂(t − 1)). ▶ Ensure µ̂(t) → µ hence Ni (t)/t → wi∗ by “forced exploration” ▶ Draw arm with Ni (t)/t below wi∗ (tracking) ▶ Computation of w∗ (reduction to 1d line search) 12 / 34
Stopping When can we stop? 13 / 34
Stopping When can we stop? When can we stop and give answer ı̂? 13 / 34
Stopping When can we stop? When can we stop and give answer ı̂? There is no plausible bandit model λ on which ı̂ is wrong. 13 / 34
Stopping When can we stop? When can we stop and give answer ı̂? There is no plausible bandit model λ on which ı̂ is wrong. Definition Generalized Likelihood Ratio (GLR) measure of evidence supµ:ı̂∈i ∗ (µ) P (X n |An , µ) GLRn (ı̂) := ln supλ:ı̸̂∈i ∗ (λ) P (X n |An , λ) 13 / 34
Stopping When can we stop? When can we stop and give answer ı̂? There is no plausible bandit model λ on which ı̂ is wrong. Definition Generalized Likelihood Ratio (GLR) measure of evidence supµ:ı̂∈i ∗ (µ) P (X n |An , µ) GLRn (ı̂) := ln supλ:ı̸̂∈i ∗ (λ) P (X n |An , λ) Idea: stop when GLRn (ı̂) is big for some answer ı̂. 13 / 34
GLR Stopping For any plausible answer ı̂ ∈ i ∗ (µ̂(n)), the GLRn simplifies to K X GLRn (ı̂) = inf Na (n) KL(µ̂a (n), λa ) λ:ı̸̂∈i ∗ (λ) a=1 where KL(x, y ) is the Kullback-Leibler divergence in the exponential family. 14 / 34
GLR Stopping, Treshold What is a suitable threshold for GLRn so that we do not make mistakes? 15 / 34
GLR Stopping, Treshold What is a suitable threshold for GLRn so that we do not make mistakes? A mistake is made when GLRn (ı̂) is big while ı̂ ̸∈ i ∗ (µ). 15 / 34
GLR Stopping, Treshold What is a suitable threshold for GLRn so that we do not make mistakes? A mistake is made when GLRn (ı̂) is big while ı̂ ̸∈ i ∗ (µ). But then K X K X GLRn (ı̂) = inf Na (n) KL(µ̂a (n), λa ) ≤ Na (n) KL(µ̂a (n), µa ) . λ:ı̸̂∈i ∗ (λ) a=1 a=1 15 / 34
GLR Stopping, Treshold What is a suitable threshold for GLRn so that we do not make mistakes? A mistake is made when GLRn (ı̂) is big while ı̂ ̸∈ i ∗ (µ). But then K X K X GLRn (ı̂) = inf Na (n) KL(µ̂a (n), λa ) ≤ Na (n) KL(µ̂a (n), µa ) . λ:ı̸̂∈i ∗ (λ) a=1 a=1 Good anytime deviation inequalities exist for that upper bound. Theorem (Kaufmann and Koolen, 2018) K X X P ∃n : Na (n) KL(µ̂a (n), µa ) − ln ln Na (n) ≥ C (K , δ) ≤ δ a=1 n for C (K , δ) ≈ ln δ1 + K ln ln δ1 . 15 / 34
All in all Final result: lower and upper bound meet on every problem instance. Theorem (Garivier and Kaufmann 2016) For the Track-and-Stop algorithm, for any bandit µ Eµ [τ ] lim sup 1 = T ∗ (µ) δ→0 ln δ 16 / 34
All in all Final result: lower and upper bound meet on every problem instance. Theorem (Garivier and Kaufmann 2016) For the Track-and-Stop algorithm, for any bandit µ Eµ [τ ] lim sup 1 = T ∗ (µ) δ→0 ln δ Very similar optimality result for Top Two Thompson Sampling by Russo (2016). Here Ni (t)/t → wi∗ result of posterior sampling. 16 / 34
Problem Variations and Algorithms 17 / 34
Variations ▶ Prior knowledge about µ ▶ Shape constraints: linear, convex, unimodal, etc. bandits ▶ Non-parametric (and heavy-tailed) reward distributions (Agrawal, Koolen, and Juneja, 2021) ▶ ... ▶ Questions beyond Best Arm ▶ A/B/n testing (Russac et al., 2021) ▶ Robust best arm (part 2 today) ▶ Thresholding ▶ Best VaR, CVaR and other tail risk measures (Agrawal, Koolen, and Juneja, 2021) ▶ ... ▶ Multiple correct answers ▶ ϵ-best arm ▶ In general (Degenne and Koolen, 2019) (Requires a change in lower bound and upper bound) 18 / 34
Lazy Iterative Optimisation of w∗ Instead of computing at every round the plug-in oracle weights K X w∗ (µ̂) = argmax min wi KL(µ̂i ∥λi ) w∈△K λ∈Alt(µ̂) i=1 | {z } concave in w We may work as follows ▶ The inner problem is concave in w. ▶ It can be maximised iteratively, i.e. with gradient descent. ▶ We may interleave sampling and gradient steps. ▶ A single gradient step per sample is enough (Degenne, Koolen, and Ménard, 2019) 19 / 34
Minimax Action Identification 20 / 34
Model (Teraoka, Hatano, and Takimoto, 2014) max min min max Maximin Action Identification Problem Find best move at root from samples of leaves. 21 / 34
Model (Teraoka, Hatano, and Takimoto, 2014) max min min max Maximin Action Identification Problem Find best move at root from samples of leaves. guarantee? 21 / 34
My Brief History 9 2 13 9 7 18 20 9 Best Arm Identification Depth 2 Game (Garivier and Kaufmann, 2016) (Garivier, Kaufmann, and Koolen, 2016) Solved, continuous Open, continuous? γ 7 8 20 17 Depth 1.5 Game (Kaufmann, Koolen, and Garivier, 2018) Solved, discontinuous 22 / 34
What we are able to solve today Noisy games of any depth 5 4 4 2 1 7 8 5 3 4 2 6 8 1 9 8 MAX MIN µ is N (µ, 1) 23 / 34
Example Backward Induction Computation 5 4 4 2 1 7 8 5 3 4 2 6 8 1 9 8 MAX MIN µ is N (µ, 1) 24 / 34
Example Backward Induction Computation 4 4 3 4 5 3 8 4 2 1 5 3 2 1 8 5 4 4 2 1 7 8 5 3 4 2 6 8 1 9 8 MAX MIN µ is N (µ, 1) 24 / 34
Model Definition A game tree is a min-max tree with leaves L. A bandit model µ assigns a distribution µℓ to each leaf ℓ ∈ L. 25 / 34
Model Definition A game tree is a min-max tree with leaves L. A bandit model µ assigns a distribution µℓ to each leaf ℓ ∈ L. The maximin action (best action at the root) is i ∗ (µ) := argmax min max min · · · µa1 a2 a3 a4 ... a1 a2 a3 a4 25 / 34
Model Definition A game tree is a min-max tree with leaves L. A bandit model µ assigns a distribution µℓ to each leaf ℓ ∈ L. The maximin action (best action at the root) is i ∗ (µ) := argmax min max min · · · µa1 a2 a3 a4 ... a1 a2 a3 a4 Protocol For t = 1, 2, . . . , τ : ▶ Learner picks a leaf Lt ∈ L. ▶ Learner sees Xt ∼ µLt Learner recommends action Iˆ 25 / 34
Model Definition A game tree is a min-max tree with leaves L. A bandit model µ assigns a distribution µℓ to each leaf ℓ ∈ L. The maximin action (best action at the root) is i ∗ (µ) := argmax min max min · · · µa1 a2 a3 a4 ... a1 a2 a3 a4 Protocol For t = 1, 2, . . . , τ : ▶ Learner picks a leaf Lt ∈ L. ▶ Learner sees Xt ∼ µLt Learner recommends action Iˆ Learner is δ-PAC if ∀µ : P τ < ∞ ∧ Iˆ ̸= i ∗ (µ) ≤ δ µ 25 / 34
Main Theorem I: Lower Bound Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}. NB here i ∗ is best action at the root 26 / 34
Main Theorem I: Lower Bound Define the alternatives to µ by Alt(µ) = {λ|i ∗ (λ) ̸= i ∗ (µ)}. NB here i ∗ is best action at the root Theorem (Castro 2014; Garivier and Kaufmann 2016) Fix a δ-correct strategy. Then for every bandit model µ 1 E[τ ] ≥ T ∗ (µ) ln µ δ where the characteristic time T ∗ (µ) is given by K 1 X = max min wi KL(µi ∥λi ). T ∗ (µ) w∈△K λ∈Alt(µ) i=1 26 / 34
Main Theorem II: Algorithm Idea is still to consider the oracle weight map K X w∗ (µ) := argmax min wi KL(µi ∥λi ) w∈△K λ∈Alt(µ) i=1 and track the plug-in estimate: Lt ∼ w∗ (µ̂(t − 1)). 27 / 34
Main Theorem II: Algorithm Idea is still to consider the oracle weight map K X w∗ (µ) := argmax min wi KL(µi ∥λi ) w∈△K λ∈Alt(µ) i=1 and track the plug-in estimate: Lt ∼ w∗ (µ̂(t − 1)). But what about continuity? Does µ̂(t) → µ imply w∗ (µ(t)) → w∗ (µ)? 27 / 34
Main Theorem II: Algorithm Idea is still to consider the oracle weight map K X w∗ (µ) := argmax min wi KL(µi ∥λi ) w∈△K λ∈Alt(µ) i=1 and track the plug-in estimate: Lt ∼ w∗ (µ̂(t − 1)). But what about continuity? Does µ̂(t) → µ imply w∗ (µ(t)) → w∗ (µ)? But w∗ is not continuous. Even at depth “1.5” with 2 arms. 27 / 34
Main Theorem II: Algorithm Idea is still to consider the oracle weight map K X w∗ (µ) := argmax min wi KL(µi ∥λi ) w∈△K λ∈Alt(µ) i=1 and track the plug-in estimate: Lt ∼ w∗ (µ̂(t − 1)). But what about continuity? Does µ̂(t) → µ imply w∗ (µ(t)) → w∗ (µ)? But w∗ is not continuous. Even at depth “1.5” with 2 arms. Theorem (Degenne and Koolen, 2019) Take set-valued interpretation of argmax defining w∗ . Then µ 7→ w∗ (µ) is upper-hemicontinuous and convex-valued. Suitable tracking ensures that as µ̂(t) → µ, any wt ∈ w∗ (µ̂(t − 1)) have min ∥wt − w∥∞ → 0 w∈w∗ (µ) Track-and-Stop is asymptotically optimal. 27 / 34
Example 5 4 4 2 1 7 8 5 3 4 2 6 8 1 9 8 28 / 34
On Computation To compute a gradient (in w) we need to differentiate K X w 7→ min wi KL(µi ∥λi ) λ∈Alt(µ) i=1 An optimal λ ∈ Alt(µ) can be found by binary search for common value plus tree reasoning in O(|L|) (board). 29 / 34
Conclusion 30 / 34
Conclusion This concludes the guest lecture. ▶ It has been a pleasure ▶ Good luck for the exam ▶ If you have an idea that you want to work on . . . 31 / 34
Agrawal, S., W. M. Koolen, and S. Juneja (Dec. 2021). “Optimal Best-Arm Identification Methods for Tail-Risk Measures”. In: Advances in Neural Information Processing Systems (NeurIPS) 34. Accepted. Castro, R. M. (Nov. 2014). “Adaptive sensing performance lower bounds for sparse signal detection and support estimation”. In: Bernoulli 20.4, pp. 2217–2246. Degenne, R. and W. M. Koolen (Dec. 2019). “Pure Exploration with Multiple Correct Answers”. In: Advances in Neural Information Processing Systems (NeurIPS) 32, pp. 14591–14600. Degenne, R., W. M. Koolen, and P. Ménard (Dec. 2019). “Non-Asymptotic Pure Exploration by Solving Games”. In: Advances in Neural Information Processing Systems (NeurIPS) 32, pp. 14492–14501. Even-Dar, E., S. Mannor, and Y. Mansour (2002). “PAC Bounds for Multi-armed Bandit and Markov Decision Processes”. In: Computational Learning Theory, 15th Annual Conference on Computational Learning Theory, COLT 2002, Sydney, Australia, July 8-10, 2002, Proceedings. Vol. 2375. Lecture Notes in Computer Science, pp. 255–270. 32 / 34
Garivier, A. and E. Kaufmann (2016). “Optimal Best arm Identification with Fixed Confidence”. In: Proceedings of the 29th Conference On Learning Theory (COLT). Garivier, A., E. Kaufmann, and W. M. Koolen (June 2016). “Maximin Action Identification: A New Bandit Framework for Games”. In: Proceedings of the 29th Annual Conference on Learning Theory (COLT). Kaufmann, E. and W. M. Koolen (Oct. 2018). “Mixture Martingales Revisited with Applications to Sequential Tests and Confidence Intervals”. Preprint. Kaufmann, E., W. M. Koolen, and A. Garivier (Dec. 2018). “Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling”. In: Advances in Neural Information Processing Systems (NeurIPS) 31, pp. 6333–6343. Russac, Y., C. Katsimerou, D. Bohle, O. Cappé, A. Garivier, and W. M. Koolen (Dec. 2021). “A/B/n Testing with Control in the Presence of Subpopulations”. In: Advances in Neural Information Processing Systems (NeurIPS) 34. Accepted. 33 / 34
Russo, D. (2016). “Simple Bayesian Algorithms for Best Arm Identification”. In: CoRR abs/1602.08448. Teraoka, K., K. Hatano, and E. Takimoto (2014). “Efficient Sampling Method for Monte Carlo Tree Search Problem”. In: IEICE Transactions on Infomation and Systems, pp. 392–398. 34 / 34
You can also read