VIPER: PROVABLY EFFICIENT ALGORITHM FOR OFFLINE RL WITH NEURAL FUNCTION APPROXIMATION - THANH NGUYEN-TANG & RAMAN ARORA

Page created by Micheal Todd

World Around

English

Like
Share
Embed
Fullscreen
Slides
Download HTML
Download PDF
Abuse

←

→

Page content transcription

If your browser does not render page correctly, please read the page content below

VIPER: PROVABLY EFFICIENT ALGORITHM FOR OFFLINE RL WITH NEURAL FUNCTION APPROXIMATION - THANH NGUYEN-TANG & RAMAN ARORA

VIPeR: Provably Efficient Algorithm for Offline RL with
 Neural Function Approximation
 Thanh Nguyen-Tang & Raman Arora
 Department of Computer Science, Johns Hopkins University

*top 25% noble papers, ICLR’23 1

Episodic MDP
• Episodic time-inhomogeneous Markov decision process
 ℳ = , , H, P, r, d!
 • State space : finite but exponentially large
 • Action space : finite but exponentially large
 • Episode length H: finite
 • Agent interacts with MDP for H steps and then restart the episode
 • Unknown Transition kernels P = P!, … , P" , where P# : × → Δ 
 • Unknown Reward functions r = (r!, … , r"), where r# ∶ × → 0,1
 • Initial state distribution d! ∈ Δ( )

 2

Episodic MDP

• A policy π = π# #∈ % where π# : → Δ 
• Action-value functions:
 Q&# s, a = & ∑%
 '(! r' | s# , a# = (s, a)
• Value functions
 V#& s = & ∑%'(! r' | s# = s
• The optimal policy π∗ maximizes V!& 3

Offline RL
 *∈ +
• Offline dataset: collected a priori, = s#* , a*# , r#* #∈ %
 • a$#~ µ#(⋅ |s#$), s#%!
 $
 ~ P# ⋅ s#$, a$#
 • µ is the behavior policy
 • K: # number of episodes
• No further interactions with MDP
• Learning objective:
 SubOpt π >; s! = V!∗ s! − V!&, (s! )
where π> = OfBlineRLAlgo , ℱ , ℱ is some function class (e.g., neural
networks)

 4

PEVI algorithm (Jin et al., ICML’21)
• Low-rank MDPs:
 r# -,/ = ϕ# s, a 0 w# , P# s 1 s, a = ϕ# s, a 0 ν# s 1
• Pessimism design: Computing LCB (lower confidence bound)
 QP # s, a = ϕ# s, a 0 w
 > # − β||ϕ# s, a ||2"#
 !
 ( & & & & )
 • Λ# = λI + ∑&'! ϕ# s# , a# ϕ# s# , a#
 •w@ #: ERM
• Then, extract π
 ># greedy w.r.t. Q P#
• PEVI is both statistically efficient and computationally efficient in
 low-rank MDPs
• PEVI require data coverage only over an optimal policy
[Jin et al., ICML’21] “Is pessimism provably efficient for offline rl?”
 5

Offline RL with Neural Function Approximation
However, in many practical settings, MDPs do not admit any linear
structures

à Desirable to use differentiable function approximator such as
neural networks

 6

Background: Neural networks
• Action-value functions are approximated by a two-layer neural net
 3
 1
 f x; W = V b' ⋅ ReLU(w'0 x)
 m
 '(!
 • x = s, a ∈ = × ∈ ℝ*
 • m: network width
 • b+ ~ Unif({−1,1}): fixed during training
 • W = w!, w,, … , w- ∈ ℝ-* : trainable
 • W.: initialization of W
 "
 • w! ~ (0, !)
 #

 7

NeuraLCB (Nguyen-Tang et al., ICLR)
 LCB + neural function approximation
 • Use neural network f s, a; W to approximate the action-value function
 • Computing LCB:
 O # s, a = f s, a; W
 Q O − β||∇/f s, a; W
 O || "#
 0 !
 •WO : found by gradient descent
 O : the gradient of the neural network w.r.t. its parameters
 • 1 , ; 
 )
 • Λ# = λI + ∑(
 &'! ∇/f
 O
 s#&, a&#; W ∇/f O
 s#&, a&#; W is the empirical covariance
 matrix
 • Complexity of computing LCB: (# of network parameters):

[Nguyen-Tang et al., ICLR’22] “Offline neural contextual bandits: Pessimism, Optimization, and Generalization”
 8

VIPeR: Implicit pessimism via reward perturbing
 ! !"# ← 0
• End of Episode: Q
• Bootstrapping (backward induction): for h = H, H − 1, … , 1
 • Perturb the rewards with independent Gaussian noises ξ%,' (
 $ ~ (0, σ ) to create M perturbed

 * $) , 
 datasets * $( , … , 
 * $*

 ! $% = {(s$& , a&$ , r$& + V
 !$"# s$"#
 &
 + ξ&,%
 $ )}&∈ 5

 %'()* 3)*()*

 • Train a different neural network in each * $' to get f(s, a; W ' )
 • Pessimistically estimate
 * $ (s, a) = min f(s, a; W ' )
 Q
 '∈[*]
 • Optimize π *
 7$ s ← argmax.∈ Q $ s, a and V *$ s = max Q * $ (s, a)
 .
 • Move to step h-1 and repeat
 9

Value suboptimality of VIPeR
• Conditions:
 • Adaptively collected data with coverage of only an optimal policy
 • Completeness assumption
 • The ensemble size M is polylogarithmically large
 • 9 d677 )
 The noise level = ( 
 • Small learning rate and sufficiently large training iterations
 • The network width m is polynomially large
• Then,
 C

 SubOpt π 4 ||E:;
 '; s? ≾ @∗ . ||∇D f sA , aA ; W
 9
 AB?
• Same guarantee as NeuraLCB, but no need to compute LCB
 10

Comparison

 11

Experiment: Neural Contextual Bandits
• LinLCB: LCB + linear
• Lin-VIPeR: Reward perturbing + linear
• NeuralGreedy: FQI + neural
• NeuraLCB: LCB + neural
• Neural-VIPeR: reward perturbing + neural

 12

Runtime efficiency
• NeuraLCB spends O K : time in action selection
• Neural-VIPeR spends O 1 in action selection

 13

Controlling pessimism
M controls the level of pessimism and nicely correlates with the
decrease of subopt in practice

 14

Result in D4RL
• BEAR (Kumar et al., 2019): use MMD
 distance to constraint policy to the offline
 data
• UWAC (Wu et al., 2021): improves BEAR
 using dropout uncertainty
• CQL (Kumar et al., 2020) that minimizes Q-
 values of OOD actions
• MOPO (Yu et al., 2020): uses model-based
 uncertainty via ensemble dynamics
• TD3-BC (Fujimoto & Gu, 2021): uses
 adaptive behavior cloning
• PBRL (Bai et al., 2022): use uncertainty
 quantification via disagreement of
 bootstrapped Q-functions

 15

Summary

We now have a provably efficient & computationally efficient algorithm for
neural function approximation with polynomial sample and runtime under
mild data coverage

 Algorithm: perturbed rewards + (Stochastic) gradient descent

 C ⋅ G ⋅H?@@
 ' = 7
Sample complexity: SubOpt π I
 • K: # samples
 • H: horizon
 • κ: single-policy concentration coefficient
 • d677 : effective dimension

Future Work

• Match the lower bound in terms of ?

• Avoid the need of training an ensemble of models?

• General-purpose offline RL with any regression oracle?

• Can randomized value iteration obtain a horizon-free rate?

• Extension to the POMDP?

• Model-free posterior sampling for offline RL with optimal frequentist rates?

 17

Thank you
More details at our paper: https://openreview.net/forum?id=WOquZTLCBO1

 18

You can also read

BAROSSA BUSINESS EVENTS - on galaxy

NHSmail Access Policy - (England) April 2019 Version 5 - Amazon S3

OPENING THE DOOR TO A JUST CONVERSATION: Ontario Election 2022 April 27, 2022 www.acto.ca

CHAMPION DECENT WORK AND JUSTICE FOR WORKERS ALL ALONG THE FOOD CHAIN

SIJS ADJUSTMENT OF STATUS WEBINAR: ADDITIONAL RESOURCES - Childrens ...

Texans still sharply divided on new abortion law, survey finds - July 20 2022, by Sally Strong

The General Data Protection Regulation - Fighting IP infringement on an anonymous internet - Clarivate

Stunningly restored modern home - Super value Vacant land to build another home - GotProperty

JUST CULTURE: A STRATEGIC PERSPECTIVE - DAVID MARX

COVID-19 DISINFORMATION'S ARSENAL IS WELL SPREAD ACROSS THE EU, AND DISINFORMATION ABOUT AFGHANISTAN IS ON THE RISE - Monthly brief no. 3 - EDMO ...

P4PI: P4 on Raspberry PI for Networking Education - Sándor Laki, Eötvös Loránd University (HU) - Open ...

2022 Benefit Highlights - Vitality Plus (HMO) H0545, Plan 015 - Inter Valley Health Plan

Towards a French single marketplace for gas in 2018 - TEREGA

EFarm - Green Hydrogen Society in Wind Regions. André Steinau, GP JOULE GmbH Berlin, 25th of September 2019

Next Generation Ticketing - Barry Dorgan, Head of Ticketing Systems - Central Bank of Ireland

Projects Neil Shaw Quad 204, North Sea Under construction in Korea

C# Cheat Sheet - the coding guys

Project iNOVPESCA: Cetacean interactions with Algarve (Southern Portugal) coastal fisheries and mitigation approaches

NETWORK OF RESEARCH PILOT LINES FOR LITHIUM BATTERY CELLS

Brought Our Own Enterprise - Lessons Integrating the IACD Framework

TARGETING & LEAD TOOLS 2021 - Valid as of January 1st, 2021 - bauMagazin-online

FREE ELLIOTT WAVE ANALYSIS FOR 2022 - Wavetraders

The future LIFE Programme 2021-2027 - François Delcueillerie LIFE programme unit, DG ENV, European Commission

5th Grade Learning Overview (2020-2021) - Learning Pods

CALLING INSTRUCTIONS VODAFONE SIM CARD EGYPT

MELBOURNE FOOTBALL CLUB - COTERIE 2021

PADLET: CHARACTERISTICS AND USAGE - ALEXANDER STEWART TEACHING & LEARNING CENTRE - The Seneca ...

MORPHEUS SUCCESS STORY - Skåne, Sweden Segesholmsån river and Degeberga WWTP, Project MORPHEUS 2017 2019

THREAT ADVISORY ATTACK REPORT - TA2023191

DataGrid WP1 Massimo Sgaravatto INFN Padova

Junction Arts Marathon Vest Design Competition 2021

The right to be you Identity is a fundamental right - IN Groupe helps defend it.

TECHNOLOGY FOR LIFE ROSIA Project - ROSIA-PCP

A talk by Geert Barentsen for #KeplerSciCon 2019 - Photo by Paul Trienekens on unsplash.com - Kepler & K2

72ND ANNUAL NEW YORK STATE AMERICAN LEGION FAMILY BOWLING TOURNAMENT

Weekly Bulletin for registered users - 4th Edition Date: 25/11/2020 - Northern Ireland ...

FIELDGUIDE: BUILDING COMMUNITIES THROUGH IMAGE RECOGNITION AI - ANDRE POREMSKI & TAYLAN PINCE (FIELDGUIDE) | LINDSIE MCCABE & NEIL S. COBB ...

Best Cricket Betting Software Developer

Career Decision Making: Learning How To Be Adaptable With An Advanced Degree - Evan Faidley Jennifer Mani Kent State University

Next gen ORCA Community Update

SupportAssist for Home PCs Version 3.10.3 Release Notes Release summary

NFIQ 2 From Excitement to Practice - European Association for Biometrics

CASE STUDY MONKEYSPORTS - MONKEYSPORTS IMPLEMENTED SEARCHSPRING ACROSS MULTIPLE DOMAINS TO OVERCOME CHALLENGES WITH SEARCH RELEVANCY.

Artificial Intelligence solves real problems in industry and business - Dario Piga, IDSIA - Swissphotonics

HORIZON 2020 SCIENCE WITH AND FOR SOCIETY

Rebuilding Landonline - ALGIM 2019 Overview

Invis is more comfortable,* less painful, faster.

Expected Currency Impact - November 15, 2021 - Novartis

Focal Point Meeting External evaluation of the focal point network - 19 May 2021 - EFSA

Community Resource Coordination (CRC) in the Libraries - ANCHORAGE PUBLIC LIBRARY COMMUNITY RESOURCE SEPTEMBER 2018-PRESENT