Trust Region Policy Optimization.pdf

上善若水

304

9页

0次

2022-03-24

免费下载

Trust Region Policy Optimization

John Schulman JOSCHU@EECS.BERKELEY.EDU

Sergey Levine SLEVINE@EECS.BERKELEY.EDU

Philipp Moritz PCMORITZ@EECS.BERKELEY.EDU

Michael Jordan JORDAN@CS.BERKELEY.EDU

Pieter Abbeel PAB BE EL @CS.BERKELEY.EDU

University of California, Berkeley, Department of Electrical Engineering and Computer Sciences

Abstract

In this article, we describe a method for optimiz-

ing control policies, with guaranteed monotonic

improvement. By making several approxima-

tions to the theoretically-justiﬁed scheme, we de-

velop a practical algorithm, called Trust Region

Policy Optimization (TRPO). This algorithm is

effective for optimizing large nonlinear poli-

cies such as neural networks. Our experiments

demonstrate its robust performance on a wide va-

riety of tasks: learning simulated robotic swim-

ming, hopping, and walking gaits; and playing

Atari games using images of the screen as input.

Despite its approximations that deviate from the

theory, TRPO tends to give monotonic improve-

ment, with little tuning of hyperparameters.

1 Introduction

Most algorithms for policy optimization can be classiﬁed

into three broad categories: policy iteration methods, which

alternate between estimating the value function under the

current policy and improving the policy (Bertsekas, 2005);

policy gradient methods, which use an estimator of the gra-

dient of the expected cost obtained from sample trajec-

tories (Peters & Schaal, 2008a) (and which, as we later

discuss, have a close connection to policy iteration); and

derivative-free optimization methods, such as the cross-

entropy method (CEM) and covariance matrix adaptation

(CMA), which treat the cost as a black box function to be

optimized in terms of the policy parameters (Fu et al., 2005;

Szita & L

orincz, 2006).

General derivative-free stochastic optimization methods

such as CEM and CMA are preferred on many prob-

lems, because they achieve good results while being sim-

ple to understand and implement. For example, while

Tetris is a classic benchmark problem for approximate dy-

Proceedings of the 31

International Conference on Machine

Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-

right 2015 by the author(s).

namic programming (ADP) methods, stochastic optimiza-

tion methods are difﬁcult to beat on this task (Gabillon

et al., 2013). For continuous control problems, methods

like CMA have been successful at learning control poli-

cies for challenging tasks like locomotion when provided

with hand-engineered policy classes with low-dimensional

parameterizations (Wampler & Popovi

c, 2009). The in-

ability of ADP and gradient-based methods to consistently

beat gradient-free random search is unsatisfying, since

gradient-based optimization algorithms enjoy much better

sample complexity guarantees than gradient-free methods

(Nemirovski, 2005). Continuous gradient-based optimiza-

tion has been very successful at learning function approxi-

mators for supervised learning tasks with huge numbers of

parameters, and extending their success to reinforcement

learning would allow for efﬁcient training of complex and

powerful policies.

In this article, we ﬁrst prove that minimizing a certain sur-

rogate loss function guarantees policy improvement with

non-trivial step sizes. Then we make a series of approxi-

mations to the theoretically-justiﬁed algorithm, yielding a

practical algorithm, which we call trust region policy op-

timization (TRPO). We describe two variants of this algo-

rithm: ﬁrst, the single-path method, which can be applied

in the model-free setting; second, the vine method, which

requires the system to be restored to particular states, which

is typically only possible in simulation. These algorithms

are scalable and can optimize nonlinear policies with tens

of thousands of parameters, which have previously posed a

major challenge for model-free policy search (Deisenroth

et al., 2013). In our experiments, we show that the same

TRPO methods can learn complex policies for swimming,

hopping, and walking, as well as playing Atari games di-

rectly from raw images.

2 Preliminaries

Consider an inﬁnite-horizon discounted Markov decision

process (MDP), deﬁned by the tuple (S, A,P,c,⇢

,),

where S is a ﬁnite set of states, A is a ﬁnite set of actions,

P : S⇥A⇥S !R is the transition probability distri-

bution, c : S!R is the cost function, ⇢

: S!R is

Trust Region Policy Optimization

the distribution of the initial state s

, and  2 (0, 1) is the

discount factor.

Let ⇡ denote a stochastic policy ⇡ : S⇥A![0, 1], and

let ⌘(⇡) denote its expected discounted cost:

⌘(⇡)=E

,...

t=0



c(s

)

, where

⇠ ⇢

),a

⇠ ⇡(a

),s

t+1

⇠ P (s

t+1

We will use the following standard deﬁnitions of the state-

action value function Q

⇡

, the value function V

⇡

, and the

advantage function A

⇡

)=E

t+1

,...

l=0



c(s

t+l

)

⇡

)= E

t+1

,...

l=0



c(s

t+l

)

⇡

(s, a)= Q

⇡

(s, a)  V

⇡

(s), where

⇠ ⇡(a

),s

t+1

⇠ P (s

t+1

) for t  0.

The following useful identity expresses the expected cost

of another policy ˜⇡ in terms of the advantage over ⇡, accu-

mulated over timesteps (see Kakade & Langford (2002) for

the proof, which we also reprise in Appendix A using the

notation in this paper):

⌘(˜⇡)=⌘(⇡)+E

,...

t=0



⇡

)

, where

⇠⇢

),a

⇠ ˜⇡(a

),s

t+1

⇠P (s

t+1

). (1)

Let ⇢

⇡

be the (unnormalized) discounted visitation fre-

quencies

⇢

⇡

(s)=(P (s

= s)+P(s

= s)+

P (s

= s)+...),

where s

⇠ ⇢

and the actions are chosen according to

⇡. Rearranging Equation (1) to sum over states instead of

timesteps, we obtain

⌘(˜⇡)=⌘(⇡)+

⇢

˜⇡

(s)

˜⇡(a|s)A

⇡

(s, a). (2)

This equation implies that any policy update ⇡ ! ˜⇡ that

has a non-positive expected advantage at every state s, i.e.,

˜⇡(a|s)A

⇡

(s, a)  0, is guaranteed to reduce ⌘, or

leave it constant in the case that the expected advantage

is zero everywhere. This implies the classic result that the

update performed by exact policy iteration, which uses the

deterministic policy ˜⇡(s) = arg min

⇡

(s, a), improves

the policy if there is at least one state-action pair with a

negative advantage value and nonzero state visitation prob-

ability (otherwise it has converged). However, in the ap-

proximate setting, it will typically be unavoidable, due to

estimation and approximation error, that there will be some

states s for which the expected advantage is positive (i.e.,

bad), that is,

˜⇡(a|s)A

⇡

(s, a) > 0. The complex de-

pendency of ⇢

˜⇡

(s) on ˜⇡ makes Equation (2) difﬁcult to op-

timize directly. Instead, we introduce the following local

approximation to ⌘:

⇡

(˜⇡)=⌘(⇡)+

⇢

⇡

(s)

˜⇡(a|s)A

⇡

(s, a). (3)

Note that L

⇡

uses the visitation frequency ⇢

⇡

rather than

⇢

˜⇡

, ignoring changes in state visitation density due to

changes in the policy. However, if we have a parameter-

ized policy ⇡

✓

, where ⇡

✓

(a|s) is a differentiable function

of the parameter vector ✓, then L

⇡

matches ⌘ to ﬁrst order

(see Kakade & Langford (2002)). That is, for any parame-

ter value ✓

⇡

✓

(⇡

✓

)=⌘(⇡

✓

⇡

✓

(⇡

✓

)



✓=✓

= r

✓

⌘(⇡

✓

)



✓=✓

. (4)

Equation (4) implies that a sufﬁciently small step ⇡

✓

! ˜⇡

that improves L

⇡

✓

old

will also improve ⌘, but does not give

us any guidance on how big of a step to take. To address

this issue, Kakade & Langford (2002) proposed a policy

updating scheme called conservative policy iteration, for

which they could provide explicit lower bounds on the im-

provement of ⌘.

To deﬁne the conservative policy iteration update, let ⇡

old

denote the current policy, and assume that we can solve

⇡

= arg min

⇡

old

(⇡

). The new policy ⇡

new

is taken to

be the following mixture policy:

⇡

new

(a|s)=(1 ↵)⇡

old

(a|s)+↵⇡

(a|s) (5)

Kakade and Langford proved the following result for this

update:

⌘(⇡

new

)L

⇡

old

(⇡

new

2✏

(1  (1  ↵))(1  )

↵

, (6)

where ✏ is the maximum advantage (positive or negative)

of ⇡

relative to ⇡:

✏ = max

a⇠⇡

(a|s)

⇡

(s, a)]| (7)

Since ↵,  2 [0, 1], Equation (6) implies the following sim-

pler bound, which we refer to in the next section:

⌘(⇡

new

)  L

⇡

old

(⇡

new

2✏

(1  )

↵

. (8)

This bound is only slightly weaker when ↵ ⌧ 1, which

is typically the case in the conservative policy iteration

method of Kakade & Langford (2002). Note, however, that

so far this bound only applies to mixture policies gener-

ated by Equation (5). This policy class is unwieldy and

restrictive in practice, and it is desirable for a practical pol-

icy update scheme to be applicable to all general stochastic

policy classes.

Trust Region Policy Optimization

John Schulman JOSCHU@EECS.BERKELEY.EDU

Sergey Levine SLEVINE@EECS.BERKELEY.EDU

Philipp Moritz PCMORITZ@EECS.BERKELEY.EDU

Michael Jordan JORDAN@CS.BERKELEY.EDU

Pieter Abbeel PAB BE EL @CS.BERKELEY.EDU

University of California, Berkeley, Department of Electrical Engineering and Computer Sciences

Abstract

In this article, we describe a method for optimiz-

ing control policies, with guaranteed monotonic

improvement. By making several approxima-

tions to the theoretically-justiﬁed scheme, we de-

velop a practical algorithm, called Trust Region

Policy Optimization (TRPO). This algorithm is

effective for optimizing large nonlinear poli-

cies such as neural networks. Our experiments

demonstrate its robust performance on a wide va-

riety of tasks: learning simulated robotic swim-

ming, hopping, and walking gaits; and playing

Atari games using images of the screen as input.

Despite its approximations that deviate from the

theory, TRPO tends to give monotonic improve-

ment, with little tuning of hyperparameters.

1 Introduction

Most algorithms for policy optimization can be classiﬁed

into three broad categories: policy iteration methods, which

alternate between estimating the value function under the

current policy and improving the policy (Bertsekas, 2005);

policy gradient methods, which use an estimator of the gra-

dient of the expected cost obtained from sample trajec-

tories (Peters & Schaal, 2008a) (and which, as we later

discuss, have a close connection to policy iteration); and

derivative-free optimization methods, such as the cross-

entropy method (CEM) and covariance matrix adaptation

(CMA), which treat the cost as a black box function to be

optimized in terms of the policy parameters (Fu et al., 2005;

Szita & L

orincz, 2006).

General derivative-free stochastic optimization methods

such as CEM and CMA are preferred on many prob-

lems, because they achieve good results while being sim-

ple to understand and implement. For example, while

Tetris is a classic benchmark problem for approximate dy-

Proceedings of the 31

International Conference on Machine

Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-

right 2015 by the author(s).

namic programming (ADP) methods, stochastic optimiza-

tion methods are difﬁcult to beat on this task (Gabillon

et al., 2013). For continuous control problems, methods

like CMA have been successful at learning control poli-

cies for challenging tasks like locomotion when provided

with hand-engineered policy classes with low-dimensional

parameterizations (Wampler & Popovi

c, 2009). The in-

ability of ADP and gradient-based methods to consistently

beat gradient-free random search is unsatisfying, since

gradient-based optimization algorithms enjoy much better

sample complexity guarantees than gradient-free methods

(Nemirovski, 2005). Continuous gradient-based optimiza-

tion has been very successful at learning function approxi-

mators for supervised learning tasks with huge numbers of

parameters, and extending their success to reinforcement

learning would allow for efﬁcient training of complex and

powerful policies.

In this article, we ﬁrst prove that minimizing a certain sur-

rogate loss function guarantees policy improvement with

non-trivial step sizes. Then we make a series of approxi-

mations to the theoretically-justiﬁed algorithm, yielding a

practical algorithm, which we call trust region policy op-

timization (TRPO). We describe two variants of this algo-

rithm: ﬁrst, the single-path method, which can be applied

in the model-free setting; second, the vine method, which

requires the system to be restored to particular states, which

is typically only possible in simulation. These algorithms

are scalable and can optimize nonlinear policies with tens

of thousands of parameters, which have previously posed a

major challenge for model-free policy search (Deisenroth

et al., 2013). In our experiments, we show that the same

TRPO methods can learn complex policies for swimming,

hopping, and walking, as well as playing Atari games di-

rectly from raw images.

2 Preliminaries

Consider an inﬁnite-horizon discounted Markov decision

process (MDP), deﬁned by the tuple (S, A,P,c,⇢

,),

where S is a ﬁnite set of states, A is a ﬁnite set of actions,

P : S⇥A⇥S !R is the transition probability distri-

bution, c : S!R is the cost function, ⇢

: S!R is

Trust Region Policy Optimization

the distribution of the initial state s

, and  2 (0, 1) is the

discount factor.

Let ⇡ denote a stochastic policy ⇡ : S⇥A![0, 1], and

let ⌘(⇡) denote its expected discounted cost:

⌘(⇡)=E

,...

t=0



c(s

)

, where

⇠ ⇢

),a

⇠ ⇡(a

),s

t+1

⇠ P (s

t+1

We will use the following standard deﬁnitions of the state-

action value function Q

⇡

, the value function V

⇡

, and the

advantage function A

⇡

)=E

t+1

,...

l=0



c(s

t+l

)

⇡

)= E

t+1

,...

l=0



c(s

t+l

)

⇡

(s, a)= Q

⇡

(s, a)  V

⇡

(s), where

⇠ ⇡(a

),s

t+1

⇠ P (s

t+1

) for t  0.

The following useful identity expresses the expected cost

of another policy ˜⇡ in terms of the advantage over ⇡, accu-

mulated over timesteps (see Kakade & Langford (2002) for

the proof, which we also reprise in Appendix A using the

notation in this paper):

⌘(˜⇡)=⌘(⇡)+E

,...

t=0



⇡

)

, where

⇠⇢

),a

⇠ ˜⇡(a

),s

t+1

⇠P (s

t+1

). (1)

Let ⇢

⇡

be the (unnormalized) discounted visitation fre-

quencies

⇢

⇡

(s)=(P (s

= s)+P(s

= s)+

P (s

= s)+...),

where s

⇠ ⇢

and the actions are chosen according to

⇡. Rearranging Equation (1) to sum over states instead of

timesteps, we obtain

⌘(˜⇡)=⌘(⇡)+

⇢

˜⇡

(s)

˜⇡(a|s)A

⇡

(s, a). (2)

This equation implies that any policy update ⇡ ! ˜⇡ that

has a non-positive expected advantage at every state s, i.e.,

˜⇡(a|s)A

⇡

(s, a)  0, is guaranteed to reduce ⌘, or

leave it constant in the case that the expected advantage

is zero everywhere. This implies the classic result that the

update performed by exact policy iteration, which uses the

deterministic policy ˜⇡(s) = arg min

⇡

(s, a), improves

the policy if there is at least one state-action pair with a

negative advantage value and nonzero state visitation prob-

ability (otherwise it has converged). However, in the ap-

proximate setting, it will typically be unavoidable, due to

estimation and approximation error, that there will be some

states s for which the expected advantage is positive (i.e.,

bad), that is,

˜⇡(a|s)A

⇡

(s, a) > 0. The complex de-

pendency of ⇢

˜⇡

(s) on ˜⇡ makes Equation (2) difﬁcult to op-

timize directly. Instead, we introduce the following local

approximation to ⌘:

⇡

(˜⇡)=⌘(⇡)+

⇢

⇡

(s)

˜⇡(a|s)A

⇡

(s, a). (3)

Note that L

⇡

uses the visitation frequency ⇢

⇡

rather than

⇢

˜⇡

, ignoring changes in state visitation density due to

changes in the policy. However, if we have a parameter-

ized policy ⇡

✓

, where ⇡

✓

(a|s) is a differentiable function

of the parameter vector ✓, then L

⇡

matches ⌘ to ﬁrst order

(see Kakade & Langford (2002)). That is, for any parame-

ter value ✓

⇡

✓

(⇡

✓

)=⌘(⇡

✓

⇡

✓

(⇡

✓

)



✓=✓

= r

✓

⌘(⇡

✓

)



✓=✓

. (4)

Equation (4) implies that a sufﬁciently small step ⇡

✓

! ˜⇡

that improves L

⇡

✓

old

will also improve ⌘, but does not give

us any guidance on how big of a step to take. To address

this issue, Kakade & Langford (2002) proposed a policy

updating scheme called conservative policy iteration, for

which they could provide explicit lower bounds on the im-

provement of ⌘.

To deﬁne the conservative policy iteration update, let ⇡

old

denote the current policy, and assume that we can solve

⇡

= arg min

⇡

old

(⇡

). The new policy ⇡

new

is taken to

be the following mixture policy:

⇡

new

(a|s)=(1 ↵)⇡

old

(a|s)+↵⇡

(a|s) (5)

Kakade and Langford proved the following result for this

update:

⌘(⇡

new

)L

⇡

old

(⇡

new

2✏

(1  (1  ↵))(1  )

↵

, (6)

where ✏ is the maximum advantage (positive or negative)

of ⇡

relative to ⇡:

✏ = max

a⇠⇡

(a|s)

⇡

(s, a)]| (7)

Since ↵,  2 [0, 1], Equation (6) implies the following sim-

pler bound, which we refer to in the next section:

⌘(⇡

new

)  L

⇡

old

(⇡

new

2✏

(1  )

↵

. (8)

This bound is only slightly weaker when ↵ ⌧ 1, which

is typically the case in the conservative policy iteration

method of Kakade & Langford (2002). Note, however, that

so far this bound only applies to mixture policies gener-

ated by Equation (5). This policy class is unwieldy and

restrictive in practice, and it is desirable for a practical pol-

icy update scheme to be applicable to all general stochastic

policy classes.

of 9

免费下载

深度学习 paper

关注

评论