暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
Trust Region Policy Optimization.pdf
304
9页
0次
2022-03-24
免费下载
Trust Region Policy Optimization
John Schulman JOSCHU@EECS.BERKELEY.EDU
Sergey Levine SLEVINE@EECS.BERKELEY.EDU
Philipp Moritz PCMORITZ@EECS.BERKELEY.EDU
Michael Jordan JORDAN@CS.BERKELEY.EDU
Pieter Abbeel PAB BE EL @CS.BERKELEY.EDU
University of California, Berkeley, Department of Electrical Engineering and Computer Sciences
Abstract
In this article, we describe a method for optimiz-
ing control policies, with guaranteed monotonic
improvement. By making several approxima-
tions to the theoretically-justified scheme, we de-
velop a practical algorithm, called Trust Region
Policy Optimization (TRPO). This algorithm is
effective for optimizing large nonlinear poli-
cies such as neural networks. Our experiments
demonstrate its robust performance on a wide va-
riety of tasks: learning simulated robotic swim-
ming, hopping, and walking gaits; and playing
Atari games using images of the screen as input.
Despite its approximations that deviate from the
theory, TRPO tends to give monotonic improve-
ment, with little tuning of hyperparameters.
1 Introduction
Most algorithms for policy optimization can be classified
into three broad categories: policy iteration methods, which
alternate between estimating the value function under the
current policy and improving the policy (Bertsekas, 2005);
policy gradient methods, which use an estimator of the gra-
dient of the expected cost obtained from sample trajec-
tories (Peters & Schaal, 2008a) (and which, as we later
discuss, have a close connection to policy iteration); and
derivative-free optimization methods, such as the cross-
entropy method (CEM) and covariance matrix adaptation
(CMA), which treat the cost as a black box function to be
optimized in terms of the policy parameters (Fu et al., 2005;
Szita & L
¨
orincz, 2006).
General derivative-free stochastic optimization methods
such as CEM and CMA are preferred on many prob-
lems, because they achieve good results while being sim-
ple to understand and implement. For example, while
Tetris is a classic benchmark problem for approximate dy-
Proceedings of the 31
st
International Conference on Machine
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-
right 2015 by the author(s).
namic programming (ADP) methods, stochastic optimiza-
tion methods are difficult to beat on this task (Gabillon
et al., 2013). For continuous control problems, methods
like CMA have been successful at learning control poli-
cies for challenging tasks like locomotion when provided
with hand-engineered policy classes with low-dimensional
parameterizations (Wampler & Popovi
´
c, 2009). The in-
ability of ADP and gradient-based methods to consistently
beat gradient-free random search is unsatisfying, since
gradient-based optimization algorithms enjoy much better
sample complexity guarantees than gradient-free methods
(Nemirovski, 2005). Continuous gradient-based optimiza-
tion has been very successful at learning function approxi-
mators for supervised learning tasks with huge numbers of
parameters, and extending their success to reinforcement
learning would allow for efficient training of complex and
powerful policies.
In this article, we first prove that minimizing a certain sur-
rogate loss function guarantees policy improvement with
non-trivial step sizes. Then we make a series of approxi-
mations to the theoretically-justified algorithm, yielding a
practical algorithm, which we call trust region policy op-
timization (TRPO). We describe two variants of this algo-
rithm: first, the single-path method, which can be applied
in the model-free setting; second, the vine method, which
requires the system to be restored to particular states, which
is typically only possible in simulation. These algorithms
are scalable and can optimize nonlinear policies with tens
of thousands of parameters, which have previously posed a
major challenge for model-free policy search (Deisenroth
et al., 2013). In our experiments, we show that the same
TRPO methods can learn complex policies for swimming,
hopping, and walking, as well as playing Atari games di-
rectly from raw images.
2 Preliminaries
Consider an infinite-horizon discounted Markov decision
process (MDP), defined by the tuple (S, A,P,c,
0
,),
where S is a finite set of states, A is a finite set of actions,
P : SAS !R is the transition probability distri-
bution, c : S!R is the cost function,
0
: S!R is
Trust Region Policy Optimization
the distribution of the initial state s
0
, and 2 (0, 1) is the
discount factor.
Let denote a stochastic policy : SA![0, 1], and
let () denote its expected discounted cost:
()=E
s
0
,a
0
,...
"
1
X
t=0
t
c(s
t
)
#
, where
s
0
0
(s
0
),a
t
(a
t
|s
t
),s
t+1
P (s
t+1
|s
t
,a
t
).
We will use the following standard definitions of the state-
action value function Q
, the value function V
, and the
advantage function A
:
Q
(s
t
,a
t
)=E
s
t+1
,a
t+1
,...
"
1
X
l=0
l
c(s
t+l
)
#
,
V
(s
t
)= E
a
t
,s
t+1
,...
"
1
X
l=0
l
c(s
t+l
)
#
,
A
(s, a)= Q
(s, a) V
(s), where
a
t
(a
t
|s
t
),s
t+1
P (s
t+1
|s
t
,a
t
) for t 0.
The following useful identity expresses the expected cost
of another policy ˜ in terms of the advantage over , accu-
mulated over timesteps (see Kakade & Langford (2002) for
the proof, which we also reprise in Appendix A using the
notation in this paper):
)=()+E
s
0
,a
0
,s
1
,a
1
,...
"
1
X
t=0
t
A
(s
t
,a
t
)
#
, where
s
0
0
(s
0
),a
t
˜(a
t
|s
t
),s
t+1
P (s
t+1
|s
t
,a
t
). (1)
Let
be the (unnormalized) discounted visitation fre-
quencies
(s)=(P (s
0
= s)+P(s
1
= s)+
2
P (s
2
= s)+...),
where s
0
0
and the actions are chosen according to
. Rearranging Equation (1) to sum over states instead of
timesteps, we obtain
)=()+
X
s
˜
(s)
X
a
˜(a|s)A
(s, a). (2)
This equation implies that any policy update ! ˜ that
has a non-positive expected advantage at every state s, i.e.,
P
a
˜(a|s)A
(s, a) 0, is guaranteed to reduce , or
leave it constant in the case that the expected advantage
is zero everywhere. This implies the classic result that the
update performed by exact policy iteration, which uses the
deterministic policy ˜(s) = arg min
a
A
(s, a), improves
the policy if there is at least one state-action pair with a
negative advantage value and nonzero state visitation prob-
ability (otherwise it has converged). However, in the ap-
proximate setting, it will typically be unavoidable, due to
estimation and approximation error, that there will be some
states s for which the expected advantage is positive (i.e.,
bad), that is,
P
a
˜(a|s)A
(s, a) > 0. The complex de-
pendency of
˜
(s) on ˜ makes Equation (2) difficult to op-
timize directly. Instead, we introduce the following local
approximation to :
L
)=()+
X
s
(s)
X
a
˜(a|s)A
(s, a). (3)
Note that L
uses the visitation frequency
rather than
˜
, ignoring changes in state visitation density due to
changes in the policy. However, if we have a parameter-
ized policy
, where
(a|s) is a differentiable function
of the parameter vector , then L
matches to first order
(see Kakade & Langford (2002)). That is, for any parame-
ter value
0
,
L
0
(
0
)=(
0
),
r
L
0
(
)
=
0
= r
(
)
=
0
. (4)
Equation (4) implies that a sufficiently small step
0
! ˜
that improves L
old
will also improve , but does not give
us any guidance on how big of a step to take. To address
this issue, Kakade & Langford (2002) proposed a policy
updating scheme called conservative policy iteration, for
which they could provide explicit lower bounds on the im-
provement of .
To define the conservative policy iteration update, let
old
denote the current policy, and assume that we can solve
0
= arg min
0
L
old
(
0
). The new policy
new
is taken to
be the following mixture policy:
new
(a|s)=(1 )
old
(a|s)+↵⇡
0
(a|s) (5)
Kakade and Langford proved the following result for this
update:
(
new
)L
old
(
new
)+
2✏
(1 (1 ))(1 )
2
, (6)
where is the maximum advantage (positive or negative)
of
0
relative to :
= max
s
|E
a
0
(a|s)
[A
(s, a)]| (7)
Since , 2 [0, 1], Equation (6) implies the following sim-
pler bound, which we refer to in the next section:
(
new
) L
old
(
new
)+
2✏
(1 )
2
2
. (8)
This bound is only slightly weaker when 1, which
is typically the case in the conservative policy iteration
method of Kakade & Langford (2002). Note, however, that
so far this bound only applies to mixture policies gener-
ated by Equation (5). This policy class is unwieldy and
restrictive in practice, and it is desirable for a practical pol-
icy update scheme to be applicable to all general stochastic
policy classes.
of 9
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜