暂无图片
暂无图片
暂无图片
暂无图片
暂无图片
Algorithms for Reinforcement Learning.pdf
404
98页
12次
2023-01-06
免费下载
Algorithms for Reinforcement Learning
Draft of the lecture published in the
Synthesis Lectures on Artificial Intelligence and Machine Learning
series
by
Morgan & Claypool Publishers
Csaba Szepesv´ari
June 9, 2009
Contents
1 Overview 3
2 Markov decision processes 7
2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Dynamic programming algorithms for solving MDPs . . . . . . . . . . . . . . 16
3 Value prediction problems 17
3.1 Temporal difference learning in finite state spaces . . . . . . . . . . . . . . . 18
3.1.1 Tabular TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.2 Every-visit Monte-Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.3 TD(λ): Unifying Monte-Carlo and TD(0) . . . . . . . . . . . . . . . . 23
3.2 Algorithms for large state spaces . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 TD(λ) with function approximation . . . . . . . . . . . . . . . . . . . 29
3.2.2 Gradient temporal difference learning . . . . . . . . . . . . . . . . . . 33
3.2.3 Least-squares methods . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Last update: July 8, 2017
1
3.2.4 The choice of the function space . . . . . . . . . . . . . . . . . . . . . 42
4 Control 45
4.1 A catalog of learning problems . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Closed-loop interactive learning . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Online learning in bandits . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Active learning in bandits . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.3 Active learning in Markov Decision Processes . . . . . . . . . . . . . 50
4.2.4 Online learning in Markov Decision Processes . . . . . . . . . . . . . 51
4.3 Direct methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 Q-learning in finite MDPs . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.2 Q-learning with function approximation . . . . . . . . . . . . . . . . 59
4.4 Actor-critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.1 Implementing a critic . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4.2 Implementing an actor . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 For further exploration 72
5.1 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A The theory of discounted Markovian decision processes 74
A.1 Contractions and Banach’s fixed-point theorem . . . . . . . . . . . . . . . . 74
A.2 Application to MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Abstract
Reinforcement learning is a learning paradigm concerned with learning to control a
system so as to maximize a numerical performance measure that expresses a long-term
objective. What distinguishes reinforcement learning from supervised learning is that
only partial feedback is given to the learner about the learner’s predictions. Further,
the predictions may have long term effects through influencing the future state of the
controlled system. Thus, time plays a special role. The goal in reinforcement learning
is to develop efficient learning algorithms, as well as to understand the algorithms’
merits and limitations. Reinforcement learning is of great interest because of the large
number of practical applications that it can be used to address, ranging from problems
in artificial intelligence to operations research or control engineering. In this book, we
focus on those algorithms of reinforcement learning that build on the powerful theory of
dynamic programming. We give a fairly comprehensive catalog of learning problems,
2
of 98
免费下载
【版权声明】本文为墨天轮用户原创内容,转载时必须标注文档的来源(墨天轮),文档链接,文档作者等基本信息,否则作者和墨天轮有权追究责任。如果您发现墨天轮中有涉嫌抄袭或者侵权的内容,欢迎发送邮件至:contact@modb.pro进行举报,并提供相关证据,一经查实,墨天轮将立刻删除相关内容。

评论

关注
最新上传
暂无内容,敬请期待...
下载排行榜
Top250 周榜 月榜