Algorithms for Reinforcement Learning.pdf

yBmZlQzJ

404

98页

12次

2023-01-06

免费下载

Algorithms for Reinforcement Learning

Draft of the lecture published in the

Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning

series

Morgan & Claypool Publishers

Csaba Szepesv´ari

June 9, 2009

∗

Contents

1 Overview 3

2 Markov decision processes 7

2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Dynamic programming algorithms for solving MDPs . . . . . . . . . . . . . . 16

3 Value prediction problems 17

3.1 Temporal diﬀerence learning in ﬁnite state spaces . . . . . . . . . . . . . . . 18

3.1.1 Tabular TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1.2 Every-visit Monte-Carlo . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.3 TD(λ): Unifying Monte-Carlo and TD(0) . . . . . . . . . . . . . . . . 23

3.2 Algorithms for large state spaces . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 TD(λ) with function approximation . . . . . . . . . . . . . . . . . . . 29

3.2.2 Gradient temporal diﬀerence learning . . . . . . . . . . . . . . . . . . 33

3.2.3 Least-squares methods . . . . . . . . . . . . . . . . . . . . . . . . . . 36

∗

Last update: July 8, 2017

3.2.4 The choice of the function space . . . . . . . . . . . . . . . . . . . . . 42

4 Control 45

4.1 A catalog of learning problems . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Closed-loop interactive learning . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Online learning in bandits . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Active learning in bandits . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.3 Active learning in Markov Decision Processes . . . . . . . . . . . . . 50

4.2.4 Online learning in Markov Decision Processes . . . . . . . . . . . . . 51

4.3 Direct methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.1 Q-learning in ﬁnite MDPs . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.2 Q-learning with function approximation . . . . . . . . . . . . . . . . 59

4.4 Actor-critic methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4.1 Implementing a critic . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.4.2 Implementing an actor . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 For further exploration 72

5.1 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A The theory of discounted Markovian decision processes 74

A.1 Contractions and Banach’s ﬁxed-point theorem . . . . . . . . . . . . . . . . 74

A.2 Application to MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

Abstract

Reinforcement learning is a learning paradigm concerned with learning to control a

system so as to maximize a numerical performance measure that expresses a long-term

objective. What distinguishes reinforcement learning from supervised learning is that

only partial feedback is given to the learner about the learner’s predictions. Further,

the predictions may have long term eﬀects through inﬂuencing the future state of the

controlled system. Thus, time plays a special role. The goal in reinforcement learning

is to develop eﬃcient learning algorithms, as well as to understand the algorithms’

merits and limitations. Reinforcement learning is of great interest because of the large

number of practical applications that it can be used to address, ranging from problems

in artiﬁcial intelligence to operations research or control engineering. In this book, we

focus on those algorithms of reinforcement learning that build on the powerful theory of

dynamic programming. We give a fairly comprehensive catalog of learning problems,

of 98

免费下载

关注

评论