Deep Reinforcement Learning » książka

zaloguj się | załóż konto

topmenu

Szukaj

Książki na zamówienie

Wyszukiwanie zaawansowane

Pusty koszyk

Bezpłatna dostawa dla zamówień powyżej 40 zł

Kategorie główne

• Nauka

[2953408]

• Literatura piękna

[1818674]

więcej...

Kategorie szczegółowe BISAC

Deep Reinforcement Learning

Name: Deep Reinforcement Learning
Brand: Springer Verlag, Singapore
Price: 223.98 PLN
Availability: InStock

ISBN-13: 9789811906374 / Angielski / Miękka / 2022

Aske Plaat

Deep Reinforcement Learning

ISBN-13: 9789811906374 / Angielski / Miękka / 2022

Aske Plaat

cena 223,98
(netto: 213,31 VAT: 5%)

Najniższa cena z 30 dni: 192,74

Termin realizacji zamówienia:
ok. 16-18 dni roboczych.

Darmowa dostawa!

Deep reinforcement learning has attracted considerable attention recently. Impressive results have been achieved in such diverse fields as autonomous driving, game playing, molecular recombination, and robotics. In all these fields, computer programs have taught themselves to understand problems that were previously considered to be very difficult. In the game of Go, the program AlphaGo has even learned to outmatch three of the world’s leading players.Deep reinforcement learning takes its inspiration from the fields of biology and psychology. Biology has inspired the creation of artificial neural networks and deep learning, while psychology studies how animals and humans learn, and how subjects’ desired behavior can be reinforced with positive and negative stimuli. When we see how reinforcement learning teaches a simulated robot to walk, we are reminded of how children learn, through playful exploration. Techniques that are inspired by biology and psychology work amazingly well in computers: animal behavior and the structure of the brain as new blueprints for science and engineering. In fact, computers truly seem to possess aspects of human behavior; as such, this field goes to the heart of the dream of artificial intelligence. These research advances have not gone unnoticed by educators. Many universities have begun offering courses on the subject of deep reinforcement learning. The aim of this book is to provide an overview of the field, at the proper level of detail for a graduate course in artificial intelligence. It covers the complete field, from the basic algorithms of Deep Q-learning, to advanced topics such as multi-agent reinforcement learning and meta learning.

Kategorie:

Informatyka, Bazy danych

Kategorie BISAC:

Computers > Artificial Intelligence - General
Computers > Computer Science
Mathematics > Prawdopodobieństwo i statystyka

Wydawca:

Springer Verlag, Singapore

Język:

Angielski

ISBN-13:

9789811906374

Rok wydania:

2022

Waga:

0.65 kg

Wymiary:

23.5 x 15.5

Oprawa:

Miękka

Dodatkowe informacje:

Wydanie ilustrowane

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 What is Deep Reinforcement Learning? . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Three Machine Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Tabular Value-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1 Sequential Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Tabular Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Classic Gym Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3 Approximating the Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.1 Large, High-Dimensional, Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.2 Deep Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.3 Atari 2600 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4 Policy-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.1 Continuous Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.2 Policy-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.3 Locomotion and Visuo-Motor Environments . . . . . . . . . . . . . . . . . . . . 111

4.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5 Model-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.1 Dynamics Models of High-Dimensional Problems . . . . . . . . . . . . . . . 122

5.2 Learning and Planning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.3 High-dimensional Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

vii

viii CONTENTS

5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6 Two-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.1 Two-Agent Zero-Sum Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.2 Tabula Rasa Self-Play Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.3 Self-Play Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7 Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7.1 Multi-Agent Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

7.2 Multi-Agent Reinforcement Learning Agents . . . . . . . . . . . . . . . . . . . . 202

7.3 Multi-Agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

7.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

8 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.1 Granularity of the Structure of Problems . . . . . . . . . . . . . . . . . . . . . . . 227

8.2 Divide and Conquer for Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

8.3 Hierarchical Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

8.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

9 Meta Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

9.1 Learning to Learn Related Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

9.2 Transfer Learning and Meta Learning Agents . . . . . . . . . . . . . . . . . . . 247

9.3 Meta-Learning Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

9.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

10 Further Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

10.1 Developments in Deep Reinforcement Learning . . . . . . . . . . . . . . . . . 271

10.2 Main Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

10.3 The Future of Articial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

A Deep Reinforcement Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

A.2 Agent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

A.3 Deep Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

B Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

B.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

B.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

B.3 Datasets and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

CONTENTS ix

C Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

C.1 Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

C.2 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

C.3 Derivative of an Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

C.4 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

x CONTENTS

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 What is Deep Reinforcement Learning? . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.5 Four Related Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.1.5.1 Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.5.2 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.5.3 Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.1.5.4 Biology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.2 Three Machine Learning Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3 Overview of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3.1 Prerequisite Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3.2 Structure of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2 Tabular Value-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1 Sequential Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2 Tabular Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.1 Agent and Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.2 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2.2.1 State ( . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2.2.2 Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.2.3 Transition )0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.2.4 Reward '0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.2.2.5 Discount Factor W . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.2.2.6 Policy Function c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.2.3 MDP Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xii Contents

2.2.3.1 Trace g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.2.3.2 State Value + . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.2.3.3 State-Action Value & . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.2.3.4 Reinforcement Learning Objective . . . . . . . . . . . . . . 38

2.2.3.5 Bellman Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2.4 MDP Solution Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.2.4.1 Hands On: Value Iteration in Gym . . . . . . . . . . . . . . . 41

2.2.4.2 Model-Free Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.2.4.3 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.2.4.4 O-Policy Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.2.4.5 Hands On: Q-learning on Taxi . . . . . . . . . . . . . . . . . . 52

2.3 Classic Gym Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.3.1 Mountain Car and Cartpole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.3.2 Path Planning and Board Games . . . . . . . . . . . . . . . . . . . . . . . . 56

2.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3 Approximating the Value Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.1 Large, High-Dimensional, Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.1.1 Atari Arcade Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.1.2 Real-Time Strategy and Video Games . . . . . . . . . . . . . . . . . . . . 68

3.2 Deep Value-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.2.1 Generalization of Large Problem with Deep Learning . . . . . 69

3.2.1.1 Minimizing Supervised Target Loss . . . . . . . . . . . . . 69

3.2.1.2 Bootstrapping Q-Values . . . . . . . . . . . . . . . . . . . . . . . 70

3.2.1.3 Deep Reinforcement Learning Target-Error . . . . . 71

3.2.2 Three Problems: Coverage, Correlation, Convergence . . . . . 72

3.2.2.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.2.2.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.2.2.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.2.3 Stable Deep Value-Based Learning . . . . . . . . . . . . . . . . . . . . . . 74

3.2.3.1 Decorrelating States . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.2.3.2 Infrequent Updates of Target Weights . . . . . . . . . . . 76

3.2.3.3 Hands On: DQN and Breakout Gym Example . . . . . 76

3.2.4 Improving Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.2.4.1 Overestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.2.4.2 Distributional Methods . . . . . . . . . . . . . . . . . . . . . . . . 83

3.3 Atari 2600 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.3.2 Benchmarking Atari . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Contents xiii

4 Policy-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.1 Continuous Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.1.1 Continuous Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.1.2 Stochastic Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.1.3 Environments: Gym and MuJoCo . . . . . . . . . . . . . . . . . . . . . . . 92

4.1.3.1 Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.1.3.2 Physics Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.1.3.3 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.2 Policy-Based Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.2.1 Policy-Based Algorithm: REINFORCE . . . . . . . . . . . . . . . . . . . 95

4.2.2 Bias-Variance trade-o in Policy-Based Methods . . . . . . . . . 98

4.2.3 Actor Critic Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.2.4 Baseline Subtraction with Advantage Function . . . . . . . . . . . 101

4.2.5 Trust Region Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.2.6 Entropy and Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.2.7 Deterministic Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.2.8 Hands On: PPO and DDPG MuJoCo Examples . . . . . . . . . . . . . 110

4.3 Locomotion and Visuo-Motor Environments . . . . . . . . . . . . . . . . . . . . 111

4.3.1 Locomotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.3.2 Visuo-Motor Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

4.3.3 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5 Model-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.1 Dynamics Models of High-Dimensional Problems . . . . . . . . . . . . . . . 122

5.2 Learning and Planning Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

5.2.1 Learning the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.2.1.1 Modeling Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 128

5.2.1.2 Latent Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

5.2.2 Planning with the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.2.2.1 Trajectory Rollouts and Model-Predictive Control 132

5.2.2.2 End-to-end Learning and Planning-by-Network . 133

5.3 High-dimensional Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.3.1 Overview of Model-Based Experiments . . . . . . . . . . . . . . . . . . 137

5.3.2 Small Navigation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.3.3 Robotic Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.3.4 Games Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.3.5 Hands On: PlaNet Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

xiv Contents

6 Two-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.1 Two-Agent Zero-Sum Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

6.1.1 The Diculty of Playing Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

6.1.2 AlphaGo Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.2 Tabula Rasa Self-Play Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.2.1 Move-Level Self Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

6.2.1.1 Minimax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

6.2.1.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . 164

6.2.2 Example-Level Self Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

6.2.2.1 Policy and Value Network . . . . . . . . . . . . . . . . . . . . . 172

6.2.2.2 Stability and Exploration . . . . . . . . . . . . . . . . . . . . . . 172

6.2.3 Tournament-Level Self Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

6.2.3.1 Self-Play Curriculum Learning . . . . . . . . . . . . . . . . . 175

6.2.3.2 Supervised Curriculum Learning . . . . . . . . . . . . . . . 175

6.3 Self-Play Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.3.1 How to Design a World Class Go Program? . . . . . . . . . . . . . . 178

6.3.2 AlphaGo Zero Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

6.3.3 AlphaZero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

6.3.4 Open Self-Play Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

6.3.5 Hands On: Hex in Polygames Example . . . . . . . . . . . . . . . . . . . . 184

6.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

7 Multi-Agent Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

7.1 Multi-Agent Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

7.1.1 Competitive Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

7.1.2 Cooperative Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

7.1.3 Mixed Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

7.1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

7.1.4.1 Partial Observability . . . . . . . . . . . . . . . . . . . . . . . . . . 201

7.1.4.2 Nonstationary Environments . . . . . . . . . . . . . . . . . . 201

7.1.4.3 Large State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

7.2 Multi-Agent Reinforcement Learning Agents . . . . . . . . . . . . . . . . . . . . 202

7.2.1 Competitive Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

7.2.1.1 Counterfactual Regret Minimization . . . . . . . . . . . . 203

7.2.1.2 Deep Counterfactual Regret Minimization . . . . . . . 204

7.2.2 Cooperative Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

7.2.2.1 Centralized Training/Decentralized Execution . . . 206

7.2.2.2 Opponent Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7.2.2.3 Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

7.2.2.4 Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

7.2.3 Mixed Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

7.2.3.1 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . 209

7.2.3.2 Swarm Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

7.2.3.3 Population-Based Training . . . . . . . . . . . . . . . . . . . . . 212

Contents xv

7.2.3.4 Self-Play Leagues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

7.3 Multi-Agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

7.3.1 Competitive Behavior: Poker . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

7.3.2 Cooperative Behavior: Hide and Seek. . . . . . . . . . . . . . . . . . . . 216

7.3.3 Mixed Behavior: Capture the Flag and StarCraft . . . . . . . . . . 218

7.3.4 Hands On: Hide and Seek in the Gym Example . . . . . . . . . . . . 220

7.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

8 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.1 Granularity of the Structure of Problems . . . . . . . . . . . . . . . . . . . . . . . 227

8.1.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

8.1.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

8.2 Divide and Conquer for Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

8.2.1 The Options Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

8.2.2 Finding Subgoals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

8.2.3 Overview of Hierarchical Algorithms . . . . . . . . . . . . . . . . . . . . 231

8.2.3.1 Tabular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

8.2.3.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

8.3 Hierarchical Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

8.3.1 Four Rooms and Robot Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

8.3.2 Montezuma’s Revenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

8.3.3 Multi-Agent Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238

8.3.4 Hands On: Hierarchical Actor Citic Example . . . . . . . . . . . . . . 238

8.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

9 Meta Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

9.1 Learning to Learn Related Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

9.2 Transfer Learning and Meta Learning Agents . . . . . . . . . . . . . . . . . . . 247

9.2.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

9.2.1.1 Task Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

9.2.1.2 Pretraining and Finetuning . . . . . . . . . . . . . . . . . . . . 249

9.2.1.3 Hands-on: Pretraining Example . . . . . . . . . . . . . . . . . 249

9.2.1.4 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

9.2.1.5 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

9.2.2 Meta Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

9.2.2.1 Evaluating Few-Shot Learning Problems . . . . . . . . 253

9.2.2.2 Deep Meta Learning Algorithms . . . . . . . . . . . . . . . 254

9.2.2.3 Recurrent Meta Learning . . . . . . . . . . . . . . . . . . . . . . 256

9.2.2.4 Model-Agnostic Meta Learning . . . . . . . . . . . . . . . . . 257

9.2.2.5 Hyperparameter Optimization . . . . . . . . . . . . . . . . . 259

9.2.2.6 Meta Learning and Curriculum Learning . . . . . . . . 260

9.2.2.7 From Few-Shot to Zero-Shot Learning . . . . . . . . . . 260

9.3 Meta-Learning Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

xvi Contents

9.3.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

9.3.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

9.3.3 Meta Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

9.3.4 Meta World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

9.3.5 Alchemy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

9.3.6 Hands-on: Meta World Example . . . . . . . . . . . . . . . . . . . . . . . . . 266

9.4 Summary and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

10 Further Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

10.1 Developments in Deep Reinforcement Learning . . . . . . . . . . . . . . . . . 271

10.1.1 Tabular and Single-Agent Methods . . . . . . . . . . . . . . . . . . . . . . 272

10.1.2 Deep Learning Model-Free Methods . . . . . . . . . . . . . . . . . . . . . 272

10.1.3 Multi-Agent and Imperfect Information . . . . . . . . . . . . . . . . . . 272

10.1.4 A Framework for Learning by Doing . . . . . . . . . . . . . . . . . . . . 273

10.2 Main Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

10.2.1 Latent Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

10.2.2 Self Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

10.2.3 Hierarchical Reinforcement Learning . . . . . . . . . . . . . . . . . . . . 275

10.2.4 Transfer Learning and Meta Learning . . . . . . . . . . . . . . . . . . . 276

10.2.5 Population-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

10.2.6 Exploration and Intrinsic Motivation . . . . . . . . . . . . . . . . . . . . 277

10.2.7 Explainable AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

10.2.8 Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

10.3 The Future of Articial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

A Deep Reinforcement Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A.1 Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

A.2 Agent Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

A.3 Deep Learning Suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

B Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

B.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287

B.1.1 Training Set and Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

B.1.2 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

B.1.3 Overtting and the Bias-Variance Trade-O . . . . . . . . . . . . . . 290

B.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

B.2.1 Weights, Neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294

B.2.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

B.2.3 End-to-end Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

B.2.4 Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300

B.2.5 Recurrent Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

B.2.6 More Network Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

B.2.7 Overtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

B.3 Datasets and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

Contents xvii

B.3.1 Keras, TensorFlow, PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

B.3.2 MNIST and ImageNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

B.3.3 GPU Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315

B.3.4 Hands On: Classication Example . . . . . . . . . . . . . . . . . . . . . . . . 316

B.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

C Mathematical Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

C.1 Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

C.1.1 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

C.1.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325

C.2 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

C.2.1 Discrete Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . 326

C.2.2 Continuous Probability Distributions . . . . . . . . . . . . . . . . . . . . 327

C.2.3 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

C.2.4 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330

C.2.4.1 Expectation of a Random Variable . . . . . . . . . . . . . . 330

C.2.4.2 Expectation of a Function of a Random Variable . 331

C.2.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

C.2.5.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

C.2.5.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

C.2.5.3 Cross-entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

C.2.5.4 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . 333

C.3 Derivative of an Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334

C.4 Bellman Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337

Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

Aske Plaat is a Professor of Data Science at Leiden University and scientific director of the Leiden Institute of Advanced Computer Science (LIACS). He is co-founder of the Leiden Centre of Data Science (LCDS) and initiated SAILS, a multidisciplinary program on artificial intelligence. His research interests include reinforcement learning, combinatorial games and self-learning systems. He is the author of Learning to Play (published by Springer in 2020), which specifically covers reinforcement learning and games.

These research advances have not gone unnoticed by educators. Many universities have begun offering courses on the subject of deep reinforcement learning. The aim of this book is to provide an overview of the field, at the proper level of detail for a graduate course in artificial intelligence. It covers the complete field, from the basic algorithms of Deep Q-learning, to advanced topics such as multi-agent reinforcement learning and meta learning.

Krainaksiazek.pl w programie rzetelna firma

Krainaksiaze.pl - płatności przez paypal

Czytaj nas na:

Zobacz:

1997-2026 DolnySlask.com Agencja Internetowa