Deep Learning » książka

zaloguj się | załóż konto

topmenu

Szukaj

Książki na zamówienie

Wyszukiwanie zaawansowane

Pusty koszyk

Bezpłatna dostawa dla zamówień powyżej 20 zł

Kategorie główne

• Nauka

[2858761]

• Literatura piękna

[1806812]

więcej...

Kategorie szczegółowe BISAC

Deep Learning

Name: Deep Learning
Brand: Springer International Publishing
Price: 299.84 PLN
Availability: InStock

ISBN-13: 9783031454677 / Angielski / Twarda / 2023

Christopher M. Bishop;Hugh Bishop

Deep Learning

ISBN-13: 9783031454677 / Angielski / Twarda / 2023

Christopher M. Bishop;Hugh Bishop

cena 299,84 zł
(netto: 285,56 VAT: 5%)

Najniższa cena z 30 dni: 300,55 zł

Termin realizacji zamówienia:
ok. 16-18 dni roboczych.

Darmowa dostawa!

This book offers a comprehensive introduction to the central ideas that underpin deep learning. It is intended both for newcomers to machine learning and for those already experienced in the field. Covering key concepts relating to contemporary architectures and techniques, this essential book equips readers with a robust foundation for potential future specialization. The field of deep learning is undergoing rapid evolution, and therefore this book focusses on ideas that are likely to endure the test of time.The book is organized into numerous bite-sized chapters, each exploring a distinct topic, and the narrative follows a linear progression, with each chapter building upon content from its predecessors. This structure is well-suited to teaching a two-semester undergraduate or postgraduate machine learning course, while remaining equally relevant to those engaged in active research or in self-study.
A full understanding of machine learning requires some mathematical background and so the book includes a self-contained introduction to probability theory. However, the focus of the book is on conveying a clear understanding of ideas, with emphasis on the real-world practical value of techniques rather than on abstract theory. Complex concepts are therefore presented from multiple complementary perspectives including textual descriptions, diagrams, mathematical formulae, and pseudo-code.Chris Bishop is a Technical Fellow at Microsoft and is the Director of Microsoft Research AI4Science. He is a Fellow of Darwin College Cambridge, a Fellow of the Royal Academy of Engineering, and a Fellow of the Royal Society. Hugh Bishop is an Applied Scientist at Wayve, a deep learning autonomous driving company in London, where he designs and trains deep neural networks. He completed his MPhil in Machine Learning and Machine Intelligence at Cambridge University.“Chris Bishop wrote a terrific textbook on neural networks in 1995 and has a deep knowledge of the field and its core ideas. His many years of experience in explaining neural networks have made him extremely skillful at presenting complicated ideas in the simplest possible way and it is a delight to see these skills applied to the revolutionary new developments in the field.” -- Geoffrey Hinton"With the recent explosion of deep learning and AI as a research topic, and the quickly growing importance of AI applications, a modern textbook on the topic was badly needed. The "New Bishop" masterfully fills the gap, covering algorithms for supervised and unsupervised learning, modern deep learning architecture families, as well as how to apply all of this to various application areas." – Yann LeCun
“This excellent and very educational book will bring the reader up to date with the main concepts and advances in deep learning with a solid anchoring in probability. These concepts are powering current industrial AI systems and are likely to form the basis of further advances towards artificial general intelligence.” -- Yoshua Bengio

Kategorie:

Informatyka

Kategorie BISAC:

Computers > Artificial Intelligence - General
Computers > Information Theory
Mathematics > Prawdopodobieństwo i statystyka

Wydawca:

Springer International Publishing

Język:

Angielski

ISBN-13:

9783031454677

Rok wydania:

2023

Waga:

1.44 kg

Wymiary:

25.4 x 17.8

Oprawa:

Twarda

Dodatkowe informacje:

Bibliografia
Wydanie ilustrowane

Preface 3

1 The Deep Learning Revolution 19

1.1 The Impact of Deep Learning . . . . . . . . . . . . . . . . . . . . 20

1.1.1 Medical diagnosis . . . . . . . . . . . . . . . . . . . . . . 20

1.1.2 Protein structure . . . . . . . . . . . . . . . . . . . . . . . 21

1.1.3 Image synthesis . . . . . . . . . . . . . . . . . . . . . . . . 22

1.1.4 Large language models . . . . . . . . . . . . . . . . . . . . 23

1.2 A Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.2.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.2.2 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.2.3 Error function . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.2.4 Model complexity . . . . . . . . . . . . . . . . . . . . . . 27

1.2.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 30

1.2.6 Model selection . . . . . . . . . . . . . . . . . . . . . . . . 32

1.3 A Brief History of Machine Learning . . . . . . . . . . . . . . . . 34

1.3.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 35

1.3.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 36

1.3.3 Deep networks . . . . . . . . . . . . . . . . . . . . . . . . 38

2 Probabilities 41

2.1 The Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . 43

2.1.1 A medical screening example . . . . . . . . . . . . . . . . 43

2.1.2 The sum and product rules . . . . . . . . . . . . . . . . . . 44

2.1.3 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 46

2.1.4 Medical screening revisited . . . . . . . . . . . . . . . . . 48

2.1.5 Prior and posterior probabilities . . . . . . . . . . . . . . . 49

2.1.6 Independent variables . . . . . . . . . . . . . . . . . . . . 49

2.2 Probability Densities . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.2.1 Example distributions . . . . . . . . . . . . . . . . . . . . 51

2.2.2 Expectations and covariances . . . . . . . . . . . . . . . . 52

2.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 54

2.3.1 Mean and variance . . . . . . . . . . . . . . . . . . . . . . 55

2.3.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 55

2.3.3 Bias of maximum likelihood . . . . . . . . . . . . . . . . . 57

2.3.4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . 58

2.4 Transformation of Densities . . . . . . . . . . . . . . . . . . . . . 60

2.4.1 Multivariate distributions . . . . . . . . . . . . . . . . . . . 62

2.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.5.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.5.2 Physics perspective . . . . . . . . . . . . . . . . . . . . . . 65

2.5.3 Differential entropy . . . . . . . . . . . . . . . . . . . . . . 67

2.5.4 Maximum entropy . . . . . . . . . . . . . . . . . . . . . . 68

2.5.5 Kullback–Leibler divergence . . . . . . . . . . . . . . . . . 69

2.5.6 Conditional entropy . . . . . . . . . . . . . . . . . . . . . 71

2.5.7 Mutual information . . . . . . . . . . . . . . . . . . . . . . 72

2.6 Bayesian Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 72

2.6.1 Model parameters . . . . . . . . . . . . . . . . . . . . . . . 73

2.6.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 74

2.6.3 Bayesian machine learning . . . . . . . . . . . . . . . . . . 75

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3 Standard Distributions 83

3.1 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . 84

3.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . 85

3.1.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . 86

3.2 The Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . 88

3.2.1 Geometry of the Gaussian . . . . . . . . . . . . . . . . . . 89

3.2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.2.4 Conditional distribution . . . . . . . . . . . . . . . . . . . 94

3.2.5 Marginal distribution . . . . . . . . . . . . . . . . . . . . . 97

3.2.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 99

3.2.7 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 102

3.2.8 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 103

3.2.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 104

3.3 Periodic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.3.1 Von Mises distribution . . . . . . . . . . . . . . . . . . . . 107

3.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 112

3.4.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . 115

3.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 116

3.5.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.5.2 Kernel densities . . . . . . . . . . . . . . . . . . . . . . . . 118

3.5.3 Nearest-neighbours . . . . . . . . . . . . . . . . . . . . . . 121

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

4 Single-layer Networks: Regression 129

4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 130

4.1.1 Basis functions . . . . . . . . . . . . . . . . . . . . . . . . 130

4.1.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 132

4.1.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 133

4.1.4 Geometry of least squares . . . . . . . . . . . . . . . . . . 135

4.1.5 Sequential learning . . . . . . . . . . . . . . . . . . . . . . 135

4.1.6 Regularized least squares . . . . . . . . . . . . . . . . . . . 136

4.1.7 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . 137

4.2 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.3 The Bias–Variance Trade-off . . . . . . . . . . . . . . . . . . . . . 141

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5 Single-layer Networks: Classification 149

5.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 150

5.1.1 Two classes . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.1.2 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . 152

5.1.3 1-of-K coding . . . . . . . . . . . . . . . . . . . . . . . . 153

5.1.4 Least squares for classification . . . . . . . . . . . . . . . . 154

5.2 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.2.1 Misclassification rate . . . . . . . . . . . . . . . . . . . . . 157

5.2.2 Expected loss . . . . . . . . . . . . . . . . . . . . . . . . . 158

5.2.3 The reject option . . . . . . . . . . . . . . . . . . . . . . . 160

5.2.4 Inference and decision . . . . . . . . . . . . . . . . . . . . 161

5.2.5 Classifier accuracy . . . . . . . . . . . . . . . . . . . . . . 165

5.2.6 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . 166

5.3 Generative Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 168

5.3.1 Continuous inputs . . . . . . . . . . . . . . . . . . . . . . 170

5.3.2 Maximum likelihood solution . . . . . . . . . . . . . . . . 171

5.3.3 Discrete features . . . . . . . . . . . . . . . . . . . . . . . 174

5.3.4 Exponential family . . . . . . . . . . . . . . . . . . . . . . 174

5.4 Discriminative Classifiers . . . . . . . . . . . . . . . . . . . . . . 175

5.4.1 Activation functions . . . . . . . . . . . . . . . . . . . . . 176

5.4.2 Fixed basis functions . . . . . . . . . . . . . . . . . . . . . 176

5.4.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 177

5.4.4 Multi-class logistic regression . . . . . . . . . . . . . . . . 179

5.4.5 Probit regression . . . . . . . . . . . . . . . . . . . . . . . 181

5.4.6 Canonical link functions . . . . . . . . . . . . . . . . . . . 182

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

6 Deep Neural Networks 189

6.1 Limitations of Fixed Basis Functions . . . . . . . . . . . . . . . . 190

6.1.1 The curse of dimensionality . . . . . . . . . . . . . . . . . 190

6.1.2 High-dimensional spaces . . . . . . . . . . . . . . . . . . . 193

6.1.3 Data manifolds . . . . . . . . . . . . . . . . . . . . . . . . 194

6.1.4 Data-dependent basis functions . . . . . . . . . . . . . . . 196

6.2 Multilayer Networks . . . . . . . . . . . . . . . . . . . . . . . . . 198

6.2.1 Parameter matrices . . . . . . . . . . . . . . . . . . . . . . 199

6.2.2 Universal approximation . . . . . . . . . . . . . . . . . . . 199

6.2.3 Hidden unit activation functions . . . . . . . . . . . . . . . 200

6.2.4 Weight-space symmetries . . . . . . . . . . . . . . . . . . 203

6.3 Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

6.3.1 Hierarchical representations . . . . . . . . . . . . . . . . . 205

6.3.2 Distributed representations . . . . . . . . . . . . . . . . . . 205

6.3.3 Representation learning . . . . . . . . . . . . . . . . . . . 206

6.3.4 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . 207

6.3.5 Contrastive learning . . . . . . . . . . . . . . . . . . . . . 209

6.3.6 General network architectures . . . . . . . . . . . . . . . . 211

6.3.7 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

6.4 Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

6.4.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 212

6.4.2 Binary classification . . . . . . . . . . . . . . . . . . . . . 214

6.4.3 multiclass classification . . . . . . . . . . . . . . . . . . . 215

6.5 Mixture Density Networks . . . . . . . . . . . . . . . . . . . . . . 216

6.5.1 Robot kinematics example . . . . . . . . . . . . . . . . . . 216

6.5.2 Conditional mixture distributions . . . . . . . . . . . . . . 217

6.5.3 Gradient optimization . . . . . . . . . . . . . . . . . . . . 219

6.5.4 Predictive distribution . . . . . . . . . . . . . . . . . . . . 220

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

7 Gradient Descent 227

7.1 Error Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

7.1.1 Local quadratic approximation . . . . . . . . . . . . . . . . 229

7.2 Gradient Descent Optimization . . . . . . . . . . . . . . . . . . . 231

7.2.1 Use of gradient information . . . . . . . . . . . . . . . . . 232

7.2.2 Batch gradient descent . . . . . . . . . . . . . . . . . . . . 232

7.2.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . 232

7.2.4 Mini-batches . . . . . . . . . . . . . . . . . . . . . . . . . 234

7.2.5 Parameter initialization . . . . . . . . . . . . . . . . . . . . 234

7.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236

7.3.1 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . 238

7.3.2 Learning rate schedule . . . . . . . . . . . . . . . . . . . . 240

7.3.3 RMSProp and Adam . . . . . . . . . . . . . . . . . . . . . 241

7.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

7.4.1 Data normalization . . . . . . . . . . . . . . . . . . . . . . 244

7.4.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . 245

7.4.3 Layer normalization . . . . . . . . . . . . . . . . . . . . . 247

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

8 Backpropagation 251

8.1 Evaluation of Gradients . . . . . . . . . . . . . . . . . . . . . . . 252

8.1.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 252

8.1.2 General feed-forward networks . . . . . . . . . . . . . . . 253

8.1.3 A simple example . . . . . . . . . . . . . . . . . . . . . . 256

8.1.4 Numerical differentiation . . . . . . . . . . . . . . . . . . . 257

8.1.5 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 258

8.1.6 The Hessian matrix . . . . . . . . . . . . . . . . . . . . . . 260

8.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . 262

8.2.1 Forward-mode automatic differentiation . . . . . . . . . . . 264

8.2.2 Reverse-mode automatic differentiation . . . . . . . . . . . 267

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

9 Regularization 271

9.1 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272

9.1.1 Inverse problems . . . . . . . . . . . . . . . . . . . . . . . 272

9.1.2 No free lunch theorem . . . . . . . . . . . . . . . . . . . . 273

9.1.3 Symmetry and invariance . . . . . . . . . . . . . . . . . . . 274

9.1.4 Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . 277

9.2 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

9.2.1 Consistent regularizers . . . . . . . . . . . . . . . . . . . . 280

9.2.2 Generalized weight decay . . . . . . . . . . . . . . . . . . 282

9.3 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

9.3.1 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . 284

9.3.2 Double descent . . . . . . . . . . . . . . . . . . . . . . . . 286

9.4 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 288

9.4.1 Soft weight sharing . . . . . . . . . . . . . . . . . . . . . . 289

9.5 Residual Connections . . . . . . . . . . . . . . . . . . . . . . . . 292

9.6 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 295

9.6.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

10 Convolutional Networks 305

10.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

10.1.1 Image data . . . . . . . . . . . . . . . . . . . . . . . . . . 307

10.2 Convolutional Filters . . . . . . . . . . . . . . . . . . . . . . . . . 308

10.2.1 Feature detectors . . . . . . . . . . . . . . . . . . . . . . . 308

10.2.2 Translation equivariance . . . . . . . . . . . . . . . . . . . 309

10.2.3 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

10.2.4 Strided convolutions . . . . . . . . . . . . . . . . . . . . . 312

10.2.5 Multi-dimensional convolutions . . . . . . . . . . . . . . . 313

10.2.6 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314

10.2.7 Multilayer convolutions . . . . . . . . . . . . . . . . . . . 316

10.2.8 Example network architectures . . . . . . . . . . . . . . . . 317

10.3 Visualizing Trained CNNs . . . . . . . . . . . . . . . . . . . . . . 320

10.3.1 Visual cortex . . . . . . . . . . . . . . . . . . . . . . . . . 320

10.3.2 Visualizing trained filters . . . . . . . . . . . . . . . . . . . 321

10.3.3 Saliency maps . . . . . . . . . . . . . . . . . . . . . . . . 323

10.3.4 Adversarial attacks . . . . . . . . . . . . . . . . . . . . . . 324

10.3.5 Synthetic images . . . . . . . . . . . . . . . . . . . . . . . 326

10.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 326

10.4.1 Bounding boxes . . . . . . . . . . . . . . . . . . . . . . . 327

10.4.2 Intersection-over-union . . . . . . . . . . . . . . . . . . . . 328

10.4.3 Sliding windows . . . . . . . . . . . . . . . . . . . . . . . 329

10.4.4 Detection across scales . . . . . . . . . . . . . . . . . . . . 331

10.4.5 Non-max suppression . . . . . . . . . . . . . . . . . . . . . 332

10.4.6 Fast region CNNs . . . . . . . . . . . . . . . . . . . . . . . 332

10.5 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 333

10.5.1 Convolutional segmentation . . . . . . . . . . . . . . . . . 333

10.5.2 Up-sampling . . . . . . . . . . . . . . . . . . . . . . . . . 334

10.5.3 Fully convolutional networks . . . . . . . . . . . . . . . . . 336

10.5.4 The U-net architecture . . . . . . . . . . . . . . . . . . . . 337

10.6 Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

11 Structured Distributions 343

11.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 344

11.1.1 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . 344

11.1.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 345

11.1.3 Discrete variables . . . . . . . . . . . . . . . . . . . . . . . 347

11.1.4 Gaussian variables . . . . . . . . . . . . . . . . . . . . . . 350

11.1.5 Binary classifier . . . . . . . . . . . . . . . . . . . . . . . 352

11.1.6 Parameters and observations . . . . . . . . . . . . . . . . . 352

11.1.7 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 354

11.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . 355

11.2.1 Three example graphs . . . . . . . . . . . . . . . . . . . . 356

11.2.2 Explaining away . . . . . . . . . . . . . . . . . . . . . . . 359

11.2.3 D-separation . . . . . . . . . . . . . . . . . . . . . . . . . 361

11.2.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 362

11.2.5 Generative models . . . . . . . . . . . . . . . . . . . . . . 364

11.2.6 Markov blanket . . . . . . . . . . . . . . . . . . . . . . . . 365

11.2.7 Graphs as filters . . . . . . . . . . . . . . . . . . . . . . . . 366

11.3 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 367

11.3.1 Hidden variables . . . . . . . . . . . . . . . . . . . . . . . 370

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371

12 Transformers 375

12.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376

12.1.1 Transformer processing . . . . . . . . . . . . . . . . . . . . 378

12.1.2 Attention coefficients . . . . . . . . . . . . . . . . . . . . . 379

12.1.3 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . 380

12.1.4 Network parameters . . . . . . . . . . . . . . . . . . . . . 381

12.1.5 Scaled self-attention . . . . . . . . . . . . . . . . . . . . . 384

12.1.6 Multi-head attention . . . . . . . . . . . . . . . . . . . . . 384

12.1.7 Transformer layers . . . . . . . . . . . . . . . . . . . . . . 386

12.1.8 Computational complexity . . . . . . . . . . . . . . . . . . 388

12.1.9 Positional encoding . . . . . . . . . . . . . . . . . . . . . . 389

12.2 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

12.2.1 Word embedding . . . . . . . . . . . . . . . . . . . . . . . 393

12.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 395

12.2.3 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . 396

12.2.4 Autoregressive models . . . . . . . . . . . . . . . . . . . . 397

12.2.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . 398

12.2.6 Backpropagation through time . . . . . . . . . . . . . . . . 399

12.3 Transformer Language Models . . . . . . . . . . . . . . . . . . . . 400

12.3.1 Decoder transformers . . . . . . . . . . . . . . . . . . . . . 401

12.3.2 Sampling strategies . . . . . . . . . . . . . . . . . . . . . . 404

12.3.3 Encoder transformers . . . . . . . . . . . . . . . . . . . . . 406

12.3.4 Sequence-to-sequence transformers . . . . . . . . . . . . . 408

12.3.5 Large language models . . . . . . . . . . . . . . . . . . . . 408

12.4 Multimodal Transformers . . . . . . . . . . . . . . . . . . . . . . 412

12.4.1 Vision transformers . . . . . . . . . . . . . . . . . . . . . . 413

12.4.2 Generative image transformers . . . . . . . . . . . . . . . . 414

12.4.3 Audio data . . . . . . . . . . . . . . . . . . . . . . . . . . 417

12.4.4 Text-to-speech . . . . . . . . . . . . . . . . . . . . . . . . 418

12.4.5 Vision and language transformers . . . . . . . . . . . . . . 420

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421

13 Graph Neural Networks 425

13.1 Machine Learning on Graphs . . . . . . . . . . . . . . . . . . . . 427

13.1.1 Graph properties . . . . . . . . . . . . . . . . . . . . . . . 428

13.1.2 Adjacency matrix . . . . . . . . . . . . . . . . . . . . . . . 428

13.1.3 Permutation equivariance . . . . . . . . . . . . . . . . . . . 429

13.2 Neural Message-Passing . . . . . . . . . . . . . . . . . . . . . . . 430

13.2.1 Convolutional filters . . . . . . . . . . . . . . . . . . . . . 431

13.2.2 Graph convolutional networks . . . . . . . . . . . . . . . . 432

13.2.3 Aggregation operators . . . . . . . . . . . . . . . . . . . . 434

13.2.4 Update operators . . . . . . . . . . . . . . . . . . . . . . . 436

13.2.5 Node classification . . . . . . . . . . . . . . . . . . . . . . 437

13.2.6 Edge classification . . . . . . . . . . . . . . . . . . . . . . 438

13.2.7 Graph classification . . . . . . . . . . . . . . . . . . . . . . 438

13.3 General Graph Networks . . . . . . . . . . . . . . . . . . . . . . . 438

13.3.1 Graph attention networks . . . . . . . . . . . . . . . . . . . 439

13.3.2 Edge embeddings . . . . . . . . . . . . . . . . . . . . . . . 439

13.3.3 Graph embeddings . . . . . . . . . . . . . . . . . . . . . . 440

13.3.4 Over-smoothing . . . . . . . . . . . . . . . . . . . . . . . 440

13.3.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 441

13.3.6 Geometric deep learning . . . . . . . . . . . . . . . . . . . 442

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

14 Sampling 447

14.1 Basic Sampling Algorithms . . . . . . . . . . . . . . . . . . . . . 448

14.1.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . 448

14.1.2 Standard distributions . . . . . . . . . . . . . . . . . . . . 449

14.1.3 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . 451

14.1.4 Adaptive rejection sampling . . . . . . . . . . . . . . . . . 453

14.1.5 Importance sampling . . . . . . . . . . . . . . . . . . . . . 455

14.1.6 Sampling-importance-resampling . . . . . . . . . . . . . . 457

14.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 458

14.2.1 The Metropolis algorithm . . . . . . . . . . . . . . . . . . 459

14.2.2 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . 460

14.2.3 The Metropolis–Hastings algorithm . . . . . . . . . . . . . 463

14.2.4 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . 464

14.2.5 Ancestral sampling . . . . . . . . . . . . . . . . . . . . . . 468

14.3 Langevin Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 469

14.3.1 Energy-based models . . . . . . . . . . . . . . . . . . . . . 470

14.3.2 Maximizing the likelihood . . . . . . . . . . . . . . . . . . 471

14.3.3 Langevin dynamics . . . . . . . . . . . . . . . . . . . . . . 472

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474

15 Discrete Latent Variables 477

15.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 478

15.1.1 Image segmentation . . . . . . . . . . . . . . . . . . . . . 482

15.2 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . 484

15.2.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 486

15.2.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 488

15.3 Expectation–Maximization Algorithm . . . . . . . . . . . . . . . . 492

15.3.1 Gaussian mixtures . . . . . . . . . . . . . . . . . . . . . . 496

15.3.2 Relation to K-means . . . . . . . . . . . . . . . . . . . . . 498

15.3.3 Mixtures of Bernoulli distributions . . . . . . . . . . . . . . 499

15.4 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 503

15.4.1 EM revisited . . . . . . . . . . . . . . . . . . . . . . . . . 504

15.4.2 Independent and identically distributed data . . . . . . . . . 506

15.4.3 Parameter priors . . . . . . . . . . . . . . . . . . . . . . . 507

15.4.4 Generalized EM . . . . . . . . . . . . . . . . . . . . . . . 507

15.4.5 Sequential EM . . . . . . . . . . . . . . . . . . . . . . . . 508

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508

16 Continuous Latent Variables 513

16.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . 515

16.1.1 Maximum variance formulation . . . . . . . . . . . . . . . 515

16.1.2 Minimum-error formulation . . . . . . . . . . . . . . . . . 517

16.1.3 Data compression . . . . . . . . . . . . . . . . . . . . . . . 519

16.1.4 Data whitening . . . . . . . . . . . . . . . . . . . . . . . . 520

16.1.5 High-dimensional data . . . . . . . . . . . . . . . . . . . . 522

16.2 Probabilistic Latent Variables . . . . . . . . . . . . . . . . . . . . 524

16.2.1 Generative model . . . . . . . . . . . . . . . . . . . . . . . 524

16.2.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 525

16.2.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 527

16.2.4 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . 531

16.2.5 Independent component analysis . . . . . . . . . . . . . . . 532

16.2.6 Kalman filters . . . . . . . . . . . . . . . . . . . . . . . . . 533

16.3 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 534

16.3.1 Expectation maximization . . . . . . . . . . . . . . . . . . 536

16.3.2 EM for PCA . . . . . . . . . . . . . . . . . . . . . . . . . 537

16.3.3 EM for factor analysis . . . . . . . . . . . . . . . . . . . . 538

16.4 Nonlinear Latent Variable Models . . . . . . . . . . . . . . . . . . 540

16.4.1 Nonlinear manifolds . . . . . . . . . . . . . . . . . . . . . 540

16.4.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 542

16.4.3 Discrete data . . . . . . . . . . . . . . . . . . . . . . . . . 544

16.4.4 Four approaches to generative modelling . . . . . . . . . . 545

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545

17 Generative Adversarial Networks 551

17.1 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 552

17.1.1 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 553

17.1.2 GAN training in practice . . . . . . . . . . . . . . . . . . . 554

17.2 Image GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557

17.2.1 CycleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . 557

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562

18 Normalizing Flows 565

18.1 Coupling Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567

18.2 Autoregressive Flows . . . . . . . . . . . . . . . . . . . . . . . . . 570

18.3 Continuous Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 572

18.3.1 Neural differential equations . . . . . . . . . . . . . . . . . 572

18.3.2 Neural ODE backpropagation . . . . . . . . . . . . . . . . 573

18.3.3 Neural ODE flows . . . . . . . . . . . . . . . . . . . . . . 575

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577

19 Autoencoders 581

19.1 Deterministic Autoencoders . . . . . . . . . . . . . . . . . . . . . 582

19.1.1 Linear autoencoders . . . . . . . . . . . . . . . . . . . . . 582

19.1.2 Deep autoencoders . . . . . . . . . . . . . . . . . . . . . . 583

19.1.3 Sparse autoencoders . . . . . . . . . . . . . . . . . . . . . 584

19.1.4 Denoising autoencoders . . . . . . . . . . . . . . . . . . . 585

19.1.5 Masked autoencoders . . . . . . . . . . . . . . . . . . . . . 585

19.2 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 587

19.2.1 Amortized inference . . . . . . . . . . . . . . . . . . . . . 590

19.2.2 The reparameterization trick . . . . . . . . . . . . . . . . . 592

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596

20 Diffusion Models 599

20.1 Forward Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 600

20.1.1 Diffusion kernel . . . . . . . . . . . . . . . . . . . . . . . 601

20.1.2 Conditional distribution . . . . . . . . . . . . . . . . . . . 602

20.2 Reverse Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 603

20.2.1 Training the decoder . . . . . . . . . . . . . . . . . . . . . 605

20.2.2 Evidence lower bound . . . . . . . . . . . . . . . . . . . . 606

20.2.3 Rewriting the ELBO . . . . . . . . . . . . . . . . . . . . . 607

20.2.4 Predicting the noise . . . . . . . . . . . . . . . . . . . . . . 609

20.2.5 Generating new samples . . . . . . . . . . . . . . . . . . . 610

20.3 Score Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612

20.3.1 Score loss function . . . . . . . . . . . . . . . . . . . . . . 613

20.3.2 Modified score loss . . . . . . . . . . . . . . . . . . . . . . 614

20.3.3 Noise variance . . . . . . . . . . . . . . . . . . . . . . . . 615

20.3.4 Stochastic differential equations . . . . . . . . . . . . . . . 616

20.4 Guided Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 617

20.4.1 Classifier guidance . . . . . . . . . . . . . . . . . . . . . . 618

20.4.2 Classifier-free guidance . . . . . . . . . . . . . . . . . . . 618

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621

Appendix A Linear Algebra 627

A.1 Matrix Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . 627

A.2 Traces and Determinants . . . . . . . . . . . . . . . . . . . . . . . 628

A.3 Matrix Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 629

A.4 Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630

Appendix B Calculus of Variations 635

Appendix C Lagrange Multipliers 639

Bibliography 643

Index 659

Chris Bishop is a Technical Fellow at Microsoft and is the Director of Microsoft Research AI4Science. He is a Fellow of Darwin College, Cambridge, a Fellow of the Royal Academy of Engineering, a Fellow of the Royal Society of Edinburgh, and a Fellow of the Royal Society of London. He is a keen advocate of public engagement in science, and in 2008 he delivered the prestigious Royal Institution Christmas Lectures, established in 1825 by Michael Faraday, and broadcast on prime-time national television. Chris was a founding member of the UK AI Council and was also appointed to the Prime Minister’s Council for Science and Technology.

Hugh Bishop is an Applied Scientist at Wayve, an end-to-end deep learning based autonomous driving company in London, where he designs and trains deep neural networks. Before working at Wayve, he completed his MPhil in Machine Learning and Machine Intelligence in the engineering department at Cambridge University. Hugh also holds an MEng in Computer Science from the University of Durham, where he focused his projects on deep learning. During his studies, he also worked as an intern at FiveAI, another autonomous driving company in the UK, and as a Research Assistant, producing educational interactive iPython notebooks for machine learning courses at Cambridge University.

The book is organized into numerous bite-sized chapters, each exploring a distinct topic, and the narrative follows a linear progression, with each chapter building upon content from its predecessors. This structure is well-suited to teaching a two-semester undergraduate or postgraduate machine learning course, while remaining equally relevant to those engaged in active research or in self-study.

A full understanding of machine learning requires some mathematical background and so the book includes a self-contained introduction to probability theory. However, the focus of the book is on conveying a clear understanding of ideas, with emphasis on the real-world practical value of techniques rather than on abstract theory. Complex concepts are therefore presented from multiple complementary perspectives including textual descriptions, diagrams, mathematical formulae, and pseudo-code.

Chris Bishop is a Technical Fellow at Microsoft and is the Director of Microsoft Research AI4Science. He is a Fellow of Darwin College Cambridge, a Fellow of the Royal Academy of Engineering, and a Fellow of the Royal Society.

Hugh Bishop is an Applied Scientist at Wayve, a deep learning autonomous driving company in London, where he designs and trains deep neural networks. He completed his MPhil in Machine Learning and Machine Intelligence at Cambridge University.

“Chris Bishop wrote a terrific textbook on neural networks in 1995 and has a deep knowledge of the field and its core ideas. His many years of experience in explaining neural networks have made him extremely skillful at presenting complicated ideas in the simplest possible way and it is a delight to see these skills applied to the revolutionary new developments in the field.” -- Geoffrey Hinton

"With the recent explosion of deep learning and AI as a research topic, and the quickly growing importance of AI applications, a modern textbook on the topic was badly needed. The "New Bishop" masterfully fills the gap, covering algorithms for supervised and unsupervised learning, modern deep learning architecture families, as well as how to apply all of this to various application areas." – Yann LeCun

“This excellent and very educational book will bring the reader up to date with the main concepts and advances in deep learning with a solid anchoring in probability. These concepts are powering current industrial AI systems and are likely to form the basis of further advances towards artificial general intelligence.” -- Yoshua Bengio

Krainaksiazek.pl w programie rzetelna firma

Krainaksiaze.pl - płatności przez paypal

Czytaj nas na:

Zobacz:

1997-2025 DolnySlask.com Agencja Internetowa