This book offers a comprehensive introduction to the central ideas that underpin deep learning. It is intended both for newcomers to machine learning and for those already experienced in the field. Covering key concepts relating to contemporary architectures and techniques, this essential book equips readers with a robust foundation for potential future specialization. The field of deep learning is undergoing rapid evolution, and therefore this book focusses on ideas that are likely to endure the test of time.The book is organized into numerous bite-sized chapters, each exploring a distinct topic, and the narrative follows a linear progression, with each chapter building upon content from its predecessors. This structure is well-suited to teaching a two-semester undergraduate or postgraduate machine learning course, while remaining equally relevant to those engaged in active research or in self-study.
A full understanding of machine learning requires some mathematical background and so the book includes a self-contained introduction to probability theory. However, the focus of the book is on conveying a clear understanding of ideas, with emphasis on the real-world practical value of techniques rather than on abstract theory. Complex concepts are therefore presented from multiple complementary perspectives including textual descriptions, diagrams, mathematical formulae, and pseudo-code.Chris Bishop is a Technical Fellow at Microsoft and is the Director of Microsoft Research AI4Science. He is a Fellow of Darwin College Cambridge, a Fellow of the Royal Academy of Engineering, and a Fellow of the Royal Society. Hugh Bishop is an Applied Scientist at Wayve, a deep learning autonomous driving company in London, where he designs and trains deep neural networks. He completed his MPhil in Machine Learning and Machine Intelligence at Cambridge University.“Chris Bishop wrote a terrific textbook on neural networks in 1995 and has a deep knowledge of the field and its core ideas. His many years of experience in explaining neural networks have made him extremely skillful at presenting complicated ideas in the simplest possible way and it is a delight to see these skills applied to the revolutionary new developments in the field.” -- Geoffrey Hinton"With the recent explosion of deep learning and AI as a research topic, and the quickly growing importance of AI applications, a modern textbook on the topic was badly needed. The "New Bishop" masterfully fills the gap, covering algorithms for supervised and unsupervised learning, modern deep learning architecture families, as well as how to apply all of this to various application areas." – Yann LeCun
“This excellent and very educational book will bring the reader up to date with the main concepts and advances in deep learning with a solid anchoring in probability. These concepts are powering current industrial AI systems and are likely to form the basis of further advances towards artificial general intelligence.” -- Yoshua Bengio
Preface 3
1 The Deep Learning Revolution 19
1.1 The Impact of Deep Learning . . . . . . . . . . . . . . . . . . . . 20
1.1.1 Medical diagnosis . . . . . . . . . . . . . . . . . . . . . . 20
1.1.2 Protein structure . . . . . . . . . . . . . . . . . . . . . . . 21
1.1.3 Image synthesis . . . . . . . . . . . . . . . . . . . . . . . . 22
1.1.4 Large language models . . . . . . . . . . . . . . . . . . . . 23
1.2 A Tutorial Example . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.2.2 Linear models . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2.3 Error function . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2.4 Model complexity . . . . . . . . . . . . . . . . . . . . . . 27
1.2.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 30
1.2.6 Model selection . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3 A Brief History of Machine Learning . . . . . . . . . . . . . . . . 34
1.3.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 35
1.3.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 36
1.3.3 Deep networks . . . . . . . . . . . . . . . . . . . . . . . . 38
2 Probabilities 41
2.1 The Rules of Probability . . . . . . . . . . . . . . . . . . . . . . . 432.1.1 A medical screening example . . . . . . . . . . . . . . . . 43
2.1.2 The sum and product rules . . . . . . . . . . . . . . . . . . 44
2.1.3 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.4 Medical screening revisited . . . . . . . . . . . . . . . . . 48
2.1.5 Prior and posterior probabilities . . . . . . . . . . . . . . . 49
2.1.6 Independent variables . . . . . . . . . . . . . . . . . . . . 49
2.2 Probability Densities . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.2.1 Example distributions . . . . . . . . . . . . . . . . . . . . 51
2.2.2 Expectations and covariances . . . . . . . . . . . . . . . . 522.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 54
2.3.1 Mean and variance . . . . . . . . . . . . . . . . . . . . . . 55
2.3.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 55
2.3.3 Bias of maximum likelihood . . . . . . . . . . . . . . . . . 57
2.3.4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . 58
2.4 Transformation of Densities . . . . . . . . . . . . . . . . . . . . . 60
2.4.1 Multivariate distributions . . . . . . . . . . . . . . . . . . . 62
2.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 64
2.5.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642.5.2 Physics perspective . . . . . . . . . . . . . . . . . . . . . . 65
2.5.3 Differential entropy . . . . . . . . . . . . . . . . . . . . . . 67
2.5.4 Maximum entropy . . . . . . . . . . . . . . . . . . . . . . 68
2.5.5 Kullback–Leibler divergence . . . . . . . . . . . . . . . . . 69
2.5.6 Conditional entropy . . . . . . . . . . . . . . . . . . . . . 71
2.5.7 Mutual information . . . . . . . . . . . . . . . . . . . . . . 72
2.6 Bayesian Probabilities . . . . . . . . . . . . . . . . . . . . . . . . 72
2.6.1 Model parameters . . . . . . . . . . . . . . . . . . . . . . . 73
2.6.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 742.6.3 Bayesian machine learning . . . . . . . . . . . . . . . . . . 75
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3 Standard Distributions 83
3.1 Discrete Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . 84
3.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . 85
3.1.3 Multinomial distribution . . . . . . . . . . . . . . . . . . . 86
3.2 The Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . 88
3.2.1 Geometry of the Gaussian . . . . . . . . . . . . . . . . . . 89
3.2.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . 923.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.2.4 Conditional distribution . . . . . . . . . . . . . . . . . . . 94
3.2.5 Marginal distribution . . . . . . . . . . . . . . . . . . . . . 97
3.2.6 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 99
3.2.7 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 102
3.2.8 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 103
3.2.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 104
3.3 Periodic Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.3.1 Von Mises distribution . . . . . . . . . . . . . . . . . . . . 1073.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 112
3.4.1 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . 115
3.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 116
3.5.1 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.5.2 Kernel densities . . . . . . . . . . . . . . . . . . . . . . . . 118
3.5.3 Nearest-neighbours . . . . . . . . . . . . . . . . . . . . . . 121
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4 Single-layer Networks: Regression 129
4.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 1304.1.1 Basis functions . . . . . . . . . . . . . . . . . . . . . . . . 130
4.1.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 132
4.1.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 133
4.1.4 Geometry of least squares . . . . . . . . . . . . . . . . . . 135
4.1.5 Sequential learning . . . . . . . . . . . . . . . . . . . . . . 135
4.1.6 Regularized least squares . . . . . . . . . . . . . . . . . . . 136
4.1.7 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . 137
4.2 Decision theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.3 The Bias–Variance Trade-off . . . . . . . . . . . . . . . . . . . . . 141Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5 Single-layer Networks: Classification 149
5.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 150
5.1.1 Two classes . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.1.2 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . 152
5.1.3 1-of-K coding . . . . . . . . . . . . . . . . . . . . . . . . 153
5.1.4 Least squares for classification . . . . . . . . . . . . . . . . 154
5.2 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.2.1 Misclassification rate . . . . . . . . . . . . . . . . . . . . . 1575.2.2 Expected loss . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.2.3 The reject option . . . . . . . . . . . . . . . . . . . . . . . 160
5.2.4 Inference and decision . . . . . . . . . . . . . . . . . . . . 161
5.2.5 Classifier accuracy . . . . . . . . . . . . . . . . . . . . . . 165
5.2.6 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . 166
5.3 Generative Classifiers . . . . . . . . . . . . . . . . . . . . . . . . 168
5.3.1 Continuous inputs . . . . . . . . . . . . . . . . . . . . . . 170
5.3.2 Maximum likelihood solution . . . . . . . . . . . . . . . . 171
5.3.3 Discrete features . . . . . . . . . . . . . . . . . . . . . . . 1745.3.4 Exponential family . . . . . . . . . . . . . . . . . . . . . . 174
5.4 Discriminative Classifiers . . . . . . . . . . . . . . . . . . . . . . 175
5.4.1 Activation functions . . . . . . . . . . . . . . . . . . . . . 176
5.4.2 Fixed basis functions . . . . . . . . . . . . . . . . . . . . . 176
5.4.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 177
5.4.4 Multi-class logistic regression . . . . . . . . . . . . . . . . 179
5.4.5 Probit regression . . . . . . . . . . . . . . . . . . . . . . . 181
5.4.6 Canonical link functions . . . . . . . . . . . . . . . . . . . 182
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1846 Deep Neural Networks 189
6.1 Limitations of Fixed Basis Functions . . . . . . . . . . . . . . . . 190
6.1.1 The curse of dimensionality . . . . . . . . . . . . . . . . . 190
6.1.2 High-dimensional spaces . . . . . . . . . . . . . . . . . . . 193
6.1.3 Data manifolds . . . . . . . . . . . . . . . . . . . . . . . . 194
6.1.4 Data-dependent basis functions . . . . . . . . . . . . . . . 196
6.2 Multilayer Networks . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.2.1 Parameter matrices . . . . . . . . . . . . . . . . . . . . . . 199
6.2.2 Universal approximation . . . . . . . . . . . . . . . . . . . 199
6.2.3 Hidden unit activation functions . . . . . . . . . . . . . . . 200
6.2.4 Weight-space symmetries . . . . . . . . . . . . . . . . . . 203
6.3 Deep Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
6.3.1 Hierarchical representations . . . . . . . . . . . . . . . . . 205
6.3.2 Distributed representations . . . . . . . . . . . . . . . . . . 205
6.3.3 Representation learning . . . . . . . . . . . . . . . . . . . 206
6.3.4 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . 207
6.3.5 Contrastive learning . . . . . . . . . . . . . . . . . . . . . 209
6.3.6 General network architectures . . . . . . . . . . . . . . . . 211
6.3.7 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.4 Error Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.4.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 212
6.4.2 Binary classification . . . . . . . . . . . . . . . . . . . . . 214
6.4.3 multiclass classification . . . . . . . . . . . . . . . . . . . 215
6.5 Mixture Density Networks . . . . . . . . . . . . . . . . . . . . . . 216
6.5.1 Robot kinematics example . . . . . . . . . . . . . . . . . . 216
6.5.2 Conditional mixture distributions . . . . . . . . . . . . . . 217
6.5.3 Gradient optimization . . . . . . . . . . . . . . . . . . . . 219
6.5.4 Predictive distribution . . . . . . . . . . . . . . . . . . . . 220
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7 Gradient Descent 227
7.1 Error Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
7.1.1 Local quadratic approximation . . . . . . . . . . . . . . . . 229
7.2 Gradient Descent Optimization . . . . . . . . . . . . . . . . . . . 231
7.2.1 Use of gradient information . . . . . . . . . . . . . . . . . 232
7.2.2 Batch gradient descent . . . . . . . . . . . . . . . . . . . . 232
7.2.3 Stochastic gradient descent . . . . . . . . . . . . . . . . . . 232
7.2.4 Mini-batches . . . . . . . . . . . . . . . . . . . . . . . . . 2347.2.5 Parameter initialization . . . . . . . . . . . . . . . . . . . . 234
7.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
7.3.1 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . 238
7.3.2 Learning rate schedule . . . . . . . . . . . . . . . . . . . . 240
7.3.3 RMSProp and Adam . . . . . . . . . . . . . . . . . . . . . 241
7.4 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.4.1 Data normalization . . . . . . . . . . . . . . . . . . . . . . 244
7.4.2 Batch normalization . . . . . . . . . . . . . . . . . . . . . 245
7.4.3 Layer normalization . . . . . . . . . . . . . . . . . . . . . 247Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
8 Backpropagation 251
8.1 Evaluation of Gradients . . . . . . . . . . . . . . . . . . . . . . . 252
8.1.1 Single-layer networks . . . . . . . . . . . . . . . . . . . . 252
8.1.2 General feed-forward networks . . . . . . . . . . . . . . . 253
8.1.3 A simple example . . . . . . . . . . . . . . . . . . . . . . 256
8.1.4 Numerical differentiation . . . . . . . . . . . . . . . . . . . 257
8.1.5 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 258
8.1.6 The Hessian matrix . . . . . . . . . . . . . . . . . . . . . . 260
8.2 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . 262
8.2.1 Forward-mode automatic differentiation . . . . . . . . . . . 264
8.2.2 Reverse-mode automatic differentiation . . . . . . . . . . . 267
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9 Regularization 271
9.1 Inductive Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
9.1.1 Inverse problems . . . . . . . . . . . . . . . . . . . . . . . 272
9.1.2 No free lunch theorem . . . . . . . . . . . . . . . . . . . . 273
9.1.3 Symmetry and invariance . . . . . . . . . . . . . . . . . . . 274
9.1.4 Equivariance . . . . . . . . . . . . . . . . . . . . . . . . . 2779.2 Weight Decay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
9.2.1 Consistent regularizers . . . . . . . . . . . . . . . . . . . . 280
9.2.2 Generalized weight decay . . . . . . . . . . . . . . . . . . 282
9.3 Learning Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
9.3.1 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . 284
9.3.2 Double descent . . . . . . . . . . . . . . . . . . . . . . . . 286
9.4 Parameter Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 288
9.4.1 Soft weight sharing . . . . . . . . . . . . . . . . . . . . . . 289
9.5 Residual Connections . . . . . . . . . . . . . . . . . . . . . . . . 2929.6 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.6.1 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
10 Convolutional Networks 305
10.1 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
10.1.1 Image data . . . . . . . . . . . . . . . . . . . . . . . . . . 307
10.2 Convolutional Filters . . . . . . . . . . . . . . . . . . . . . . . . . 308
10.2.1 Feature detectors . . . . . . . . . . . . . . . . . . . . . . . 308
10.2.2 Translation equivariance . . . . . . . . . . . . . . . . . . . 30910.2.3 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
10.2.4 Strided convolutions . . . . . . . . . . . . . . . . . . . . . 312
10.2.5 Multi-dimensional convolutions . . . . . . . . . . . . . . . 313
10.2.6 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
10.2.7 Multilayer convolutions . . . . . . . . . . . . . . . . . . . 316
10.2.8 Example network architectures . . . . . . . . . . . . . . . . 317
10.3 Visualizing Trained CNNs . . . . . . . . . . . . . . . . . . . . . . 320
10.3.1 Visual cortex . . . . . . . . . . . . . . . . . . . . . . . . . 320
10.3.2 Visualizing trained filters . . . . . . . . . . . . . . . . . . . 32110.3.3 Saliency maps . . . . . . . . . . . . . . . . . . . . . . . . 323
10.3.4 Adversarial attacks . . . . . . . . . . . . . . . . . . . . . . 324
10.3.5 Synthetic images . . . . . . . . . . . . . . . . . . . . . . . 326
10.4 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
10.4.1 Bounding boxes . . . . . . . . . . . . . . . . . . . . . . . 327
10.4.2 Intersection-over-union . . . . . . . . . . . . . . . . . . . . 328
10.4.3 Sliding windows . . . . . . . . . . . . . . . . . . . . . . . 329
10.4.4 Detection across scales . . . . . . . . . . . . . . . . . . . . 331
10.4.5 Non-max suppression . . . . . . . . . . . . . . . . . . . . . 33210.4.6 Fast region CNNs . . . . . . . . . . . . . . . . . . . . . . . 332
10.5 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 333
10.5.1 Convolutional segmentation . . . . . . . . . . . . . . . . . 333
10.5.2 Up-sampling . . . . . . . . . . . . . . . . . . . . . . . . . 334
10.5.3 Fully convolutional networks . . . . . . . . . . . . . . . . . 336
10.5.4 The U-net architecture . . . . . . . . . . . . . . . . . . . . 337
10.6 Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
11 Structured Distributions 34311.1 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
11.1.1 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . 344
11.1.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . 345
11.1.3 Discrete variables . . . . . . . . . . . . . . . . . . . . . . . 347
11.1.4 Gaussian variables . . . . . . . . . . . . . . . . . . . . . . 350
11.1.5 Binary classifier . . . . . . . . . . . . . . . . . . . . . . . 352
11.1.6 Parameters and observations . . . . . . . . . . . . . . . . . 352
11.1.7 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . 354
11.2 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . 35511.2.1 Three example graphs . . . . . . . . . . . . . . . . . . . . 356
11.2.2 Explaining away . . . . . . . . . . . . . . . . . . . . . . . 359
11.2.3 D-separation . . . . . . . . . . . . . . . . . . . . . . . . . 361
11.2.4 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . 362
11.2.5 Generative models . . . . . . . . . . . . . . . . . . . . . . 364
11.2.6 Markov blanket . . . . . . . . . . . . . . . . . . . . . . . . 365
11.2.7 Graphs as filters . . . . . . . . . . . . . . . . . . . . . . . . 366
11.3 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
11.3.1 Hidden variables . . . . . . . . . . . . . . . . . . . . . . . 370Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
12 Transformers 375
12.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
12.1.1 Transformer processing . . . . . . . . . . . . . . . . . . . . 378
12.1.2 Attention coefficients . . . . . . . . . . . . . . . . . . . . . 379
12.1.3 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . 380
12.1.4 Network parameters . . . . . . . . . . . . . . . . . . . . . 381
12.1.5 Scaled self-attention . . . . . . . . . . . . . . . . . . . . . 384
12.1.6 Multi-head attention . . . . . . . . . . . . . . . . . . . . . 38412.1.7 Transformer layers . . . . . . . . . . . . . . . . . . . . . . 386
12.1.8 Computational complexity . . . . . . . . . . . . . . . . . . 388
12.1.9 Positional encoding . . . . . . . . . . . . . . . . . . . . . . 389
12.2 Natural Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
12.2.1 Word embedding . . . . . . . . . . . . . . . . . . . . . . . 393
12.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 395
12.2.3 Bag of words . . . . . . . . . . . . . . . . . . . . . . . . . 396
12.2.4 Autoregressive models . . . . . . . . . . . . . . . . . . . . 397
12.2.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . 39812.2.6 Backpropagation through time . . . . . . . . . . . . . . . . 399
12.3 Transformer Language Models . . . . . . . . . . . . . . . . . . . . 400
12.3.1 Decoder transformers . . . . . . . . . . . . . . . . . . . . . 401
12.3.2 Sampling strategies . . . . . . . . . . . . . . . . . . . . . . 404
12.3.3 Encoder transformers . . . . . . . . . . . . . . . . . . . . . 406
12.3.4 Sequence-to-sequence transformers . . . . . . . . . . . . . 408
12.3.5 Large language models . . . . . . . . . . . . . . . . . . . . 408
12.4 Multimodal Transformers . . . . . . . . . . . . . . . . . . . . . . 412
12.4.1 Vision transformers . . . . . . . . . . . . . . . . . . . . . . 41312.4.2 Generative image transformers . . . . . . . . . . . . . . . . 414
12.4.3 Audio data . . . . . . . . . . . . . . . . . . . . . . . . . . 417
12.4.4 Text-to-speech . . . . . . . . . . . . . . . . . . . . . . . . 418
12.4.5 Vision and language transformers . . . . . . . . . . . . . . 420
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
13 Graph Neural Networks 425
13.1 Machine Learning on Graphs . . . . . . . . . . . . . . . . . . . . 427
13.1.1 Graph properties . . . . . . . . . . . . . . . . . . . . . . . 428
13.1.2 Adjacency matrix . . . . . . . . . . . . . . . . . . . . . . . 428
13.1.3 Permutation equivariance . . . . . . . . . . . . . . . . . . . 429
13.2 Neural Message-Passing . . . . . . . . . . . . . . . . . . . . . . . 430
13.2.1 Convolutional filters . . . . . . . . . . . . . . . . . . . . . 431
13.2.2 Graph convolutional networks . . . . . . . . . . . . . . . . 432
13.2.3 Aggregation operators . . . . . . . . . . . . . . . . . . . . 434
13.2.4 Update operators . . . . . . . . . . . . . . . . . . . . . . . 436
13.2.5 Node classification . . . . . . . . . . . . . . . . . . . . . . 437
13.2.6 Edge classification . . . . . . . . . . . . . . . . . . . . . . 438
13.2.7 Graph classification . . . . . . . . . . . . . . . . . . . . . . 438
13.3 General Graph Networks . . . . . . . . . . . . . . . . . . . . . . . 438
13.3.1 Graph attention networks . . . . . . . . . . . . . . . . . . . 439
13.3.2 Edge embeddings . . . . . . . . . . . . . . . . . . . . . . . 439
13.3.3 Graph embeddings . . . . . . . . . . . . . . . . . . . . . . 440
13.3.4 Over-smoothing . . . . . . . . . . . . . . . . . . . . . . . 440
13.3.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 441
13.3.6 Geometric deep learning . . . . . . . . . . . . . . . . . . . 442
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
14 Sampling 447
14.1 Basic Sampling Algorithms . . . . . . . . . . . . . . . . . . . . . 44814.1.1 Expectations . . . . . . . . . . . . . . . . . . . . . . . . . 448
14.1.2 Standard distributions . . . . . . . . . . . . . . . . . . . . 449
14.1.3 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . 451
14.1.4 Adaptive rejection sampling . . . . . . . . . . . . . . . . . 453
14.1.5 Importance sampling . . . . . . . . . . . . . . . . . . . . . 455
14.1.6 Sampling-importance-resampling . . . . . . . . . . . . . . 457
14.2 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . 458
14.2.1 The Metropolis algorithm . . . . . . . . . . . . . . . . . . 459
14.2.2 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . 46014.2.3 The Metropolis–Hastings algorithm . . . . . . . . . . . . . 463
14.2.4 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . 464
14.2.5 Ancestral sampling . . . . . . . . . . . . . . . . . . . . . . 468
14.3 Langevin Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 469
14.3.1 Energy-based models . . . . . . . . . . . . . . . . . . . . . 470
14.3.2 Maximizing the likelihood . . . . . . . . . . . . . . . . . . 471
14.3.3 Langevin dynamics . . . . . . . . . . . . . . . . . . . . . . 472
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
15 Discrete Latent Variables 47715.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . 478
15.1.1 Image segmentation . . . . . . . . . . . . . . . . . . . . . 482
15.2 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . 484
15.2.1 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 486
15.2.2 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 488
15.3 Expectation–Maximization Algorithm . . . . . . . . . . . . . . . . 492
15.3.1 Gaussian mixtures . . . . . . . . . . . . . . . . . . . . . . 496
15.3.2 Relation to K-means . . . . . . . . . . . . . . . . . . . . . 498
15.3.3 Mixtures of Bernoulli distributions . . . . . . . . . . . . . . 49915.4 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 503
15.4.1 EM revisited . . . . . . . . . . . . . . . . . . . . . . . . . 504
15.4.2 Independent and identically distributed data . . . . . . . . . 506
15.4.3 Parameter priors . . . . . . . . . . . . . . . . . . . . . . . 507
15.4.4 Generalized EM . . . . . . . . . . . . . . . . . . . . . . . 507
15.4.5 Sequential EM . . . . . . . . . . . . . . . . . . . . . . . . 508
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
16 Continuous Latent Variables 513
16.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . 515
16.1.1 Maximum variance formulation . . . . . . . . . . . . . . . 515
16.1.2 Minimum-error formulation . . . . . . . . . . . . . . . . . 517
16.1.3 Data compression . . . . . . . . . . . . . . . . . . . . . . . 519
16.1.4 Data whitening . . . . . . . . . . . . . . . . . . . . . . . . 520
16.1.5 High-dimensional data . . . . . . . . . . . . . . . . . . . . 522
16.2 Probabilistic Latent Variables . . . . . . . . . . . . . . . . . . . . 524
16.2.1 Generative model . . . . . . . . . . . . . . . . . . . . . . . 524
16.2.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 525
16.2.3 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . 527
16.2.4 Factor analysis . . . . . . . . . . . . . . . . . . . . . . . . 531
16.2.5 Independent component analysis . . . . . . . . . . . . . . . 532
16.2.6 Kalman filters . . . . . . . . . . . . . . . . . . . . . . . . . 533
16.3 Evidence Lower Bound . . . . . . . . . . . . . . . . . . . . . . . 534
16.3.1 Expectation maximization . . . . . . . . . . . . . . . . . . 536
16.3.2 EM for PCA . . . . . . . . . . . . . . . . . . . . . . . . . 537
16.3.3 EM for factor analysis . . . . . . . . . . . . . . . . . . . . 538
16.4 Nonlinear Latent Variable Models . . . . . . . . . . . . . . . . . . 540
16.4.1 Nonlinear manifolds . . . . . . . . . . . . . . . . . . . . . 540
16.4.2 Likelihood function . . . . . . . . . . . . . . . . . . . . . . 542
16.4.3 Discrete data . . . . . . . . . . . . . . . . . . . . . . . . . 544
16.4.4 Four approaches to generative modelling . . . . . . . . . . 545
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
17 Generative Adversarial Networks 551
17.1 Adversarial Training . . . . . . . . . . . . . . . . . . . . . . . . . 552
17.1.1 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . 553
17.1.2 GAN training in practice . . . . . . . . . . . . . . . . . . . 554
17.2 Image GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557
17.2.1 CycleGAN . . . . . . . . . . . . . . . . . . . . . . . . . . 557
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
18 Normalizing Flows 565
18.1 Coupling Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
18.2 Autoregressive Flows . . . . . . . . . . . . . . . . . . . . . . . . . 570
18.3 Continuous Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 572
18.3.1 Neural differential equations . . . . . . . . . . . . . . . . . 572
18.3.2 Neural ODE backpropagation . . . . . . . . . . . . . . . . 573
18.3.3 Neural ODE flows . . . . . . . . . . . . . . . . . . . . . . 575
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57719 Autoencoders 581
19.1 Deterministic Autoencoders . . . . . . . . . . . . . . . . . . . . . 582
19.1.1 Linear autoencoders . . . . . . . . . . . . . . . . . . . . . 582
19.1.2 Deep autoencoders . . . . . . . . . . . . . . . . . . . . . . 583
19.1.3 Sparse autoencoders . . . . . . . . . . . . . . . . . . . . . 584
19.1.4 Denoising autoencoders . . . . . . . . . . . . . . . . . . . 585
19.1.5 Masked autoencoders . . . . . . . . . . . . . . . . . . . . . 585
19.2 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 587
19.2.1 Amortized inference . . . . . . . . . . . . . . . . . . . . . 59019.2.2 The reparameterization trick . . . . . . . . . . . . . . . . . 592
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
20 Diffusion Models 599
20.1 Forward Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
20.1.1 Diffusion kernel . . . . . . . . . . . . . . . . . . . . . . . 601
20.1.2 Conditional distribution . . . . . . . . . . . . . . . . . . . 602
20.2 Reverse Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
20.2.1 Training the decoder . . . . . . . . . . . . . . . . . . . . . 605
20.2.2 Evidence lower bound . . . . . . . . . . . . . . . . . . . . 606
20.2.3 Rewriting the ELBO . . . . . . . . . . . . . . . . . . . . . 607
20.2.4 Predicting the noise . . . . . . . . . . . . . . . . . . . . . . 609
20.2.5 Generating new samples . . . . . . . . . . . . . . . . . . . 610
20.3 Score Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
20.3.1 Score loss function . . . . . . . . . . . . . . . . . . . . . . 613
20.3.2 Modified score loss . . . . . . . . . . . . . . . . . . . . . . 614
20.3.3 Noise variance . . . . . . . . . . . . . . . . . . . . . . . . 615
20.3.4 Stochastic differential equations . . . . . . . . . . . . . . . 616
20.4 Guided Diffusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
20.4.1 Classifier guidance . . . . . . . . . . . . . . . . . . . . . . 618
20.4.2 Classifier-free guidance . . . . . . . . . . . . . . . . . . . 618
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Appendix A Linear Algebra 627
A.1 Matrix Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
A.2 Traces and Determinants . . . . . . . . . . . . . . . . . . . . . . . 628
A.3 Matrix Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 629
A.4 Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
Appendix B Calculus of Variations 635
Appendix C Lagrange Multipliers 639Bibliography 643
Index 659
Chris Bishop is a Technical Fellow at Microsoft and is the Director of Microsoft Research AI4Science. He is a Fellow of Darwin College, Cambridge, a Fellow of the Royal Academy of Engineering, a Fellow of the Royal Society of Edinburgh, and a Fellow of the Royal Society of London. He is a keen advocate of public engagement in science, and in 2008 he delivered the prestigious Royal Institution Christmas Lectures, established in 1825 by Michael Faraday, and broadcast on prime-time national television. Chris was a founding member of the UK AI Council and was also appointed to the Prime Minister’s Council for Science and Technology.
This book offers a comprehensive introduction to the central ideas that underpin deep learning. It is intended both for newcomers to machine learning and for those already experienced in the field. Covering key concepts relating to contemporary architectures and techniques, this essential book equips readers with a robust foundation for potential future specialization. The field of deep learning is undergoing rapid evolution, and therefore this book focusses on ideas that are likely to endure the test of time.
The book is organized into numerous bite-sized chapters, each exploring a distinct topic, and the narrative follows a linear progression, with each chapter building upon content from its predecessors. This structure is well-suited to teaching a two-semester undergraduate or postgraduate machine learning course, while remaining equally relevant to those engaged in active research or in self-study.
A full understanding of machine learning requires some mathematical background and so the book includes a self-contained introduction to probability theory. However, the focus of the book is on conveying a clear understanding of ideas, with emphasis on the real-world practical value of techniques rather than on abstract theory. Complex concepts are therefore presented from multiple complementary perspectives including textual descriptions, diagrams, mathematical formulae, and pseudo-code.
Chris Bishop is a Technical Fellow at Microsoft and is the Director of Microsoft Research AI4Science. He is a Fellow of Darwin College Cambridge, a Fellow of the Royal Academy of Engineering, and a Fellow of the Royal Society.
Hugh Bishop is an Applied Scientist at Wayve, a deep learning autonomous driving company in London, where he designs and trains deep neural networks. He completed his MPhil in Machine Learning and Machine Intelligence at Cambridge University.
“Chris Bishop wrote a terrific textbook on neural networks in 1995 and has a deep knowledge of the field and its core ideas. His many years of experience in explaining neural networks have made him extremely skillful at presenting complicated ideas in the simplest possible way and it is a delight to see these skills applied to the revolutionary new developments in the field.” -- Geoffrey Hinton
"With the recent explosion of deep learning and AI as a research topic, and the quickly growing importance of AI applications, a modern textbook on the topic was badly needed. The "New Bishop" masterfully fills the gap, covering algorithms for supervised and unsupervised learning, modern deep learning architecture families, as well as how to apply all of this to various application areas." – Yann LeCun
1997-2024 DolnySlask.com Agencja Internetowa