A
- accuracy, 34, 38, 44–46
- activation function, 272–273
- active learning, 19
- ad click prediction, 6
- adaptive learning rate, 290–291
-
additive smoothing
- see Laplace smoothing, 226
-
aleatoric uncertainty
- see inherent uncertainty, 109
- anomaly, 13, 97, 112, 142–143
- anomaly detection, 13, 97, 142–143, 152–153, 171–172, 178–180
- array, 193, 277
- artificial intelligence, 5
-
artificial neural network
- see neural network, 271
- attention layer, 351, 358
- attention mechanism, 349–354
-
attribute
- see variable, 5
-
audio data
- see audio processing, 4
- audio processing, 4, 11, 323–324
- autoencoder, 154–158
- automated machine learning, 108
-
automatic differentiation
- see backpropagation, 301
-
autoregressive model, 336
- see sequence modeling, 4
-
auxiliary task
- see pretext task, 22
B
- backpropagation, 297–302
- backpropagation through time, 326
- bagging, 232–236
- bag-of-words assumption, 199, 260
- balanced data, 31
- basin of attraction, 289
- batch, 291
- batch normalization, 293, 316
- batch size, 292–293
- Bayes classifier, 111
- Bayes error, 111
- Bayesian evidence, 383
-
Bayesian inference, 53, 379–397
- nonparametric, 386
-
Bayesian observation
- see Bayesian evidence, 383
- Bayesian optimization, 108
- Bayes’s theorem, 383–384, 389
- beam search, 347
- BERT, 207–208, 334, 356–358
- biases, 214, 217, 272
- bias-variance tradeoff, 102
- bidirectional recurrent network, 326
- bidirectional transformer, 354–358
-
binning
- see discretization, 187
- boosting, 236–240
-
byte pair encoding
- see subword tokenization, 198
C
-
calibration curve
- see reliability curve, 50
- categorical distribution, 166
- categorical variable, 30, 68, 188–191
- causality, 71, 97, 112
- chain network, 278–279
- chain rule, 296–299
- class, 27
- class priors, 52–53
- class probabilities, 28–29, 52–55, 93–95
- classic learning methods, 6, 116, 211–265
- classification, 1–2, 10–11, 20–21, 27–55, 84–86, 93–95, 310–321, 331–335
-
classification tree
- see decision tree, 227
- clustering, 12, 123–136
- clustering criterion, 127
- clustering purity, 126
- clustering tree, 135
- coin flip, 379–385
- collaborative filtering, 158–161
- computer vision, 1–3, 11, 14, 20–22, 40–45, 84–86, 116–117, 127–130, 144–148, 151–158, 178–180, 191–196, 306–307, 309–323, 353
- concept drift, 112
- confidence interval, 63–64
- confusion matrix, 44, 47
-
content filtering
- see recommendation, 158
-
content selection
- see recommendation, 158
- contextual word embeddings, 205–208
- continuity assumption, 97
- control engineering, 16–17
- convexity, 215
- convolution
- convolution layer, 305
- convolutional neural network (CNN), 302–324
- cost function, 43, 91–92, 94, 103
- count vectors, 199
- covariance
- covariance function, 249–250, 253–254, 256–258
- covariance matrix, 252
- cross-entropy loss, 94, 284
- cross-validation, 106–107
- customer segmentation, 6
D
- data augmentation, 44, 114, 192
- data collection, 35, 40–41, 130, 133–134, 147–148
- data curation, 44
- data shift, 111–113
- data size, 113–114
-
dataset, 2
- Boston Homes, 68–78, 148–149, 212–213
- Brain Weights, 65–67, 186–187
- Car Stopping Distances, 61–64, 89–93, 386–394
- Car vs. Truck, 27–29, 93–95
- CIFAR, 178–180
- DNA Sequences, 133–136
- Fisher’s Irises, 98–103, 123–127, 167–175, 211–212
- ImageNet, 116–117, 302–303
- MNIST, 50–51, 84–86, 146, 151–158, 292, 312–314
- MovieLens, 159–161
- Mushroom Identification, 2, 20–21, 40–45, 147–148
- Symbolic Integration, 341–348, 360–367
- Titanic Survival, 30–35, 109–110, 175–177, 183–184
- Wikipedia Topics, 35–40, 130–132, 331–341
- decision boundary, 29
-
decision forest
- see random forest, 231
- decision theory, 53–55
- decision tree, 227–231
-
decision under uncertainty
- see decision theory, 53
- decoder, 155, 344, 363
-
deep learning, 5
- see neural network, 271
- deep neural network, 275, 280
- dendrogram, 136
-
denoising
- see error correction, 143
- denoising autoencoder, 158
-
dense layer
- see fully connected layer, 274
- density estimation, 165, 169–171
- derivative, 286, 294
-
development set
- see validation set, 98
- dimensionality reduction, 12–13, 131, 139–161
- discretization, 64, 187–188
- distribution learning, 165–180
- DNA sequences, 133–136
- downstream task, 21
- dropout layer, 315–316
E
- early stopping, 105–106
- EfficientNet, 41, 314–321
- embedding layer, 190
-
empirical risk
- see cost function, 91
- encoder, 155, 343–344, 362
- end-to-end learning, 278
- energy-based model, 395–396
- ensemble of models, 232–240
- environment, 15
- epoch, 292
- error, 45–46
- error correction, 143, 153, 158
- example mining, 114
- expectation-maximization (EM), 159, 175–177
- explainable AI, 32, 36–37, 70
- exploration vs. exploitation tradeoff, 108
- exploratory data analysis, 30–31
F
- F1 score, 46
- facial recognition, 127–129, 192
- fairness, 113
- feature, 10
-
feature engineering
- see preprocessing, 115
- feature extraction, 139
- feature importance, 32, 36–37, 70
-
feature learning
- see feature extraction, 139
- feature space, 145
- feature type, 34, 70
- feature vector, 10, 85
- feature visualization, 318–321
- feature-space plot, 13, 146–149
-
feed-forward neural network
- see multilayer perceptron, 274
-
finite difference
- see numerical differentiation, 294
- fraud detection, 5
- fully connected layer, 274–275
-
fully connected network
- see multilayer perceptron, 274
G
- game playing, 4, 16
- gated recurrent unit (GRU), 331, 337
- gating mechanism, 317, 330
- Gaussian mixture model, 177
- Gaussian process, 87–88, 247–258, 386
- generalization, 63, 95–106
- generative modeling, 14
- generative modelling, 165, 168–169, 178–179
-
GloVe
- see word embeddings, 203
- GPT, 283, 360
- GPU, 283, 355–356, 365, 367
- gradient, 286, 293
- gradient boosted trees, 236–242
- gradient descent, 285–287, 290–291
- graphical network, 279–280
- grid search, 107–108
H
- Hadamard matrix, 190
- heavy-tailed distribution, 66, 186–187
- heteroscedastic noise, 64
- hidden layer, 275
- hidden Markov model, 265
- hinge loss, 243
- homoscedastic noise, 64
- hyperparameter, 103, 109
- hyperparameter optimization, 104–108
I
- iid assumption, 52, 96–97, 111
- image captioning, 353
- image colorization, 22
- image conformation, 191
-
image data
- see computer vision, 3
-
image identification
- see classification image, 1
- image representation, 84–86, 306–307
- image segmentation, 322–323
- imbalanced data, 31
- imputation, 14, 143–144, 153–154, 159–161, 172–177
- in-distribution example, 96
- in-distribution generalization, 96
- inductive bias, 97
- infinite-width neural network, 258
- information retrieval, 149–150
- inherent uncertainty, 34, 109–111
- instance-based models, 83, 86–88
- integer encoding, 188
-
intercept
- see biases, 214
-
intra-attention
- see self-attention, 353
-
irreducible error
- see Bayes error, 111
-
irreducible uncertainty
- see inherent uncertainty, 109
- irrelevant features, 97, 115
- Isomap, 140
J
- Jacobian matrix, 298
K
- kernel, 244
- k-fold cross-validation, 107
- k-means method, 125–127, 177
- k-nearest neighbors
- Kneser–Ney smoothing, 263
- knowledge distillation, 231
L
- L1 regularization, 215–216, 219
- L2 regularization, 103–104, 215–216, 219
-
label noise
- see inherent uncertainty, 109
- language identification, 261–263
-
language modeling
- see sequence modeling, 4
- language modelling, 22
- Laplace smoothing, 226
-
large margin classifier
- see support-vector machine, 244
- latent Dirichlet allocation (LDA), 397
- latent features, 21
- latent semantic analysis, 201
-
latent space
- see feature space, 145
- latent variables, 139–140
- lazy learning, 227
-
learning bias, 102
- see inductive bias, 97
- learning curve, 34–35, 43, 106, 108, 110, 114, 178, 278
- learning paradigms, 9–22
- learning rate, 42, 287, 290–291
- LetNet, 310–314
-
likelihood, 48–49, 77–78, 94–95, 170–171, 383
- log-, 48
- linear classifier, 220–221
-
linear layer
- see fully connected layer, 274
-
linear regression, 90–93, 213–216
- Bayesian, 386–394
- local minimum, 288–289
-
log loss
- see cross-entropy loss, 94
- log transformation, 66–67, 186–187
- logistic regression, 93–95, 217–221
- logit transformation, 160
- logits, 219
-
LogSumExp
- see softmax function, 218
- long short-term memory network (LSTM), 330–331, 333
- loss function, 91, 94
M
- machine learning method, 1–3, 7
- manifold, 139–145, 152–154
- marginal likelihood, 258
- Markov chain Monte Carlo (MCMC), 389–392
- Markov model, 259–265
- masked self-attention, 359–360
- matrix multiplication, 275–276, 283
- maximum a posteriori (MAP), 393
- maximum likelihood estimation, 216
-
mean cross-entropy
- see negative log-likelihood (NLL), 49
- mean squared error (MSE), 75, 91
- measure uncertainty, 45, 106, 113–114
- measurements, 21, 34, 37–38, 43–51, 72–78, 98, 106, 314
-
measures
- see measurements, 45
- medical diagnosis, 5
-
memory-based models
- see instance-based models, 83
-
metrics
- see measurements, 45
- Metropolis algorithm, 389–391
-
mini-batch
- see batch, 291
- missing completely at random, 173
-
missing data synthesis
- see imputation, 14
- missing values, 30–31
- MNIST dataset, 151
- model, 2, 81–83
- model capacity, 100–103, 115–116
- model deployment, 39–40
-
model evaluation
- see model inference, 11
-
model export
- see serialization, 39
- model family, 83, 90
- model inference, 11
- model variance, 102
- model-based reinforcement learning, 16
- multilayer perceptron, 274–276
- multimodal distribution, 64, 173–175
-
multinomial logistic regression
- see logistic regression, 217
N
- natural language processing, 3–4, 11, 22, 35–40, 130–132, 149–150, 196–208, 331–341
- nearest neighbors, 86–88, 222–227
- negative log-likelihood (NLL), 49, 78, 94, 171
-
neighbor-based models
- see instance-based models, 83
- neural architecture, 274
- neural architecture search, 117, 274
- neural layer, 274–275, 277
- neural network, 20–21, 41–45, 67, 82, 88, 105–106, 116–117, 154–158, 190, 195–196, 202, 205–208, 219, 271–367
- neuron
- news aggregation, 130–132
- n-gram, 260–263
-
n-gram method
- see Markov model, 259
-
nominal variable
- see categorical variable, 30
- nonlinear method, 227
- nonparametric methods, 83–88
-
nonparametric models
- see nonparametric methods, 83
- numeric variable, 30
- numerical differentiation, 294
O
- object detection, 3, 11, 321–322
-
objective function
- see cost function, 91
- one-hot encoding, 189
- online learning, 18–19
- optimization, 83, 92, 94
- ordinal regression, 64–65
- outlier
- out-of-core learning, 18
- out-of-distribution example, 96, 112, 172
- out-of-distribution generalization, 96–97, 348
- overconfident model, 51
- overfitting, 63, 97–103
P
- parallel computing, 276, 283, 355
-
parameters
- see parametric methods, 89
- parametric curve, 140
- parametric methods, 89–95
- partition function, 218
- perplexity, 338
- pixel-wise distance, 85–86
- policy, 15
- polynomial fit, 101–105, 214
- pooling layer, 311
- posterior distribution, 383, 397
- posterior mean, 393
- precision, 46
- prediction scatter plot, 72, 76–77
- predictive distribution, 63–65, 70, 216
- predictive model, 9, 81–82
- preprocessing, 66–67, 115, 131, 134–136, 144, 160, 183–208
- preprocessing pipeline, 183–184
- pretext task, 22
- pre-trained model, 20, 41, 45, 195, 280, 334
-
prior belief
- see prior distribution, 382
- prior distribution, 52–53, 382–383, 387–389
-
prior hypotheses
- see prior knowledge, 97
- prior knowledge, 97, 115–116
- probabilistic graphical model, 264–265, 397
- probabilistic model, 29, 63–64, 216
- probabilistic programming, 395–397
- probability calibration, 50–51
- probability conditioning, 173–175, 386, 396–397
- probability density function, 166–167
- probability distribution, 165, 395–397
- protein folding, 281
- pseudo-residuals, 239–240
Q
R
- R2 (coefficient of determination), 76
- random embedding, 189–190
-
random field
- see random function, 251
- random forest, 231–236
- random function, 251–253
-
random noise
- see inherent uncertainty, 109
-
random process
- see random function, 251
- random search, 107–108
- rarer probability, 172, 179–180
- recall, 46
- recommendation, 6, 158–161
- reconstruction error, 141–142, 152–154
- rectified linear unit (ReLu), 273
- recurrent neural network (RNN), 324–348, 352–354
- regression, 10, 61–78, 87–93, 278
-
regression tree
- see decision tree, 227
- regularization, 103–106, 109, 221, 225
- reinforcement learning, 15–17
- rejection threshold, 53
-
reliability
- see probability calibration, 50
- reliability curve, 50
-
representation learning
- see feature extraction, 139
- resampling with replacement, 232
- residual block, 317
- residual neural network (ResNet), 309, 317
- residuals, 74
- responsible AI, 113, 161
- reward, 15
- robotics, 15–16
- robustness, 96–97, 112, 348
- root mean squared error (RMSE), 70, 72, 75–76
S
- scaled exponential linear unit (SELU), 276
-
search engine
- see information retrieval, 149
- self-attention, 353–360
- self-normalizing network, 276
- self-supervised learning, 22
- self-training, 18, 159, 175
- semantic distance, 129
- semantic features, 21, 97, 196
- semantic hashing, 150
- semi-supervised learning, 17–18
- sentiment analysis, 4
-
sequence generation
- see sequence modeling, 4
- sequence modeling, 4, 259–265, 324–348, 352–367
- sequence-to-sequence (seq2seq), 341–348, 360–367
- serialization, 39
- Shapley additive explanations, 32, 70
- signal processing, 194–195
-
similarity-based models
- see instance-based models, 83
- simulation, 15–16
- skip connection, 317
-
smoothing
- see regularization, 103
- social bias, 113
- softmax function, 218
- speech recognition, 11, 323–324
- speech to text, 4
- spurious features, 97, 115
- squared error loss, 91
- squeeze-and-excitation block, 317
- standardization, 124–125, 184–185
-
steepest descent
- see gradient descent, 285
- stochastic gradient descent, 291–293
-
stochastic process
- see random function, 251
- stop words, 132
- structured data, 5, 9–10, 14, 30, 61, 65, 68, 89, 98, 123, 148–149, 167, 175, 211
- subword tokenization, 198
- super-resolution imaging, 3
- supervised learning, 9–11
- support-vector machine, 242–247
- symbolic differentiation, 295
T
-
tabular data
- see structured data, 30
- teacher forcing, 336
-
tensor
- see array, 193
- term frequency–inverse document frequency (tf–idf), 199–201
- term-document matrix, 199
- test set, 38, 98, 112
-
text data
- see natural language processing, 3
-
text generation
- see sequence modeling, 4
- text normalization, 196
- text search, 149–150
- text tokenization, 197–198
- text translation, 3, 11, 360
- training, 11
-
training round
- see epoch, 292
- training set, 38, 98
-
transfer function
- see activation function, 272
- transfer learning, 20–21, 41–45, 128–129, 195–196, 280, 334–335
- transformer, 348–367
- t-SNE, 131
U
- underconfident model, 51
- underfitting, 97–103
- U-Net, 323
- unidirectional transformer, 359–360
- unstructured data, 5
- unsupervised learning, 12–14
- utility, 53–55
V
- validation set, 41, 98, 106, 112
- vanilla recurrent network, 327–329
- vanishing and exploding gradient, 329
- variable, 5
- vector embedding, 189–191
-
vector representation
- see latent features, 21
- vision as inverse graphics, 397
- Vision transformer (ViT), 358
- vocabulary, 197