Index - Introduction to Machine Learning

A

accuracy, 34, 38, 44–46
activation function, 272–273
active learning, 19
ad click prediction, 6
adaptive learning rate, 290–291
additive smoothing
- see Laplace smoothing, 226
aleatoric uncertainty
- see inherent uncertainty, 109
anomaly, 13, 97, 112, 142–143
anomaly detection, 13, 97, 142–143, 152–153, 171–172, 178–180
array, 193, 277
artificial intelligence, 5
artificial neural network
- see neural network, 271
attention layer, 351, 358
attention mechanism, 349–354
attribute
- see variable, 5
audio data
- see audio processing, 4
audio processing, 4, 11, 323–324
autoencoder, 154–158
automated machine learning, 108
automatic differentiation
- see backpropagation, 301
autoregressive model, 336
- see sequence modeling, 4
auxiliary task
- see pretext task, 22

B

backpropagation, 297–302
backpropagation through time, 326
bagging, 232–236
bag-of-words assumption, 199, 260
balanced data, 31
basin of attraction, 289
batch, 291
batch normalization, 293, 316
batch size, 292–293
Bayes classifier, 111
Bayes error, 111
Bayesian evidence, 383
Bayesian inference, 53, 379–397
- nonparametric, 386
Bayesian observation
- see Bayesian evidence, 383
Bayesian optimization, 108
Bayes’s theorem, 383–384, 389
beam search, 347
BERT, 207–208, 334, 356–358
biases, 214, 217, 272
bias-variance tradeoff, 102
bidirectional recurrent network, 326
bidirectional transformer, 354–358
binning
- see discretization, 187
boosting, 236–240
byte pair encoding
- see subword tokenization, 198

C

calibration curve
- see reliability curve, 50
categorical distribution, 166
categorical variable, 30, 68, 188–191
causality, 71, 97, 112
chain network, 278–279
chain rule, 296–299
class, 27
class priors, 52–53
class probabilities, 28–29, 52–55, 93–95
classic learning methods, 6, 116, 211–265
classification, 1–2, 10–11, 20–21, 27–55, 84–86, 93–95, 310–321, 331–335
- binary, 28
- image, 1–2, 20–21, 40–45, 84–86, 310–321
- multiclass, 35
- multi-label, 29
- sequence, 331–335
- structured data, 27–35, 93–95
- text, 10, 35–40, 331–335
- topic, 35–40
classification tree
- see decision tree, 227
clustering, 12, 123–136
- hierarchical, 133–136
- image, 127–130
- structured data, 123–127
- text, 130–132
clustering criterion, 127
clustering purity, 126
clustering tree, 135
coin flip, 379–385
collaborative filtering, 158–161
computer vision, 1–3, 11, 14, 20–22, 40–45, 84–86, 116–117, 127–130, 144–148, 151–158, 178–180, 191–196, 306–307, 309–323, 353
concept drift, 112
confidence interval, 63–64
confusion matrix, 44, 47
content filtering
- see recommendation, 158
content selection
- see recommendation, 158
contextual word embeddings, 205–208
continuity assumption, 97
control engineering, 16–17
convexity, 215
convolution
- 1D, 303–304
- 2D, 306–310
convolution layer, 305
convolutional neural network (CNN), 302–324
cost function, 43, 91–92, 94, 103
count vectors, 199
covariance
- squared exponential, 249, 253–254
- white noise, 257
covariance function, 249–250, 253–254, 256–258
covariance matrix, 252
cross-entropy loss, 94, 284
cross-validation, 106–107
customer segmentation, 6

D

data augmentation, 44, 114, 192
data collection, 35, 40–41, 130, 133–134, 147–148
data curation, 44
data shift, 111–113
data size, 113–114
dataset, 2
- Boston Homes, 68–78, 148–149, 212–213
- Brain Weights, 65–67, 186–187
- Car Stopping Distances, 61–64, 89–93, 386–394
- Car vs. Truck, 27–29, 93–95
- CIFAR, 178–180
- DNA Sequences, 133–136
- Fisher’s Irises, 98–103, 123–127, 167–175, 211–212
- ImageNet, 116–117, 302–303
- MNIST, 50–51, 84–86, 146, 151–158, 292, 312–314
- MovieLens, 159–161
- Mushroom Identification, 2, 20–21, 40–45, 147–148
- Symbolic Integration, 341–348, 360–367
- Titanic Survival, 30–35, 109–110, 175–177, 183–184
- Wikipedia Topics, 35–40, 130–132, 331–341
decision boundary, 29
decision forest
- see random forest, 231
decision theory, 53–55
decision tree, 227–231
decision under uncertainty
- see decision theory, 53
decoder, 155, 344, 363
deep learning, 5
- see neural network, 271
deep neural network, 275, 280
dendrogram, 136
denoising
- see error correction, 143
denoising autoencoder, 158
dense layer
- see fully connected layer, 274
density estimation, 165, 169–171
derivative, 286, 294
development set
- see validation set, 98
dimensionality reduction, 12–13, 131, 139–161
discretization, 64, 187–188
distribution learning, 165–180
DNA sequences, 133–136
downstream task, 21
dropout layer, 315–316

E

early stopping, 105–106
EfficientNet, 41, 314–321
embedding layer, 190
empirical risk
- see cost function, 91
encoder, 155, 343–344, 362
end-to-end learning, 278
energy-based model, 395–396
ensemble of models, 232–240
environment, 15
epoch, 292
error, 45–46
error correction, 143, 153, 158
example mining, 114
expectation-maximization (EM), 159, 175–177
explainable AI, 32, 36–37, 70
exploration vs. exploitation tradeoff, 108
exploratory data analysis, 30–31

F

F₁ score, 46
facial recognition, 127–129, 192
fairness, 113
feature, 10
feature engineering
- see preprocessing, 115
feature extraction, 139
feature importance, 32, 36–37, 70
feature learning
- see feature extraction, 139
feature space, 145
feature type, 34, 70
feature vector, 10, 85
feature visualization, 318–321
feature-space plot, 13, 146–149
feed-forward neural network
- see multilayer perceptron, 274
finite difference
- see numerical differentiation, 294
fraud detection, 5
fully connected layer, 274–275
fully connected network
- see multilayer perceptron, 274

G

game playing, 4, 16
gated recurrent unit (GRU), 331, 337
gating mechanism, 317, 330
Gaussian mixture model, 177
Gaussian process, 87–88, 247–258, 386
generalization, 63, 95–106
generative modeling, 14
generative modelling, 165, 168–169, 178–179
GloVe
- see word embeddings, 203
GPT, 283, 360
GPU, 283, 355–356, 365, 367
gradient, 286, 293
gradient boosted trees, 236–242
gradient descent, 285–287, 290–291
graphical network, 279–280
grid search, 107–108

H

Hadamard matrix, 190
heavy-tailed distribution, 66, 186–187
heteroscedastic noise, 64
hidden layer, 275
hidden Markov model, 265
hinge loss, 243
homoscedastic noise, 64
hyperparameter, 103, 109
hyperparameter optimization, 104–108

I

iid assumption, 52, 96–97, 111
image captioning, 353
image colorization, 22
image conformation, 191
image data
- see computer vision, 3
image identification
- see classification image, 1
image representation, 84–86, 306–307
image segmentation, 322–323
imbalanced data, 31
imputation, 14, 143–144, 153–154, 159–161, 172–177
in-distribution example, 96
in-distribution generalization, 96
inductive bias, 97
infinite-width neural network, 258
information retrieval, 149–150
inherent uncertainty, 34, 109–111
instance-based models, 83, 86–88
integer encoding, 188
intercept
- see biases, 214
intra-attention
- see self-attention, 353
irreducible error
- see Bayes error, 111
irreducible uncertainty
- see inherent uncertainty, 109
irrelevant features, 97, 115
Isomap, 140

J

Jacobian matrix, 298

K

kernel, 244
- linear, 247
- polynomial, 247
- radial basis function (RBF), 244
k-fold cross-validation, 107
k-means method, 125–127, 177
k-nearest neighbors
- see nearest neighbors, 87, 223
Kneser–Ney smoothing, 263
knowledge distillation, 231

L

L1 regularization, 215–216, 219
L2 regularization, 103–104, 215–216, 219
label noise
- see inherent uncertainty, 109
language identification, 261–263
language modeling
- see sequence modeling, 4
language modelling, 22
Laplace smoothing, 226
large margin classifier
- see support-vector machine, 244
latent Dirichlet allocation (LDA), 397
latent features, 21
latent semantic analysis, 201
latent space
- see feature space, 145
latent variables, 139–140
lazy learning, 227
learning bias, 102
- see inductive bias, 97
learning curve, 34–35, 43, 106, 108, 110, 114, 178, 278
learning paradigms, 9–22
learning rate, 42, 287, 290–291
LetNet, 310–314
likelihood, 48–49, 77–78, 94–95, 170–171, 383
- log-, 48
linear classifier, 220–221
linear layer
- see fully connected layer, 274
linear regression, 90–93, 213–216
- Bayesian, 386–394
local minimum, 288–289
log loss
- see cross-entropy loss, 94
log transformation, 66–67, 186–187
logistic regression, 93–95, 217–221
logit transformation, 160
logits, 219
LogSumExp
- see softmax function, 218
long short-term memory network (LSTM), 330–331, 333
loss function, 91, 94

M

machine learning method, 1–3, 7
manifold, 139–145, 152–154
marginal likelihood, 258
Markov chain Monte Carlo (MCMC), 389–392
Markov model, 259–265
masked self-attention, 359–360
matrix multiplication, 275–276, 283
maximum a posteriori (MAP), 393
maximum likelihood estimation, 216
mean cross-entropy
- see negative log-likelihood (NLL), 49
mean squared error (MSE), 75, 91
measure uncertainty, 45, 106, 113–114
measurements, 21, 34, 37–38, 43–51, 72–78, 98, 106, 314
measures
- see measurements, 45
medical diagnosis, 5
memory-based models
- see instance-based models, 83
metrics
- see measurements, 45
Metropolis algorithm, 389–391
mini-batch
- see batch, 291
missing completely at random, 173
missing data synthesis
- see imputation, 14
missing values, 30–31
MNIST dataset, 151
model, 2, 81–83
model capacity, 100–103, 115–116
model deployment, 39–40
model evaluation
- see model inference, 11
model export
- see serialization, 39
model family, 83, 90
model inference, 11
model variance, 102
model-based reinforcement learning, 16
multilayer perceptron, 274–276
multimodal distribution, 64, 173–175
multinomial logistic regression
- see logistic regression, 217

N

natural language processing, 3–4, 11, 22, 35–40, 130–132, 149–150, 196–208, 331–341
nearest neighbors, 86–88, 222–227
negative log-likelihood (NLL), 49, 78, 94, 171
neighbor-based models
- see instance-based models, 83
neural architecture, 274
neural architecture search, 117, 274
neural layer, 274–275, 277
neural network, 20–21, 41–45, 67, 82, 88, 105–106, 116–117, 154–158, 190, 195–196, 202, 205–208, 219, 271–367
- Bayesian, 394
- biological, 271–272
neuron
- artificial, 272–273
- biological, 271–272
news aggregation, 130–132
n-gram, 260–263
n-gram method
- see Markov model, 259
nominal variable
- see categorical variable, 30
nonlinear method, 227
nonparametric methods, 83–88
nonparametric models
- see nonparametric methods, 83
numeric variable, 30
numerical differentiation, 294

O

object detection, 3, 11, 321–322
objective function
- see cost function, 91
one-hot encoding, 189
online learning, 18–19
optimization, 83, 92, 94
ordinal regression, 64–65
outlier
- see anomaly, 13, 97
out-of-core learning, 18
out-of-distribution example, 96, 112, 172
out-of-distribution generalization, 96–97, 348
overconfident model, 51
overfitting, 63, 97–103

P

parallel computing, 276, 283, 355
parameters
- see parametric methods, 89
parametric curve, 140
parametric methods, 89–95
partition function, 218
perplexity, 338
pixel-wise distance, 85–86
policy, 15
polynomial fit, 101–105, 214
pooling layer, 311
posterior distribution, 383, 397
posterior mean, 393
precision, 46
prediction scatter plot, 72, 76–77
predictive distribution, 63–65, 70, 216
predictive model, 9, 81–82
preprocessing, 66–67, 115, 131, 134–136, 144, 160, 183–208
- categorical, 188–191
- image, 191–196
- numeric, 184–188
- text, 196–208
preprocessing pipeline, 183–184
pretext task, 22
pre-trained model, 20, 41, 45, 195, 280, 334
prior belief
- see prior distribution, 382
prior distribution, 52–53, 382–383, 387–389
prior hypotheses
- see prior knowledge, 97
prior knowledge, 97, 115–116
probabilistic graphical model, 264–265, 397
probabilistic model, 29, 63–64, 216
probabilistic programming, 395–397
probability calibration, 50–51
probability conditioning, 173–175, 386, 396–397
probability density function, 166–167
probability distribution, 165, 395–397
protein folding, 281
pseudo-residuals, 239–240

Q

quadratic loss
- see squared error loss, 91
quantiles, 187–188

R

R² (coefficient of determination), 76
random embedding, 189–190
random field
- see random function, 251
random forest, 231–236
random function, 251–253
random noise
- see inherent uncertainty, 109
random process
- see random function, 251
random search, 107–108
rarer probability, 172, 179–180
recall, 46
recommendation, 6, 158–161
reconstruction error, 141–142, 152–154
rectified linear unit (ReLu), 273
recurrent neural network (RNN), 324–348, 352–354
regression, 10, 61–78, 87–93, 278
regression tree
- see decision tree, 227
regularization, 103–106, 109, 221, 225
reinforcement learning, 15–17
rejection threshold, 53
reliability
- see probability calibration, 50
reliability curve, 50
representation learning
- see feature extraction, 139
resampling with replacement, 232
residual block, 317
residual neural network (ResNet), 309, 317
residuals, 74
responsible AI, 113, 161
reward, 15
robotics, 15–16
robustness, 96–97, 112, 348
root mean squared error (RMSE), 70, 72, 75–76

S

scaled exponential linear unit (SELU), 276
search engine
- see information retrieval, 149
self-attention, 353–360
self-normalizing network, 276
self-supervised learning, 22
self-training, 18, 159, 175
semantic distance, 129
semantic features, 21, 97, 196
semantic hashing, 150
semi-supervised learning, 17–18
sentiment analysis, 4
sequence generation
- see sequence modeling, 4
sequence modeling, 4, 259–265, 324–348, 352–367
sequence-to-sequence (seq2seq), 341–348, 360–367
serialization, 39
Shapley additive explanations, 32, 70
signal processing, 194–195
similarity-based models
- see instance-based models, 83
simulation, 15–16
skip connection, 317
smoothing
- see regularization, 103
social bias, 113
softmax function, 218
speech recognition, 11, 323–324
speech to text, 4
spurious features, 97, 115
squared error loss, 91
squeeze-and-excitation block, 317
standardization, 124–125, 184–185
steepest descent
- see gradient descent, 285
stochastic gradient descent, 291–293
stochastic process
- see random function, 251
stop words, 132
structured data, 5, 9–10, 14, 30, 61, 65, 68, 89, 98, 123, 148–149, 167, 175, 211
subword tokenization, 198
super-resolution imaging, 3
supervised learning, 9–11
support-vector machine, 242–247
symbolic differentiation, 295

T

tabular data
- see structured data, 30
teacher forcing, 336
tensor
- see array, 193
term frequency–inverse document frequency (tf–idf), 199–201
term-document matrix, 199
test set, 38, 98, 112
text data
- see natural language processing, 3
text generation
- see sequence modeling, 4
text normalization, 196
text search, 149–150
text tokenization, 197–198
text translation, 3, 11, 360
training, 11
training round
- see epoch, 292
training set, 38, 98
transfer function
- see activation function, 272
transfer learning, 20–21, 41–45, 128–129, 195–196, 280, 334–335
transformer, 348–367
t-SNE, 131

U

underconfident model, 51
underfitting, 97–103
U-Net, 323
unidirectional transformer, 359–360
unstructured data, 5
unsupervised learning, 12–14
utility, 53–55

V

validation set, 41, 98, 106, 112
vanilla recurrent network, 327–329
vanishing and exploding gradient, 329
variable, 5
vector embedding, 189–191
vector representation
- see latent features, 21
vision as inverse graphics, 397
Vision transformer (ViT), 358
vocabulary, 197

W

weight sharing, 305
weights, 213–214, 217, 272
within-cluster variance, 126–127
word cloud, 132
word embeddings, 201–204
word vectors
- see word embeddings, 201
Word2Vec
- see word embeddings, 203

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

0.00 Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W