Overview of Machine Learning ✅

References:

https://d2l.ai/chapter_linear-regression/generalization.html#sec-generalization-basics https://d2l.ai/chapter_multilayer-perceptrons/generalization-deep.html

1 Elements of Supervised Machine Learning

1.1 The model

We start by defining a parameterized function:

\[ f|\theta: \mathcal{X} \to \mathcal{Y} \]

where

\(\mathcal{X}\) is the input space. We almost always consider vector spaces \(\mathbb{R}^m\) or \(\mathbb{R}^{m_1\times m_2}\) or \(\mathbb{R}^{m_1\times m_2\times m_3\dots}\).
\(\mathcal{Y}\) is the output space. Again, we almost always consider vector spaces.
\(\theta\) is the parameters of \(f\), and \(\theta\) is one or more tensors.

We must have \(f\) differential with respect to \(\theta\).

1.2 The training data

We have a training dataset \(D = \{(x_i, y_i): i\in\mathrm{Train}\}\) where

\(x_i\in \mathcal{X}\)
\(y_i\in \mathcal{Y}_\mathrm{true}\)

In most cases \(\mathcal{Y}_\mathrm{true} = \mathcal{Y}\).

The training data allows us to perform model identification by selecting a good parameter setting \(\theta^*\) such that:

\[ \forall i\in\mathrm{Train},\ f|{\theta^*}(x_i) \simeq y_i \]

1.3 The loss function

To assess how good a model is, we need a loss function:

\[ \ell : \mathcal{Y}\times\mathcal{Y}_\mathrm{true}\to\mathbb{R} \]

It compares two outputs: the predicted \(y_\mathrm{pred} = f|\theta(x_i)\) and the expected \(y_i\).

We must have \(l\) differentiable with respect to \(y_\mathrm{pred}\)

2 Learning From Data

2.1 Incremental improvements per sample

Suppose that we have a parameter setting \(\theta_0\), and some sample in the training data \((x_i, y_i)\).

Q: How do we adjust \(\theta \leftarrow \theta_0 + \Delta\theta\) such that the loss function is slighly improved?

A: Use gradient descent: gradient is the direction that increases the field function.

The field function is:

\[ Q(\theta) = \ell(f|\theta(x_i), y_i) \]
The direction to decrease \(Q\) at \(\theta_0\) is given by:

\[ \nabla Q(\theta_0) \]
But this direction is only good for the neighbourhood of \(\theta_0\).

\[ \theta\leftarrow \theta_0 - \mathrm{lr}\cdot \nabla Q(\theta_0) \] where \(\mathrm{lr}\) is a small fraction, called the learning rate.

2.2 Incremental improvements for the whole training dataset

Now, we can consider a set of training samples: \(\{(x_i, y_i)\}_\mathrm{Train}\).

The field function is the sum of the losses of all the samples in \(I\):

\[ Q(\theta) = \sum_{i\in \mathrm{Train}}\ell(f|\theta(x_i), y_i) \]

This allows us to perform incremental improvements over the entire training data.

Gradient Descent

This called the gradient descent (GD) algorithm.

2.3 Incremental improvements for a batch

The problem with the GD algorithm is that the training data \(|\mathrm{Train}|\) can be too large to fit into memory (sometimes even a single machine).

We will have to estimate the true gradient \(\nabla Q\) using a sample of the training data.

Let \(I = \mathrm{batch}(\mathrm{Train})\).

\[ \nabla Q\simeq \sum_{i\in I}\ell(f|\theta(x_i), y_i) \]

2.4 The Stochastic Gradient Descent algorithm

Stochastic Gradient Descent (SGD) performs model parameter update for each batch of the training data. Using a batch, SGD estimates the gradient which is used to update the parameters.

SGD scans through the entire dataset one batch at a time. Each scan of the entire training data is called an epoch.

for I in batches(Train) {
    grad = ∇Q using I
    𝜃 = 𝜃 - learning_rate * grad
}

To further improve the model parameters, we typically perform multiple epochs.

def train(dataset, f, 𝜃):
  repeat epochs {
    for I in batches(dataset) {
        grad = ∇Q using I
        𝜃 = 𝜃 - learning_rate * grad
    }
  }
  return 𝜃

3 Evaluation of Models

3.1 Evaluation metrics

Let’s fix the model \(f|\theta\), and a loss function.

Given a dataset \(\{(x_i, y_i)\}_D\), and we can evaluate the quality of the model \(f|\theta\) with respect to \(D\).

3.2 Loss defined as the average loss for each sample in \(D\):

\[L = \frac{1}{n} \sum_{i\in D}\ell(f|\theta(x_i), y_i) \] where \(n = |D|\).

3.3 Accuracy defined only for classification models.

Suppose that \(f|\theta\) is performing classification over \(c\) classes. This means that \(f|\theta\) output is \(p\in[0, 1]^c\), probabilities over the \(c\) classes.

For each \((x_i, y_i)\), the predicated class is given by:

\[ y_i^\mathrm{pred} = \operatorname{argmax}\{p_j: j\leq c\} \]

The accuracy w.r.t. \(D\) is the percentage of samples that \(f|\theta\) predicts the correct class.

\[ \mathrm{acc} = \frac{|\{i\in D: y_i^\mathrm{pred} = y_i\}|}{n} \]

4 Training vs Testing

Training is absolutely essential to obtain an optimized model. But how do we know if the optimized model (w.r.t. training data) is actually going to work in application scenario?

This is the problem of generalization.

How general is the model when used for data not seen in the training data?

4.1 Test data

To estimate the generalizality of \(f|\theta\), we will split our available dataset into two disjoint datasets:

\[\{(x_i, y_i\}_\mathrm{Train} \cup \{(x_i, y_i)\}_\mathrm{Test} \]

After training, we can simulate how well the model works in application by evaluating with:

Test loss
Test accuracy

Important

As a golden rule in machine learning, you are not allowed to incorporate test metrics as part of your training strategy.

4.2 Cross Validation

Q: What if the test metrics are terrible?

we cannot retrain because that would violate the golden rule.
it’s basically a failed design. We have to redesign and train from scratch.

A (partial) solution is to use cross-validation.

Cross validation:

For each epoch, split the training data into two segments:

\[ \mathrm{Train} = \underbrace{\mathrm{Used}(\mathrm{epoch})}_{90\%}\cup\underbrace{\mathrm{Unused}(\mathrm{epoch})}_{10\%} \]

The training with validation is:

def train_with_validation(dataset, f, 𝜃, validation_ratio=0.1):
  for epoch in range(epochs) {
    unused = sample(dataset, validation_ratio)
    used = dataset.remove(unused)
    
    train(used, f, 𝜃)
    eval(unused, f, 𝜃)
  }