Logistic Regression ✅

1 A dataset

We will work on the iris dataset.

  • Three species of iris: Setosa, Versicolour, and Virginica.
  • Four measurements of the sepal and petal dimensions of 150 samples.
from sklearn import datasets
import pandas as pd

iris = datasets.load_iris()
dataset = pd.DataFrame(
  iris['data'],
  columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
)
dataset['species'] = iris['target']
dataset = dataset.sample(frac=1)
dataset.head()
sepal_length sepal_width petal_length petal_width species
103 6.3 2.9 5.6 1.8 2
32 5.2 4.1 1.5 0.1 0
144 6.7 3.3 5.7 2.5 2
24 4.8 3.4 1.9 0.2 0
71 6.1 2.8 4.0 1.3 1

2 A simplified classification problem

For the species 0 and 1, can we tell them apart based on the sepal dimensions?

This is a binary classification problem based on two attributes.

df = dataset[(dataset.species == 0) | (dataset.species == 1)]
x_df = df[['sepal_length', 'sepal_width']]
y_df = df['species']
import matplotlib.pyplot as pl
c = ['blue' if s == 0 else 'red' for s in y_df]
pl.scatter(x_df['sepal_length'], x_df['sepal_width'], c=c);

3 The model of binary classificatio using logistic regression

The training data is given as:

  • Input: \(\{x_i\}\) where each \(x_i\in\mathbb{R}^n\).
  • Output: \(\{y_i\}\) where each \(y_i\in\{0, 1\}\).

The model is given as:

\[ p_i = f(x_i|\theta) \]

where \(p_i\) is the probability that \(y_i=1\).

The logistic model

Let \(w\in\mathbb{R}^n\) and \(b\in\mathbb{R}\) be two vectors that define the hyperplane \(P(w, b)\) in \(\mathbb{R}^n\), where:

  • \(w\) is the normal vector.
  • \(b\) is the offset.

\[ P(w,b) = \{x: w^x+b = 0\} \]

The displacement of \(x_i\) from the hyperplane \(P(w,b)\) is given by:

\[ \delta_i = w^Tx_i + b \]

Furthermore,

\[ \begin{eqnarray} \mathrm{if}\ \delta_i < 0 &\implies& y_i = 0 \\ \mathrm{if}\ \delta_i > 0 &\implies& y_i = 1 \end{eqnarray} \]

The displacement \(\delta_i\) is called the logit of \(x_i\). The logit \(\delta_i\in[-\infty, \infty]\).

So, to convert logit to a probability measure, we use the sigmoid function:

\[ \mathrm{sigmoid}(u) = \frac{1}{1 + e^{-u}} \]

Finally, putting everything together, we have:

\[ f(x_i) = \mathrm{sigmoid}(w^Tx_i + b) \]

import torch
import torch.nn as nn
class LogisticRegression(nn.Module):
    def __init__(self, features=2):
        super().__init__()
        self.w = nn.Parameter(torch.randn(size=(features,)))
        self.b = nn.Parameter(torch.randn(size=()))
        self.activation = nn.Sigmoid()
    def forward(self, x):
        return self.activation(x @ self.w + self.b)
model = LogisticRegression()
x_input = torch.tensor(x_df.values, dtype=torch.float32)
y_true = torch.tensor(y_df, dtype=torch.float32)
model(x_input[:5])
tensor([3.1510e-05, 1.2083e-04, 4.1820e-05, 3.2806e-04, 9.5421e-05],
       grad_fn=<SigmoidBackward0>)

4 Evaluating model performance with binary cross-entropy

We have two sets of outputs:

  • True output: \(y_i\in\{0, 1\}\)
  • Predicted probability: \(p_i\in[0, 1]\).
y_i p_i
0 0.0 0.000032
1 0.0 0.000121
2 1.0 0.000042
3 0.0 0.000328
4 1.0 0.000095

4.1 The binary cross-entropy:

\[ L(p, y) = -y\log(p)-(1-y)\log(1-p) \]

from torch.nn.functional import binary_cross_entropy


with torch.no_grad():
    p = model(x_input)
    L = binary_cross_entropy(p, y_true, reduction='none').numpy()
    df = pd.DataFrame({
        'y_i': y_true,
        'p_i': p,
        'L_i': L,
    })
df.head()
y_i p_i L_i
0 0.0 0.000032 0.000032
1 0.0 0.000121 0.000121
2 1.0 0.000042 10.082129
3 0.0 0.000328 0.000328
4 1.0 0.000095 9.257207

5 Train the model

model = LogisticRegression()
loss = binary_cross_entropy
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
epochs = 100
for epoch in range(epochs):
    optimizer.zero_grad()
    L = loss(model(x_input), y_true)
    L.backward()
    optimizer.step()
    if epoch % (epochs // 10) == 0:
        with torch.no_grad():
            print(L.numpy())
6.0816207
0.78494406
0.7099538
0.64621294
0.59198713
0.5457063
0.5060138
0.47177264
0.44204697
0.4160743
model.w, model.b
(Parameter containing:
 tensor([ 0.6915, -1.3958], requires_grad=True),
 Parameter containing:
 tensor(0.6079, requires_grad=True))

6 Plot the classification boundary

7 A bit of linear algebra on plotting the surface

Let’s go through the linear algebra that computes the classification boundary. Since the classification boundary is in \(\mathbb{R}^2\), the boundary is a line.

Logistic regression models the boundary line by the normal vector \(w\) and offset \(b\), and the line is given by: \[ L = \{x\in\mathbb{R}^2: w^Tx + b = 0\} \]

The nice thing is that this equation for \(L\) generalizes to higher dimensions.

To plot \(L\), we need to transform it into another form:

\[ L = \{x_0 + vt: t\in\mathbb{R}\} \] where \(x_0, v\in\mathbb{R}^2\).

So, we need to find:

  1. A point already on \(L\): \(x_0\in\mathbb{R}^2\),
  2. the direction vector of \(L\): \(v\in\mathbb{R}^2\).

7.1 Finding a point on \(L\)

Note

Claim:

\(x_0 = -\frac{w}{\|w\|}b\) is on \(L\)

Proof:

Just check:

\[ \begin{eqnarray} w^T x_0 + b &=& -w^T(\frac{w}{\|w\|^2}b + b \\ &=& -\frac{w^T w}{\|w\|^2}b + b \\ &=& -b + b \\ &=& 0 \end{eqnarray} \]

7.2 Finding the direction vector of \(L\)

Note

Claim: \[ \left[\begin{array}{c} u \\ v \end{array} \right]^T \left[\begin{array}{c} -v \\ u \end{array} \right] = 0 \]

Therefore, given \(w = [w_0, w_1]\), the direction vector is simply \(v = [-w_1, w_0]\).

7.3 Plotting \(L\)

Now that we have: \(L = \{x_0 + vt\}\),

We just need to points:

  • \(p_1 = x_0 + vt_1\)
  • \(p_2 = x_0 + vt_2\)

We pick \(t_1\) and \(t_2\) to be two arbitrary values.

plot(
    [p1[0], p2[0]],
    [p1[1], p2[1]], ...)

8 Generalizing to multiple classes

We will now perform classification on all three species based on the sepal length and sepal width.

With multiple classes, the model needs to compute multiple probabilities.

\[ f(x|\theta) = \left[\begin{array}{c} p_1 \\ p_2 \\ p_3 \\ \end{array}\right] \]

where each \(p_k\) represents the likelihood that \(x\) is the sepal dimensions of species \(k\).

Thus: \(p_1+p_2+p_3 = 1\).

8.1 Logits of classification

The model consists of the model parameters \((W, b)\) where - \(W\in\mathbb{R}^{2\times 3}\) - \(b\in\mathbb{R}^3\).

Given an input \(x\in\mathbb{R}^2\), we have define

\[ v = xW + b \]

From the dimensions, we can tell that \(v\in\mathbb{R}^3\). Here \(v\) is the vector of logits. In order to convert the logits into probabilities, we use the softmax function.

8.2 Softmax function

\[ \mathrm{softmax}:\mathbb{R}^n\to[0,1]^n \]

For \(p = \mathrm{softmax}(v)\), we have:

\[p_i = \frac{e^{v_i}}{\sum_k e^{v_k}}\]

8.3 A general model

\[ f(x|W,b) = \mathrm{softmax}(xW + b) \]

Note, \(W\) and \(b\) can be designed to accommodate any input dimension and number of classes.

9 Linear Layer With Activation Function

The model:

\[f(x|W, b) = \mathrm{softmax}(xW + b)\]

is called the linear layer.

The function softmax is called the activation function.

Both are supported by PyTorch.

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 3)
    def forward(self, x):
        return nn.functional.softmax(self.linear(x), axis=-1)
x_input = torch.tensor(
    dataset[["sepal_length", "sepal_width"]].values,
    dtype=torch.float32)

y_true = torch.tensor(dataset['species'], dtype=torch.float32)
#
# Use the model to do some classification WITHOUT training.
#

model = Classifier()
p_pred = model(x_input)

p_pred[:5]
tensor([[0.7420, 0.0797, 0.1784],
        [0.6849, 0.0461, 0.2690],
        [0.7546, 0.0682, 0.1772],
        [0.6799, 0.0615, 0.2586],
        [0.7352, 0.0826, 0.1822]], grad_fn=<SliceBackward0>)
#
# The true labels
#

y_true[:5]
tensor([2., 0., 2., 0., 1.])
#
# The one-hot encoding of the true labels
#

one_hots = nn.functional.one_hot(
    y_true.to(torch.int64),
    3
)

one_hots[:5]
tensor([[0, 0, 1],
        [1, 0, 0],
        [0, 0, 1],
        [1, 0, 0],
        [0, 1, 0]])
  • y_true is a tensor of float32.
  • y_true.to(torch.int64) converts each element to int64, so it can be used as an index tensor
  • functional.one_hot(index_tensor, num_classes) converts the index tensor elements to their one-hot encodings.

10 Cross Entropy

Let \(p_i = [p_{i1}, p_{i2}, p_{i3}]\) be the prediction prediction, and \(y_i\) the true label.

We need a loss function \(L(p_i, y_i)\) to assess the quality of the prediction.

10.1 Cross entropy loss of one-hot encodings

Let \(\mathbf{b}\) be an one-hot encoding vector over \(k\) classes, and \(\mathbf{p} = [p_1, p_2, \dots p_k]\) be the classification probability.

The cross entropy loss is given by:

\[ L_i = \mathrm{crossentropy}(\mathbf{p},\mathbf{b}) = - \sum_{i=1}^k b_i\cdot \log(p_i) \]

crossentropies = -torch.sum(torch.log(p_pred) * one_hots, axis=-1)
df = pd.DataFrame(p_pred.detach())
df['y_true'] = y_true.int()
df['crossentropy' ] = crossentropies.detach()
df.head().round(2)
0 1 2 y_true crossentropy
0 0.74 0.08 0.18 2 1.72
1 0.68 0.05 0.27 0 0.38
2 0.75 0.07 0.18 2 1.73
3 0.68 0.06 0.26 0 0.39
4 0.74 0.08 0.18 1 2.49

11 Training the linear layer with activation function

  • PyTorch CrossEntropyLoss() constructs a loss function \(f(\mathrm{logits}, \mathrm{indexes})\)

  • It computes the softmax inside the loss function.

  • It computes the one-hot vectors inside the loss function.

model = nn.Linear(2, 3)
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

x_input = torch.tensor(
    dataset[['sepal_length', 'sepal_width']].values,
    dtype=torch.float32
)

y_true = torch.tensor(
    dataset['species'].values,
    dtype=torch.int64
)
epochs = 1000
for epoch in range(epochs):
    optimizer.zero_grad()
    pred = model(x_input)
    l = loss(pred, y_true)
    l.backward()
    optimizer.step()
    
    if epoch % (epochs//10) == 0:
        with torch.no_grad():
            print(epoch, l.numpy().round(3))
print(epoch, l.detach().numpy().round(3))
0 3.751
100 0.726
200 0.624
300 0.58
400 0.554
500 0.536
600 0.523
700 0.512
800 0.504
900 0.497
999 0.491

12 Visualizing the 3-way classification

model.eval()
p_pred = model(x_input)
y_pred = p_pred.argmax(axis=1)
print("Predicted classes", y_pred[:5])
print("True classes", y_true[:5])
Predicted classes tensor([2, 0, 2, 0, 2])
True classes tensor([2, 0, 2, 0, 1])
torch.sum(y_pred == y_true) / y_true.shape[0]
tensor(0.7867)
colormap = {
    0: 'red',
    1: 'blue',
    2: 'orange',
}
c = [colormap[y] for y in y_true.numpy()]
pl.scatter(x_input[:,0], x_input[:, 1], c=c)
pl.title('True species classes');

x_min, y_min = x_input.min(axis=0).values.numpy()
x_max, y_max = x_input.max(axis=0).values.numpy()

x_range = np.linspace(x_min, x_max, 100)
y_range = np.linspace(y_min, y_max, 100)
xx, yy = np.meshgrid(x_range, y_range)

X = np.concatenate([
    xx.reshape(100, 100, 1),
    yy.reshape(100, 100, 1)
], axis=-1)

X.shape
(100, 100, 2)
input = torch.tensor(X.reshape(-1, 2), dtype=torch.float32)
logits = model(input)
output = logits.argmax(axis=-1)
output.shape
torch.Size([10000])
coordinates = X.reshape(-1, 2)

for c in [0, 1, 2]:
    pl.scatter(
        coordinates[output==c, 0],
        coordinates[output==c, 1],
        c=colormap[c],
        alpha=0.5,
        edgecolor='none',
        s=10,
    )
    
c = [colormap[y] for y in y_true.numpy()]
pl.scatter(x_input[:,0], x_input[:, 1], c=c)
pl.title('True species classes');