import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, random_split
from torchsummaryX import summary
import numpy as np
import time
from tqdm.notebook import trange, tqdm
import my
Residual Networks ✅
1 Theoretical background
Let’s look at how to improve the performance of a neural network by adding more layers.
Consider a function that we are learning:
Traditional wisdom is that we are to build networks with ever increasing number of layers to learn
We can see the learned network
Suppose we wish to improve the approximation an additional layer:
How can we guarantee that
If we design
2 What is a residual layer?
A residual layer
The residual layer
It is schematics is as follows:
We note that
3 Why residual layers help?
Claim:
can learn the function by default. We just need to initialize the weights to 0. Then MLP or Conv2D layer would just produce zero vectors as its output.
Claim:
can learn the function by default.
- By definition
- By default
, thus by default.
Claim:
By default
.
- By default
. - Thus, by default
.
Therefore, we can see that by adding the residual layer, we cannot do worse than the previous network.
4 Building the MLP residual layer.
Let’s build a MLP residual layer.
= torch.device('cuda:0' \
device if torch.cuda.is_available() \
else 'cpu')
device
device(type='cuda', index=0)
class Residual_MLP(nn.Module):
def __init__(self, input_dim, hidden_dim):
super().__init__()
self.mlp = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, input_dim)
)def forward(self, x):
return self.mlp(x) + x
Let’s try out the MLP residual layer in action.
= torch.randn(1, 10)
sample = Residual_MLP(10, 100)
L print("x.shape", sample.shape)
print("L(x).shape", L(sample).shape)
x.shape torch.Size([1, 10])
L(x).shape torch.Size([1, 10])
5 Residual layer in action
Let’s start with a simply classifier.
= nn.Sequential(
N1
nn.Flatten(),10),
nn.LazyLinear( )
def train(model, dataset, epochs):
= torch.optim.Adam(model.parameters())
optimizer = nn.CrossEntropyLoss()
loss = DataLoader(dataset, batch_size=128, shuffle=True)
dataloader = model.to(device)
model for epoch in trange(epochs):
= time.time()
start for (xs, targets) in tqdm(dataloader):
= xs.to(device), targets.to(device)
xs, targets = model(xs)
ys
optimizer.zero_grad()= loss(ys, targets)
l
l.backward()
optimizer.step()with torch.no_grad():
= (ys.argmax(axis=1) == targets).sum() / xs.shape[0]
acc = time.time() - start
duration print("[%d] acc = %.2f loss = %.4f in %.2f seconds." % (epoch, acc.item(), l.item(), duration))
1) train(N1, mnist,
[0] acc = 0.83 loss = 0.4740 in 6.88 seconds.
Now, we can add a residual layer.
= nn.Sequential(
N2
N1,10, 5),
Residual_MLP(
)
1) train(N2, mnist,
[0] acc = 0.94 loss = 0.3469 in 7.61 seconds.
Let’s add one more residual layer.
= nn.Sequential(
N3
N2,10, 5),
Residual_MLP(
)1) train(N3, mnist,
[0] acc = 0.95 loss = 0.2087 in 7.37 seconds.
= nn.Sequential(
N4
N3,10, 5),
Residual_MLP(
)1) train(N4, mnist,
[0] acc = 0.96 loss = 0.1728 in 7.72 seconds.
So, we can see that additional layers incrementally improves the network performance at the expense of the memory requirements of larger models.
The added residual layers brings the function closer to the true function