DLHW2_Q4

pdf

School

New York University *

*We aren’t endorsed by this school

Course

7123

Subject

Electrical Engineering

Date

Apr 3, 2024

Type

pdf

Pages

6

Uploaded by SuperHumanSteel12063

Report
25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory Transformers in Computer Vision Transformer architectures owe their origins in natural language processing (NLP), and indeed form the core of the current state of the art models for most NLP applications. We will now see how to develop transformers for processing image data (and in fact, this line of deep learning research has been gaining a lot of attention in 2021). The Vision Transformer (ViT) introduced in this paper shows how standard transformer architectures can perform very well on image. The high level idea is to extract patches from images, treat them as tokens, and pass them through a sequence of transformer blocks before throwing on a couple of dense classification layers at the very end. Some caveats to keep in mind: ViT models are very cumbersome to train (since they involve a ton of parameters) so budget accordingly. ViT models are a bit hard to interpret (even more so than regular convnets). Finally, while in this notebook we will train a transformer from scratch, ViT models in practice are almost always pre-trained on some large dataset (such as ImageNet) before being transferred onto specific training datasets. v Setup As usual, we start with basic data loading and preprocessing. Ilpip install einops Requirement already satisfied: einops in /opt/conda/lib/python3.18/site-packages (0.7.0) import torch from torch import nn from torch import nn, einsum import torch.nn.functional as F from torch import optim from einops import rearrange, repeat from einops.layers.torch import Rearrange import numpy as np import torchvision import time torch.manual_seed(42) DOWNLOAD PATH = '/data/fashionmnist’ BATCH_SIZE TRAIN = 100 BATCH_SIZE_TEST = 1000 transform_fashionmnist = torchvision.transforms.Compose([torchvision.transforms.ToTensor(), torchvision.transforms.Normalize((©.5,), (8.5,))]) train_set = torchvision.datasets.FashionMNIST(DOWNLOAD PATH, train=True, download=True, transform=transform_fashionmnist) train_loader = torch.utils.data.DatalLoader(train_set, batch_size=BATCH_SIZE_TRAIN, shuffle=True) test_set = torchvision.datasets.FashionMNIST(DOWNLOAD PATH, train=False, download=True, transform=transform_fashionmnist) test loader = torch.utils.data.Dataloader(test _set, batch size=BATCH SIZE TEST, shuffle=True) v The ViT Model We will now set up the ViT model. There will be 3 parts to this model: » A patch embedding" layer that takes an image and tokenizes it. There is some amount of tensor algebra involved here (since we have to slice and dice the input appropriately), and the einops package is helpful. We will also add learnable positional encodings as parameters. * A sequence of transformer blocks. This will be a smaller scale replica of the original proposed ViT, except that we will only use 4 blocks in our model (instead of 32 in the actual ViT). » A (dense) classification layer at the end. Further, each transformer block consists of the following components: » A self-attention layer with H heads, » A one-hidden-layer (dense) network to collapse the various heads. For the hidden neurons, the original ViT used something called a GeLU activation function, which is a smooth approximation to the ReLU. For our example, regular ReLUs seem to be working just fine. The original ViT also used Dropout but we won't need it here. * |ayer normalization preceeding each of the above operations. Some care needs to be taken in making sure the various dimensions of the tensors are matched. https://colab.research.google.com/drive/126Qz6pSq6YfOUDfVuH8JnQrp100HdS1B#printMode=true 1/6
25/03/2024, 18:44 def pair(t): return t if isinstance(t, tuple) else (t, t) # classes class PreNorm(nn.Module): def def __init_ (self, dim, fn): super().__init_ () self.norm = nn.LayerNorm(dim) self.fn = fn forward(self, x, **kwargs): return self.fn(self.norm(x), **kwargs) class FeedForward(nn.Module): def def _init_ (self, dim, hidden_dim, dropout = 0.): super().__init_ () self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.GELU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), n.Dropout(dropout) > ) forward(self, x): return self.net(x) class Attention(nn.Module): def def __init__ (self, dim, heads = 4, dim_head = 64, dropout = @.1): super().__init_ () inner_dim = dim_head * heads project_out = not (heads == 1 and dim_head == dim) self.heads = heads self.scale = dim_head ** -0.5 self.attend = nn.Softmax(dim = -1) self.to_gkv = nn.Linear(dim, inner_dim * 3, bias = False) self.to_out = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) if project_out else nn.Identity() forward(self, x): b, n, _, h = *x.shape, self.heads gkv = self.to_gkv(x).chunk(3, dim = -1) q, k, v = map(lambda t: rearrange(t, 'b n (hd) ->b h nd', h =h), gkv) dots = einsum('b hid, bh jd->bhiij', q, k) * self.scale attn = self.attend(dots) out = einsum('b h i j, b h jd->bhid', attn, v) out = rearrange(out, '‘bhnd->bn (hd)") return self.to_out(out) class Transformer(nn.Module): def def __init_ (self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.): super().__init_ () self.layers = nn.ModuleList([]) for _ in range(depth): self.layers.append(nn.ModuleList([ FashionMnist ViT.ipynb - Colaboratory PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)), PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout)) 1) forward(self, x): for attn, ff in self.layers: X = attn(x) + x X = FF(x) + x return x class ViT(nn.Module): def __init__ (self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 9.1, super().__init__ () image_height, image_width = pair(image_size) patch_height, patch_width = pair(patch_size) assert image_height % patch_height == @ and image_width % patch_width == @, num_patches = (image_height // patch_height) * (image_width // patch_width) patch_dim = channels * patch_height * patch_width 'Image dimensions must be divisible by the patch size.’ assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)’ self.to_patch_embedding = nn.Sequential( Rearrange('b (h p1) (w p2) -> b (h w) (p1 p2 c)', pl = patch_height, p2 = patch_width), nn.Linear(patch_dim, dim), ) calf nne amhaddino = nn Paramatarf{+nrch randnf1 nim natrhace + 1 Aim\) https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuH8JnQrp100HdS1B#printMode=true 2/6
25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory BEA 1 PUS_CHUSUULIIE = e T @1 GIIS LS | LU Al e | @M 4y P LSS T 4y udmy ) self.cls_token = nn.Parameter(torch.randn(1, 1, dim)) self.dropout = nn.Dropout(emb_dropout) self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout) self.pool = pool self.to_latent = nn.Identity() self.mlp_head = nn.Sequential( nn.LayerNorm(dim), nn.Linear(dim, num_classes) ) def forward(self, img): x = self.to_patch_embedding(img) b, n, _ = x.shape cls_tokens = repeat(self.cls_token, ‘() nd -> bnd', b =b) x = torch.cat((cls_tokens, x), dim=1) x += self.pos_embedding[:, :(n + 1)] X = self.dropout(x) x = self.transformer(x) x = X.mean(dim = 1) if self.pool == 'mean' else x[:, @] x = self.to_latent(x) return self.mlp_head(x) model = ViT(image_size=28, patch_size=4, num_classes=10, channels=1, dim=64, depth=6, heads=4, mlp_dim=256) optimizer = optim.Adam(model.parameters(), 1lr=0.002) Let's see how the model looks like. model ViT( (to_patch_embedding): Sequential( (@): Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', pl=4, p2=4) (1): Linear(in_features=16, out_features=64, bias=True) ) (dropout): Dropout(p=0.1, inplace=False) (transformer): Transformer( (layers): Modulelist( (0-5): 6 x ModuleList( (@): PreNorm( (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (fn): Attention( (attend): Softmax(dim=-1) (to_gkv): Linear(in_features=64, out_features=768, bias=False) (to_out): Sequential( (@): Linear(in_features=256, out_features=64, bias=True) (1): Dropout(p=0.1, inplace=False) ) ) (1): PreNorm( (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (fn): FeedForward( (net): Sequential( (@): Linear(in_features=64, out_features=256, bias=True) (1): GELU(approximate="none') (2): Dropout(p=0.1, inplace=False) (3): Linear(in_features=256, out_features=64, bias=True) (4): Dropout(p=0.1, inplace=False) ) ) ) ) ) ) (to_latent): Identity() (mlp_head): Sequential( (@): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (1): Linear(in_features=64, out_features=10, bias=True) ) ) This is it - 4 transformer blocks, followed by a linear classification layer. Let us quickly see how many trainable parameters are present in this model. def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) print(count_parameters(model)) https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuUH8JnQrp100HdS1B#printMode=true 3/6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory 598794 About half a million. Not too bad; the bigger NLP type models have several tens of millions of parameters. But since we are training on MNIST this should be more than sufficient. v Training and testing All done! We can now train the ViT model. The following again is boilerplate code. def train_epoch(model, optimizer, data_loader, loss_history): total_samples = len(data_loader.dataset) model.train() for i, (data, target) in enumerate(data_loader): optimizer.zero_grad() output = F.log_softmax(model(data), dim=1) loss = F.nll_loss(output, target) loss.backward() optimizer.step() if i ¥ 100 == o: print('[" + "{:5} .format(i * len(data)) + '/' + "{:5} .format(total_samples) + (" + '"{:3.0f}" .format(100 * i / len(data_loader)) + '%)] Loss: ' + '{:6.4f}"' .format(loss.item())) loss_history.append(loss.item()) def evaluate(model, data_loader, loss_history): model.eval() total_samples = len(data_loader.dataset) correct_samples = @ total_loss = @ with torch.no_grad(): for data, target in data_loader: output = F.log_softmax(model(data), dim=1) loss = F.nll_loss(output, target, reduction='sum') _s pred = torch.max(output, dim=1) total_loss += loss.item() correct_samples += pred.eq(target).sum() avg loss = total_loss / total_samples loss_history.append(avg_loss) print('\nAverage test loss: ' + '{:.4f}'.format(avg_loss) + ' Accuracy:' + '{:5}'.format(correct_samples) + '/' + *{:5}'.format(total_samples) + ' (' + ‘{:4.2f}".format(100.8 * correct_samples / total_samples) + '%)\n') The following will take a bit of time (on CPU). Each epoch should take about 2 to 3 minutes. At the end of training, we should see upwards of 95% test accuracy. N_EPOCHS = 5 start_time = time.time() train_loss_history, test_loss_history = [], [] for epoch in range(1, N_EPOCHS + 1): print('Epoch:’, epoch) train_epoch(model, optimizer, train_loader, train_loss_history) evaluate(model, test_loader, test_loss_history) print('Execution time:', '{:5.2f}'.format(time.time() - start_time), 'seconds') Epoch: 1 [ 0/60000 ( ©0%)] Loss: 2.4418 [10000/60000 ( 17%)] Loss: 0.8022 [20000/60000 ( 33%)] Loss: ©.7919 [30000/60000 ( 50%)] Loss: ©.6731 [40000/60000 ( 67%)] Loss: ©.5388 [Seeeo/c0000 ( 83%)] Loss: ©.8123 Average test loss: 0.5273 Accuracy: 8013/10000 (80.13%) Epoch: 2 [ 0/60000 ( 0%)] Loss: ©.4823 [1e000/60000 ( 17%)] Loss: ©.5416 [20000/60000 ( 33%)] Loss: ©.5099 [30000/60000 ( 50%)] Loss: ©.5960 [40000/60000 ( 67%)] Loss: ©.3532 [50000/60000 ( 83%)] Loss: ©.4261 https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuH8JnQrp100HdS1B#printMode=true 4/6
25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory Average test loss: ©.4427 Accuracy: 8358/10000 (83.58%) Epoch: 3 [ 0/60000 ( ©0%)] Loss: ©.5236 [10000/60000 ( 17%)] Loss: ©.4962 [20000/60000 ( 33%)] Loss: 0.5010 [3eep0/60000 ( 50%)] Loss: ©.5399 [40000/60000 ( 67%)] Loss: ©.5829 [Se000/60000 ( 83%)] Loss: 0©.5767 Average test loss: 0.4175 Accuracy: 8441/10000 (84.41%) Epoch: 4 [ 0/60000 ( ©%)] Loss: ©.3551 [10000/60000 ( 17%)] Loss: ©.4487 [20000/60000 ( 33%)] Loss: ©.5227 [30000/60000 ( 58%)] Loss: ©.5057 [40000/60000 ( 67%)] Loss: ©.2990 [ceeoe/60000 ( 83%)] Loss: ©.5170 Average test loss: ©.4178 Accuracy: 8473/10000 (84.73%) Epoch: 5 [ 0/60000 ( ©%)] Loss: ©.2978 [10000/60000 ( 17%)] Loss: ©.3932 [20000/60000 ( 33%)] Loss: ©.3207 [30000/60000 ( 50%)] Loss: ©.3457 [40000/60000 ( 67%)] Loss: ©.403@ [cee00/60000 ( 83%)] Loss: ©.2998 Average test loss: 0.4049 Accuracy: 8497/10000 (84.97%) Execution time: 1051.84 seconds evaluate(model, test_loader, test_loss_history) Average test loss: 0.4049 Accuracy: 8497/10000 (84.97%) import matplotlib.pyplot as plt import numpy as np # Load a few test images and labels test_data, test_labels = next(iter(test_loader)) test_images = test_data[:3] # Take the first 3 images test_labels = test_labels[:3] # Take the corresponding labels with torch.no_grad(): output = model(test_images) probs = F.softmax(output, dim=1) # Define a colormap for different classes colors = plt.cm.tabil@(np.linspace(®, 1, 10)) fig, axes = plt.subplots(2, 3, figsize=(12, 9)) for i, (ax1, ax2) in enumerate(zip(axes[@], axes[1])): axl.imshow(test_images[i][@], cmap='gray"') axl.set_title(f'Truth: {test_labels[i].item()}") axl.axis('off") ax2.bar(range(10), probs[i].detach().numpy(), color=colors) ax2.set_title('Predicted Probabilities') ax2.set_xticks(range(10)) plt.tight_layout() plt.show() https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuH8JnQrp100HdS1B#printMode=true 5/6
25/03/2024, 18:44 Truth: 1 Predicted Probabilities 1 0.8 - 0.6 - 0.4 - 0.2 - 0-0 L ) | ) | T T T T Start coding or generate with AI. https://colab.research.google.com/drive/126Qz6pSq6YfOUDfVuH8JnQrp100HdS1B#printMode=true FashionMnist ViT.ipynb - Colaboratory Truth: 7 Predicted Probabilities 1.0 1 0.8 A 0.6 0.4 - 0.2 0.0 Truth: O Predicted Probabilities 0.8 - 0.6 A 0.4 1 0.2 - 0.0 - 6/6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help