DLHW2_Q4
pdf
keyboard_arrow_up
School
New York University *
*We aren’t endorsed by this school
Course
7123
Subject
Electrical Engineering
Date
Apr 3, 2024
Type
Pages
6
Uploaded by SuperHumanSteel12063
25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory Transformers in Computer Vision Transformer architectures owe their origins in natural language processing (NLP), and indeed form the core of the current state of the art models for most NLP applications. We will now see how to develop transformers for processing image data (and in fact, this line of deep learning research has been gaining a lot of attention in 2021). The Vision Transformer (ViT) introduced in this paper shows how standard transformer architectures can perform very well on image. The high level idea is to extract patches from images, treat them as tokens, and pass them through a sequence of transformer blocks before throwing on a couple of dense classification layers at the very end. Some caveats to keep in mind: ViT models are very cumbersome to train (since they involve a ton of parameters) so budget accordingly. ViT models are a bit hard to interpret (even more so than regular convnets). Finally, while in this notebook we will train a transformer from scratch, ViT models in practice are almost always pre-trained on some large dataset (such as ImageNet) before being transferred onto specific training datasets. v Setup As usual, we start with basic data loading and preprocessing. Ilpip install einops Requirement already satisfied: einops in /opt/conda/lib/python3.18/site-packages (0.7.0) import torch from torch import nn from torch import nn, einsum import torch.nn.functional as F from torch import optim from einops import rearrange, repeat from einops.layers.torch import Rearrange import numpy as np import torchvision import time torch.manual_seed(42) DOWNLOAD PATH = '/data/fashionmnist’ BATCH_SIZE TRAIN = 100 BATCH_SIZE_TEST = 1000 transform_fashionmnist = torchvision.transforms.Compose([torchvision.transforms.ToTensor(), torchvision.transforms.Normalize((©.5,), (8.5,))]) train_set = torchvision.datasets.FashionMNIST(DOWNLOAD PATH, train=True, download=True, transform=transform_fashionmnist) train_loader = torch.utils.data.DatalLoader(train_set, batch_size=BATCH_SIZE_TRAIN, shuffle=True) test_set = torchvision.datasets.FashionMNIST(DOWNLOAD PATH, train=False, download=True, transform=transform_fashionmnist) test loader = torch.utils.data.Dataloader(test _set, batch size=BATCH SIZE TEST, shuffle=True) v The ViT Model We will now set up the ViT model. There will be 3 parts to this model: » A patch embedding" layer that takes an image and tokenizes it. There is some amount of tensor algebra involved here (since we have to slice and dice the input appropriately), and the einops package is helpful. We will also add learnable positional encodings as parameters. * A sequence of transformer blocks. This will be a smaller scale replica of the original proposed ViT, except that we will only use 4 blocks in our model (instead of 32 in the actual ViT). » A (dense) classification layer at the end. Further, each transformer block consists of the following components: » A self-attention layer with H heads, » A one-hidden-layer (dense) network to collapse the various heads. For the hidden neurons, the original ViT used something called a GeLU activation function, which is a smooth approximation to the ReLU. For our example, regular ReLUs seem to be working just fine. The original ViT also used Dropout but we won't need it here. * |ayer normalization preceeding each of the above operations. Some care needs to be taken in making sure the various dimensions of the tensors are matched. https://colab.research.google.com/drive/126Qz6pSq6YfOUDfVuH8JnQrp100HdS1B#printMode=true 1/6
25/03/2024, 18:44 def pair(t): return t if isinstance(t, tuple) else (t, t) # classes class PreNorm(nn.Module): def def __init_ (self, dim, fn): super().__init_ () self.norm = nn.LayerNorm(dim) self.fn = fn forward(self, x, **kwargs): return self.fn(self.norm(x), **kwargs) class FeedForward(nn.Module): def def _init_ (self, dim, hidden_dim, dropout = 0.): super().__init_ () self.net = nn.Sequential( nn.Linear(dim, hidden_dim), nn.GELU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), n.Dropout(dropout) > ) forward(self, x): return self.net(x) class Attention(nn.Module): def def __init__ (self, dim, heads = 4, dim_head = 64, dropout = @.1): super().__init_ () inner_dim = dim_head * heads project_out = not (heads == 1 and dim_head == dim) self.heads = heads self.scale = dim_head ** -0.5 self.attend = nn.Softmax(dim = -1) self.to_gkv = nn.Linear(dim, inner_dim * 3, bias = False) self.to_out = nn.Sequential( nn.Linear(inner_dim, dim), nn.Dropout(dropout) ) if project_out else nn.Identity() forward(self, x): b, n, _, h = *x.shape, self.heads gkv = self.to_gkv(x).chunk(3, dim = -1) q, k, v = map(lambda t: rearrange(t, 'b n (hd) ->b h nd', h =h), gkv) dots = einsum('b hid, bh jd->bhiij', q, k) * self.scale attn = self.attend(dots) out = einsum('b h i j, b h jd->bhid', attn, v) out = rearrange(out, '‘bhnd->bn (hd)") return self.to_out(out) class Transformer(nn.Module): def def __init_ (self, dim, depth, heads, dim_head, mlp_dim, dropout = 0.): super().__init_ () self.layers = nn.ModuleList([]) for _ in range(depth): self.layers.append(nn.ModuleList([ FashionMnist ViT.ipynb - Colaboratory PreNorm(dim, Attention(dim, heads = heads, dim_head = dim_head, dropout = dropout)), PreNorm(dim, FeedForward(dim, mlp_dim, dropout = dropout)) 1) forward(self, x): for attn, ff in self.layers: X = attn(x) + x X = FF(x) + x return x class ViT(nn.Module): def __init__ (self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 9.1, super().__init__ () image_height, image_width = pair(image_size) patch_height, patch_width = pair(patch_size) assert image_height % patch_height == @ and image_width % patch_width == @, num_patches = (image_height // patch_height) * (image_width // patch_width) patch_dim = channels * patch_height * patch_width 'Image dimensions must be divisible by the patch size.’ assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)’ self.to_patch_embedding = nn.Sequential( Rearrange('b c¢ (h p1) (w p2) -> b (h w) (p1 p2 c)', pl = patch_height, p2 = patch_width), nn.Linear(patch_dim, dim), ) calf nne amhaddino = nn Paramatarf{+nrch randnf1 nim natrhace + 1 Aim\) https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuH8JnQrp100HdS1B#printMode=true 2/6
25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory BEA 1 PUS_CHUSUULIIE = e T @1 GIIS LS | LU Al e | @M 4y P LSS T 4y udmy ) self.cls_token = nn.Parameter(torch.randn(1, 1, dim)) self.dropout = nn.Dropout(emb_dropout) self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout) self.pool = pool self.to_latent = nn.Identity() self.mlp_head = nn.Sequential( nn.LayerNorm(dim), nn.Linear(dim, num_classes) ) def forward(self, img): x = self.to_patch_embedding(img) b, n, _ = x.shape cls_tokens = repeat(self.cls_token, ‘() nd -> bnd', b =b) x = torch.cat((cls_tokens, x), dim=1) x += self.pos_embedding[:, :(n + 1)] X = self.dropout(x) x = self.transformer(x) x = X.mean(dim = 1) if self.pool == 'mean' else x[:, @] x = self.to_latent(x) return self.mlp_head(x) model = ViT(image_size=28, patch_size=4, num_classes=10, channels=1, dim=64, depth=6, heads=4, mlp_dim=256) optimizer = optim.Adam(model.parameters(), 1lr=0.002) Let's see how the model looks like. model ViT( (to_patch_embedding): Sequential( (@): Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', pl=4, p2=4) (1): Linear(in_features=16, out_features=64, bias=True) ) (dropout): Dropout(p=0.1, inplace=False) (transformer): Transformer( (layers): Modulelist( (0-5): 6 x ModuleList( (@): PreNorm( (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (fn): Attention( (attend): Softmax(dim=-1) (to_gkv): Linear(in_features=64, out_features=768, bias=False) (to_out): Sequential( (@): Linear(in_features=256, out_features=64, bias=True) (1): Dropout(p=0.1, inplace=False) ) ) (1): PreNorm( (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (fn): FeedForward( (net): Sequential( (@): Linear(in_features=64, out_features=256, bias=True) (1): GELU(approximate="none') (2): Dropout(p=0.1, inplace=False) (3): Linear(in_features=256, out_features=64, bias=True) (4): Dropout(p=0.1, inplace=False) ) ) ) ) ) ) (to_latent): Identity() (mlp_head): Sequential( (@): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (1): Linear(in_features=64, out_features=10, bias=True) ) ) This is it - 4 transformer blocks, followed by a linear classification layer. Let us quickly see how many trainable parameters are present in this model. def count_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad) print(count_parameters(model)) https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuUH8JnQrp100HdS1B#printMode=true 3/6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory 598794 About half a million. Not too bad; the bigger NLP type models have several tens of millions of parameters. But since we are training on MNIST this should be more than sufficient. v Training and testing All done! We can now train the ViT model. The following again is boilerplate code. def train_epoch(model, optimizer, data_loader, loss_history): total_samples = len(data_loader.dataset) model.train() for i, (data, target) in enumerate(data_loader): optimizer.zero_grad() output = F.log_softmax(model(data), dim=1) loss = F.nll_loss(output, target) loss.backward() optimizer.step() if i ¥ 100 == o: print('[" + "{:5} .format(i * len(data)) + '/' + "{:5} .format(total_samples) + (" + '"{:3.0f}" .format(100 * i / len(data_loader)) + '%)] Loss: ' + '{:6.4f}"' .format(loss.item())) loss_history.append(loss.item()) def evaluate(model, data_loader, loss_history): model.eval() total_samples = len(data_loader.dataset) correct_samples = @ total_loss = @ with torch.no_grad(): for data, target in data_loader: output = F.log_softmax(model(data), dim=1) loss = F.nll_loss(output, target, reduction='sum') _s pred = torch.max(output, dim=1) total_loss += loss.item() correct_samples += pred.eq(target).sum() avg loss = total_loss / total_samples loss_history.append(avg_loss) print('\nAverage test loss: ' + '{:.4f}'.format(avg_loss) + ' Accuracy:' + '{:5}'.format(correct_samples) + '/' + *{:5}'.format(total_samples) + ' (' + ‘{:4.2f}".format(100.8 * correct_samples / total_samples) + '%)\n') The following will take a bit of time (on CPU). Each epoch should take about 2 to 3 minutes. At the end of training, we should see upwards of 95% test accuracy. N_EPOCHS = 5 start_time = time.time() train_loss_history, test_loss_history = [], [] for epoch in range(1, N_EPOCHS + 1): print('Epoch:’, epoch) train_epoch(model, optimizer, train_loader, train_loss_history) evaluate(model, test_loader, test_loss_history) print('Execution time:', '{:5.2f}'.format(time.time() - start_time), 'seconds') Epoch: 1 [ 0/60000 ( ©0%)] Loss: 2.4418 [10000/60000 ( 17%)] Loss: 0.8022 [20000/60000 ( 33%)] Loss: ©.7919 [30000/60000 ( 50%)] Loss: ©.6731 [40000/60000 ( 67%)] Loss: ©.5388 [Seeeo/c0000 ( 83%)] Loss: ©.8123 Average test loss: 0.5273 Accuracy: 8013/10000 (80.13%) Epoch: 2 [ 0/60000 ( 0%)] Loss: ©.4823 [1e000/60000 ( 17%)] Loss: ©.5416 [20000/60000 ( 33%)] Loss: ©.5099 [30000/60000 ( 50%)] Loss: ©.5960 [40000/60000 ( 67%)] Loss: ©.3532 [50000/60000 ( 83%)] Loss: ©.4261 https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuH8JnQrp100HdS1B#printMode=true 4/6
25/03/2024, 18:44 FashionMnist ViT.ipynb - Colaboratory Average test loss: ©.4427 Accuracy: 8358/10000 (83.58%) Epoch: 3 [ 0/60000 ( ©0%)] Loss: ©.5236 [10000/60000 ( 17%)] Loss: ©.4962 [20000/60000 ( 33%)] Loss: 0.5010 [3eep0/60000 ( 50%)] Loss: ©.5399 [40000/60000 ( 67%)] Loss: ©.5829 [Se000/60000 ( 83%)] Loss: 0©.5767 Average test loss: 0.4175 Accuracy: 8441/10000 (84.41%) Epoch: 4 [ 0/60000 ( ©%)] Loss: ©.3551 [10000/60000 ( 17%)] Loss: ©.4487 [20000/60000 ( 33%)] Loss: ©.5227 [30000/60000 ( 58%)] Loss: ©.5057 [40000/60000 ( 67%)] Loss: ©.2990 [ceeoe/60000 ( 83%)] Loss: ©.5170 Average test loss: ©.4178 Accuracy: 8473/10000 (84.73%) Epoch: 5 [ 0/60000 ( ©%)] Loss: ©.2978 [10000/60000 ( 17%)] Loss: ©.3932 [20000/60000 ( 33%)] Loss: ©.3207 [30000/60000 ( 50%)] Loss: ©.3457 [40000/60000 ( 67%)] Loss: ©.403@ [cee00/60000 ( 83%)] Loss: ©.2998 Average test loss: 0.4049 Accuracy: 8497/10000 (84.97%) Execution time: 1051.84 seconds evaluate(model, test_loader, test_loss_history) Average test loss: 0.4049 Accuracy: 8497/10000 (84.97%) import matplotlib.pyplot as plt import numpy as np # Load a few test images and labels test_data, test_labels = next(iter(test_loader)) test_images = test_data[:3] # Take the first 3 images test_labels = test_labels[:3] # Take the corresponding labels with torch.no_grad(): output = model(test_images) probs = F.softmax(output, dim=1) # Define a colormap for different classes colors = plt.cm.tabil@(np.linspace(®, 1, 10)) fig, axes = plt.subplots(2, 3, figsize=(12, 9)) for i, (ax1, ax2) in enumerate(zip(axes[@], axes[1])): axl.imshow(test_images[i][@], cmap='gray"') axl.set_title(f'Truth: {test_labels[i].item()}") axl.axis('off") ax2.bar(range(10), probs[i].detach().numpy(), color=colors) ax2.set_title('Predicted Probabilities') ax2.set_xticks(range(10)) plt.tight_layout() plt.show() https://colab.research.google.com/drive/126Qz6pSq6YTOUDNVuH8JnQrp100HdS1B#printMode=true 5/6
25/03/2024, 18:44 Truth: 1 Predicted Probabilities 1 0.8 - 0.6 - 0.4 - 0.2 - 0-0 L ) | ) | T T T T Start coding or generate with AI. https://colab.research.google.com/drive/126Qz6pSq6YfOUDfVuH8JnQrp100HdS1B#printMode=true FashionMnist ViT.ipynb - Colaboratory Truth: 7 Predicted Probabilities 1.0 1 0.8 A 0.6 0.4 - 0.2 0.0 Truth: O Predicted Probabilities 0.8 - 0.6 A 0.4 1 0.2 - 0.0 - 6/6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Related Questions
The base bias is commonly used in linear operation for its simplicity and B-independent feature.
True
False
arrow_forward
Complete the phrase! The error of a measurement system depends on the non-ideal characteristics of every element in the system. Using calibration techniques we can identify which element in the system have the most dominant non-ideal behavior. We can than devise compensation strategies for t6hese elements which should produce significant reductions in the overall system error. The methods are named .....
arrow_forward
An asynchronous circuit has two SR latches (not edge-triggered FF) denoted SR1 and SR2, and two external inputs x1 and x2. It also has a single output z. The excitation expressions for the inputs and output equations are (see image).
Using the characteristic equation for an SR latch, draw the excitation table. Then draw the state table, flow table, and flow diagram.
arrow_forward
1. Consider the function of 3 variables specified with the following sum of products (SOP) form.
F(X, Y, Z) = XY'Z'+ X'Y
a) Specify the function with a canonical SOP form. Use minterm numbers.
Minterm numbers can be identified using product terms presented with "0-1-dash" notation.
b) Specify the function with a canonical product of sums (POS) from. Use Maxterm numbers.
c) Minimize the function using K-map: obtain the minimal SOP and POS forms for the function.
arrow_forward
2a. Convert both combinational circuits in a.and b.to Booleans functions with applying K-map on extracted terms to get the optimized versions of both of them.
arrow_forward
Q2) determine A>B, A
arrow_forward
Design a combinational circuit with the four inputs A,B.C, and D, and three outputs
X, Y, and Z. When the binary input is odd number, the binary output is one lesser
than the input. When the binary input is even number the binary output is one greate
than the input. Implement the function using multiplexers with minimal input and
select line.
arrow_forward
Briefly explain why we need/use compression. An example can work as an answer here.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Power System Analysis and Design (MindTap Course ...
Electrical Engineering
ISBN:9781305632134
Author:J. Duncan Glover, Thomas Overbye, Mulukutla S. Sarma
Publisher:Cengage Learning
Related Questions
- The base bias is commonly used in linear operation for its simplicity and B-independent feature. True Falsearrow_forwardComplete the phrase! The error of a measurement system depends on the non-ideal characteristics of every element in the system. Using calibration techniques we can identify which element in the system have the most dominant non-ideal behavior. We can than devise compensation strategies for t6hese elements which should produce significant reductions in the overall system error. The methods are named .....arrow_forwardAn asynchronous circuit has two SR latches (not edge-triggered FF) denoted SR1 and SR2, and two external inputs x1 and x2. It also has a single output z. The excitation expressions for the inputs and output equations are (see image). Using the characteristic equation for an SR latch, draw the excitation table. Then draw the state table, flow table, and flow diagram.arrow_forward
- 1. Consider the function of 3 variables specified with the following sum of products (SOP) form. F(X, Y, Z) = XY'Z'+ X'Y a) Specify the function with a canonical SOP form. Use minterm numbers. Minterm numbers can be identified using product terms presented with "0-1-dash" notation. b) Specify the function with a canonical product of sums (POS) from. Use Maxterm numbers. c) Minimize the function using K-map: obtain the minimal SOP and POS forms for the function.arrow_forward2a. Convert both combinational circuits in a.and b.to Booleans functions with applying K-map on extracted terms to get the optimized versions of both of them.arrow_forwardQ2) determine A>B, Aarrow_forward
- Design a combinational circuit with the four inputs A,B.C, and D, and three outputs X, Y, and Z. When the binary input is odd number, the binary output is one lesser than the input. When the binary input is even number the binary output is one greate than the input. Implement the function using multiplexers with minimal input and select line.arrow_forwardBriefly explain why we need/use compression. An example can work as an answer here.arrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Power System Analysis and Design (MindTap Course ...Electrical EngineeringISBN:9781305632134Author:J. Duncan Glover, Thomas Overbye, Mulukutla S. SarmaPublisher:Cengage Learning

Power System Analysis and Design (MindTap Course ...
Electrical Engineering
ISBN:9781305632134
Author:J. Duncan Glover, Thomas Overbye, Mulukutla S. Sarma
Publisher:Cengage Learning