Quiz 5 (6 April)

Published

April 6, 2023

  1. Assume that we want to design a neural network with two layers by composing two linear layers. That is, the output of the first layer becomes the input of the second layer. Why would such a naive composition not work? Or, in other words show that without using a non-linear activation, we get a linear model. You can show this for a single datapoint \((1 X D)\) instead of showing this for \(N\) datapoints.(1 mark)

  2. For this task, we are trying to use a neural network to predict a non-negative real number. However, in general, the output of a neural network is not constrained to be non-negative. Thus, we try to solve this problem by learning \(f(\hat{y})\) as the output of the neural network, and we then apply \(f^{-1}(f(\hat{y}))\) to finally get a \(\hat{y}\) which is guaranteed to be non-negative? Can you give an example of what \(f\) and thus its inverse can be? Please note: you are not allowed to use ReLU activation. Further, we want our function f and its inverse to be continuous and differentiable everywhere.(1 mark)

  3. Draw a simple perceptron with two inputs and one output for the binary AND problem. Remember that the simple perceptron works on binary input and produces binary outputs. Show that your model works for all four inputs ((1, 1), (1, 0), (0, 0) and (0, 1))(1 mark)

  4. Show that the following network using ReLu activation works for the XOR problem. (1 mark; 0.25 marks for each of the four possible inputs (0,0), (0,1), (1,0), (1,1))

  5. Assume we are solving the 3-class classification problem. We have two models (M1 and M2) from which we get two different output probabilities: \(\hat{y}_1 = (0.2, 0.7, 0.1)\), \(\hat{y}_1 = (0.2, 0.5, 0.3)\). The true class is \(y = 0\). Which of M1 or M2 has a higher cross-entropy losses? (1 mark)

  6. We have the following code

class M_A(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 10)
        
    def forward(self, x):
        z1 = self.fc1(x)
        a1 = F.relu(z1)
        z2 = self.fc2(a1)
        a2 = F.relu(z2)
        z3 = self.fc3(a2) # logits
        return z3

class M_B(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 10)
        
    def forward(self, x):
        z1 = self.fc1(x)
        z2 = self.fc2(z1)
        z3 = self.fc3(z2) 
        return z3

Both models have the same number of parameters despite some differences. Explain? (1 mark)

  1. Let us learn a model to predict the next character given a context of k previous characters. We have a total v characters in our vocabulary (e.g., a, b, …, z, _) Let us use an embedding size of e, i.e. each character in our vocabulary is represented as e length vector. Now, let us use a simple MLP with one hidden layer with h hidden units. What is the total number of parameters in this model? (1 mark)

  2. Assume the above setup in Q7. We observe that our embedding has “overfitted”, i.e. the embedding vectors are very far from the origin. Can you suggest an additional loss term that can help us to bring the embeddings closer to the origin and thus act as a regularisation? (1 mark)

  3. As discussed in the lecture, chatGPT like systems are similar to the above model. But, we know that chatGPT can produce many different outputs for the same input. Let us use a simplification and consider next character prediction problem. Our model outputs a probability distribution over the next character. However, if we just pick up the most probable class/character, we should always get the same output given the same input. But, how do chatGPT like systems generate different outputs? (1 mark)

  4. Assume a MLP input of 4 dimensions, 2 real outputs, first hidden layers of 3 units, second hidden layer of 2 units, and a final output layer of 2 units. What is the total number of parameters in this model? (1 mark)