Quiz 3 (27 Feb)

Published

February 27, 2023

  1. Many evaluation metrics decompose over the training examples. For example, the loss function for linear regression (proportional to mean squared error) is given as: \[L(\theta) = \frac{1}{2N}\sum_{i=1}^N (y_i - \sum_{d=1}^D \theta_d x_i^d)^2\] where \(N\) is the number of training examples, \(x_i\) is the \(i^{th}\) training example and \(y_i\) is the corresponding label. Mention any evaluation metric/loss function in machine learning that does not decompose over the training examples. [1 mark]

  2. We saw the figure showing SGD convergence.

  1. In an above question, we proved that the SGD estimator is an unbiased estimator. We have also previously discussed that we typically have a bias-variance tradeoff in our models. In the recent assignment question, we have plotted the bias and variance for different complexity trees. In this question, you have to derive the mean squared error in terms of three terms: bias, variance and irreducible noise.

Let us assume our data is generated from a `true’ function \(f(x)\) and we have some additional zero mean normally distributed noise \(\epsilon \sim \mathcal{N}(0, \sigma^2)\).

\[y = f(x) + \epsilon\]

We can use some model such as a decision tree or linear regression to approximate \(f(x)\). We now consider a single training example \((x_0, y_0)\). We can define the mean squared error as:

\[MSE = \mathbb{E}[(y_0 - \hat{f}(x_0)^2)]\]

where \(y\) is the true label and \(\hat{f}(x_0)\) is the predicted label. The expectation is over all possible training sets that could have been generated.

To keep the notation simple, we refer \(f(x_0)\) as \(f\) and \(\hat{f}(x_0)\) as \(\hat{f}\). Thus, we can write \[MSE = \mathbb{E}[(f - \hat{f})^2]\] We also define the bias as the difference between the true function and the predicted function, evaluated at the training example: \[bias = \mathbb{E}[\hat{f}] - f\] or, \[bias = \overline{f} - f\] where \(\overline{f}\) is the average/expectation of the predicted function over all possible training sets.

We define the variance as: \[variance = VAR(\hat{f})\] or,

\[variance = \mathbb{E}[(\hat{f} - \overline{f})^2]\]

We define irreducible noise as the variance of the noise term \(\epsilon\):

\[irreducible = VAR(\epsilon)\] or,

\[irreducible = \sigma^2\]

Using the above definitions, show that the mean squared error can be written as:

\[MSE = bias^2 + variance + irreducible\]

[2 marks]