Nipun Batra, IIT Gandhinagar
$z = f(x,y) = x^{2} + y^{2}$
Gradient denotes the direction of steepest ascent or the direction in which there is a maximum increase in f(x,y)
$$\nabla f(x, y) = \begin{bmatrix} \frac{\partial f(x, y)}{\partial x}\ \frac{\partial f(x, y)}{\partial y} \end{bmatrix} = \begin{bmatrix} 2x \ 2y \end{bmatrix}$$
Taylor's series approximates a function $f(x)$ around point $x_0$ using a polynomial:
$$f(x) = f(x_0) + \frac{f'(x_0)}{1!}(x-x_0) + \frac{f''(x_0)}{2!}(x-x_0)^2 + \ldots$$
$$f(\vec{x}) = f(\vec{x_0}) + \nabla f(\vec{x_0})^T(\vec{x}-\vec{x_0}) + \frac{1}{2}(\vec{x}-\vec{x_0})^T\nabla^2 f(\vec{x_0})(\vec{x}-\vec{x_0}) + \ldots$$
where $\nabla^2 f(\vec{x_0})$ is the Hessian matrix and $\nabla f(\vec{x_0})$ is the gradient vector
For small $\Delta x$, ignoring higher order terms:
$$f(x_0 + \Delta x) \approx f(x_0) + f'(x_0)\Delta x$$
In vector form: $$f(\vec{x_0} + \Delta \vec{x}) \approx f(\vec{x_0}) + \nabla f(\vec{x_0})^T\Delta \vec{x}$$
$$\vec{x_1} = \vec{x_0} - \alpha \nabla f(\vec{x_0})$$
Converges slowly
Converges quickly, but might overshoot
Diverges
Just right
Learn $y = \theta_0 + \theta_1 x$ using gradient descent:
- Initial: $(\theta_0, \theta_1) = (4,0)$
- Step-size: $\alpha = 0.1$
- Dataset:
x | y |
---|---|
1 | 1 |
2 | 2 |
3 | 3 |
MSE = $\frac{\epsilon_1^2 + \epsilon_2^2 + \epsilon_3^2}{3}$
$$\frac{\partial MSE}{\partial \theta_0} = \frac{2\sum_i (y_i - \theta_0 - \theta_1x_i)(-1)}{N} = \frac{2\sum_i \epsilon_i(-1)}{N}$$
$$\frac{\partial MSE}{\partial \theta_1} = \frac{2\sum_i (y_i - \theta_0 - \theta_1x_i)(-x_i)}{N} = \frac{2\sum_i \epsilon_i(-x_i)}{N}$$
$$\theta_0 = \theta_0 - \alpha\frac{\partial MSE}{\partial \theta_0}$$ $$\theta_1 = \theta_1 - \alpha\frac{\partial MSE}{\partial \theta_1}$$
For dataset $\mathcal{D} = {(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)}$:
$$L(\theta) = \frac{1}{N}\sum_{i=1}^{N}loss(f(x_i, \theta), y_i)$$
True gradient: $$\nabla L = \frac{1}{n} \sum_{i=1}^n \nabla \operatorname{loss}(f(x_i), y_i)$$
For single sample $(x, y)$: $$\nabla \tilde{L} = \nabla \operatorname{loss}(f(x), y)$$
$$\mathbb{E}[\nabla \tilde{L}] = \sum_{i=1}^n \frac{1}{n} \nabla \operatorname{loss}(f(x_i), y_i) = \nabla L$$
Therefore: SGD gradient is an unbiased estimator of the true gradient!
For $X \in \mathbb{R}^{N \times D}$: - $X^TX$: $\mathcal{O}(D^2N)$ - Matrix inversion: $\mathcal{O}(D^3)$ - $X^Ty$: $\mathcal{O}(DN)$ - Final multiplication: $\mathcal{O}(D^2)$
Total complexity: $\mathcal{O}(D^2N + D^3)$
Vectorized update: $\theta = \theta - \alpha X^T(X\theta - y)$
Efficient form: $\theta = \theta - \alpha X^TX\theta + \alpha X^Ty$
Alternative form: $\mathcal{O}(NDt)$ per iteration
Gradient descent: following the steepest path downhill!