A quadratic form in two variables is a polynomial in two variables with all degrees 2: $f(x,y) = ax^2 + bxy + cy^2$. We can associate a two-variable quadratic form given in this form with the matrix $A = [[a, \frac12 b], [\frac12 b, c]]$ since $f(x,y) = (x,y)^\top A(x,y)$. This generalises naturally: an $n$-variable quadratic form corresponds to a unique symmetric $n\times n$ matrix.
$n$-dimensional quadratic forms are useful because they act as quadratic approximation terms in $n$ dimensions. To understand what this means, consider the Taylor expansion around $f(x)$ given by $f(x+c) =f(x) + f'(x)c + \frac12 f''(x)c^2 + \dots$ Each term here captures a property of the local behaviour of $f$. The $f(x)$ is a constant. The $f'(x)c$ term is a linear approximation without a constant term, since we've already included the constant term in the expansion. The $\frac12 f''(x)c^2$ is a quadratic approximation without a constant or linear term, since those have already been captured. And so on. The constant term captures only the constant behaviour. The linear term captures only the linear behaviour i.e. the behaviour of the slope. The quadratic term captures only the quadratic behaviour, i.e. the concavity.
How to visualise 2-variable quadratic forms? They are paraboloids, which can be elliptic (imagine a parabola rotated about its vertical, but not necessarily symmetric), hyperbolic (a saddle), or a parabolic cylinder (think of a half pipe).
We are given a function of two variables $f(x,y)$ and wish to approximate it at $f(x+a,y+b)$. The approximation is
$$f(x+a,y+b) = f(x,y) + f_x(x,y)a + f_y(x,y)b + \tfrac12 f_{xx}(x,y)a^2 + f_{xy}(x,y)ab + \tfrac12 f_{yy}(x,y)b^2 + \dots$$
This is intuitive. We capture the constant behaviour first. Then for the linear behaviour, we capture the x and y behaviours separately and combine them linearly to describe the plane approximation. For the quadratic behaviour, we can't just capture x and y and combine them linearly, since the quadratic surface is more complex than a linear plane, and can't just be captured by combining two independent dimensions. Instead we capture the behaviour of the x and y directions separately and also of combining the x and y together.
Rewriting in vector form, we approximate $f(v+c)$ around $f(v)$ by
$$f(v+c) = f(v) + \nabla_f(v)\cdot c + \tfrac12 c^\top H_f(v)c + ...$$
where the Hessian is $H_f = [[f_{xx}, f_{xy}], [f_{yx}, f_{yy}]]$ and we evaluate this at $v$. The Hessian therefore represents the component of our approximation corresponding to "concavity information", in the same way that the grad captures "gradient/slope information". Neat.
Since partials commute for $f$ satisfying basic differentiability conditions, the Hessian is symmetric. Suppose the evaluated Hessian matrix is $[[1,-2], [-2,1]]$. What information about the concavity can we read off this matrix? $f_{xx}(v)$ and $f_{yy}(v)$ are both 1, so the double derivatives in the x and y directions are both positive. Is this enough to conclude that we are at a local minimum, and that whatever direction we head in will take us "upwards" (i.e. bending away from the tangent plane)?
Real symmetric matrices have real orthogonal eigenvectors. For our example Hessian, there is eigenvector $[1,1]$ with eigenvalue $-1$ and eigenvector $[-1,1]$ with eigenvalue $3$. Indeed we can diagonalise this Hessian using these eigenvectors/values. What it tells us is that if we head in the direction of the first eigenvector, the double derivative in this direction is -1, and if we head in the direction of the second eigenvector the double derivative is 3. So although the x and y directions both have positive double derivatives, we have found a direction $[1,1]$ where we don't have positive concavity. So we are at a saddle point, not a minimum.
Is there any other information about the concavity that's not present in its eigenvectors and eigenvalues? No! The eigenvectors and eigenvalues will fully describe the quadratic approximation at the point. Specifically the eigenvectors are the axes and the eigenvalues are the scalings. If we take level sets of this surface, we get hyperbolas with orthogonal axes, and the eigenvectors and eigenvalues give the axes positions and scalings.
In this example the Hessian at $v$ had a positive and a negative eigenvalue, so we concluded that $v$ is a saddle. Generalising: if the Hessian has two positive eigenvalues, then we are at a local minimum (i.e. the quadratic term in the Taylor series is an elliptic paraboloid), and any direction in which we move will take us upward. And if the Hessian has two negative eigenvalues, then we are at a local maximum.
And if we are working in more than two variables? Then we are at a local minimum if all the Hessian's eigenvalues are positive, and a local maximum if all the Hessian's eigenvalues are negative. But these are precisely the conditions for positive/negative definiteness!
So: positive definiteness is a condition on a matrix which means that its corresponding quadratic form is convex. Since the Hessian of a function at a point captures all the information about the function's convexity at that point, then the Hessian being positive definite at a point means the function is convex at that point. And if the Hessian is not positive or negative definite, then we can read off its eigenvectors and values to see the directions in which the function moves up or down.
And how does positive definiteness relate to quadratic optimisation? Take the function $f(X) = \tfrac 12 x^\top A x + x^\top b + c$. The second derivative of this function is $A$ everywhere. If $A$ is positive definite, then the second derivative is "positive" everywhere (i.e. from any starting point, whatever direction we go to will lead to the surface bending away from its tangent plane), and we know this from the eigenvalues all being positive. But this precisely means that $f$ is convex! This is the relationship between convex optimisation and positive semidefiniteness.
No comments:
Post a Comment