02 Jan 2022 - tsp
Last update 20 Mar 2022
23 mins
So first of this is not a purely mathematical article nor a tutorial. It’s just a summary of my understanding of Krylov subspaces and their role in solving linear equations - since I had to implement such an solver for one of my FTDM simulations and as usual didn’t want to use external dependencies (and always wondered how those methods really work since usually one only uses some magically working and highly performant libraries) I thought it would be a good idea to also summarize my view on this method - that’s unsurprisingly again based primarily on the idea of projections and taking a shortcut by solving a high dimensional problem inside a projective subspace. It’s a good idea to familiarize oneself with stuff like power iterations, the QR decomposition using Givens rotations and how to apply this decomposition to solve the linear least squares problem first.
Side note: Every time I denote the norm of a vector $\mid \vec{v} \mid$ I implicitly mean the Euclidean norm $\mid \vec{v} \mid_2$
So what’s the problem that I wanted to solve and what are Krylov subspace iterations used for? They’re mainly used to solve Eigenvector problems and to approximate solutions to large scale linear equation systems (let’s say a million variables in a million equations for example). The latter on is a pretty common problem when doing finite element and finite difference calculations in mechanics, physics, aerodynamics, electrodynamics, statics, social network analysis, statistics and many more fields - basically these are methods that are used to solve or approximate system state of complex systems that are described by partial differential equations. On the other hand Krylov subspace methods also are a powerful method to calculate Eigenvectors - in fact their basic idea is similar to power iterations.
The basic problem for $N$ variables in $N$ linear equations can be written as a matrix-vector equation:
[ A * \vec{x} = \vec{b} ]In this case the coefficients of the equations are contained in the matrix $A \in \mathfrak{R}^{N \times N}$, the vector $\vec{x} \in \mathfrak{R}^N$ is formed by the unknowns and the vector $\vec{b} \in \mathfrak{R}^N$ contains the constant solutions for the equations. For example the system of equations
[ 1 a + 2 b + 3 c = 4 \\ 5 a + 6c = 7 \\ 8 a - 9 b + 10 c = 11 ]would be expressed as
[ \begin{pmatrix} 1 & 2 & 3 \\ 5 & 0 & 6 \\ 8 & -9 & 10 \end{pmatrix} * \begin{pmatrix} a \\ b \\ c \end{pmatrix} = \begin{pmatrix} 4 \\ 7 \\ 1 \end{pmatrix} ]Of course such a small system could be solved by hand or automating the Gaussian elimination method - which works pretty well for small scale systems but it gets unusable when one approaches thousands of equations in thousands of variables. Then iterative methods get more interesting. The basic idea behind current software implementations is to use algorithms that are just based on some simple operations such as:
This is usually done by choosing an initial guess $x_0$ and then computing a sequence of $x_n$ that should minimize the residual $\vec{r_n} = \vec{b} - A * \vec{x_n}$. For Krylov subspaces this is done by moving into an affine subspace $\vec{x_0} + \mathfrak{K} \subset \mathfrak{R}^N$. Krylov subspace methods are as of today the most used methods to solve general and symmetric positive definite systems of equations - other methods such as the successive overrelaxation do exist but are not really in use often except for some special cases.
So the first question that arises is - what is a Krylov subspace anyways and why does one want to use this subspace?
Formally one could solve the linear equation system by calculating a matrix inverse:
[ A * \vec{x} = \vec{b} \\ A^{-1} * A * \vec{x} = A^{-1} * \vec{b} \\ \vec{x} = A^{-1} * \vec{b} ]This is of course not suited for any real world implementation - but there is a theorem - the Cayley-Hamilton theorem that tells one that every square matrix over a commutative ring satisfies it’s own characteristic equation. This also implies that for every general $n \times n$ matrix that’s invertible and has a nonzero determinant one can write it’s inverse $A^{-1}$ as an polynomial expression of order $n-1$. This polynomial expression is built by the powers of $A$ - so the inverse can be built by a linear combination of $A, A^2, A^3, \ldots, A^{n-1}$. In contrast to a direct approach Krylov subspace iterations never calculate the powers of a matrix themselves - but they do repeatedly perform a matrix-vector multiplication and thus basically multiply vectors with powers of $A$ which leads to a huge performance gain (see power iterations).
Given any vector $\vec{v} \in \mathfrak{R}^N$ and $n \leq N$ a Krylov subspace is just
[ K_n = K_n(A, \vec{v}) = span(\vec{v}, A \vec{v}, A^2 \vec{v}, \ldots, A^{n-1} \vec{v}) ]As one can see of course this is a nested set of subspaces:
[ K_1 \subseteq K_2 \subseteq K_3 \subseteq \ldots \subseteq K_{N} ]The dimension of these subspaces grows at a maximum of one for every iteration and it’s dimension can never ever exceed $N$ - in practice it’s usually much smaller. This is also the foundation for solving linear equations in Krylov subspaces - searching for representation of a solution in a low dimensional space is usually much easier than to search inside a high dimensional space.
There is a description for the basic inner workings of a standard Krylov subspace solver that I’ve found in a paper:
[ \vec{r_n} = \vec{b} - A \vec{x_n} \\ \vec{x_n} - \vec{x_0} = q_{n-1}(A) \vec{r_0} \in K_n(A,\vec{r_0}) \\ \vec{r_n} - \vec{r_0} = \zeta_n(A) \vec{r_0} \in A K_n(A, \vec{r_0} \in K_{n+1}(A, \vec{r_0})) ]So how does one build the basis of a Krylov subspace in practice? One could for example use:
This is one of the most basic Krylov subspace algorithms that’s only applicable to symmetric positive definite matrices. The basic idea is to start at a random unit vector $\vec{v}$ and iteratively construct an orthonormal basis $K_n(A,\vec{v})$ with $n \leq N$.
First one initializes the algorithm:
[ \vec{w}_0 = \vec{0} \\ \vec{w}_1 = \vec{v} \\ \delta_1 = 0 ]Then an iterative sequence follows for at most $n$ iterations:
[ \vec{h} = A * \vec{w}_j - \delta_j * \vec{w}_{j-1} \\ \gamma_j = \vec{h}^ T * \vec{w}_j \\ \vec{k} = \vec{h} - \gamma_j \vec{w}_j \\ \delta_{k+1} = \mid \mid \vec{k} \mid \mid_2 \\ \vec{w_{j+1}} = \frac{1}{\delta_{j+1}} \vec{k} ]As one can see the first step is as usual to push the current vector $\vec{w}_j$ through the projection matrix $A$. One then subtracts the normal components of the previous iteration (more on that later) - this weight factor is calculated later on and of course is zero for the first iteration.
The factor $\gamma_j$ basically describes the projection of the newly projected vector onto the previous guess. The calculation of $\vec{k}$ thus basically tries to remove any components of ${w_j}$ from $\vec{h}$ by simple subtraction. This is a simple try of an orthogonalization technique - note that this is usually numerically unstable because of cancellation errors.
Now one calculates the new guess simply by determining the length of the remaining orthogonal components and re-normalizing the guess.
As mentioned above this algorithm would work in theory but is usually numerically unstable - one could use something like the Householder algorithm. The resulting orthonormal basis is:
[ W_n = \left( \vec{w_1}, \vec{w_2}, \ldots, \vec{w_n} \right) \in \mathfrak{R}^{N x n} ]Note that the Lanczos algorithm might terminate early - in case $\delta_{j+1} = 0$ for any $j < n$ one can assume that the degree of $\vec{v}$ is $j$ and $K_j$ is invariant.
The Arnoldi algorithm works somewhat like the Lanczos method above - in fact when being applied to Hermitian matrices the Arnoldi algorithm reduces to the Lanczos algorithm. The idea is similar too: Perform a Gram-Schmidt orthogonalization process after each iteration step to build the Krylov subspace. This requires of course to store all basis vectors in memory which might be a problem under some circumstances - one can implement restarts in this case for example to try different configurations in case the subspace grows too large.
Again as one can see the algorithm forms a iterative process that starts with an unit start vector $w_0$. This vector is then pushed through the projection matrix (and thus slowly moves towards the Eigenvector or the matrix with the largest Eigenvalue as with any power iteration).
Then one performs the Gram-Schmidt orthogonalization process. This basically projects the newly generated vector $\vec{w}_j$ onto all previously generated vectors $\vec{w}_1 \ldots \vec{w}_{j-1}$ one after each other. As one should already know the inner product of two vectors yields the length of the projection of one vector onto the other vector - in our case one vector is of unit length so the projected length is only scaled by the length of the second vector. Then one simply rescales the previous vector until the projected components in the direction of $\vec{w}_j$ match the components of $\vec{w}_{j}$ and subtracts the scaled vector ($\vec{w}_j = \vec{w}_j - h_{l, j-1} * \vec{w}_l$). This removes all components pointing in the direction of $\vec{w}_l$ out of $\vec{w}_n$ - thus this is the typical Gram-Schmidt orthogonalization process.
The last step then - since we’re interested in normalized basis vectors - is the normalization of the vector $\vec{w}_j$ by dividing it by it’s own length $h_{j, j-1}$.
Note that as for the Lanczos algorithm this algorithm may abort early with $\vec{w}_j = \vec{0}$ in case the degree of the characteristic polynomial is only of degree $j$.
So what has all of this to do with linear equations? First lets take a look on how to solve linear equations numerically. One could of course always apply the Gaussian elimination procedure which is the way to go for small scale systems but unfortunately this scales with $\mathfrak{O}(n^3)$ which makes it pretty prohibitive for large scale applications. And when looking into big data directions it’s not really flexible enough to be updated and interpreted in an online fashion without knowing the whole dataset.
So let’s take a different approach (that’s usually also not executed that way due to it’s numerical complexity but it’s guaranteed to converge in case a solution exists). One can show that the solution of the system of linear equations
[ A \vec{x} = \vec{b} ]is equal to the minimization problem:
[ min \phi(\vec{x}) = \frac{1}{2} \vec{x}^T A \vec{x} - \vec{x}^T \vec{b} ]The idea behind steepest descent thus is just take some arbitrary start guess $\vec{x}_0$, find the direction of the steepest descent, calculate $x_{1}$ and take that step. Now iterate as long as one hasn’t reached the correct solution. The first thing to look at is the direction of the step so one can do a Taylor expansion of the function $\phi(\vec{x} + \alpha \vec{d})$ when taking a step into direction $\vec{d}$:
[ \phi(\vec{x}_k + \alpha \vec{d}) = \phi(\vec{x}_k) + \alpha \nabla \phi(\vec{x}_k)^T \vec{d} + \mathfrak{O}(\alpha^2 \mid \vec{d} \mid^2) ]As one can see the (first order) approximation for the direction of steepest descent is simply $- \nabla \phi(\vec{x}_k)$. Let’s just define this as the residual $\vec{r}_k$:
[ \vec{r_k} = - \nabla \phi(\vec{x}_k) \\ \vec{r_k} = \vec{b} - A \vec{x}_k ]Now one is just moving into the direction of $\vec{r}_k$ till one reaches the optimum solution (exact line search). When one looks at $\phi(\vec{x_k} + \alpha \vec{r}_k)$ one can see this is just a second order polynomial in $\alpha$ and thus one can solve exactly:
[ \phi(\vec{x_k} + \alpha_k \vec{r}_k) = \frac{1}{2} (\vec{r}_k^T A \vec{r}_k) \alpha_k^2 + (\vec{r}_k^T A \vec{x}_k - \vec{r}_k^T \vec{b}) \alpha_k + \phi(\vec{x}_k) \\ \to \alpha_k = \frac{\vec{r}_k^T \vec{r}_k}{\vec{r}_k^T A \vec{r}_k} ]This would already be enough to build an idealized algorithm with pretty slow but guaranteed convergence:
while
$\vec{r}_k \neq 0$:
As one can easily see one can also cache one intermediate result to reduce the number of matrix multiplications:
[ \vec{r}_{k+1} = \vec{b} - A \vec{x}_{k+1} \\ \vec{r}_{k+1} = \vec{b} - A (\vec{x}_k + \alpha \vec{r}_k) \\ \vec{r}_{k+1} = \vec{r}_k - \alpha_k \underbrace{A \vec{r}_k}_{see in spectral coefficient} ]Since this method is - even when guaranteed to converge - pretty slow one can extend the idea to use a projective method. The idea is similar to the steepest descent step - but one performs this minimization in a projective subspace. This space is spanned by some vectors $span(\vec{v_1}, \vec{v_2}, \ldots, \vec{v_k})$. This idea arises when one looks at the result updating step inside the steepest descent algorithm:
[ \vec{x}_{k+1} = \vec{x}_k + \alpha_k \vec{r}_k \\ \vec{x}_{k+1} = \vec{x}_0 + \alpha_0 \vec{r}_0 + \alpha_1 \vec{r}_1 + \ldots + \alpha_k \vec{r}_k \\ \to \vec{x}_{k+1} \in \vec{x_0} + span(\vec{r}_0, \vec{r}_1, \vec{r}_2, \ldots, \vec{r}_k) ]This will allow one to reduce the complexity of this exact line step for example when using the conjugate gradient method or the numerical generalized minimal residual method.
The conjugate gradient (CG) method is the most used method for symmetric positive definit matrices. The basic idea is as mentioned above that the exact line search is faster in a subspace $V_k$ than in the overall solution space. In case one chooses $V_k$ wisely one can even expect the exact solution to lie inside an subspace of the whole problem / expect fast convergence. Thus the idea is to combine the low computational cost of steepest descent in contrast to generalized methods (which limits the conjugate gradient method to symmetric positive definit matrices) with the convergence speed of projective methods.
The basic building blocks are:
The third step can simply be realized symbolically by first projecting the residuum vector $\vec{r}_k$ onto the space $V_{k-1}$:
[ \vec{z}_k = \mathfrak{P}(AK_{k-1}) * \vec{r}_k ]Thus $\vec{z}_k$ only contains the elements of $\vec{r}_k$ that are contained in the previous projective space. Subtracting this from $\vec{r}_k$ yields all components that lie outside this given subspace:
[ \vec{p}_k = \frac{1}{\gamma_k} (\vec{r}_k - \vec{z}_k) ]As one can see the subspace one’s spanning that way is $\mathfrak{K}_k = span(\vec{r}_0, A \vec{r}_0, \ldots, A^k \vec{r}_0)$. The dimension of this space is increasing at most by one each iteration and thus $dim \mathfrak{K}_k \leq k+1$.
Explicitly inserting and compactifying yields a simple expression without explicit projections into the subspace:
[ \vec{p}_k = \vec{r}_k - \frac{\vec{p}_{k-1}^T A \vec{r}_k}{p_{k-1}^T A \vec{p}_{k-1}} \vec{p}_{k-1} ]This is the conjugate gradient algorithm - one can imagine that one is just taking always the steepest possible step with the optimal length along into the direction of each residuum in a space orthogonal to the previous subspace. When looking at the iteration $\mathfrak{K}_k = span(\vec{r}_0, A \vec{r}_0, \ldots, A^k \vec{r}_0)$ that spans the subspace one can already again see a power iteration as when calculating Eigenvectors - which also intuitively explains the high speed of projective methods. First they minimize along the approximate direction of the first principal vector, then after the second one, etc. - it’s not exact though since they don’t calculate the first Eigenvector but only the projection of the residuum that’s massively projected into the direction of the first Eigenvector - but the process is similar especially when the Eigenvectors are clearly separated and Eigenvalues are largely separated.
while
$\vec{r}_k \neq 0$:
if
$k = 0$:
else
:
The generalized minimal residual method is one of the methods one can apply to general indefinite non symmetric systems of linear equations. It approximates a solution vector in a Krylov subspace by trying to minimize the Euclidean norm of the residual (by choosing the Krylov subspace representation of the solution with minimal error deviation) at each step:
[ \vec{r}_n = \vec{b} - A * \vec{x_n} \\ \mid \vec{r}_n \mid \to min. ]The vector $\vec{x}_n$ is assumed to be a element of the n’th Krylov subspace $K_n$.
[ \vec{x}_n \in K_n ]Note that this is particular efficient when solving the least squares each iteration using QR decomposition when implemented with Givens rotations. This is the case since Arnoldi iterations performs full Gram Schmidt orthogonalization for each new vector in the basis that’s created iteratively and thus the cost will only be $\mathfrak{O}(k)$ at the $k$ step.
The first step is to build the Krylov subspace using not a random vector but the initial residual for any initial guess $x_0$:
[ \vec{r}_0 = \vec{b} - A * \vec{x}_0 ]The initial vector for the Arnoldi iteration thus is the unit vector pointing into the direction of the initial residual:
[ \vec{v}_1 = \frac{1}{\mid \vec{r}_0 \mid} \vec{r}_0 ]The approximated vector $\vec{x}_n$ after the n’th iteration can be written simply as:
[ Q_n = \begin{pmatrix} \vec{v}_1 & \vec{v}_2 & \ldots \vec{v}_n \end{pmatrix} \in \mathfrak{R}^{m \times n} \\ \vec{x}_n = \vec{x}_0 + Q_n * \vec{y}_n ]if one looks closely one sees that one is basically building a linear combination of the first $n$ vectors $\vec{v}_n$ that - as for the power iteration / Arnoldi algorithm - approximate the Eigenvectors of the matrix that correspond to the largest Eigenvalues - so one basically tries to isolate the components with the biggest influence on the projective transformation and builds a linear combination along the principal axis. This is the same as during the subspace iterations of the conjugate gradient method.
The main difference is that one now does not do an exact line search in the direction of the maximum gradient in orthogonal direction to the previous subspace (as for the CG method above) but tries to minimize the residuum in a general way in the current subspace.
In this case one has to choose $\vec{y}_n$ such that:
[ \mid r_n \mid \to min \\ \mid \vec{b} - A * \vec{x}_n \mid \to min \\ \mid \vec{b} - A * \left( \vec{x}_0 + Q_n * \vec{y}_n \right) \mid \to min \\ \mid \vec{b} - \underbrace{A * \vec{x}_0}_{\beta * \vec{v}_1} - A * Q_n * \vec{y}_n \mid \to min \\ \beta = \mid \vec{r}_0 \mid ]Thus on each iteration one tries to minimize the residual approximately along one of the basis vectors more of the Krylov subspace / is approximating the inverse matrix by a component along one more of the approximated Eigenvector dimensions. This next basisvector is orthogonal to the previous vectors though because it has been orthogonalized during the Arnoldi iteration.
The algorithm is finished whenever the minimization yields an result that minimizes the residual as far as desired or when an exact solution if found (i.e. the residual is zero - which will generally not be the case even for exactly solvable systems due to rounding errors as usual for floating point algorithms) - of course this heavily depends on the problem that one tries to solve and the matrix (preconditioning, etc.). It’s only guaranteed to converge in the last step - which then is equivalent to directly solve the system of equations using least squares.
The main problem implementing GMRES is the least squares minimization step to find an appropriate $\vec{y}_n$. One can easily show from the above equations that the minimization of
[ \mid \vec{b} - \beta * \vec{v}_1 - A * Q_n * \vec{y}_n \mid \to min ]is equivalent to the minimization
[ J(\vec{y}) = \mid \mid \vec{r}_0 \mid \vec{v}_1 - A * Q_n * \vec{y}_n \mid ]This can then be rewritten a little bit further using the upper Heisenberg matrix $H_k$. Recall that a upper Heisenberg matrix is just a square matrix with zero elements below the first sub diagonal. This works since the Arnoldi iteration yields
[ A q_k = \sum_{i = 1}^{k+1} h_{i,k} \vec{q}_i \\ A Q^{k} = Q^{k+1} H^k \\ H^k \in \mathfrak{R}^{(k+1) \times k} \\ H_k = \begin{pmatrix} h_{1,1} & h_{1,2} & \ldots & h_{1,k} \\ h_{2,1} & h_{2,2} & \ldots & h_{2,k} \\ 0 & h_{3,2} & \ldots & h_{3,k} \\ \ldots & \ldots & \ldots & \ldots \\ 0 & 0 & 0 & h_{k+1, k} \end{pmatrix} ] [ J(\vec{y}) = \mid \mid \vec{r}_0 \mid \vec{v}_1 - Q_{n+1} * H_{n} * \vec{y}_n \mid ]The next step is just to factorize out $Q_{n+1}$.
[ J(\vec{y}) = \mid Q_{n+1} \left( \beta \vec{e}_1 - H_k * \vec{y} \right) \mid \\ J(\vec{y}) = \mid \beta \vec{e}_1 - H_k \vec{y} \mid ]The term $Q_{n+1}$ could be neglected since it’s an orthogonal matrix - its application only rotates the vector $\beta \vec{e}_1 - H_k \vec{y}$ and thus does not change it’s magnitude that is minimized using least squares by QR factorization using Givens rotations.
Since QR factorization tries to produce a upper triangular matrix and the matrix $H_k$ only has at most $k$ non zero elements below the diagonal in its first sub diagonal at most $O(k)$ Givens rotations are required - in this case solving the least squares problem using Givens rotations and backsubstitution result in a similar or even better performance than Householder reflections when applying the Arnoldi iteration as described above.
Summarized the algorithm is pretty simple:
The following GitHub GIST contains a simple and naive implementation of the described algorithms using the QR decomposition using Givens rotations and the Least squares solver using QR decomposition. The notebook is also available as a PDF
This article is tagged: Math, Programming, Physics, How stuff works, Machine learning, Computational linear algebra
Dipl.-Ing. Thomas Spielauer, Wien (webcomplains389t48957@tspi.at)
This webpage is also available via TOR at http://rh6v563nt2dnxd5h2vhhqkudmyvjaevgiv77c62xflas52d5omtkxuid.onion/