Skip to article frontmatterSkip to article content

Krylov Subspace Methods

The University of Texas at Austin

Preamble: Run the cells below to import the necessary Python packages

This notebook created by William Gilpin. Consult the course website for all content and GitHub repository for raw files and runnable online code.

import numpy as np
from IPython.display import Image, display

import matplotlib.pyplot as plt
%matplotlib inline

Motivation: eigenvector alignment and dimensionality



np.random.seed(0) # ensures same every time
A = np.random.random((10, 10))
A = A.T + A # Hermitian matrix
b = np.random.random((10,))

n = 2
A, b = A[:n, :n], b[:n]

eig_vals, eig_vecs = np.linalg.eigh(A)

eig_vec_leading = eig_vecs[np.argmax(np.abs(eig_vals))]


plt.plot([0, eig_vec_leading[0]], [0, eig_vec_leading[1]], 'r-')


y = A @ b
y /= np.linalg.norm(y)
plt.plot([0, y[0]], [0, y[1]], 'b-', label='A @ b')

plt.title('Leading eigenvector and first iterate A b')
<Figure size 640x480 with 1 Axes>

Krylov subspace methods

  • We’ve seen several iterative algorithms that involve repeatedly multiplying a matrix by a vector (QR algorithm, power method, etc)
  • Our choice of target vector b\mathbf{b} and our operator AA jointly define a the Krylov subspace of the pair
Kr(A,b)=span{b,Ab,A2b,,Ar1b}.{\displaystyle {\mathcal {K}}_{r}(A,b)=\operatorname {span} \,\{b,Ab,A^{2}b,\ldots ,A^{r-1}b\}.}

It’s called a subspace because it is often a subspace of the column span of AA that depends on b\mathbf{b}. We can choose any value of rr up to a maximum rmaxrank(A)+1r_{max} \leq \text{rank}(A) + 1 (the full column space)

Properties of the Krylov subspace

  • The matrix AA performs a shift on the Krylov subspace: Kr(A,b),AKr(A,b)Kr+1(A,b)\displaystyle {\mathcal {K}}_{r}(A,b),A{\mathcal {K}}_{r}(A,b)\subset {\mathcal {K}}_{r+1}(A,b)
  • If r>rmaxr > r_{max}, then the Krylov subspace is rank-deficient (there are linearly dependent matrices)
  • If AA is full-rank N×NN \times N matrix, then Krmax(A,b){\mathcal {K}}_{r_{max}}(A,b) is also a full-rank N×NN \times N matrix

Why use Krylov subspace methods?

The main reason: we can work only with matrix-vector products, and not directly with A, which could be a very large matrix.

Suppose a linear equation has a residual vector:

r=Axb\mathbf{r} = A \mathbf{x} - \mathbf{b}

For example, if we are performing partial linear regression, then the residual denotes our error terms

Different Krylov subspace methods boil down to choosing a vector x\mathbf{x}, and thus subspace Kj(A,b)\mathcal {K}_{j}(A,b), for which the residual satisfies certain desiderata:

  1. Minimize r|| \mathbf{r} ||. Iterative least-squares via GMRES, MINRES
  2. Minimize Kjr|| {\mathcal{K}_j} \cdot\mathbf{r} || (find orthogonal residual). Conjugate gradient method

where Kj\mathcal{K}_j is shorthand for Kj(A,b)\mathcal {K}_{j}(A,b). In these iterative algorithms, we seek to minimize the desired cost function at each step ii, corresponding to a single power iteration of the matrix.

Arnoldi iteration

  • An eigenvalue calculation algorithm
  • Similar to QR factorization, but allows us to stop after the first mnm \ll n terms of the factorization of the full matrix ARN×NA \in \mathbb{R}^{N \times N}
  • The full factorization of this matrix would have the form A=QHQ1A= Q H Q^{-1}
  • But if we want to just compute the first mm eigenvalues of a matrix, we want a lower-rank approximation of QQ
  • The partial matrix QmRN×MQ_m \in \mathbb{R}^{N \times M} has the property
Qm+11AQm=H~mQ_{m + 1}^{-1} A Q_m = \tilde{H}_m

where H~mR(M+1)×M\tilde{H}_m \in \mathbb{R}^{(M+1) \times M} is the first mm and m+1m+1 elements of the full Hessenberg matrix HH.

A[q1q2qm]=[q1q2qm+1][h1,1h1,2h1,mh2,1h2,2h2,m0h3,2h3,m00hm+1,m]\begin{align} A \left[ \begin{array}{cccc} \mathbf{q}_1 & \mathbf{q}_2 & \cdots & \mathbf{q}_m \end{array} \right] &= \left[ \begin{array}{cccc} \mathbf{q}_1 & \mathbf{q}_2 & \cdots & \mathbf{q}_{m+1} \end{array} \right] \left[ \begin{array}{cccc} h_{1,1} & h_{1,2} & \cdots & h_{1,m} \\ h_{2,1} & h_{2,2} & \cdots & h_{2,m} \\ 0 & h_{3,2} & \cdots & h_{3,m} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & h_{m+1,m} \end{array} \right] \end{align}

Arnoldi iteration algorithm

  1. Choose a random starting vector q1\mathbf{q}_1 of length NN and normalize it to have unit length
  2. For k=1,2,,mk = 1, 2, \ldots, m:
    1. Compute vm=Aqmv_m = A q_m.
    2. For j=1,2,,kj = 1, 2, \ldots, k:
      1. Compute hj,k=qj,vkh_{j,k} = \langle q_j, v_k \rangle
      2. Compute vk=vkhj,kqjv_k = v_k - h_{j,k} q_j
    3. Compute hk+1,k=vkh_{k+1,k} = || v_k ||
    4. Compute qk+1=vk/hk+1,kq_{k+1} = v_k / h_{k+1,k}
  3. Return QmQ_m and H~m\tilde{H}_m

What is going on here?

  • This is almost the same algorithm as modified Gram-Schmidt, which we saw before

  • The difference is that we are applying the matrix AA as an operator to qq in each iteration, and thus switching to a new column vector in our Krylov subspace

  • Our output is the reduced matrix QmQ_m

  • Each iteration finds a new basis vector for the corresponding rank-kk Krylov subspace

  • After stopping at our mm of interest, we can find the reduced Hessenberg matrix Hm=QmAQmH_m = Q_m^* A Q_m, where HmRM×MH_m \in \mathbb{R}^{M \times M}. Can solve for eigenvalues of this reduced matrix

class ArnoldiIteration:
    """
    Perform the Arnoldi iteration on a matrix A and a vector b.
    """
    def __init__(self, k):
        pass

    def arnoldi_iteration(self, A, b, k):
        n = A.shape[0]
        Q, H = np.zeros((n, k + 1)), np.zeros((k + 1, k))
        Q[:, 0] = b / np.linalg.norm(b)
        for j in range(k):
            v = A @ Q[:, j] # the main difference between the QR and Arnoldi iteration
            for i in range(j + 1):
                H[i, j] = Q[:, i] @ v
                v -= H[i, j] * Q[:, i]
            H[j + 1, j] = np.linalg.norm(v)
            Q[:, j + 1] = v / H[j + 1, j]
        return Q, H

    def eig_k(self, A, k):
        """
        Compute the k largest eigenvalues of A.
        """
        Q, H = self.arnoldi_iteration(A, np.random.rand(A.shape[0]), k)
        return np.linalg.eigvals(H[:-1])
a_big = np.random.random((3000, 3000))
a_big = a_big + a_big.T
eig_topk_np = np.sort(np.linalg.eigvals(a_big))[::-1][:10]

model = ArnoldiIteration(10)
eig_topk_arnoldi = model.eig_k(a_big, 10)
plt.plot(eig_topk_np)
plt.plot(eig_topk_arnoldi)
<Figure size 640x480 with 1 Axes>

Conjugate gradient methods

  • Iterative method of solving the linear problem: Find x\mathbf{x} such that Ax=bA \mathbf{x} = \mathbf{b}

  • (Axb)=r(Ax - b)= \mathbf{r}

  • An iterative method. We saw that iterative methods are useful for eigenvalue problems, and so it’s useful to have a similar approach for linear solving.

  • The basic version only holds for symmetric, positive-definite matrices (like the Hessians of potential functions)

  • Dynamical systems view: the solution is the fixed point of the dynamics of an iterative algorithm, and we want to approach it as fast as possible.

  • The CG iterate xk\mathbf{x}_k is the unique member of Kk(A,b)\mathcal{K}_{k}(A,b) which minimizes the distance to the “true” solution.

  • Why not use Gauss-Jordan, or LU factorization, which give us exact answers after finite steps?

    • If an iterative algorithm converges faster, then we can save time

Conjugate gradient algorithm

  1. Choose a random starting vector x0\mathbf{x}_0 of length NN and set the initial residual r0=bAx0\mathbf{r}_0 = \mathbf{b} - A \mathbf{x}_0
  2. For k=1,2,,mk = 1, 2, \ldots, m:
    1. Compute αk=rk1,rk1/rk1,Ark1\alpha_k = \langle \mathbf{r}_{k-1}, \mathbf{r}_{k-1} \rangle / \langle \mathbf{r}_{k-1}, A \mathbf{r}_{k-1} \rangle
    2. Compute xk=xk1+αkrk1\mathbf{x}_k = \mathbf{x}_{k-1} + \alpha_k \mathbf{r}_{k-1}
    3. Compute rk=rk1αkArk1\mathbf{r}_k = \mathbf{r}_{k-1} - \alpha_k A \mathbf{r}_{k-1}
  3. Return xm\mathbf{x}_m

We can stabilize the algorithm by including a momentum term βk\beta_k in the update rule for rk\mathbf{r}_k. This is called the Fletcher-Reeves method.

Conjugate gradient algorithm (modified)

  1. Choose a random starting vector x0\mathbf{x}_0 of length NN and set the initial residual r0=bAx0\mathbf{r}_0 = \mathbf{b} - A \mathbf{x}_0
  2. For k=1,2,,mk = 1, 2, \ldots, m:
    1. Compute αk=rk1,rk1/pk1,Apk1\alpha_k = \langle \mathbf{r}_{k-1}, \mathbf{r}_{k-1} \rangle / \langle \mathbf{p}_{k-1}, A \mathbf{p}_{k-1} \rangle
    2. Compute xk=xk1+αkpk1\mathbf{x}_k = \mathbf{x}_{k-1} + \alpha_k \mathbf{p}_{k-1}
    3. Compute rk=rk1αkApk1\mathbf{r}_k = \mathbf{r}_{k-1} - \alpha_k A \mathbf{p}_{k-1}
    4. Compute βk=rk,rk/rk1,rk1\beta_k = \langle \mathbf{r}_{k}, \mathbf{r}_{k} \rangle / \langle \mathbf{r}_{k-1}, \mathbf{r}_{k-1} \rangle
    5. Compute pk=rk+βkpk1\mathbf{p}_k = \mathbf{r}_k + \beta_k \mathbf{p}_{k-1}
  3. Return xm\mathbf{x}_m
What’s going on here?
  • r\mathbf{r} is the vector residual between AA times our current guess x\mathbf{x} and the target bb at the kthk^{th} step
  • p\mathbf{p} is orthogonal to r\mathbf{r} (last step looks like a Gram-Schmidt update). It tells us which direction to step in x\mathbf{x}



  • α\alpha is a learning rate for r\mathbf{r} that encourages larger steps when the residual is large, but smaller steps when the direction of the orthogonal vector p\mathbf{p} is changing quickly
  • β\beta is a learning rate for p\mathbf{p} that encourages larger steps when the residual is large, but smaller steps when the residual is changing quickly (avoids oscillations)



  • At each step, we update pkp_k to be orthogonal to rkr_k. pkp_k is orthogonal not only to rkr_k, but also all previous rjr_j, j<kj < k.
class ConjugateGradientMethod:
    """
    Solve a linear system Ax = b using the conjugate gradient method. A must be a 
    symmetric, positive definite matrix.

    Parameters
        max_iter (int): Maximum number of iterations to perform.
        store_history (bool): If True, store the history of the solution at each 
            iteration.
    """

    def __init__(self, max_iter=1000, store_history=False):
        self.max_iter = max_iter
        self.store_history = store_history

    def solve(self, A, b):

        if self.store_history:
            self.history = list()

        x = np.zeros_like(b)
        r = b
        p = r

        for i in range(self.max_iter):
            alpha = (r.T @ r) / (p.T @ A @ p)
            x = x + alpha * p
            r = r - alpha * A @ p

            # if np.linalg.norm(r) < 1e-12:
            #     break

            beta = (r.T @ r) / (b.T @ b)
            p = r + beta * p

            if self.store_history:
                self.history.append(x.copy())

        return x

    def preconditioner(self, A):
        """
        Return a Jacobi preconditioner for the matrix A.
        """
        return np.diag(1 / np.diag(A))

    def solve_with_preconditioner(self, A, b):
            
        if self.store_history:
            self.history = list()

        x = np.zeros_like(b)
        r = b
        M = self.preconditioner(A)
        z = M @ r # M is the preconditioner
        p = r

        for i in range(self.max_iter):
            alpha = (r.T @ z) / (p.T @ A @ p)
            x = x + alpha * p
            r = r - alpha * A @ p
            z = M @ r 
            beta = (r.T @ z) / (b.T @ b)
            p = z + beta * p

            if self.store_history:
                self.history.append(x.copy())

        return x


np.random.seed(0)
a = np.random.rand(10, 10)
a = a.T @ a # A*A^T is always positive definite 
b = np.random.rand(10, 1)

model = ConjugateGradientMethod(max_iter=2000, store_history=True)
sol_cg = model.solve(a, b).squeeze()
sol_np = np.linalg.solve(a, b).squeeze()

plt.figure(figsize=(6, 6))
plt.plot(sol_np, sol_cg, '.k', markersize=10)
plt.xlabel("Numpy solution")
plt.ylabel("Conjugate gradient solution")


plt.figure(figsize=(6, 6))
all_norms = np.linalg.norm(np.array(model.history).squeeze() - sol_np, axis=1)
plt.semilogy(all_norms, 'k', markersize=10)
plt.xlabel("Iteration")
plt.ylabel("Residual norm")

<Figure size 600x600 with 1 Axes><Figure size 600x600 with 1 Axes>
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt
import numpy as np
from IPython.display import HTML

# Assuming `data` is a 2D numpy array with shape (num_frames, num_data_points)
# For demonstration, generating random data
data = model.history[::3][:200]

fig, ax = plt.subplots()

# Initialize the plot with the first frame's data
plt.plot(sol_np, model.history[0], '.k', markersize=20);
plt.xlabel("Numpy solution")
plt.ylabel("Conjugate gradient solution")

def update(frame):
    ax.clear()
    plt.plot(sol_np, frame, '.k', markersize=20);
    plt.xlabel("Numpy solution")
    plt.ylabel("Conjugate gradient solution")

ani = FuncAnimation(fig, update, frames=data, interval=40)
HTML(ani.to_jshtml())
Loading...

Appendix and other functions

## Make an interactive video
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets

def plotter(i):
    plt.figure(figsize=(6, 6))
    plt.plot(sol_np, model.history[i], '.k', markersize=20);
    plt.xlabel("Numpy solution")
    plt.ylabel("Conjugate gradient solution")
    plt.show()

interact(
    plotter, 
    i=widgets.IntSlider(0, 0, len(model.history) - 1, 1, layout=Layout(width='800px'))
)


class Householder:
    
    def __init__(self, A):
        self.A = A
        self.Q = np.zeros_like(A)
        self.R = np.zeros_like(A)
        self._decompose()

    def _decompose(self):
        for i in range(self.A.shape[1]):
            self.Q[:, i] = self.A[:, i]
            for j in range(i):
                self.R[j, i] = np.dot(self.Q[:, j], self.A[:, i])
                self.Q[:, i] -= self.R[j, i] * self.Q[:, j]
            self.R[i, i] = np.linalg.norm(self.Q[:, i])
            self.Q[:, i] /= self.R[i, i]

    def solve(self, b):
        y = np.dot(self.Q.T, b)
        return np.linalg.solve(self.R, y)

    def inverse(self):
        return np.linalg.solve(self.R.T, np.linalg.solve(self.R, np.eye(self.A.shape[0])))

    def __str__(self):
        return "Q:\n{}\nR:\n{}".format(self.Q, self.R)

    def __repr__(self):
        return self.__str__()

class Givens:
    
    def __init__(self, A):
        self.A = A
        self.Q = np.zeros_like(A)
        self.R = np.zeros_like(A)
        self._decompose()

    def _decompose(self):
        for i in range(self.A.shape[1]):
            self.Q[:, i] = self.A[:, i]
            for j in range(i):
                self.R[j, i] = np.dot(self.Q[:, j], self.A[:, i])
                self.Q[:, i] -= self.R[j, i] * self.Q[:, j]
            self.R[i, i] = np.linalg.norm(self.Q[:, i])
            self.Q[:, i] /= self.R[i, i]

    def solve(self, b):
        y = np.dot(self.Q.T, b)
        return np.linalg.solve(self.R, y)

    def inverse(self):
        return np.linalg.solve(self.R.T, np.linalg.solve(self.R, np.eye(self.A.shape[0])))

    def __str__(self):
        return "Q:\n{}\nR:\n{}".format(self.Q, self.R)

    def __repr__(self):
        return self.__str__()

class QR:

    def __init__(self, A):
        self.A = A
        self.Q = np.zeros_like(A)
        self.R = np.zeros_like(A)
        self._decompose()

    def _decompose(self):
        for i in range(self.A.shape[1]):
            self.Q[:, i] = self.A[:, i]
            for j in range(i):
                self.R[j, i] = np.dot(self.Q[:, j], self.A[:, i])
                self.Q[:, i] -= self.R[j, i] * self.Q[:, j]
            self.R[i, i] = np.linalg.norm(self.Q[:, i])
            self.Q[:, i] /= self.R[i, i]

    def solve(self, b):
        y = np.dot(self.Q.T, b)
        return np.linalg.solve(self.R, y)

    def inverse(self):
        return np.linalg.solve(self.R.T, np.linalg.solve(self.R, np.eye(self.A.shape[0])))


class Cholesky:
        
    def __init__(self, A):
        self.A = A
        self.L = np.zeros_like(A)
        self._decompose()

    def _decompose(self):
        for i in range(self.A.shape[0]):
            for j in range(i):
                self.L[i, j] = (self.A[i, j] - np.dot(self.L[i, :j], self.L[j, :j])) / self.L[j, j]
            self.L[i, i] = np.sqrt(self.A[i, i] - np.dot(self.L[i, :i], self.L[i, :i]))

    def solve(self, b):
        y = np.linalg.solve(self.L, b)
        return np.linalg.solve(self.L.T, y)

    def inverse(self):
        return np.linalg.solve(self.L.T, np.linalg.solve(self.L, np.eye(self.A.shape[0])))

    def __str__(self):
        return "L:\n{}\nL.T:\n{}".format(self.L, self.L.T)

    def __repr__(self):
        return self.__str__()


def test():
    A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 10]])
    b = np.array([3, 3, 4])
    print("A:\n{}\nb:\n{}".format(A, b))
    print("QR decomposition")

    qr = QR(A)
    print(qr)
    print("QR solution")
    print(qr.solve(b))
    print("QR inverse")
    print(qr.inverse())
    print("Cholesky decomposition")
    cholesky = Cholesky(A)
    print(cholesky)
    print("Cholesky solution")
    print(cholesky.solve(b))
    print("Cholesky inverse")
    print(cholesky.inverse())