Block Krylov Iteration - Randomized Numerical Linear Algebra with Examples

Subspace Iteration can dramatically improve on the performance of the Randomized SVD when $\vec{A}$ has a heavy tail. The key observation is that the singular value tail of $(\vec{A}\vec{A}^\T)^q\vec{A}$ is much lighter than that of $\vec{A}$ . This is because the polynomial $x^{2q+1}$ pushes tail singular values of $\vec{A}$ to zero.

From a computational perspective, there isn’t that much special about $x^{2q+1}$ . In fact, we can replace this monomial with any polynomial of the same degree containing only odd degree terms, without significantly changing the cost of the algorithm. Thus, we might consider an approximation of the form:

\widehat{\vec{A}} = \vec{Q}\vec{Q}^\T \vec{A}, \quad \vec{Q} = \Call{orth}(p(\vec{A}\vec{A}^\T)\vec{A}\vec{\Omega}) ,\quad \deg(p) \leq q.

(5.7)

It’s conceivable that, if we choose a good polynomial, this method could be more accurate than the subspace iteration. For instance, Chebyshev polynomials grow substantially faster than monomials.

Block Krylov Iteration¶

The difficulty with the approach in (5.7) is that we don’t know how to choose a good polynomial in advance, since we don’t know the singular values of $\vec{A}$ .

Observe that for any polynomial $p(x)$ with $\deg(p)<t$ ,

p(\vec{A}\vec{A}^\T)\vec{A}\vec{\Omega} \in \mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega}),

(5.8)

where

\mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega}) := \operatorname{span}\left\{ \vec{A}\vec{\Omega}, (\vec{A}\vec{A}^\T)\vec{A}\vec{\Omega}, \ldots, (\vec{A}\vec{A}^\T)^{t-1}\vec{A}\vec{\Omega} \right\}.

(5.9)

In particular, note that, using the same number of matrix-vector products with $\vec{A}$ and $\vec{A}^\T$ as required to compute $p(\vec{A}\vec{A}^\T)\vec{A}\vec{\Omega}$ , we can actually compute an orthonormal basis for the entire block Krylov subspace $\mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega})$ .

This suggests the Randomized Block Krylov Iteration (RBKI) approximation:

\widehat{\vec{A}} = \vec{Q}\vec{Q}^\T \vec{A}, \quad \vec{Q} = \Call{orth}(\mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega})).

(5.10)

Intuitively, by projecting onto a larger space, we can get a better approximation.

Block-size versus depth¶

The orthonormal basis for $\vec{A}\mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega})$ has up to $tb$ columns, where $b$ is the number of columns in $\vec{\Omega}$ , so we may hope to obtain a good rank- $k$ approximation even if $b<k$ . However, this raises the practical question of how the block-size should be set. We tend to observe two competing factors:

Smaller is better: While the worst-case matrix-vector product complexity for block sizes $b=1$ and $b=k$ match, the convergence of Krylov methods on a given instance $\vec{A}$ is highly dependent on $\vec{A}$ 's spectrum, so “better-than-worst-case” performance is often observed. Theoretical evidence suggests that, in the absence of small singular value gaps, choosing $b=1$ typically requires fewer matrix-vector products to reach a desired level of accuracy for non-worst-case instances; see Meyer et al., 2024.
Bigger is better: Due to parallelism, caching, and dedicated hardware accelerators in modern computing systems, it is typically much faster to compute multiple matrix-vector products all at once (i.e., grouped into a matrix-matrix product) instead of one after the other. This was observed in our experiments on the cost of NLA. RBKI groups matrix-vector products into blocks of size $b$ , so for a fixed number of total matrix-vector products, we expect the method to run faster if $b$ is larger. The benefits of blocking into matrix-matrix products can be even more pronounced when $\vec{A}$ is too large to store in fast memory, in which case the cost of reading $\vec{A}$ from slow memory is a bottleneck, and using a larger block size $b\gg 1$ can be essentially “free”.

These two effects are in competition with one another, and one observes (e.g. as in the experiment below) that the best block size $b$ is often $1\ll b\ll k$ .

Convergence Guarantees¶

Theoretical work answer this question is provided in Chen et al., 2025, which builds off of Meyer et al., 2024. In particular, Chen et al., 2025 asserts that, up to logartihmic factors in $n$ and certain spectral properties of $\vec{A}$ , the number of matrix-vector products required to obtain a near-optimal approximation can be bounded independent of the block-size.

This means that users are free to choose the block-size based on their computational environments.

Numerical Experiment¶

# load experiment

n = 4000
name = 'intro'
Λ = np.geomspace(1,100,n)
blocksizes = [1,5,20,100,200]
qs = [650,200,60,11,7]
skips = [20,4,1,1,1]
k = 100

A = np.diag(Λ)
err_opt = np.linalg.norm(np.sort(Λ)[:-k])

class AAt:
    def __matmul__(self,X):
        return A@(A.T@X)
        
AAt = AAt()

# run experiment
all_times = {}
all_errs = {}
for i,(b,q,skip) in enumerate(zip(blocksizes,qs,skips)):

    B = np.random.randn(n,b)
    
    Q,_,times = block_lanczos(AAt,B,q,reorth=True)

    all_times[b] = times

    all_errs[b] = np.zeros(q+1)
        
    for j in range(0,q+1,skip):

        Zj = Q[:,:b*j]

        X = Zj.T@A
        Qtilde,s,Vt = np.linalg.svd(X,full_matrices=False)

            
        A_hat = Zj@(Qtilde[:,:k]@(np.diag(s[:k])@Vt[:k]))
        err = np.linalg.norm(A - A_hat)
        
        all_errs[b][j] = err

# plot the error
fig, axs = plt.subplots(1,2,figsize=(8, 4),sharey=True)
plt.subplots_adjust(wspace=0.1)

for i,(b,q,skip) in enumerate(zip(blocksizes,qs,skips)):

    errs = all_errs[b]
    times = all_times[b]
    
    ax = axs[0]

    ax.set_ylabel(r'accuracy $\varepsilon$')

    ax.plot(np.arange(0,q+1)[::skip]*b,errs[::skip]/err_opt-1,ms=5,label=f'$b={b}$')
    ax.set_yscale('log')
    ax.set_ylim(1e-6,1e0)
    ax.set_xlabel('matrix-vector products')
      
    ax = axs[1]
    
    ax.plot(times[::skip],errs[::skip]/err_opt-1,ms=5,label=f'$b={b}$')
    ax.set_xlabel('wall-clock time (s)')
    ax.set_xlim(-.1,3)
    ax.legend()

plt.savefig(f'RBKI.svg')
plt.close()

We plot the value of $\varepsilon$ so that $\|\vec{A} -\llbracket \widehat{\vec{A}}\rrbracket_k \|_\F \leq (1+\varepsilon) \|\vec{A} - \llbracket \vec{A} \rrbracket_k\|_\F$ .

References¶

Meyer, R., Musco, C., & Musco, C. (2024). On the Unreasonable Effectiveness of Single Vector Krylov Methods for Low-Rank Approximation. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (pp. 811–845). Society for Industrial. 10.1137/1.9781611977912.32
Chen, T., Epperly, E. N., Meyer, R. A., Musco, C., & Rao, A. (2025). Does block size matter in randomized block Krylov low-rank approximation?

Randomized Numerical Linear Algebra with Examples

Subspace Iteration