Skip to article frontmatterSkip to article content

Subspace Iteration can dramatically improve on the performance of the Randomized SVD when A\vec{A} has a heavy tail. The key observation is that the singular value tail of (AAT)qA(\vec{A}\vec{A}^\T)^q\vec{A} is much lighter than that of A\vec{A}. This is because the polynomial x2q+1x^{2q+1} pushes tail singular values of A\vec{A} to zero.

From a computational perspective, there isn’t that much special about x2q+1x^{2q+1}. In fact, we can replace this monomial with any polynomial of the same degree containing only odd degree terms, without significantly changing the cost of the algorithm. Thus, we might consider an approximation of the form:

A^=QQTA,Q=orth(p(AAT)AΩ),deg(p)q.\widehat{\vec{A}} = \vec{Q}\vec{Q}^\T \vec{A}, \quad \vec{Q} = \Call{orth}(p(\vec{A}\vec{A}^\T)\vec{A}\vec{\Omega}) ,\quad \deg(p) \leq q.

It’s conceivable that, if we choose a good polynomial, this method could be more accurate than the subspace iteration. For instance, Chebyshev polynomials grow substantially faster than monomials.

Block Krylov Iteration

The difficulty with the approach in (5.7) is that we don’t know how to choose a good polynomial in advance, since we don’t know the singular values of A\vec{A}.

Observe that for any polynomial p(x)p(x) with deg(p)<t\deg(p)<t,

p(AAT)AΩKt(AAT,AΩ),p(\vec{A}\vec{A}^\T)\vec{A}\vec{\Omega} \in \mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega}),

where

Kt(AAT,AΩ):=span{AΩ,(AAT)AΩ,,(AAT)t1AΩ}.\mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega}) := \operatorname{span}\left\{ \vec{A}\vec{\Omega}, (\vec{A}\vec{A}^\T)\vec{A}\vec{\Omega}, \ldots, (\vec{A}\vec{A}^\T)^{t-1}\vec{A}\vec{\Omega} \right\}.

In particular, note that, using the same number of matrix-vector products with A\vec{A} and AT\vec{A}^\T as required to compute p(AAT)AΩp(\vec{A}\vec{A}^\T)\vec{A}\vec{\Omega}, we can actually compute an orthonormal basis for the entire block Krylov subspace Kt(AAT,AΩ)\mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega}).

This suggests the Randomized Block Krylov Iteration (RBKI) approximation:

A^=QQTA,Q=orth(Kt(AAT,AΩ)).\widehat{\vec{A}} = \vec{Q}\vec{Q}^\T \vec{A}, \quad \vec{Q} = \Call{orth}(\mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega})).

Intuitively, by projecting onto a larger space, we can get a better approximation.

Block-size versus depth

The orthonormal basis for AKt(AAT,AΩ)\vec{A}\mathcal{K}_t(\vec{A}\vec{A}^\T,\vec{A}\vec{\Omega}) has up to tbtb columns, where bb is the number of columns in Ω\vec{\Omega}, so we may hope to obtain a good rank-kk approximation even if b<kb<k. However, this raises the practical question of how the block-size should be set. We tend to observe two competing factors:

  • Smaller is better: While the worst-case matrix-vector product complexity for block sizes b=1b=1 and b=kb=k match, the convergence of Krylov methods on a given instance A\vec{A} is highly dependent on A\vec{A}'s spectrum, so “better-than-worst-case” performance is often observed. Theoretical evidence suggests that, in the absence of small singular value gaps, choosing b=1b=1 typically requires fewer matrix-vector products to reach a desired level of accuracy for non-worst-case instances; see Meyer et al., 2024.

  • Bigger is better: Due to parallelism, caching, and dedicated hardware accelerators in modern computing systems, it is typically much faster to compute multiple matrix-vector products all at once (i.e., grouped into a matrix-matrix product) instead of one after the other. This was observed in our experiments on the cost of NLA. RBKI groups matrix-vector products into blocks of size bb, so for a fixed number of total matrix-vector products, we expect the method to run faster if bb is larger. The benefits of blocking into matrix-matrix products can be even more pronounced when A\vec{A} is too large to store in fast memory, in which case the cost of reading A\vec{A} from slow memory is a bottleneck, and using a larger block size b1b\gg 1 can be essentially “free”.

These two effects are in competition with one another, and one observes (e.g. as in the experiment below) that the best block size bb is often 1bk1\ll b\ll k.

Convergence Guarantees

Theoretical work answer this question is provided in Chen et al., 2025, which builds off of Meyer et al., 2024. In particular, Chen et al., 2025 asserts that, up to logartihmic factors in nn and certain spectral properties of A\vec{A}, the number of matrix-vector products required to obtain a near-optimal approximation can be bounded independent of the block-size.

This means that users are free to choose the block-size based on their computational environments.

Numerical Experiment

Source
# load experiment

n = 4000
name = 'intro'
Λ = np.geomspace(1,100,n)
blocksizes = [1,5,20,100,200]
qs = [650,200,60,11,7]
skips = [20,4,1,1,1]
k = 100

A = np.diag(Λ)
err_opt = np.linalg.norm(np.sort(Λ)[:-k])

class AAt:
    def __matmul__(self,X):
        return A@(A.T@X)
        
AAt = AAt()

# run experiment
all_times = {}
all_errs = {}
for i,(b,q,skip) in enumerate(zip(blocksizes,qs,skips)):

    B = np.random.randn(n,b)
    
    Q,_,times = block_lanczos(AAt,B,q,reorth=True)

    all_times[b] = times

    all_errs[b] = np.zeros(q+1)
        
    for j in range(0,q+1,skip):

        Zj = Q[:,:b*j]

        X = Zj.T@A
        Qtilde,s,Vt = np.linalg.svd(X,full_matrices=False)

            
        A_hat = Zj@(Qtilde[:,:k]@(np.diag(s[:k])@Vt[:k]))
        err = np.linalg.norm(A - A_hat)
        
        all_errs[b][j] = err

# plot the error
fig, axs = plt.subplots(1,2,figsize=(8, 4),sharey=True)
plt.subplots_adjust(wspace=0.1)

for i,(b,q,skip) in enumerate(zip(blocksizes,qs,skips)):

    errs = all_errs[b]
    times = all_times[b]
    
    ax = axs[0]

    ax.set_ylabel(r'accuracy $\varepsilon$')

    ax.plot(np.arange(0,q+1)[::skip]*b,errs[::skip]/err_opt-1,ms=5,label=f'$b={b}$')
    ax.set_yscale('log')
    ax.set_ylim(1e-6,1e0)
    ax.set_xlabel('matrix-vector products')
      
    ax = axs[1]
    
    ax.plot(times[::skip],errs[::skip]/err_opt-1,ms=5,label=f'$b={b}$')
    ax.set_xlabel('wall-clock time (s)')
    ax.set_xlim(-.1,3)
    ax.legend()

plt.savefig(f'RBKI.svg')
plt.close()

We plot the value of ε\varepsilon so that A[ ⁣[A^] ⁣]kF(1+ε)A[ ⁣[A] ⁣]kF\|\vec{A} -\llbracket \widehat{\vec{A}}\rrbracket_k \|_\F \leq (1+\varepsilon) \|\vec{A} - \llbracket \vec{A} \rrbracket_k\|_\F.

References
  1. Meyer, R., Musco, C., & Musco, C. (2024). On the Unreasonable Effectiveness of Single Vector Krylov Methods for Low-Rank Approximation. In Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (pp. 811–845). Society for Industrial. 10.1137/1.9781611977912.32
  2. Chen, T., Epperly, E. N., Meyer, R. A., Musco, C., & Rao, A. (2025). Does block size matter in randomized block Krylov low-rank approximation?