Variance Reduction - Randomized Numerical Linear Algebra with Examples

For simplicity, we consider the case when $\vec{A}$ is symmetric positive definite. In this case, it is reasonable to ask for a relative approximation to the trace, that is, we want an estimate of $\tr(\vec{A})$ that lives in the interval $[(1-\varepsilon)\tr(\vec{A}), (1+\varepsilon)\tr(\vec{A})]$ for some $\varepsilon > 0$ .

The variance of the Girard-Hutchinson estimator $\widehat{\tr}_m(\cdot)$ scales as $O(1/m)$ , where $m$ is the number of samples used. For instance, if we use $m$ Gaussian test vectors, then

\EE\left[\widehat{\tr}_m(\vec{A})\right] = \tr(\vec{A}) ,\qquad \VV\left[\widehat{\tr}_m(\vec{A})\right] = \frac{2 \Vert\vec{A}\Vert_\F^2}{m} \leq \frac{2\tr(\vec{A})^2}{m}.

(6.3)

To improve the accuracy of the estimator, we can use variance reduction techniques.

The general idea of modern variance reduction techniques Meyer et al., 2021Persson et al., 2022Epperly et al., 2024 for implicit estimation is to use a control variate. The basic idea is simple.

Fix any matrix $\vec{B}$ and consider the estimator

\widehat{\tr}_m(\vec{A},\vec{B}) := \tr(\vec{B}) + \widehat{\tr}_m(\vec{A} - \vec{B}).

(6.4)

This estimator can be implemented so long a we know $\tr(\vec{B})$ and can evaluate the maps $\vec{x} \mapsto \vec{x}^\T\vec{A}\vec{x}$ and $\vec{x} \mapsto \vec{x}^\T\vec{B}\vec{x}$ for any vector $\vec{x}$ . The number of matrix-vector products needed to compute $\widehat{\tr}_m(\vec{A},\vec{B})$ is the same as for $\widehat{\tr}_m(\vec{A})$ , namely $m$ . The advantage of the new estimator is that if $\vec{B}$ is a good approximation of $\vec{A}$ , then

\VV[\widehat{\tr}_m(\vec{A},\vec{B})] = \VV[\widehat{\tr}_m(\vec{A} - \vec{B})]

(6.5)

can be much smaller than $\VV[\widehat{\tr}_m(\vec{A})]$ .

Hutch++¶

Hutch++ is perhaps the most well-known variance reduction technique for implicit trace estimation. The idea behind the Hutch++ algorithm is to use a low-rank approximation to $\vec{A}$ , computing using the Randomized SVD as the control variate.

The intuition is simple:

if $\vec{A}$ has a flat spectrum, then $\Vert\vec{A}\Vert_\F \ll \tr(\vec{A})$ , and regular Girard-Hutchinson estimator works well.
if $\vec{A}$ has a decaying spectrum, then it is nearly low-rank, and using a the control variate is helpful.

Below, we present a simple variant of Hutch++ that admits a simple analysis. The exact variants studied in the original paper differ slightly in the details, but the overall idea is the same. Readers can also look at Raphael’s Blog Post on Hutch++ for more details.

Observe that Hutch++ requires $m$ matrix-vector products with $\vec{A}$ . Indeed, $(m+2)/4$ are used to compute $\vec{A}\vec{\Omega}$ , $(m+2)/4$ are used to compute $\vec{Q}^\T\vec{A}$ , and $(m-2)/2$ are used to compute $\vec{A}\vec{x}_i$ .

Below we implement the Hutch++ estimators. Note that by the cyclic property of trace, $\tr(\vec{B}) = \tr(\vec{Q}^\T \vec{A}\vec{Q})$ . This is more computationally efficient to compute, since $\vec{Q}^\T \vec{A}\vec{Q}$ is a small matrix while $\vec{Q}\vec{Q}^\T \vec{A}$ is a large matrix.

def hutchpp(A,m):

    m1 = (m+2)//4
    m2 = m - 2*m1 
    n,_ = A.shape

    Ω = np.random.randn(n,m1)
    Q,_ = np.linalg.qr(A@Ω,mode='reduced')
    Z = A.T@Q
    trB = np.sum(np.diag(Z.T@Q)) 

    X = np.random.randn(n,m2)    
    trm = np.mean(np.diag(X.T@(A@X) - (X.T@Q)@(Z.T@X)))

    return trB + trm

The convergence of Hutch++ compares favorably to (6.3).

Proof

Let $k+p$ be the number of samples used to sketch $\vec{A}$ in order to build the low-rank approximation, and $\ell$ be the number of samples used to estimate the trace of $\vec{A} - \vec{B}$ .

As noted above, $\EE[\widehat{\tr}_m^{++}(\vec{A}) ] = \EE[\EE[\widehat{\tr}_m^{++}(\vec{A}) | \vec{B} ]] = \tr(\vec{A})$ and $\VV[\widehat{\tr}_m^{++}(\vec{A})|\vec{B}] = \VV[\widehat{\tr}_m(\vec{A} - \vec{B})|\vec{B}] = 2\Vert\vec{A} - \vec{B}\Vert_\F^2/\ell$ . Now, by Theorem 5.1, a standard bound for the Randomized SVD, we have

\EE\left[\Vert\vec{A} - \vec{B}\Vert_\F^2\right] = \EE\left[\Vert\vec{A} - \vec{Q}\vec{Q}^\T\vec{A}\Vert_\F^2\right] \leq \left( 1 + \frac{k}{p-1}\right) \Vert \vec{A} - \llbracket \vec{A} \rrbracket_k \Vert_\F^2.

(6.6)

As proved in Lemma 7 in Gilbert et al., 2007,

\Vert \vec{A} - \llbracket \vec{A} \rrbracket_k \Vert_\F \leq \frac{1}{2\sqrt{k}} \tr(\vec{A}).

(6.7)

Thus, setting $p=1$ ,

\VV[\widehat{\tr}_m^{++}(\vec{A})|\vec{B}] \leq \frac{2}{\ell}\left( 1 + \frac{k}{p-1}\right) \frac{1}{4k} \tr(\vec{A})^2 = \frac{1}{\ell k} \tr(\vec{A})^2.

(6.8)

Finally, the choices $k = (m-2)/4$ and $\ell = (m-2)/2$ yield the stated bound.

Numerical Example¶

Let’s set up a numerical experiment to compare Hutch++ to the Girard--Hutchinson estimator. We’ll compare the performance on a problem with fast spectral decay and and one with slow spectral decay.

Since we only use Gaussian queries, without loss of generality, we can use a diagonal test problem.

# Generate test problems with different spectral decay rates
n = 3000

test_problems = [
    {'name': 'fast decay', 'Λ': 1/np.arange(1,n+1)**3},
    {'name': 'slow decay', 'Λ': 1/np.arange(1,n+1)**1}
]

# Define trace estimation methods
trace_methods = [
    {'name': 'Girard-Hutchinson',
     'func': girard_hutchinson},
    {'name': 'Hutch++',
     'func': hutchpp}
]

n_ms = 5
ms = np.geomspace(10,1000,n_ms,dtype='int')

# Run experiments
n_repeat = 100
errors = {}

for test_problem in test_problems:
    problem_name = test_problem['name']
    Λ = test_problem['Λ']
    A = sp.sparse.diags(Λ)
    tr_true = np.sum(Λ)

    for trace_method in trace_methods:
        alg_name = trace_method['name']
        alg = trace_method['func']

        err = np.zeros((n_ms,n_repeat))
        for i,m in enumerate(ms):
            for j in range(n_repeat):
                err[i,j] = np.abs(alg(A,m) - tr_true) / tr_true

        errors[problem_name,alg_name] = err

# Plot the results
σ = .1
fig, axs = plt.subplots(1,2,figsize=(10, 4),sharey=True)

for i,test_problem in enumerate(test_problems):
    problem_name = test_problem['name']
    ax = axs[i]
    ax.set_title(problem_name)
    
    for trace_method in trace_methods:
        alg_name = trace_method['name']
        err = errors[problem_name,alg_name]

        bot,mid,top = np.quantile(err, [σ, .5, 1-σ], axis=1)

        ax.plot(ms,mid,label=alg_name)
        ax.fill_between(ms,bot,top,alpha=.2)

    ax.set_xlabel('matvecs $m$')
    ax.set_xscale('log')

axs[0].set_ylabel(r'error: $|\operatorname{tr}(A) - \mathrm{alg}|/\operatorname{tr}(A)$')
plt.yscale('log')
plt.legend()

plt.savefig('hutchpp.svg')
plt.close()

The results are as expected. When the eigenvalue decay is fast, then Hutch++ performs substantially better than the Girard-Hutchinson estimator since it can effectively remove much of the mass with the low-rank approximation.

Two-panel comparison showing relative error vs number of matrix-vector products for trace estimation. Left panel shows "fast decay" problem where Hutch++ significantly outperforms Girard-Hutchinson. Right panel shows "slow decay" problem where both methods perform similarly. Both use log-log scale with error decreasing as more samples are used.

While Hutch++ admits a simple analysis, it’s clear there is lots of room for modifications/improvements. Below we highlight some of the main variants and related works.

Non-adaptive methods¶

It’s possible to do the variance reduction step in a non-adaptive way; i.e. without multiple passes over $\vec{A}$ . Specifically, the Randomized SVD can be replaced with Generalized Nyström.

Deflation¶

Several methods have proposed to use compute the top eigencomponents of $\vec{A}$ explicitly and apply a randomized estimator to the residual Weiße et al., 2006Gambhir et al., 2017Morita & Tohyama, 2020.

Adaptive Hutch++¶

In Hutch++, the number of queries used for low-rank approximation and estimating the trace of the residual are roughly balanced. However, there may be better tradeoffs between the two depending on the spectral decay of $\vec{A}$ . In Persson et al., 2022, the authors propose an adaptive variant of Hutch++ that attempts to use more queries for the low-rank approximation when the spectrum is flat, and more queries for estimating the trace of the residual when the spectrum is decaying.

XTrace¶

The XTrace method Epperly et al., 2024 is based on the exchangability principle uses all $m$ queries in the same way. In particular, the method uses all but one query for low-rank approximation, and the last query to estimate the trace of the residual, and then averages this estimate over all possible choices of the last query. By using a clever low-rank update formula, the algorithm can also be made computationally efficient.

References¶

Meyer, R. A., Musco, C., Musco, C., & Woodruff, D. P. (2021). Hutch++: Optimal Stochastic Trace Estimation. In Symposium on Simplicity in Algorithms (SOSA) (pp. 142–155). Society for Industrial. 10.1137/1.9781611976496.16
Persson, D., Cortinovis, A., & Kressner, D. (2022). Improved Variants of the Hutch++ Algorithm for Trace Estimation. SIAM Journal on Matrix Analysis and Applications, 43(3), 1162–1185. 10.1137/21m1447623
Epperly, E. N., Tropp, J. A., & Webber, R. J. (2024). XTrace: Making the Most of Every Sample in Stochastic Trace Estimation. SIAM Journal on Matrix Analysis and Applications, 45(1), 1–23. 10.1137/23m1548323
Girard, D. (1987). Un algorithme simple et rapide pour la validation croisée généralisée sur des problèmes de grande taille. https://membres-ljk.imag.fr/Didier.Girard/TR-665-M-IMAG.pdf
Gilbert, A. C., Strauss, M. J., Tropp, J. A., & Vershynin, R. (2007). One sketch for all: fast algorithms for compressed sensing. Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, 237–246. 10.1145/1250790.1250824
Weiße, A., Wellein, G., Alvermann, A., & Fehske, H. (2006). The kernel polynomial method. Reviews of Modern Physics, 78(1), 275–306. 10.1103/revmodphys.78.275
Gambhir, A. S., Stathopoulos, A., & Orginos, K. (2017). Deflation as a Method of Variance Reduction for Estimating the Trace of a Matrix Inverse. SIAM Journal on Scientific Computing, 39(2), A532–A558. 10.1137/16m1066361
Morita, K., & Tohyama, T. (2020). Finite-temperature properties of the Kitaev-Heisenberg models on kagome and triangular lattices studied by improved finite-temperature Lanczos methods. Physical Review Research, 2(1). 10.1103/physrevresearch.2.013205

Randomized Numerical Linear Algebra with Examples

Girard-Hutchinson Estimator