Skip to article frontmatterSkip to article content

For simplicity, we consider the case when A\vec{A} is symmetric positive definite. In this case, it is reasonable to ask for a relative approximation to the trace, that is, we want an estimate of tr(A)\tr(\vec{A}) that lives in the interval [(1ε)tr(A),(1+ε)tr(A)][(1-\varepsilon)\tr(\vec{A}), (1+\varepsilon)\tr(\vec{A})] for some ε>0\varepsilon > 0.

The variance of the Girard-Hutchinson estimator tr^m()\widehat{\tr}_m(\cdot) scales as O(1/m)O(1/m), where mm is the number of samples used. For instance, if we use mm Gaussian test vectors, then

E[tr^m(A)]=tr(A),V[tr^m(A)]=2AF2m2tr(A)2m.\EE\left[\widehat{\tr}_m(\vec{A})\right] = \tr(\vec{A}) ,\qquad \VV\left[\widehat{\tr}_m(\vec{A})\right] = \frac{2 \Vert\vec{A}\Vert_\F^2}{m} \leq \frac{2\tr(\vec{A})^2}{m}.

To improve the accuracy of the estimator, we can use variance reduction techniques.

The general idea of modern variance reduction techniques Meyer et al., 2021Persson et al., 2022Epperly et al., 2024 for implicit estimation is to use a control variate. The basic idea is simple.

Fix any matrix B\vec{B} and consider the estimator

tr^m(A,B):=tr(B)+tr^m(AB).\widehat{\tr}_m(\vec{A},\vec{B}) := \tr(\vec{B}) + \widehat{\tr}_m(\vec{A} - \vec{B}).

This estimator can be implemented so long a we know tr(B)\tr(\vec{B}) and can evaluate the maps xxTAx\vec{x} \mapsto \vec{x}^\T\vec{A}\vec{x} and xxTBx\vec{x} \mapsto \vec{x}^\T\vec{B}\vec{x} for any vector x\vec{x}. The number of matrix-vector products needed to compute tr^m(A,B)\widehat{\tr}_m(\vec{A},\vec{B}) is the same as for tr^m(A)\widehat{\tr}_m(\vec{A}), namely mm. The advantage of the new estimator is that if B\vec{B} is a good approximation of A\vec{A}, then

V[tr^m(A,B)]=V[tr^m(AB)]\VV[\widehat{\tr}_m(\vec{A},\vec{B})] = \VV[\widehat{\tr}_m(\vec{A} - \vec{B})]

can be much smaller than V[tr^m(A)]\VV[\widehat{\tr}_m(\vec{A})].

Hutch++

Hutch++ is perhaps the most well-known variance reduction technique for implicit trace estimation. The idea behind the Hutch++ algorithm is to use a low-rank approximation to A\vec{A}, computing using the Randomized SVD as the control variate.

The intuition is simple:

  • if A\vec{A} has a flat spectrum, then AFtr(A)\Vert\vec{A}\Vert_\F \ll \tr(\vec{A}), and regular Girard-Hutchinson estimator works well.

  • if A\vec{A} has a decaying spectrum, then it is nearly low-rank, and using a the control variate is helpful.

Below, we present a simple variant of Hutch++ that admits a simple analysis. The exact variants studied in the original paper differ slightly in the details, but the overall idea is the same. Readers can also look at Raphael’s Blog Post on Hutch++ for more details.

Observe that Hutch++ requires mm matrix-vector products with A\vec{A}. Indeed, (m+2)/4(m+2)/4 are used to compute AΩ\vec{A}\vec{\Omega}, (m+2)/4(m+2)/4 are used to compute QTA\vec{Q}^\T\vec{A}, and (m2)/2(m-2)/2 are used to compute Axi\vec{A}\vec{x}_i.

Below we implement the Hutch++ estimators. Note that by the cyclic property of trace, tr(B)=tr(QTAQ)\tr(\vec{B}) = \tr(\vec{Q}^\T \vec{A}\vec{Q}). This is more computationally efficient to compute, since QTAQ\vec{Q}^\T \vec{A}\vec{Q} is a small matrix while QQTA\vec{Q}\vec{Q}^\T \vec{A} is a large matrix.

def hutchpp(A,m):

    m1 = (m+2)//4
    m2 = m - 2*m1 
    n,_ = A.shape

    Ω = np.random.randn(n,m1)
    Q,_ = np.linalg.qr(A@Ω,mode='reduced')
    Z = A.T@Q
    trB = np.sum(np.diag(Z.T@Q)) 

    X = np.random.randn(n,m2)    
    trm = np.mean(np.diag(X.T@(A@X) - (X.T@Q)@(Z.T@X)))

    return trB + trm

The convergence of Hutch++ compares favorably to (6.3).

Proof

Let k+pk+p be the number of samples used to sketch A\vec{A} in order to build the low-rank approximation, and \ell be the number of samples used to estimate the trace of AB\vec{A} - \vec{B}.

As noted above, E[tr^m++(A)]=E[E[tr^m++(A)B]]=tr(A)\EE[\widehat{\tr}_m^{++}(\vec{A}) ] = \EE[\EE[\widehat{\tr}_m^{++}(\vec{A}) | \vec{B} ]] = \tr(\vec{A}) and V[tr^m++(A)B]=V[tr^m(AB)B]=2ABF2/\VV[\widehat{\tr}_m^{++}(\vec{A})|\vec{B}] = \VV[\widehat{\tr}_m(\vec{A} - \vec{B})|\vec{B}] = 2\Vert\vec{A} - \vec{B}\Vert_\F^2/\ell. Now, by Theorem 5.1, a standard bound for the Randomized SVD, we have

E[ABF2]=E[AQQTAF2](1+kp1)A[ ⁣[A] ⁣]kF2.\EE\left[\Vert\vec{A} - \vec{B}\Vert_\F^2\right] = \EE\left[\Vert\vec{A} - \vec{Q}\vec{Q}^\T\vec{A}\Vert_\F^2\right] \leq \left( 1 + \frac{k}{p-1}\right) \Vert \vec{A} - \llbracket \vec{A} \rrbracket_k \Vert_\F^2.

As proved in Lemma 7 in Gilbert et al., 2007,

A[ ⁣[A] ⁣]kF12ktr(A).\Vert \vec{A} - \llbracket \vec{A} \rrbracket_k \Vert_\F \leq \frac{1}{2\sqrt{k}} \tr(\vec{A}).

Thus, setting p=1p=1,

V[tr^m++(A)B]2(1+kp1)14ktr(A)2=1ktr(A)2.\VV[\widehat{\tr}_m^{++}(\vec{A})|\vec{B}] \leq \frac{2}{\ell}\left( 1 + \frac{k}{p-1}\right) \frac{1}{4k} \tr(\vec{A})^2 = \frac{1}{\ell k} \tr(\vec{A})^2.

Finally, the choices k=(m2)/4k = (m-2)/4 and =(m2)/2\ell = (m-2)/2 yield the stated bound.

Numerical Example

Let’s set up a numerical experiment to compare Hutch++ to the Girard--Hutchinson estimator. We’ll compare the performance on a problem with fast spectral decay and and one with slow spectral decay.

Since we only use Gaussian queries, without loss of generality, we can use a diagonal test problem.

# Generate test problems with different spectral decay rates
n = 3000

test_problems = [
    {'name': 'fast decay', 'Λ': 1/np.arange(1,n+1)**3},
    {'name': 'slow decay', 'Λ': 1/np.arange(1,n+1)**1}
]

# Define trace estimation methods
trace_methods = [
    {'name': 'Girard-Hutchinson',
     'func': girard_hutchinson},
    {'name': 'Hutch++',
     'func': hutchpp}
]
Source
n_ms = 5
ms = np.geomspace(10,1000,n_ms,dtype='int')

# Run experiments
n_repeat = 100
errors = {}

for test_problem in test_problems:
    problem_name = test_problem['name']
    Λ = test_problem['Λ']
    A = sp.sparse.diags(Λ)
    tr_true = np.sum(Λ)

    for trace_method in trace_methods:
        alg_name = trace_method['name']
        alg = trace_method['func']

        err = np.zeros((n_ms,n_repeat))
        for i,m in enumerate(ms):
            for j in range(n_repeat):
                err[i,j] = np.abs(alg(A,m) - tr_true) / tr_true

        errors[problem_name,alg_name] = err

# Plot the results
σ = .1
fig, axs = plt.subplots(1,2,figsize=(10, 4),sharey=True)

for i,test_problem in enumerate(test_problems):
    problem_name = test_problem['name']
    ax = axs[i]
    ax.set_title(problem_name)
    
    for trace_method in trace_methods:
        alg_name = trace_method['name']
        err = errors[problem_name,alg_name]

        bot,mid,top = np.quantile(err, [σ, .5, 1-σ], axis=1)

        ax.plot(ms,mid,label=alg_name)
        ax.fill_between(ms,bot,top,alpha=.2)

    ax.set_xlabel('matvecs $m$')
    ax.set_xscale('log')

axs[0].set_ylabel(r'error: $|\operatorname{tr}(A) - \mathrm{alg}|/\operatorname{tr}(A)$')
plt.yscale('log')
plt.legend()

plt.savefig('hutchpp.svg')
plt.close()

The results are as expected. When the eigenvalue decay is fast, then Hutch++ performs substantially better than the Girard-Hutchinson estimator since it can effectively remove much of the mass with the low-rank approximation.

Two-panel comparison showing relative error vs number of matrix-vector products for trace estimation. Left panel shows "fast decay" problem where Hutch++ significantly outperforms Girard-Hutchinson. Right panel shows "slow decay" problem where both methods perform similarly. Both use log-log scale with error decreasing as more samples are used.

While Hutch++ admits a simple analysis, it’s clear there is lots of room for modifications/improvements. Below we highlight some of the main variants and related works.

Non-adaptive methods

It’s possible to do the variance reduction step in a non-adaptive way; i.e. without multiple passes over A\vec{A}. Specifically, the Randomized SVD can be replaced with Generalized Nyström.

Deflation

Several methods have proposed to use compute the top eigencomponents of A\vec{A} explicitly and apply a randomized estimator to the residual Weiße et al., 2006Gambhir et al., 2017Morita & Tohyama, 2020.

Adaptive Hutch++

In Hutch++, the number of queries used for low-rank approximation and estimating the trace of the residual are roughly balanced. However, there may be better tradeoffs between the two depending on the spectral decay of A\vec{A}. In Persson et al., 2022, the authors propose an adaptive variant of Hutch++ that attempts to use more queries for the low-rank approximation when the spectrum is flat, and more queries for estimating the trace of the residual when the spectrum is decaying.

XTrace

The XTrace method Epperly et al., 2024 is based on the exchangability principle uses all mm queries in the same way. In particular, the method uses all but one query for low-rank approximation, and the last query to estimate the trace of the residual, and then averages this estimate over all possible choices of the last query. By using a clever low-rank update formula, the algorithm can also be made computationally efficient.

References
  1. Meyer, R. A., Musco, C., Musco, C., & Woodruff, D. P. (2021). Hutch++: Optimal Stochastic Trace Estimation. In Symposium on Simplicity in Algorithms (SOSA) (pp. 142–155). Society for Industrial. 10.1137/1.9781611976496.16
  2. Persson, D., Cortinovis, A., & Kressner, D. (2022). Improved Variants of the Hutch++ Algorithm for Trace Estimation. SIAM Journal on Matrix Analysis and Applications, 43(3), 1162–1185. 10.1137/21m1447623
  3. Epperly, E. N., Tropp, J. A., & Webber, R. J. (2024). XTrace: Making the Most of Every Sample in Stochastic Trace Estimation. SIAM Journal on Matrix Analysis and Applications, 45(1), 1–23. 10.1137/23m1548323
  4. Girard, D. (1987). Un algorithme simple et rapide pour la validation croisée généralisée sur des problèmes de grande taille. https://membres-ljk.imag.fr/Didier.Girard/TR-665-M-IMAG.pdf
  5. Gilbert, A. C., Strauss, M. J., Tropp, J. A., & Vershynin, R. (2007). One sketch for all: fast algorithms for compressed sensing. Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, 237–246. 10.1145/1250790.1250824
  6. Weiße, A., Wellein, G., Alvermann, A., & Fehske, H. (2006). The kernel polynomial method. Reviews of Modern Physics, 78(1), 275–306. 10.1103/revmodphys.78.275
  7. Gambhir, A. S., Stathopoulos, A., & Orginos, K. (2017). Deflation as a Method of Variance Reduction for Estimating the Trace of a Matrix Inverse. SIAM Journal on Scientific Computing, 39(2), A532–A558. 10.1137/16m1066361
  8. Morita, K., & Tohyama, T. (2020). Finite-temperature properties of the Kitaev-Heisenberg models on kagome and triangular lattices studied by improved finite-temperature Lanczos methods. Physical Review Research, 2(1). 10.1103/physrevresearch.2.013205