Skip to article frontmatterSkip to article content

Let x\vec{x} be a random vector satisfying E[x]=0\mathbb{E}[\vec{x}] = \vec{0} and E[xxT]=I\mathbb{E}[\vec{x}\vec{x}^\T] = \vec{I}. Then, a direct computations reveals that xTAx\vec{x}^\T\vec{A}\vec{x} is an unbiased estimator of the trace of A\vec{A}. In particular, by the cylic property of the trace and linearity of expectation,

E[xTAx]=E[tr(xTAx)]=E[tr(AxxT)]=tr(AE[xxT])=tr(A).\EE[ \vec{x}^\T \vec{A}\vec{x}] = \EE[ \tr(\vec{x}^\T \vec{A}\vec{x}) ] = \EE[ \tr(\vec{A}\vec{x}\vec{x}^\T ) ] = \tr(\vec{A}\EE[ \vec{x}\vec{x}^\T ] ) = \text{tr}(\vec{A}).

This suggests a simple randomized estimator.

We can implement the Girard-Hutchinson estimator.

def girard_hutchinson(A,m):

    n,_ = A.shape

    X = np.random.randn(n,m)
    trm = np.mean(np.diag(X.T@(A@X)))

    return trm

In the above implementation we use that if X=[x1,,xm]Rn×m\vec{X} = [\vec{x}_1, \ldots, \vec{x}_m] \in \mathbb{R}^{n \times m} then

XTAX=[x1TAx1x1TAx2x1TAxmx2TAx1x2TAx2x2TAxmxmTAx1xmTAx2xmTAxm],\vec{X}^\T \vec{A} \vec{X} = \begin{bmatrix} \vec{x}_1^\T \vec{A} \vec{x}_1 & \vec{x}_1^\T \vec{A} \vec{x}_2 & \cdots & \vec{x}_1^\T \vec{A} \vec{x}_m \\ \vec{x}_2^\T \vec{A} \vec{x}_1 & \vec{x}_2^\T \vec{A} \vec{x}_2 & \cdots & \vec{x}_2^\T \vec{A} \vec{x}_m \\ \vdots & \vdots & \ddots & \vdots \\ \vec{x}_m^\T \vec{A} \vec{x}_1 & \vec{x}_m^\T \vec{A} \vec{x}_2 & \cdots & \vec{x}_m^\T \vec{A} \vec{x}_m \end{bmatrix},

and hence tr^m(A)=1mtr(XTAX)\widehat{\tr}_m(\vec{A}) = \frac{1}{m} \tr(\vec{X}^\T \vec{A} \vec{X}).

This approach unnecessarily computes and stores the off-diagonal entries of XTAX\vec{X}^\T \vec{A} \vec{X}. However, in trace estimation we typically view products with A\vec{A} as expensive, so the additional cost of forming XTAX\vec{X}^\T \vec{A} \vec{X} is often negligible.

Variance Bounds

We can easily characterize the expectation and variance of the Girard-Hutchinson estimator when x\vec{x} is a Gaussian vector.

Proof

A standard computation reveals that

E[tr^m(A)]=tr(A),V[tr^m(A)]=1mV[xTAx].\EE[ \widehat{\tr}_m(\vec{A}) ] = \tr(\vec{A}), \qquad \VV[ \widehat{\tr}_m(\vec{A}) ] = \frac{1}{m} \VV[\vec{x}^\T\vec{A}\vec{x}].

Since A\vec{A} is symmetric, it has an eigendecomposition A=UΛUT\vec{A} = \vec{U}\vec{\Lambda}\vec{U}^\T, where URn×n\vec{U}\in\R^{n\times n} is an orthogonal matrix and Λ=diag(λ1,,λn)\vec{\Lambda} = \text{diag}(\lambda_1,\ldots,\lambda_n) is a diagonal matrix. By the Gaussian orthogonal invariance property, z=Ux\vec{z} = \vec{U}\vec{x} is also a Gaussian vector. Observe that

xTAx=zTΛz=i=1nλizi2.\vec{x}^\T\vec{A}\vec{x} = \vec{z}^\T\vec{\Lambda}\vec{z} = \sum_{i=1}^{n} \lambda_i z_i^2.

Since zi2z_i^2 are independent χ12\chi^2_1 random variables, we have,

V[xTAx]=i=1nλi2V[zi2]=i=1nλi22=2AF2.\VV[\vec{x}^\T\vec{A}\vec{x}] = \sum_{i=1}^{n} \lambda_i^2 \VV[z_i^2] = \sum_{i=1}^{n} \lambda_i^2 \cdot 2 = 2\|\vec{A}\|_\F^2.

If A\vec{A} is not symmetric, we can use the fact that tr^m(A)=tr^m((A+AT)/2)\widehat{\tr}_m(\vec{A})=\widehat{\tr}_m((\vec{A} + \vec{A}^\T)/2).

One can also derive probability bounds; see e.g. Cortinovis & Kressner, 2021.

Other Distributions

Besides Gaussian, two commonly used distributions are:

  • Rademacher: Each component of x\vec{x} is iid {1,1}\{-1,1\} with equal probability.

  • Random unit vector (real): Each component of x\vec{x} is iid N(0,1)\mathcal{N}(0,1), and then x\vec{x} is normalized to have norm n\sqrt{n}.

When A\vec{A} is symmetric, the variance of the estimator can be computed explicitly:

Distribution of x\vec{x}Variance
Gaussian1m2AF2\frac{1}{m} \cdot 2 \Vert\vec{A}\Vert_\F^2
Rademacher1m2(AF2iAii2)\frac{1}{m} \cdot 2 (\Vert\vec{A}\Vert_\F^2 - \sum_i A_{ii}^2)
Uniform (real)1m2nn+2(AF21ntr(A)2)\frac{1}{m} \cdot \frac{2n}{n+2} \left( \Vert \vec{A} \Vert_\F^2 - \frac{1}{n}\tr(\vec{A})^2 \right)

Deriving the variance for Rademacher vectors is a straightforward (but tedious) exercise in basic probability. The variance for the uniform distribution can also be derived from elementary techniques, but requires some care to handle the normalization. Readers can refer to Ethan’s blog post for a derivation of the variances listed above, as well as some additional discussion on interesting optimality properties of certain distributions.

Diagonal Approximation

A task closely related to trace estimation is approximating the diagonal of a matrix ARn×n\vec{A}\in\mathbb{R}^{n\times n}. The analog of Definition 6.2 is the Girard-Hutchinson estimator for diagonal estimation:

It’s not too hard to analyze the mean and variance to prove a result analogous to Theorem 6.1. See also Dharangutte & Musco, 2023 for a probability bound.

References
  1. Hutchinson, M. F. (1989). A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines. Communications in Statistics - Simulation and Computation, 18(3), 1059–1076. 10.1080/03610918908812806
  2. Girard, D. (1987). Un algorithme simple et rapide pour la validation croisée généralisée sur des problèmes de grande taille. https://membres-ljk.imag.fr/Didier.Girard/TR-665-M-IMAG.pdf
  3. Chen, T. (2022). Lanczos-Based Methods for Matrix Functions (p. 188) [Phdthesis, University of Washington]. https://www.proquest.com/dissertations-theses/lanczos-based-methods-matrix-functions/docview/2719149525/se-2
  4. Cortinovis, A., & Kressner, D. (2021). On Randomized Trace Estimates for Indefinite Matrices with an Application to Determinants. Foundations of Computational Mathematics, 22(3), 875–903. 10.1007/s10208-021-09525-9
  5. Dharangutte, P., & Musco, C. (2023). A Tight Analysis of Hutchinson’s Diagonal Estimator. In Symposium on Simplicity in Algorithms (SOSA) (pp. 353–364). Society for Industrial. 10.1137/1.9781611977585.ch32