Fixed-sparsity matrix approximation from matrix-vector products

Tyler Chen

This is a companion piece to https://arxiv.org/abs/2402.09379.

Code to replicate the figures in the corresponding paper can be found on Github.

Introduction

In this work, we study the task of approximating a d×dd\times d matrix A\mathbf{A} with a matrix A~\widetilde{\mathbf{A}} of a sparsity pattern (e.g. diagonal, tridiagonal, block diagonal, etc.). There are many existing linear algebra algorithms for matrices of a given sparsity pattern, so this task yields a proxy for A\mathbf{A} which is compatible with such algorithms. For example, a banded approximation to A=M1\mathbf{A} = \mathbf{M}^{-1} could be used as a preconditioned to systems involving M\mathbf{M}.

The catch is that we assume A\mathbf{A} can only be accessed using matrix-vector products (called matvecs). That is, we restrict ourselves to algorithms which choose vectors x1,,xm\mathbf{x}_1, \ldots, \mathbf{x}_m and subsequently receive responses Ax1,,Axm\mathbf{A}\mathbf{x}_1, \ldots, \mathbf{A}\mathbf{x}_m from some black-box, but cannot look at A\mathbf{A} in any other way. We will measure the cost of algorithms by mm, the total number of matrix-vector queries done, which in many settings may be a realistic proxy for the real-world costs. This is often called the matrix-vector query model.

It’s worth noting that we can recover A\mathbf{A} in this access model by simply reading off the columns of A\mathbf{A} one-by-one. This means we can effectively run any classical algorithm for the cost of dd matvecs. If dd is very big this is not practical, but it is worth keeping in mind as a trivial upper bound for the cost of any algorithm in this model. In general, our goal will be to design algorithms which use mdm\ll d matvecs.

One of the most powerful features of the matvec query model is that it is often possible to prove lower-bounds for the hardness of a given task. That is, to show that it is impossible (for any algorithm) to solve the task unless the algorithm uses at least a certain number of queries. This is exciting, because it offers a way of understanding the “difficulty” of linear algebra tasks! In contrast, proving lower-bounds on the number of floating point operations (which has classically been the measure of cost for linear-algebra algorithms), is seemingly far beyond current techniques.

One thing I think is really cool about this paper, is that almost everything is self-contained! In particular, besides a few well-known results on properties of Gaussian matrices and a qualitative central limit theorem called the Berry-Esseen theorem, everything needed is proved within the paper using relatively elementary techniques which would be understandable to anyone with an undergraduate-level knowledge of linear-algebra and probability. In fact, the general technique is so simple that we even prove a lower-bound for diagonal estimation on this page! Overall, I hope that this paper can serve as a case-study on the power and beauty of the matrix-vector query model.

Two theoretical problems

For convenience, we will measure the error in the Frobenius norm: AA~F=i,j([A]i,j[A~]i,j)2. \|\mathbf{A} - \widetilde{\mathbf{A}} \|_{\mathsf{F}} = \sqrt{\sum_{i,j} \big([\mathbf{A}]_{\,i,j} - [\widetilde{\mathbf{A}}]_{\,i,j} \big)^2}. Let’s also encode the desired sparsity pattern in a binary matrix S\mathbf{S}. The nonzero entries of S\mathbf{S} will tell us where A~\widetilde{\mathbf{A}} is allowed to have a nonzero entries. It is pretty clear then that the best sparse approximation to A\mathbf{A} is simply SA\mathbf{S}\circ \mathbf{A}, where “\circ” is the Hadamard (entrywise) product. That is, the best approximation is obtained by zeroing out the entries of A\mathbf{A} which are required to be zero.

We will study the following problems:

Problem 1 (Best approximation by a matrix of fixed sparsity). Given a matrix ARn×d\mathbf{A} \in\mathbb{R}^{n\times d} and a binary matrix S{0,1}n×d\mathbf{S} \in \{0,1\}^{n\times d}, find a matrix A~\widetilde{\mathbf{A}} so that A~=SA~\widetilde{\mathbf{A}} = \mathbf{S}\circ\widetilde{\mathbf{A}} and AA~F(1+ε)ASAF. \| \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F} \leq (1+\varepsilon) \, \| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}.

Problem 2 (Best approximation to on-sparsity-pattern entries). Given a matrix ARn×d\mathbf{A} \in\mathbb{R}^{n\times d} and a binary matrix S{0,1}n×d\mathbf{S} \in \{0,1\}^{n\times d}, find a matrix A~\widetilde{\mathbf{A}} so that A~=SA~\widetilde{\mathbf{A}} = \mathbf{S}\circ\widetilde{\mathbf{A}} and SAA~F2εASAF2. \| \mathbf{S}\circ \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F}^2 \leq \varepsilon \,\| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}^2.

Note that since A~\widetilde{\mathbf{A}} has the same sparsity as S\mathbf{S}; i.e. A~=SA~\widetilde{\mathbf{A}} = \mathbf{S}\circ\widetilde{\mathbf{A}}, AA~F2=ASAF2+SAA~F2. \| \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F}^2 = \| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}^2 + \|\mathbf{S}\circ\mathbf{A} - \widetilde{\mathbf{A}}\|_\mathsf{F}^2. This allows us to relate Problems 1 and 2, and for ε(0,1)\varepsilon \in (0,1), the two are equivalent up to (small) constants.

A simple algorithm

When the sparsity pattern S\mathbf{S} has at most ss non-zero entries per row, this algorithm uses m=O(s/ε)m = O(s/\varepsilon) non-adaptive matrix-vector product queries. Specifically, the algorithm computes Z=AG\mathbf{Z} = \mathbf{A}\mathbf{G}, where G\mathbf{G} is a d×md\times m matrix with independent standard normal entries, and then outputs the matrix A~=argminX=SXZXGF. \widetilde{\mathbf{A}} = \operatornamewithlimits{argmin}_{\mathbf{X} = \mathbf{S}\circ\mathbf{X}} \| \mathbf{Z} - \mathbf{X} \mathbf{G}\|_\mathsf{F}. It’s not hard to see how to implement this algorithm. First, observe that each row of the approximation can be computed independently. Then, note that the solution to each row is obtained by solving a standard least squares problem involving the relevant rows of G\mathbf{G}.

Using standard tools from high dimensional probability, which characterize the expected Frobenius norm of the pseudoinverse of a Gaussian matrix, it is relatively simple to establish the following convergence guarantee:

Theorem 1. Consider any ARn×d\mathbf{A} \in\mathbb{R}^{n\times d} and any S{0,1}n×d\mathbf{S}\in\{0,1\}^{n\times d} with at most ss nonzero entries per row. Then, for any ms+2m\geq s+2, using mm randomized matrix-vector queries, the matrix A~\widetilde{\mathbf{A}} is equal to SA\mathbf{S}\circ\mathbf{A} in expectation and satisfies E[SAA~F2]sms1ASAF2. \mathbb{E}\Bigl[\|\mathbf{S}\circ\mathbf{A} - \widetilde{\mathbf{A}}\|_\mathsf{F}^2 \Bigr] \leq {\frac{s}{m-s-1}} \| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}^2. The above inequality is equality if each row of S\mathbf{S} has exactly ss non-zero entries.

If we set m=O(s/ε)m = O(s/\varepsilon), then we solve Problem 2 (and hence Problem 1), at least in expectation. One can then use Markov’s inequality to get a probability bound. This matches the gurantees for a number of existing algorithms for special cases of Problem 1 and 2 [1], [2], [3], [4], etc., and we comment on the connections in more detail in the paper.

A lower bound

For any fixed A\mathbf{A}, there is an algorithm which solves Problem 1 exactly using zero queries; simply hard-code the algorithm to output SA\mathbf{S}\circ\mathbf{A}. This is not a practically useful algorithm, because it only works for the one particular A\mathbf{A} and we would be unable to practically ``find’’ such an algorithm for a given matrix. However, it means it’s impossible to prove a statement like: “For any matrix A\mathbf{A}, no algorithm using (some amount of queries depending on ε\varepsilon) can solve Problem 1”. Instead, it is standard to consider a distribution of matrices A\mathbf{A} (or equivalently, a random matrix). While an algorithm might be hard-coded for the particular distribution we consider, there is hope that the distribution is always still very random, even conditioned on a small number of arbitrary matvec queries.

In particular, in the paper we prove a precise version of the following:

Theorem 2. There are constants c,C>0c,C>0 so that, for any ε(0,c)\varepsilon\in(0,c) and s1s\geq 1, there is a distribution on matrices A\mathbf{A} such that, for any sparsity pattern S\mathbf{S} in a broad class of sparsity patterns, and for any algorithm that uses mCs/εm\leq Cs/\varepsilon matrix-vector queries to A\mathbf{A} to output a matrix A~\widetilde{\mathbf{A}}, P[AA~F(1+ε)ASAF]125. \mathbb{P}\Bigl[ \| \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F} \leq (1+\varepsilon) \,\| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}\Bigr] \leq \frac{1}{25}.

This shows that there are problems for which no algorithm can use more than a constant (e.g. 100100) times fewer matvecs than the algorithm analyzed in the previous section.

At a high-level, the lower-bound technique is pretty simple. The basic idea is that if A=GTG\mathbf{A} = \mathbf{G}^\mathsf{T}\mathbf{G}, where G\mathbf{G} is d×dd\times d random Gaussian matrix, then after a small number of (possibly adaptive) matvec queries to A\mathbf{A}, there is still too much randomness in A\mathbf{A} (conditioned on the information leared by the queries) to learn what we need [5].

A proof for diagonal estimation

To give a flavor of how the above lower-bound above is proved, we will now prove a simpler result for diagonal estimation; i.e. for when S=I\mathbf{S} = \mathbf{I}. We will be a bit cavalier 🤠 , providing a TCS style proof (you shouldn’t be just trusting things on the internet anyway).

Theorem 3. There is a constant C>0C>0 such that, for any ε>0\varepsilon>0 there are matrices A\mathbf{A} such that, for any algorithm that uses mC/εm\leq C/\varepsilon matrix-vector queries to A\mathbf{A} to output a diagonal matrix A~\widetilde{\mathbf{A}}, P[IAA~F2εAIAF2]125. \mathbb{P}\Bigl[ \| \mathbf{I}\circ \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F}^2 \leq \varepsilon \,\| \mathbf{A} - \mathbf{I}\circ \mathbf{A} \|_\mathsf{F}^2 \Bigr] \leq \frac{1}{25}.

The key technical tool will be the following Lemma:

Lemma 1. Suppose AGaussian(d,d)\mathbf{A}\sim\operatorname{Gaussian}(d,d) and let x1,,xm\mathbf{x}_1, \ldots, \mathbf{x}_m and y1=Ax1,,ym=Axm\mathbf{y}_1 = \mathbf{A}\mathbf{x}_1, \ldots, \mathbf{y}_m = \mathbf{A}\mathbf{x}_m be such that, for each j=0,1,,mj=0,1,\ldots, m, xj\mathbf{x}_j was chosen based only on the query vectors x1,,xj1\mathbf{x}_1, \ldots, \mathbf{x}_{j-1} and the outputs y1,,yj1\mathbf{y}_1, \ldots, \mathbf{y}_{j-1} and (w.l.o.g.) unit-length and orthogonal to x1,,xj1\mathbf{x}_1, \ldots, \mathbf{x}_{j-1}.

Let X=[x1,,xm]\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_m] and Y=[y1,,ym]\mathbf{Y} = [\mathbf{y}_1, \ldots, \mathbf{y}_m]. Then, A\mathbf{A} can be factored as A=[YG][XTZT], \mathbf{A} = \begin{bmatrix}\mathbf{Y} & \mathbf{G} \end{bmatrix} \begin{bmatrix}\mathbf{X}^\mathsf{T}\\ \mathbf{Z}^\mathsf{T}\end{bmatrix}, where ZTX=0\mathbf{Z}^\mathsf{T}\mathbf{X} = \mathbf{0}, ZTZ=I\mathbf{Z}^\mathsf{T}\mathbf{Z} = \mathbf{I}, and GGaussian(d,dm)\mathbf{G}\sim\operatorname{Gaussian}(d,d-m), independently of X\mathbf{X} and Y\mathbf{Y}.

Proof. Clearly A\mathbf{A} must be decomposed as above for matrix G\mathbf{G} (because this is just encoding what A\mathbf{A} does to the vectors in X\mathbf{X}). It therefore suffices to argue that G\mathbf{G} is a Gauassian matrix independent of X\mathbf{X} and Y\mathbf{Y}.

We proceed by induction, noting that the base case is trivial.

Suppose the lemma holds for mm queries. Choose xm+1\mathbf{x}_{m+1} as a unit vector orthogonal to x1,,xm\mathbf{x}_1, \ldots, \mathbf{x}_m (i.e. so that XTxm+1=0\mathbf{X}^\mathsf{T}\mathbf{x}_{m+1} = \mathbf{0}. Note that xm+1=Zc\mathbf{x}_{m+1} = \mathbf{Z} \mathbf{c}, for some length dmd-m unit vector c\mathbf{c}; indeed, if xm+1\mathbf{x}_{m+1} is orthogonal to X\mathbf{X} then it is in the span of Z\mathbf{Z}, and it is unit-length if and only if c\mathbf{c} is unit-length.

Deterministically complete c\mathbf{c} or an orthogonal matrix C=[cC^]. \mathbf{C} = \begin{bmatrix} \mathbf{c} & \hat{\mathbf{C}} \end{bmatrix}. By the invariance of Gaussian vectors under unitary transforms, [ym+1GC^]=GCGaussian(d,dm). \begin{bmatrix} \mathbf{y}_{m+1} & \mathbf{G} \hat{\mathbf{C}} \end{bmatrix} = \mathbf{G}\mathbf{C} \sim \operatorname{Gaussian}(d,d-m). Moreover, since G\mathbf{G} was independent of X\mathbf{X} and Y\mathbf{Y}, so is GC\mathbf{G}\mathbf{C}. Thus, GC^\mathbf{G}\hat{\mathbf{C}} is independent of X\mathbf{X}, Y\mathbf{Y}, and xm+1\mathbf{x}_{m+1} and ym+1\mathbf{y}_{m+1}.

Now, using the orthogonality of C\mathbf{C}, A=[YGC][XTCTZT]=[Yym+1GC^][XTxT(ZC^)T] \mathbf{A} = \begin{bmatrix}\mathbf{Y} & \mathbf{G}\mathbf{C} \end{bmatrix} \begin{bmatrix}\mathbf{X}^\mathsf{T}\\ \mathbf{C}^\mathsf{T}\mathbf{Z}^\mathsf{T}\end{bmatrix} = \begin{bmatrix} \mathbf{Y} & \mathbf{y}_{m+1} & \mathbf{G}\hat{\mathbf{C}} \end{bmatrix} \begin{bmatrix} \mathbf{X}^\mathsf{T}\\ \mathbf{x}^\mathsf{T}\\ (\mathbf{Z}\hat{\mathbf{C}})^\mathsf{T} \end{bmatrix} Relabeling quantities gives the result.     ~~~~\square

We will also use the following fact:

Lemma 2. Suppose a1++ad=d/2a_1 + \cdots + a_d = d/2 and ai1a_i \leq 1. Then at least d/4d/4 of the aia_i are larger than 1/41/4.

Proof. Let L={i:ai>1/4}L = \{ i : |a_i| > 1/4 \}. Suppose, for the sake of contradiction, that L<d/4|L| < d/4. Then d2=iLai+iLcaiL1+(dL)14<d41+d14=d2. \frac{d}{2} = \sum_{i\in L} a_i + \sum_{i\in L^{\textsf{c}}} a_i \leq |L| \cdot 1 + (d-|L|) \cdot \frac{1}{4} < \frac{d}{4} \cdot 1 + d \cdot \frac{1}{4} = \frac{d}{2}. This is a contradiction, so the result holds.     ~~~~\square

Proof of Theorem 3

Set d=1ε,m=d4,k=dm, d = \frac{1}{\varepsilon} ,\qquad m = \frac{d}{4}, \qquad k = d-m, and let AGaussian(d)\mathbf{A} \sim\operatorname{Gaussian}(d).

First, note that AF2\|\mathbf{A}\|_\mathsf{F}^2 is a Chi-squared random variable with d2d^2 degrees of freedom, and therefore has mean d2d^2. Therefore, by Markov’s inequality, P[A250d2]4950. \mathbb{P}\Big[ \|\mathbf{A}\|^2 \leq 50 d^2 \Big] \geq \frac{49}{50}.

The algorithm want’s to approxiamte the diagonal of A=YXT+GZT\mathbf{A} = \mathbf{Y}\mathbf{X}^\mathsf{T}+ \mathbf{G}\mathbf{Z}^\mathsf{T}, but after mm queries it only knows X\mathbf{X}, Y\mathbf{Y}, and Z\mathbf{Z}. We will show the diagonal of GZT\mathbf{G}\mathbf{Z}^\mathsf{T} is too random, so that it can’t be approxiamted well (in squared norm relative to O(d)O(d)).

Write the rows of G\mathbf{G} and giT\mathbf{g}_i^\mathsf{T} and the columns of ZT\mathbf{Z}^\mathsf{T} as zi\mathbf{z}_i. Then, d=diag(GZT)=[g1Tz1,g2Tz2,,gdTzd]. \mathbf{d} = \operatorname{diag}(\mathbf{G}\mathbf{Z}^\mathsf{T}) = \big[ \mathbf{g}_1^\mathsf{T}\mathbf{z}_1 , \mathbf{g}_2^\mathsf{T}\mathbf{z}_2, \cdots ,\mathbf{g}_d^\mathsf{T}\mathbf{z}_d \big]. Note that since the entries of G\mathbf{G} are all independent Gaussians, the entries of d\mathbf{d} are also independent Gaussians! In particular, [d]i=giTzi=j=1k[gi]j[zi]jj=1kN(0,[zi]j2)N(0,zi22). [\mathbf{d}]_{\,i} = \mathbf{g}_i^\mathsf{T}\mathbf{z}_i = \sum_{j=1}^{k} [\mathbf{g}_i]_{\,j} \cdot [\mathbf{z}_i]_{\,j} \sim \sum_{j=1}^{k} \mathcal{N}(0,[\mathbf{z}_i]_{\,j}^2) \sim \mathcal{N}(0,\|\mathbf{z}_i\|_2^2). The algorithm knows this, so the best choice of a guess for d\mathbf{d} is d=0\mathbf{d} = \mathbf{0}; indeed, each entry is a mean zero Gaussian. Thus, the algorithm can output a guess d~=diag(YXT)\widetilde{\mathbf{d}} = \operatorname{diag}(\mathbf{Y}\mathbf{X}^\mathsf{T}) for diag(A)\operatorname{diag}(\mathbf{A}). Then diag(A)d~22=d22. \|\operatorname{diag}(\mathbf{A}) - \widetilde{\mathbf{d}} \|_2^2 = \|\mathbf{d}\|_2^2. We will show that d22\|\mathbf{d}\|_2^2 is typically large, so that the output always has high expected error. Intuitively this makes sense; since the entries of d\mathbf{d} are Gaussian, and since the variance of a Gaussian is 1, E[d22]=i=1dE[[d]i2]=i=1dvi22=VF2=k=34d. \mathbb{E}[\|\mathbf{d}\|_2^2] = \sum_{i=1}^{d} \mathbb{E}[ |[\mathbf{d}]_{\,i}|^2 ] = \sum_{i=1}^{d} \|\mathbf{v}_i\|_2^2 = \| \mathbf{V} \|_\mathsf{F}^2 = k = \frac{3}{4}d. This is on the right scale, since AIAF2AF2=O(d2)=O(εd)\|\mathbf{A} - \mathbf{I}\circ\mathbf{A}\|_\mathsf{F}^2 \leq \|\mathbf{A}\|_\mathsf{F}^2 = O(d^2) = O(\varepsilon \:d). However, an expectation bound doesn’t prove a probability bound; it’s possible to have a large expectation because of a very large and very unlikely value.

In this particular case, we could simply use a standard anti-concentration result for Gaussians. However, for the sake of mirroring the proof we used in the paper, we will use an alternate approach.

Note that the zi22\|\mathbf{z}_i\|_2^2 can’t all be super small. Indeed, let L={i:zi221/4}. L = \{ i : \|\mathbf{z}_i\|_2^2 \geq 1/4 \}. Since [X Z][\mathbf{X}~\mathbf{Z}] is an orthogonal matrix zi221\|\mathbf{z}_i\|_2^2 \leq 1 and ZF2=kd/2\|\mathbf{Z}\|_\mathsf{F}^2 = k \geq d/2. Thus, by Lemma 2, Ld/4|L| \geq d/4.

Now, define P={iL:[d]i>103}. P = \{ i \in L : | [\mathbf{d}]_{\,i} | > 10^{-3} \}. If iLi\in L, just by standard properties of Gaussians, P[[d]i>103]99100. \mathbb{P}\big[ | [\mathbf{d}]_{\,i} | > 10^{-3} \big] \geq \frac{99}{100}. This implies that PP is not much smaller than LL. In particular, P[P>L/2]4950. \mathbb{P}[ |P| > |L|/2 ] \geq \frac{49}{50}.

We have now established that

Thus, by a union bound, both events happen simultaneously with probability at least 24/2524/25.

In this case, for some constant DD, IAA~F2d22=i=1d[d]i2d4(103)2DεAIAF2. \| \mathbf{I}\circ \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F}^2 \geq \| \mathbf{d} \|_2^2 = \sum_{i=1}^{d} [\mathbf{d}]_{\,i}^2 \geq \frac{d}{4} \cdot (10^{-3})^2 \geq D \varepsilon \: \| \mathbf{A} - \mathbf{I}\circ\mathbf{A} \|_\mathsf{F}^2. Replacing ε\varepsilon with ε/D\varepsilon/D throughout the proof gives the result with C=1/(4D)C = 1/(4D).     ~~~~\square

What changes in the general case?

The above proof actually captures pretty much all of the important ideas we use for the lower-bound provided in the paper. There are a number of details which differ. The most important are perhaps the following:

References

1. Curtis, A.R.; Powell, M.J.D.; Reid, J.K. On the Estimation of Sparse Jacobian Matrices. IMA Journal of Applied Mathematics 1974, 13, 117–119, doi:10.1093/imamat/13.1.117.

2. Bekas, C.; Kokiopoulou, E.; Saad, Y. An Estimator for the Diagonal of a Matrix. Applied Numerical Mathematics 2007, 57, 1214–1229, doi:10.1016/j.apnum.2007.01.003.

3. Halikias, D.; Townsend, A. Structured Matrix Recovery from Matrix‐vector Products. Numerical Linear Algebra with Applications 2023, 31, doi:10.1002/nla.2531.

4. Dharangutte, P.; Musco, C. A Tight Analysis of Hutchinsons Diagonal Estimator. In Symposium on simplicity in algorithms (SOSA); Society for Industrial; Applied Mathematics, 2023; pp. 353–364.

5. Braverman, M.; Hazan, E.; Simchowitz, M.; Woodworth, B. The Gradient Complexity of Linear Regression. In Proceedings of the Proceedings of thirty third conference on learning theory; Abernethy, J., Agarwal, S., Eds.; PMLR, 2020; Vol. 125, pp. 627–647.