Fixed-sparsity matrix approximation from matrix-vector products

This is a companion piece to https://arxiv.org/abs/2402.09379.

Code to replicate the figures in the corresponding paper can be found on Github.

Introduction

In this work, we study the task of approximating a $d\times d$ matrix $\mathbf{A}$ with a matrix $\widetilde{\mathbf{A}}$ of a sparsity pattern (e.g. diagonal, tridiagonal, block diagonal, etc.). There are many existing linear algebra algorithms for matrices of a given sparsity pattern, so this task yields a proxy for $\mathbf{A}$ which is compatible with such algorithms. For example, a banded approximation to $\mathbf{A} = \mathbf{M}^{-1}$ could be used as a preconditioned to systems involving $\mathbf{M}$ .

The catch is that we assume $\mathbf{A}$ can only be accessed using matrix-vector products (called matvecs). That is, we restrict ourselves to algorithms which choose vectors $\mathbf{x}_1, \ldots, \mathbf{x}_m$ and subsequently receive responses $\mathbf{A}\mathbf{x}_1, \ldots, \mathbf{A}\mathbf{x}_m$ from some black-box, but cannot look at $\mathbf{A}$ in any other way. We will measure the cost of algorithms by $m$ , the total number of matrix-vector queries done, which in many settings may be a realistic proxy for the real-world costs. This is often called the matrix-vector query model.

It’s worth noting that we can recover $\mathbf{A}$ in this access model by simply reading off the columns of $\mathbf{A}$ one-by-one. This means we can effectively run any classical algorithm for the cost of $d$ matvecs. If $d$ is very big this is not practical, but it is worth keeping in mind as a trivial upper bound for the cost of any algorithm in this model. In general, our goal will be to design algorithms which use $m\ll d$ matvecs.

One of the most powerful features of the matvec query model is that it is often possible to prove lower-bounds for the hardness of a given task. That is, to show that it is impossible (for any algorithm) to solve the task unless the algorithm uses at least a certain number of queries. This is exciting, because it offers a way of understanding the “difficulty” of linear algebra tasks! In contrast, proving lower-bounds on the number of floating point operations (which has classically been the measure of cost for linear-algebra algorithms), is seemingly far beyond current techniques.

One thing I think is really cool about this paper, is that almost everything is self-contained! In particular, besides a few well-known results on properties of Gaussian matrices and a qualitative central limit theorem called the Berry-Esseen theorem, everything needed is proved within the paper using relatively elementary techniques which would be understandable to anyone with an undergraduate-level knowledge of linear-algebra and probability. In fact, the general technique is so simple that we even prove a lower-bound for diagonal estimation on this page! Overall, I hope that this paper can serve as a case-study on the power and beauty of the matrix-vector query model.

Two theoretical problems

For convenience, we will measure the error in the Frobenius norm: $\|\mathbf{A} - \widetilde{\mathbf{A}} \|_{\mathsf{F}} = \sqrt{\sum_{i,j} \big([\mathbf{A}]_{\,i,j} - [\widetilde{\mathbf{A}}]_{\,i,j} \big)^2}.$ Let’s also encode the desired sparsity pattern in a binary matrix $\mathbf{S}$ . The nonzero entries of $\mathbf{S}$ will tell us where $\widetilde{\mathbf{A}}$ is allowed to have a nonzero entries. It is pretty clear then that the best sparse approximation to $\mathbf{A}$ is simply $\mathbf{S}\circ \mathbf{A}$ , where “ $\circ$ ” is the Hadamard (entrywise) product. That is, the best approximation is obtained by zeroing out the entries of $\mathbf{A}$ which are required to be zero.

We will study the following problems:

Problem 1 (Best approximation by a matrix of fixed sparsity). Given a matrix $\mathbf{A} \in\mathbb{R}^{n\times d}$ and a binary matrix $\mathbf{S} \in \{0,1\}^{n\times d}$ , find a matrix $\widetilde{\mathbf{A}}$ so that $\widetilde{\mathbf{A}} = \mathbf{S}\circ\widetilde{\mathbf{A}}$ and $\| \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F} \leq (1+\varepsilon) \, \| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}.$

Problem 2 (Best approximation to on-sparsity-pattern entries). Given a matrix $\mathbf{A} \in\mathbb{R}^{n\times d}$ and a binary matrix $\mathbf{S} \in \{0,1\}^{n\times d}$ , find a matrix $\widetilde{\mathbf{A}}$ so that $\widetilde{\mathbf{A}} = \mathbf{S}\circ\widetilde{\mathbf{A}}$ and $\| \mathbf{S}\circ \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F}^2 \leq \varepsilon \,\| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}^2.$

Note that since $\widetilde{\mathbf{A}}$ has the same sparsity as $\mathbf{S}$ ; i.e. $\widetilde{\mathbf{A}} = \mathbf{S}\circ\widetilde{\mathbf{A}}$ , $\| \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F}^2 = \| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}^2 + \|\mathbf{S}\circ\mathbf{A} - \widetilde{\mathbf{A}}\|_\mathsf{F}^2.$ This allows us to relate Problems 1 and 2, and for $\varepsilon \in (0,1)$ , the two are equivalent up to (small) constants.

A simple algorithm

When the sparsity pattern $\mathbf{S}$ has at most $s$ non-zero entries per row, this algorithm uses $m = O(s/\varepsilon)$ non-adaptive matrix-vector product queries. Specifically, the algorithm computes $\mathbf{Z} = \mathbf{A}\mathbf{G}$ , where $\mathbf{G}$ is a $d\times m$ matrix with independent standard normal entries, and then outputs the matrix $\widetilde{\mathbf{A}} = \operatornamewithlimits{argmin}_{\mathbf{X} = \mathbf{S}\circ\mathbf{X}} \| \mathbf{Z} - \mathbf{X} \mathbf{G}\|_\mathsf{F}.$ It’s not hard to see how to implement this algorithm. First, observe that each row of the approximation can be computed independently. Then, note that the solution to each row is obtained by solving a standard least squares problem involving the relevant rows of $\mathbf{G}$ .

Using standard tools from high dimensional probability, which characterize the expected Frobenius norm of the pseudoinverse of a Gaussian matrix, it is relatively simple to establish the following convergence guarantee:

Theorem 1. Consider any $\mathbf{A} \in\mathbb{R}^{n\times d}$ and any $\mathbf{S}\in\{0,1\}^{n\times d}$ with at most $s$ nonzero entries per row. Then, for any $m\geq s+2$ , using $m$ randomized matrix-vector queries, the matrix $\widetilde{\mathbf{A}}$ is equal to $\mathbf{S}\circ\mathbf{A}$ in expectation and satisfies $\mathbb{E}\Bigl[\|\mathbf{S}\circ\mathbf{A} - \widetilde{\mathbf{A}}\|_\mathsf{F}^2 \Bigr] \leq {\frac{s}{m-s-1}} \| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}^2.$ The above inequality is equality if each row of $\mathbf{S}$ has exactly $s$ non-zero entries.

If we set $m = O(s/\varepsilon)$ , then we solve Problem 2 (and hence Problem 1), at least in expectation. One can then use Markov’s inequality to get a probability bound. This matches the gurantees for a number of existing algorithms for special cases of Problem 1 and 2 [1,], [2], [3], [4], etc., and we comment on the connections in more detail in the paper.

A lower bound

For any fixed $\mathbf{A}$ , there is an algorithm which solves Problem 1 exactly using zero queries; simply hard-code the algorithm to output $\mathbf{S}\circ\mathbf{A}$ . This is not a practically useful algorithm, because it only works for the one particular $\mathbf{A}$ and we would be unable to practically ``find’’ such an algorithm for a given matrix. However, it means it’s impossible to prove a statement like: “For any matrix $\mathbf{A}$ , no algorithm using (some amount of queries depending on $\varepsilon$ ) can solve Problem 1”. Instead, it is standard to consider a distribution of matrices $\mathbf{A}$ (or equivalently, a random matrix). While an algorithm might be hard-coded for the particular distribution we consider, there is hope that the distribution is always still very random, even conditioned on a small number of arbitrary matvec queries.

In particular, in the paper we prove a precise version of the following:

Theorem 2. There are constants $c,C>0$ so that, for any $\varepsilon\in(0,c)$ and $s\geq 1$ , there is a distribution on matrices $\mathbf{A}$ such that, for any sparsity pattern $\mathbf{S}$ in a broad class of sparsity patterns, and for any algorithm that uses $m\leq Cs/\varepsilon$ matrix-vector queries to $\mathbf{A}$ to output a matrix $\widetilde{\mathbf{A}}$ , $\mathbb{P}\Bigl[ \| \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F} \leq (1+\varepsilon) \,\| \mathbf{A} - \mathbf{S}\circ \mathbf{A} \|_\mathsf{F}\Bigr] \leq \frac{1}{25}.$

This shows that there are problems for which no algorithm can use more than a constant (e.g. $100$ ) times fewer matvecs than the algorithm analyzed in the previous section.

At a high-level, the lower-bound technique is pretty simple. The basic idea is that if $\mathbf{A} = \mathbf{G}^\mathsf{T}\mathbf{G}$ , where $\mathbf{G}$ is $d\times d$ random Gaussian matrix, then after a small number of (possibly adaptive) matvec queries to $\mathbf{A}$ , there is still too much randomness in $\mathbf{A}$ (conditioned on the information leared by the queri– in the compressed sensing setting, we have a necessaryes) to learn what we need [5].

A proof for diagonal estimation

To give a flavor of how the above lower-bound above is proved, we will now prove a simpler result for diagonal estimation; i.e. for when $\mathbf{S} = \mathbf{I}$ . We will be a bit cavalier 🤠 , providing a TCS style proof (you shouldn’t be just trusting things on the internet anyway).

Theorem 3. There is a constant $C>0$ such that, for any $\varepsilon>0$ there are matrices $\mathbf{A}$ such that, for any algorithm that uses $m\leq C/\varepsilon$ matrix-vector queries to $\mathbf{A}$ to output a diagonal matrix $\widetilde{\mathbf{A}}$ , $\mathbb{P}\Bigl[ \| \mathbf{I}\circ \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F}^2 \leq \varepsilon \,\| \mathbf{A} - \mathbf{I}\circ \mathbf{A} \|_\mathsf{F}^2 \Bigr] \leq \frac{1}{25}.$

The key technical tool will be the following Lemma:

Lemma 1. Suppose $\mathbf{A}\sim\operatorname{Gaussian}(d,d)$ and let $\mathbf{x}_1, \ldots, \mathbf{x}_m$ and $\mathbf{y}_1 = \mathbf{A}\mathbf{x}_1, \ldots, \mathbf{y}_m = \mathbf{A}\mathbf{x}_m$ be such that, for each $j=0,1,\ldots, m$ , $\mathbf{x}_j$ was chosen based only on the query vectors $\mathbf{x}_1, \ldots, \mathbf{x}_{j-1}$ and the outputs $\mathbf{y}_1, \ldots, \mathbf{y}_{j-1}$ and (w.l.o.g.) unit-length and orthogonal to $\mathbf{x}_1, \ldots, \mathbf{x}_{j-1}$ .

Let $\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_m]$ and $\mathbf{Y} = [\mathbf{y}_1, \ldots, \mathbf{y}_m]$ . Then, $\mathbf{A}$ can be factored as $\mathbf{A} = \begin{bmatrix}\mathbf{Y} & \mathbf{G} \end{bmatrix} \begin{bmatrix}\mathbf{X}^\mathsf{T}\\ \mathbf{Z}^\mathsf{T}\end{bmatrix},$ where $\mathbf{Z}^\mathsf{T}\mathbf{X} = \mathbf{0}$ , $\mathbf{Z}^\mathsf{T}\mathbf{Z} = \mathbf{I}$ , and $\mathbf{G}\sim\operatorname{Gaussian}(d,d-m)$ , independently of $\mathbf{X}$ and $\mathbf{Y}$ .

Proof. Clearly $\mathbf{A}$ must be decomposed as above for matrix $\mathbf{G}$ (because this is just encoding what $\mathbf{A}$ does to the vectors in $\mathbf{X}$ ). It therefore suffices to argue that $\mathbf{G}$ is a Gauassian matrix independent of $\mathbf{X}$ and $\mathbf{Y}$ .

We proceed by induction, noting that the base case is trivial.

Suppose the lemma holds for $m$ queries. Choose $\mathbf{x}_{m+1}$ as a unit vector orthogonal to $\mathbf{x}_1, \ldots, \mathbf{x}_m$ (i.e. so that $\mathbf{X}^\mathsf{T}\mathbf{x}_{m+1} = \mathbf{0}$ . Note that $\mathbf{x}_{m+1} = \mathbf{Z} \mathbf{c}$ , for some length $d-m$ unit vector $\mathbf{c}$ ; indeed, if $\mathbf{x}_{m+1}$ is orthogonal to $\mathbf{X}$ then it is in the span of $\mathbf{Z}$ , and it is unit-length if and only if $\mathbf{c}$ is unit-length.

Deterministically complete $\mathbf{c}$ or an orthogonal matrix $\mathbf{C} = \begin{bmatrix} \mathbf{c} & \hat{\mathbf{C}} \end{bmatrix}.$ By the invariance of Gaussian vectors under unitary transforms, $\begin{bmatrix} \mathbf{y}_{m+1} & \mathbf{G} \hat{\mathbf{C}} \end{bmatrix} = \mathbf{G}\mathbf{C} \sim \operatorname{Gaussian}(d,d-m).$ Moreover, since $\mathbf{G}$ was independent of $\mathbf{X}$ and $\mathbf{Y}$ , so is $\mathbf{G}\mathbf{C}$ . Thus, $\mathbf{G}\hat{\mathbf{C}}$ is independent of $\mathbf{X}$ , $\mathbf{Y}$ , and $\mathbf{x}_{m+1}$ and $\mathbf{y}_{m+1}$ .

Now, using the orthogonality of $\mathbf{C}$ , $\mathbf{A} = \begin{bmatrix}\mathbf{Y} & \mathbf{G}\mathbf{C} \end{bmatrix} \begin{bmatrix}\mathbf{X}^\mathsf{T}\\ \mathbf{C}^\mathsf{T}\mathbf{Z}^\mathsf{T}\end{bmatrix} = \begin{bmatrix} \mathbf{Y} & \mathbf{y}_{m+1} & \mathbf{G}\hat{\mathbf{C}} \end{bmatrix} \begin{bmatrix} \mathbf{X}^\mathsf{T}\\ \mathbf{x}^\mathsf{T}\\ (\mathbf{Z}\hat{\mathbf{C}})^\mathsf{T} \end{bmatrix}$ Relabeling quantities gives the result. $~~~~\square$

We will also use the following fact:

Lemma 2. Suppose $a_1 + \cdots + a_d = d/2$ and $a_i \leq 1$ . Then at least $d/4$ of the $a_i$ are larger than $1/4$ .

Proof. Let $L = \{ i : |a_i| > 1/4 \}$ . Suppose, for the sake of contradiction, that $|L| < d/4$ . Then $\frac{d}{2} = \sum_{i\in L} a_i + \sum_{i\in L^{\textsf{c}}} a_i \leq |L| \cdot 1 + (d-|L|) \cdot \frac{1}{4} < \frac{d}{4} \cdot 1 + d \cdot \frac{1}{4} = \frac{d}{2}.$ This is a contradiction, so the result holds. $~~~~\square$

Proof of Theorem 3

Set $d = \frac{1}{\varepsilon} ,\qquad m = \frac{d}{4}, \qquad k = d-m,$ and let $\mathbf{A} \sim\operatorname{Gaussian}(d)$ .

First, note that $\|\mathbf{A}\|_\mathsf{F}^2$ is a Chi-squared random variable with $d^2$ degrees of freedom, and therefore has mean $d^2$ . Therefore, by Markov’s inequality, $\mathbb{P}\Big[ \|\mathbf{A}\|^2 \leq 50 d^2 \Big] \geq \frac{49}{50}.$

The algorithm want’s to approxiamte the diagonal of $\mathbf{A} = \mathbf{Y}\mathbf{X}^\mathsf{T}+ \mathbf{G}\mathbf{Z}^\mathsf{T}$ , but after $m$ queries it only knows $\mathbf{X}$ , $\mathbf{Y}$ , and $\mathbf{Z}$ . We will show the diagonal of $\mathbf{G}\mathbf{Z}^\mathsf{T}$ is too random, so that it can’t be approxiamted well (in squared norm relative to $O(d)$ ).

Write the rows of $\mathbf{G}$ and $\mathbf{g}_i^\mathsf{T}$ and the columns of $\mathbf{Z}^\mathsf{T}$ as $\mathbf{z}_i$ . Then, $\mathbf{d} = \operatorname{diag}(\mathbf{G}\mathbf{Z}^\mathsf{T}) = \big[ \mathbf{g}_1^\mathsf{T}\mathbf{z}_1 , \mathbf{g}_2^\mathsf{T}\mathbf{z}_2, \cdots ,\mathbf{g}_d^\mathsf{T}\mathbf{z}_d \big].$ Note that since the entries of $\mathbf{G}$ are all independent Gaussians, the entries of $\mathbf{d}$ are also independent Gaussians! In particular, $[\mathbf{d}]_{\,i} = \mathbf{g}_i^\mathsf{T}\mathbf{z}_i = \sum_{j=1}^{k} [\mathbf{g}_i]_{\,j} \cdot [\mathbf{z}_i]_{\,j} \sim \sum_{j=1}^{k} \mathcal{N}(0,[\mathbf{z}_i]_{\,j}^2) \sim \mathcal{N}(0,\|\mathbf{z}_i\|_2^2).$ The algorithm knows this, so the best choice of a guess for $\mathbf{d}$ is $\mathbf{d} = \mathbf{0}$ ; indeed, each entry is a mean zero Gaussian. Thus, the algorithm can output a guess $\widetilde{\mathbf{d}} = \operatorname{diag}(\mathbf{Y}\mathbf{X}^\mathsf{T})$ for $\operatorname{diag}(\mathbf{A})$ . Then $\|\operatorname{diag}(\mathbf{A}) - \widetilde{\mathbf{d}} \|_2^2 = \|\mathbf{d}\|_2^2.$ We will show that $\|\mathbf{d}\|_2^2$ is typically large, so that the output always has high expected error. Intuitively this makes sense; since the entries of $\mathbf{d}$ are Gaussian, and since the variance of a Gaussian is 1, $\mathbb{E}[\|\mathbf{d}\|_2^2] = \sum_{i=1}^{d} \mathbb{E}[ |[\mathbf{d}]_{\,i}|^2 ] = \sum_{i=1}^{d} \|\mathbf{z}_i\|_2^2 = \| \mathbf{Z} \|_\mathsf{F}^2 = k = \frac{3}{4}d.$ This is on the right scale, since $\|\mathbf{A} - \mathbf{I}\circ\mathbf{A}\|_\mathsf{F}^2 \leq \|\mathbf{A}\|_\mathsf{F}^2 = O(d^2) = O(\varepsilon \:d)$ . However, an expectation bound doesn’t prove a probability bound; it’s possible to have a large expectation because of a very large and very unlikely value.

In this particular case, we could simply use a standard anti-concentration result for Gaussians. However, for the sake of mirroring the proof we used in the paper, we will use an alternate approach.

Note that the $\|\mathbf{z}_i\|_2^2$ can’t all be super small. Indeed, let $L = \{ i : \|\mathbf{z}_i\|_2^2 \geq 1/4 \}.$ Since $[\mathbf{X}~\mathbf{Z}]$ is an orthogonal matrix $\|\mathbf{z}_i\|_2^2 \leq 1$ and $\|\mathbf{Z}\|_\mathsf{F}^2 = k \geq d/2$ . Thus, by Lemma 2, $|L| \geq d/4$ .

Now, define $P = \{ i \in L : | [\mathbf{d}]_{\,i} | > 10^{-3} \}.$ If $i\in L$ , just by standard properties of Gaussians, $\mathbb{P}\big[ | [\mathbf{d}]_{\,i} | > 10^{-3} \big] \geq \frac{99}{100}.$ This implies that $P$ is not much smaller than $L$ . In particular, $\mathbb{P}[ |P| > |L|/2 ] \geq \frac{49}{50}.$

We have now established that

$|P| > |L|/2 \geq d/8$ with probability at least $49/50$ , and
$\|\mathbf{A}-\mathbf{I}\circ\mathbf{A}\|_\mathsf{F}^2 \leq \|\mathbf{A}\|_\mathsf{F}^2 \leq 50 d^2$ with probability at least $49/50$ .

Thus, by a union bound, both events happen simultaneously with probability at least $24/25$ .

In this case, for some constant $D$ , $\| \mathbf{I}\circ \mathbf{A} - \widetilde{\mathbf{A}} \|_\mathsf{F}^2 \geq \| \mathbf{d} \|_2^2 = \sum_{i=1}^{d} [\mathbf{d}]_{\,i}^2 \geq \frac{d}{4} \cdot (10^{-3})^2 \geq D \varepsilon \: \| \mathbf{A} - \mathbf{I}\circ\mathbf{A} \|_\mathsf{F}^2.$ Replacing $\varepsilon$ with $\varepsilon/D$ throughout the proof gives the result with $C = 1/(4D)$ . $~~~~\square$

What changes in the general case?

The above proof actually captures pretty much all of the important ideas we use for the lower-bound provided in the paper. There are a number of details which differ. The most important are perhaps the following:

In the above proof, the entries of $\mathbf{d}$ were independent. For other sparsity patterns this won’t be the case. While we used independence in our computation of the variance (which was just for intuition), we did not use independence in the actual proof.
In the paper, we use the distribution $\mathbf{G}^\mathsf{T}\mathbf{G}$ . This somewhat complicates the analog of Lemma 1, as well as computations (since things aren’t all Gaussian anymore). However, it gives a lower-bound against the larger class of algorithms which can perform matvecs with the transpose.
In the paper we are more careful about saying things like “let $d = 1/\varepsilon$ ” and “ $\mathbf{d} = \mathbf{0}$ is the best choice”, and handle these cases explicitly / precisely.

References

Curtis, A.R.; Powell, M.J.D.; Reid, J.K. On the Estimation of Sparse Jacobian Matrices. IMA Journal of Applied Mathematics 1974, 13, 117–119, doi:10.1093/imamat/13.1.117.

Bekas, C.; Kokiopoulou, E.; Saad, Y. An Estimator for the Diagonal of a Matrix. Applied Numerical Mathematics 2007, 57, 1214–1229, doi:10.1016/j.apnum.2007.01.003.

Halikias, D.; Townsend, A. Structured Matrix Recovery from Matrix‐vector Products. Numerical Linear Algebra with Applications 2023, 31, doi:10.1002/nla.2531.

Dharangutte, P.; Musco, C. A Tight Analysis of Hutchinsons Diagonal Estimator. In Symposium on simplicity in algorithms (SOSA); Society for Industrial; Applied Mathematics, 2023; pp. 353–364.

Braverman, M.; Hazan, E.; Simchowitz, M.; Woodworth, B. The Gradient Complexity of Linear Regression. In Proceedings of the Proceedings of thirty third conference on learning theory; Abernethy, J., Agarwal, S., Eds.; PMLR, 2020; Vol. 125, pp. 627–647.