In this work, we study the task of approximating a d×d matrix A with a matrix A of a pre-specified sparsity pattern (e.g. diagonal, tridiagonal, block diagonal, etc.).
There are many existing linear algebra algorithms for matrices of a given sparsity pattern, so this task yields a proxy for A which is compatible with such algorithms.
For example, a banded approximation to A=M−1 could be used as a preconditioned to systems involving M.
The catch is that we assume A can only be accessed using matrix-vector products (called matvecs).
That is, we restrict ourselves to algorithms which choose vectors x1,…,xm and subsequently receive responses Ax1,…,Axm from some black-box, but cannot look at A in any other way.
We will measure the cost of algorithms by m, the total number of matrix-vector queries done, which in many settings may be a realistic proxy for the real-world costs.
This is often called the matrix-vector query model.
It’s worth noting that we can recover A in this access model by simply reading off the columns of A one-by-one.
This means we can effectively run any classical algorithm for the cost of d matvecs.
If d is very big this is not practical, but it is worth keeping in mind as a trivial upper bound for the cost of any algorithm in this model.
In general, our goal will be to design algorithms which use m≪d matvecs.
One of the most powerful features of the matvec query model is that it is often possible to prove lower-bounds for the hardness of a given task.
That is, to show that it is impossible (for any algorithm) to solve the task unless the algorithm uses at least a certain number of queries.
This is exciting, because it offers a way of understanding the “difficulty” of linear algebra tasks!
In contrast, proving lower-bounds on the number of floating point operations (which has classically been the measure of cost for linear-algebra algorithms), is seemingly far beyond current techniques.
One thing I think is really cool about this paper, is that almost everything is self-contained!
In particular, besides a few well-known results on properties of Gaussian matrices and a qualitative central limit theorem called the Berry-Esseen theorem, everything needed is proved within the paper using relatively elementary techniques which would be understandable to anyone with an undergraduate-level knowledge of linear-algebra and probability.
In fact, the general technique is so simple that we even prove a lower-bound for diagonal estimation on this page!
Overall, I hope that this paper can serve as a case-study on the power and beauty of the matrix-vector query model.
Let’s also encode the desired sparsity pattern in a binary matrix S.
The nonzero entries of S will tell us where A is allowed to have a nonzero entries.
It is pretty clear then that the best sparse approximation to A is simply S∘A, where “∘” is the Hadamard (entrywise) product.
That is, the best approximation is obtained by zeroing out the entries of A which are required to be zero.
We will study the following problems:
Note that since A has the same sparsity as S; i.e. A=S∘A,
When the sparsity pattern S has at most s non-zero entries per row, this algorithm uses m=O(s/ε) non-adaptive matrix-vector product queries.
Specifically, the algorithm computes Z=AG, where G is a d×m matrix with independent standard normal entries, and then outputs the matrix
It’s not hard to see how to implement this algorithm.
First, observe that each row of the approximation can be computed independently.
Then, note that the solution to each row is obtained by solving a standard least squares problem involving the relevant rows of G.
Using standard tools from high dimensional probability, which characterize the expected Frobenius norm of the pseudoinverse of a Gaussian matrix, it is relatively simple to establish the following convergence guarantee:
If we set m=O(s/ε), then we solve the two problems above, at least in expectation.
One can then use Markov’s inequality to get a probability bound.
This matches the gurantees for a number of existing algorithms for special cases of Problem 1 and 2 Curtis et al., 1974, Bekas et al., 2007, Halikias & Townsend, 2023, Dharangutte & Musco, 2023, etc., and we comment on the connections in more detail in the paper.
For any fixed A, there is an algorithm which solves Problem 1 exactly using zero queries; simply hard-code the algorithm to output S∘A.
This is not a practically useful algorithm, because it only works for the one particular A and we would be unable to practically ``find’’ such an algorithm for a given matrix.
However, it means it’s impossible to prove a statement like: “For any matrix A, no algorithm using (some amount of queries depending on ε) can solve Problem 1”.
Instead, it is standard to consider a distribution of matrices A (or equivalently, a random matrix).
While an algorithm might be hard-coded for the particular distribution we consider, there is hope that the distribution is always still very random, even conditioned on a small number of arbitrary matvec queries.
In particular, in the paper we prove a precise version of the following:
This shows that there are problems for which no algorithm can use more than a constant (e.g. 100) times fewer matvecs than the algorithm analyzed in the previous section.
At a high-level, the lower-bound technique is pretty simple.
The basic idea is that if A=GTG, where G is d×d random Gaussian matrix, then after a small number of (possibly adaptive) matvec queries to A, there is still too much randomness in A (conditioned on the information leared by the queri– in the compressed sensing setting, we have a necessaryes) to learn what we need Braverman et al., 2020.
To give a flavor of how the above lower-bound above is proved, we will now prove a simpler result for diagonal estimation; i.e. for when S=I.
We will be a bit cavalier 🤠 , providing a TCS style proof (you shouldn’t be just trusting things on the internet anyway).
The key technical tool will be the following Lemma:
Proof
Clearly A must be decomposed as above for \emph{some} matrix G (because this is just encoding what A does to the vectors in X).
It therefore suffices to argue that G is a Gauassian matrix independent of X and Y.
We proceed by induction, noting that the base case is trivial.
Suppose the lemma holds for m queries.
Choose xm+1 as a unit vector orthogonal to x1,…,xm (i.e. so that XTxm+1=0.
Note that xm+1=Zc, for some length d−m unit vector c; indeed, if xm+1 is orthogonal to X then it is in the span of Z, and it is unit-length if and only if c is unit-length.
Deterministically complete c or an orthogonal matrix\footnote{For instance, append the identity, delete the first linearly dependent column, and apply Gram--Schmidt.}
The above proof actually captures pretty much all of the important ideas we use for the lower-bound provided in the paper.
There are a number of details which differ.
The most important are perhaps the following:
In the above proof, the entries of d were independent. For other sparsity patterns this won’t be the case.
While we used independence in our computation of the variance (which was just for intuition), we did not use independence in the actual proof.
In the paper, we use the distribution GTG.
This somewhat complicates the analog of Lemma 1, as well as computations (since things aren’t all Gaussian anymore).
However, it gives a lower-bound against the larger class of algorithms which can perform matvecs with the transpose.
In the paper we are more careful about saying things like “let d=1/ε” and “d=0 is the best choice”, and handle these cases explicitly / precisely.
Amsel, N., Chen, T., Keles, F. D., Halikias, D., Musco, C., & Musco, C. (2026). Fixed-sparsity matrix approximation from matrix-vector products.
Curtis, A. R., Powell, M. J. D., & Reid, J. K. (1974). On the Estimation of Sparse Jacobian Matrices. IMA Journal of Applied Mathematics, 13(1), 117–119. 10.1093/imamat/13.1.117
Bekas, C., Kokiopoulou, E., & Saad, Y. (2007). An estimator for the diagonal of a matrix. Applied Numerical Mathematics, 57(11–12), 1214–1229. 10.1016/j.apnum.2007.01.003
Halikias, D., & Townsend, A. (2023). Structured matrix recovery from matrix‐vector products. Numerical Linear Algebra with Applications, 31(1). 10.1002/nla.2531
Dharangutte, P., & Musco, C. (2023). A Tight Analysis of Hutchinson’s Diagonal Estimator. In Symposium on Simplicity in Algorithms (SOSA) (pp. 353–364). Society for Industrial. 10.1137/1.9781611977585.ch32
Braverman, M., Hazan, E., Simchowitz, M., & Woodworth, B. (2020). The Gradient Complexity of Linear Regression. In J. Abernethy & S. Agarwal (Eds.), Proceedings of Thirty Third Conference on Learning Theory (Vol. 125, pp. 627–647). PMLR. https://proceedings.mlr.press/v125/braverman20a.html