This is a companion piece to https://arxiv.org/abs/2402.09379.
Code to replicate the figures in the corresponding paper can be found on Github.
In this work, we study the task of approximating a matrix with a matrix of a sparsity pattern (e.g. diagonal, tridiagonal, block diagonal, etc.). There are many existing linear algebra algorithms for matrices of a given sparsity pattern, so this task yields a proxy for which is compatible with such algorithms. For example, a banded approximation to could be used as a preconditioned to systems involving .
The catch is that we assume can only be accessed using matrix-vector products (called matvecs). That is, we restrict ourselves to algorithms which choose vectors and subsequently receive responses from some black-box, but cannot look at in any other way. We will measure the cost of algorithms by , the total number of matrix-vector queries done, which in many settings may be a realistic proxy for the real-world costs. This is often called the matrix-vector query model.
It’s worth noting that we can recover in this access model by simply reading off the columns of one-by-one. This means we can effectively run any classical algorithm for the cost of matvecs. If is very big this is not practical, but it is worth keeping in mind as a trivial upper bound for the cost of any algorithm in this model. In general, our goal will be to design algorithms which use matvecs.
One of the most powerful features of the matvec query model is that it is often possible to prove lower-bounds for the hardness of a given task. That is, to show that it is impossible (for any algorithm) to solve the task unless the algorithm uses at least a certain number of queries. This is exciting, because it offers a way of understanding the “difficulty” of linear algebra tasks! In contrast, proving lower-bounds on the number of floating point operations (which has classically been the measure of cost for linear-algebra algorithms), is seemingly far beyond current techniques.
One thing I think is really cool about this paper, is that almost everything is self-contained! In particular, besides a few well-known results on properties of Gaussian matrices and a qualitative central limit theorem called the Berry-Esseen theorem, everything needed is proved within the paper using relatively elementary techniques which would be understandable to anyone with an undergraduate-level knowledge of linear-algebra and probability. In fact, the general technique is so simple that we even prove a lower-bound for diagonal estimation on this page! Overall, I hope that this paper can serve as a case-study on the power and beauty of the matrix-vector query model.
For convenience, we will measure the error in the Frobenius norm: Let’s also encode the desired sparsity pattern in a binary matrix . The nonzero entries of will tell us where is allowed to have a nonzero entries. It is pretty clear then that the best sparse approximation to is simply , where “” is the Hadamard (entrywise) product. That is, the best approximation is obtained by zeroing out the entries of which are required to be zero.
We will study the following problems:
Problem 1 (Best approximation by a matrix of fixed sparsity). Given a matrix and a binary matrix , find a matrix so that and
Problem 2 (Best approximation to on-sparsity-pattern entries). Given a matrix and a binary matrix , find a matrix so that and
Note that since has the same sparsity as ; i.e. , This allows us to relate Problems 1 and 2, and for , the two are equivalent up to (small) constants.
When the sparsity pattern has at most non-zero entries per row, this algorithm uses non-adaptive matrix-vector product queries. Specifically, the algorithm computes , where is a matrix with independent standard normal entries, and then outputs the matrix It’s not hard to see how to implement this algorithm. First, observe that each row of the approximation can be computed independently. Then, note that the solution to each row is obtained by solving a standard least squares problem involving the relevant rows of .
Using standard tools from high dimensional probability, which characterize the expected Frobenius norm of the pseudoinverse of a Gaussian matrix, it is relatively simple to establish the following convergence guarantee:
Theorem 1. Consider any and any with at most nonzero entries per row. Then, for any , using randomized matrix-vector queries, the matrix is equal to in expectation and satisfies The above inequality is equality if each row of has exactly non-zero entries.
If we set , then we solve Problem 2 (and hence Problem 1), at least in expectation. One can then use Markov’s inequality to get a probability bound. This matches the gurantees for a number of existing algorithms for special cases of Problem 1 and 2 [1], [2], [3], [4], etc., and we comment on the connections in more detail in the paper.
For any fixed , there is an algorithm which solves Problem 1 exactly using zero queries; simply hard-code the algorithm to output . This is not a practically useful algorithm, because it only works for the one particular and we would be unable to practically ``find’’ such an algorithm for a given matrix. However, it means it’s impossible to prove a statement like: “For any matrix , no algorithm using (some amount of queries depending on ) can solve Problem 1”. Instead, it is standard to consider a distribution of matrices (or equivalently, a random matrix). While an algorithm might be hard-coded for the particular distribution we consider, there is hope that the distribution is always still very random, even conditioned on a small number of arbitrary matvec queries.
In particular, in the paper we prove a precise version of the following:
Theorem 2. There are constants so that, for any and , there is a distribution on matrices such that, for any sparsity pattern in a broad class of sparsity patterns, and for any algorithm that uses matrix-vector queries to to output a matrix ,
This shows that there are problems for which no algorithm can use more than a constant (e.g. ) times fewer matvecs than the algorithm analyzed in the previous section.
At a high-level, the lower-bound technique is pretty simple. The basic idea is that if , where is random Gaussian matrix, then after a small number of (possibly adaptive) matvec queries to , there is still too much randomness in (conditioned on the information leared by the queries) to learn what we need [5].
To give a flavor of how the above lower-bound above is proved, we will now prove a simpler result for diagonal estimation; i.e. for when . We will be a bit cavalier 🤠 , providing a TCS style proof (you shouldn’t be just trusting things on the internet anyway).
Theorem 3. There is a constant such that, for any there are matrices such that, for any algorithm that uses matrix-vector queries to to output a diagonal matrix ,
The key technical tool will be the following Lemma:
Lemma 1. Suppose and let and be such that, for each , was chosen based only on the query vectors and the outputs and (w.l.o.g.) unit-length and orthogonal to .
Let and . Then, can be factored as where , , and , independently of and .
Proof. Clearly must be decomposed as above for matrix (because this is just encoding what does to the vectors in ). It therefore suffices to argue that is a Gauassian matrix independent of and .
We proceed by induction, noting that the base case is trivial.
Suppose the lemma holds for queries. Choose as a unit vector orthogonal to (i.e. so that . Note that , for some length unit vector ; indeed, if is orthogonal to then it is in the span of , and it is unit-length if and only if is unit-length.
Deterministically complete or an orthogonal matrix By the invariance of Gaussian vectors under unitary transforms, Moreover, since was independent of and , so is . Thus, is independent of , , and and .
Now, using the orthogonality of , Relabeling quantities gives the result.
We will also use the following fact:
Lemma 2. Suppose and . Then at least of the are larger than .
Proof. Let . Suppose, for the sake of contradiction, that . Then This is a contradiction, so the result holds.
Set and let .
First, note that is a Chi-squared random variable with degrees of freedom, and therefore has mean . Therefore, by Markov’s inequality,
The algorithm want’s to approxiamte the diagonal of , but after queries it only knows , , and . We will show the diagonal of is too random, so that it can’t be approxiamted well (in squared norm relative to ).
Write the rows of and and the columns of as . Then, Note that since the entries of are all independent Gaussians, the entries of are also independent Gaussians! In particular, The algorithm knows this, so the best choice of a guess for is ; indeed, each entry is a mean zero Gaussian. Thus, the algorithm can output a guess for . Then We will show that is typically large, so that the output always has high expected error. Intuitively this makes sense; since the entries of are Gaussian, and since the variance of a Gaussian is 1, This is on the right scale, since . However, an expectation bound doesn’t prove a probability bound; it’s possible to have a large expectation because of a very large and very unlikely value.
In this particular case, we could simply use a standard anti-concentration result for Gaussians. However, for the sake of mirroring the proof we used in the paper, we will use an alternate approach.
Note that the can’t all be super small. Indeed, let Since is an orthogonal matrix and . Thus, by Lemma 2, .
Now, define If , just by standard properties of Gaussians, This implies that is not much smaller than . In particular,
We have now established that
Thus, by a union bound, both events happen simultaneously with probability at least .
In this case, for some constant , Replacing with throughout the proof gives the result with .
The above proof actually captures pretty much all of the important ideas we use for the lower-bound provided in the paper. There are a number of details which differ. The most important are perhaps the following:
1. Curtis, A.R.; Powell, M.J.D.; Reid, J.K. On the Estimation of Sparse Jacobian Matrices. IMA Journal of Applied Mathematics 1974, 13, 117–119, doi:10.1093/imamat/13.1.117.
2. Bekas, C.; Kokiopoulou, E.; Saad, Y. An Estimator for the Diagonal of a Matrix. Applied Numerical Mathematics 2007, 57, 1214–1229, doi:10.1016/j.apnum.2007.01.003.
3. Halikias, D.; Townsend, A. Structured Matrix Recovery from Matrix‐vector Products. Numerical Linear Algebra with Applications 2023, 31, doi:10.1002/nla.2531.
4. Dharangutte, P.; Musco, C. A Tight Analysis of Hutchinsons Diagonal Estimator. In Symposium on simplicity in algorithms (SOSA); Society for Industrial; Applied Mathematics, 2023; pp. 353–364.
5. Braverman, M.; Hazan, E.; Simchowitz, M.; Woodworth, B. The Gradient Complexity of Linear Regression. In Proceedings of the Proceedings of thirty third conference on learning theory; Abernethy, J., Agarwal, S., Eds.; PMLR, 2020; Vol. 125, pp. 627–647.