The Cost of Numerical Linear Algebra - Randomized Numerical Linear Algebra with Examples

In large, RandNLA algorithms achieve speedups over classical NLA algorithms by one or both of the following strategies:

reducing the theoretical complexity of algorithms, and/or
reorganizing computation to take advantage of modern hardware.

In this section, we provide a brief overview some common ways of measuring costs in NLA. this is important for understanding the improvements that RandNLA can provide. Of course, the “right” way of measuring costs is highly dependent, and we encourage readers to take a wholistic view.

Floating point operations¶

Perhaps the most common way of measuring the cost of numerical algorithms is by counting the number of floating point operations (flops) required to complete the algorithm. This provides a mathematical (system-independent) measure of algorithmic complexity, which is useful for comparing algorithms and understanding how they scale with the input size.

Flops ≠ runtime¶

Flops alone don’t tell us everything since the flop rate (number of flops per unit time) can vary significantly depending on the specific operation being performed and the hardware being used. On modern computing environments, other factors such as memory access patterns, communication costs, and parallelism can have a significant impact on the actual runtime of an algorithm. In fact, some modern hardware devices are literally designed to perform certain types of linear-algebra operations (e.g., matrix-matrix multiplication) very efficiently (see e.g. nVidia’s Tensor Cores).

To illustrate this point, let’s consider the following basic linear-algebra operations:

matrix-vector multiplication: $\vec{A}\vec{x}$
matrix-matrix multiplication: $\vec{A}^\T\vec{A}$
QR factorization of $\vec{A}$

We’ll time each of these operations on a $n\times d$ matrix $\vec{A}$ . By dividing the time by the number of flops, we can get a sense of relative efficiency of these primitives.

# Generate a random matrix and vector
n = 10000
d = 200

A = np.random.randn(n,d)
x = np.random.randn(d)

# Define the operations and their flop counts
operations = [
    {'name': 'A@x',
     'func': lambda: A@x,
     'flops': 2 * n * d},
    {'name': 'A.T@A',
     'func': lambda: A.T@A,
     'flops': 2 * n * d**2},
    {'name': 'QR(A)',
     'func': lambda: np.linalg.qr(A),
     'flops': 2 * n * d**2 - (2/3) * d**3},
]

We’ll now time each of these operations and compare the flop rates.

# Time the operations
n_repeat = 100 # Number of repetitions for averaging

results = [] # Initialize results list

# Time each operation
for operation in operations:
    method_name = operation['name']
    
    runtime = timeit.timeit(lambda: operation["func"](), number=n_repeat)
    
    avg_time = runtime / n_repeat
    
    results.append({
        'method': method_name,
        'flops/s': operation['flops']/avg_time 
    })

# Create DataFrame
results_df = pd.DataFrame(results)
results_df['flops/s (relative)'] = 100*results_df['flops/s'] / results_df['flops/s'].max()
results_df.style.format({'flops/s':'{:1.1e}','flops/s (relative)':'{:1.0f}%'})

The results of our experiment are striking. Even these basic linear algebra operations differ in efficiency by orders of magnitude.

Precision impacts cost¶

While NLA has traditionally been performed in double precision, many modern hardware devices (e.g., GPUs) are optimized for single precision or even lower precision arithmetic. Using lower precision can substantially increase the flop rate of numerical linear algebra.

To illustrate this, we compare the flop rates of matrix-matrix multiplication in single and double precision.

# Create matrices in different precisions
A_f64 = np.random.randn(n, d).astype(np.float64)
A_f32 = A_f64.astype(np.float32)

Similar to above, we’ll benchmark the flop rates of matrix-matrix multiplication in single and double precision.

# Precision experiment: Compare float64 vs float32 matrix multiplication
precision_ops = [
    {'name': 'A.T@A (float64)',
     'func': lambda: A_f64.T @ A_f64,
     'flops': 2 * n * d**2},
    {'name': 'A.T@A (float32)',
     'func': lambda: A_f32.T @ A_f32,
     'flops': 2 * n * d**2}
]

precision_results = []

# Time each precision
for precision_op in precision_ops:
    method_name = precision_op['name']
    
    runtime = timeit.timeit(lambda: precision_op["func"](), number=n_repeat)

    avg_time = runtime / n_repeat
    
    precision_results.append({
        'method': method_name,
        'time (s)': avg_time,
        'flops/s': precision_op['flops']/avg_time 
    })

# Create DataFrame
precision_df = pd.DataFrame(precision_results)
precision_df['speedup'] = precision_df['flops/s'] / precision_df['flops/s'].min()
precision_df.style.format({'time (s)': '{:.4f}', 'flops/s':'{:.1e}', 'speedup': '{:.1f}x'})

Here we observe roughly a 2x speedup by switching from 64-bit double precision to 32-bit single precision. On GPUs, the speedups of using lower-precision can be even more pronounced, as the hardware is often specifically designed to take advantage of lower precision arithmetic.

Of course, lower-precision formats have lower precision 🤔, so we must be careful to ensure that the results of our computations are still accurate enough for our purposes.

Matrix-vector queries¶

An increasingly popular way of measuring the cost of numerical algorithms is by counting the number of matrix-vector queries required to complete the algorithm. Here, a matrix-vector query to $\vec{A}$ is an evaluation of the linear map $\vec{x} \mapsto \vec{A}\vec{x}$ for some input vector $\vec{x}$ . In some cases we may also have access to transpose queries $\vec{y} \mapsto \vec{A}^\T\vec{y}$ .

Some examples where this access model is natural include:

The arithmetic and runtime costs of Krylov subspace methods like power iteration, conjugate gradient, and Lanczos are often dominated by the cost of matrix-vector products.
A legacy PDE solver may give us access to the action of the solution operator of a PDE, even though we do not have access to the entries of $\vec{A}$ directly.
Computing matrix-vector products with a Hessian (of e.g. a Neural Network) can be much cheaper than computing the Hessian itself Pearlmutter, 1994.

There are more theoretical motivations as well. Perhaps the most exciting is the potential for lower bounds on the number of matrix-vector queries required to solve a problem. This offers a way to measure of the complexity of various linear algebra tasks.

In Scipy, the sparse.linalg.LinearOperator class provides the canonical way to represent matrices that are only accessible via matrix-vector products.

Entry evaluations¶

In some settings, evaluating a single entry of a matrix may be the dominant cost. For instance if $\vec{A}$ is a Kernel matrix, then the $(i,j)$ entry of $\vec{A}$ takes the form $K(\vec{x}_i, \vec{x}_j)$ , where $K : \R^m\times \R^m\to \R$ is a function acting on data of dimension $m$ , and may be costly to evaluate.

References¶

Pearlmutter, B. A. (1994). Fast Exact Multiplication by the Hessian. Neural Computation, 6(1), 147–160. 10.1162/neco.1994.6.1.147

Randomized Numerical Linear Algebra with Examples

Probability