Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

While CholeskyQR has a high flop efficiency, it is unstable if A\vec{A} is not well-conditioned due to working with the Gram matrix ATA\vec{A}^\T\vec{A}. Fortunately, we can fix this by preconditioning the problem so that it is better conditioned.

Subspace Embedding for approximate QR

We begin with an extremely simple sketching-based algorithm.

This algorithm is easily implemented in Numpy. Here we use a SparseStack sketch.

def sketched_qr(A,k,zeta,rng):

    n, d = A.shape
    S = sparse_stack_sketch(n,k,zeta,rng) 
    Y = S @ A 
    R = np.linalg.qr(Y, mode='r') 
    Q = sp.linalg.solve_triangular(R.T,A.T,lower=True).T 
    
    return Q, R

As long as the sketching matrix S\vec{S} gives a subspace embedding, then the output of Algorithm 3.2 will be an approximate QR factorization.

Proof

By definition,

xrange(A):(1ε)x2Sx2(1+ε)x2.\forall \vec{x}\in\range(\vec{A}): (1-\varepsilon)\|\vec{x}\|_2 \leq \|\vec{S}\vec{x}\|_2 \leq (1+\varepsilon)\|\vec{x}\|_2.

We can reparameterize x=AR1c=Qc\vec{x} = \vec{A}\vec{R}^{-1}\vec{c} = \vec{Q}\vec{c} for some cRd\vec{c}\in\R^d. Since SAR1=Q\vec{S}\vec{A}\vec{R}^{-1} = \vec{Q}' has orthonormal columns, we have Sx=SAR1c2=c2\|\vec{S}\vec{x}\| = \|\vec{S}\vec{A}\vec{R}^{-1}\vec{c}\|_2 = \|\vec{c}\|_2. Thus,

(1ε)Qc2c2(1+ε)Qc2.(1-\varepsilon) \|\vec{Q}\vec{c}\|_2 \leq \|\vec{c}\|_2 \leq (1+\varepsilon)\|\vec{Q}\vec{c}\|_2.

Hence,

σmin(Q)=minc0Qc2c211+ε,σmax(Q)=maxc0Qc2c211ε.\smin(\vec{Q}) = \min_{\vec{c}\neq 0} \frac{\|\vec{Q}\vec{c}\|_2}{\|\vec{c}\|_2} \geq \frac{1}{1+\varepsilon} ,\qquad \smax(\vec{Q}) = \max_{\vec{c}\neq 0} \frac{\|\vec{Q}\vec{c}\|_2}{\|\vec{c}\|_2} \leq \frac{1}{1-\varepsilon}.

Unfortunately, oblivious sketching methods require k=Ω(d/ε2)k = \Omega(d/\varepsilon^2). The polynomial dependence on ε\varepsilon means that to get a very well-conditioned basis, we need a very large sketching dimension kk, which is unpractical for ε\varepsilon close to machine precision.

Preconditioning for QR

While Algorithm 3.2 doesn’t produce a highly orthogonal basis, it does produce a well-conditioned basis. Therefore, we can use the CholeskyQR method to orthonormalize the output of Sketched QR. This gives an algorithm witch a similar computational profile to Cholesky QR, but without the stability issues.

We can now use our implementation of Algorithm 3.2 to implement Algorithm 3.3 in Numpy.

def randomized_cholesky_QR(A,k,zeta,rng):

    Q1,R1 = sketched_qr(A,k,zeta,rng)
    Q,R2 = cholesky_QR(Q1)
    R = R2@R1
    
    return Q,R

Numerical Experiment

Let’s compare the performance and accuracy of different QR factorization methods, including our new randomized Cholesky QR approach. We’ll use the same numerical experiment as in the previous notebook, but now we include the randomized Cholesky QR method.

k = int(1.5*d)
zeta = 4

# Define QR factorization methods
qr_methods = [
    {'name': 'Numpy',
     'func': lambda: np.linalg.qr(A, mode='reduced')},
    {'name': 'Cholesky QR',
     'func': lambda: cholesky_QR(A)},
    {'name': 'Rand. Cholesky QR',
     'func': lambda: randomized_cholesky_QR(A,k,zeta,rng)}
]
Source
rng = np.random.default_rng(0) 

# Time the QR factorization methods
n_repeat = 10  # Number of repetitions for averaging

results = []

for qr_method in qr_methods:
    method_name = qr_method['name']
    
    # Time the method
    start = time.time()
    for _ in range(n_repeat):
        Q, R = qr_method['func']()
    end = time.time()
    
    avg_time = (end - start) / n_repeat
    
    # Compute accuracy metrics
    results.append({
        'method': method_name,
        'time (s)': avg_time,
        'orthogonality': np.linalg.norm(Q.T @ Q - np.eye(d)),
        'reconstruction': np.linalg.norm(A - Q @ R)
    })

# Create DataFrame and compute relative performance
results_df = pd.DataFrame(results)
results_df['speedup'] = results_df['time (s)'].max() / results_df['time (s)']

# Display results with formatting
results_df.reindex(columns=['method','time (s)','speedup','orthogonality','reconstruction']).style.format({
    'time (s)': '{:.4f}',
    'orthogonality': '{:1.1e}',
    'reconstruction': '{:1.1e}',
    'speedup': '{:.1f}x',
})
Loading...

We observe that randomized Cholesky QR is able to get most of the speedups of Cholesky QR, while producing an accurate QR factorization!

In the wild

Randomized Cholesky QR is used as a subroutine within subspace iteration in the DION optimizer for LLM training Ahn et al., 2025.

References
  1. Ahn, K., Xu, B., Abreu, N., Fan, Y., Magakyan, G., Sharma, P., Zhan, Z., & Langford, J. (2025). Dion: Distributed Orthonormalized Updates. https://arxiv.org/abs/2504.05295