Recently, a new optimizer called Muon has been proposed for training deep learning models with matrix-valued weight matrices (Jordan et al. (2024)).
This optimizer seems to be particularly effective for training transformer-based LLMs, consistently outperforming the previous state of the art adamW optimizer on a variety of benchmarks.
A key step in Muon (and really the big change from past optimizers) is to replace the gradient of the loss with respect to each weight matrix with its polar decomposition (see e.g. the PyTorch implementation).[1]
Mathematically, the polar decomposition of a matrix is
Of course, this definition suggests that we can simply use something like torch.svd, and this does produce good training behavior in terms of iterations.
However, the SVD is horrifyingly slow on GPU 🐌, so an SVD-based implementation of Muon would be intractable.
Instead, implementations of Muon use approximations to the polar decomposition that are computed using only matrix-matrix multiplications and additions.
Recently, Amsel et al. (2026) introduced the PolarExpress method, a principled approach to constructing an approximation to the polar decomposition within Muon.
This approach has gained attention in part since the performance of Muon with PolarExpress seems to be uniformly better than with previous methods such as NS-5.
More generally, PolarExpress is interesting because it shows how tools from fields like numerical analysis and approximation theory can be used to improve algorithms used in machine learning.
Polynomial approximation¶
Matrix-matrix multiplication and addition are very fast ⚡ on GPUs (these operations are exactly what GPUs are designed for)! As such, we’d like to compute the polar decomposition using only matrix-matrix multiplication and addition.
Note that the matrices
all have the same shape as and can be computed using matrix-matrix multiplications with . This suggests that we could try build an approximation to the polar decomposition of the form
Connection to scalar approximation¶
The definition of above (and the title of the section) suggests a connection to polynomials. Indeed, with a bit of linear algebra, one verifies that
where
and is the diagonal matrix whose diagonal entries are (notice that we are only allowed to use odd powers, since even powers wouldn’t be the right shape).
By the unitary invariance of the spectral norm, we see that the error of a polynomial approximation to the polar decomposition is exactly the error of the scalar polynomial approximation:
Thus, we have reduced the problem of approximating the polar decomposition on GPUs to a scalar-valued problem in approximation theory: approximating the function on the singular values of using an odd polynomial.
Of course, since we don’t know the singular values of , we might instead try to approximate the function on and interval where and are chosen so that for all singular values of . Since ,
The above inequality is sharp. For any , , and , there exists a matrix with singular values in such that the error of the approximation is exactly .
First approach: Newton-Schulz¶
How should we choose the coefficients of our polynomial approximation to the sign function?
A simple approach is to choose the coefficients so that the approximation matches properties of the sign function. For instance, we might specify that we will choose an approximation with , so that . Since there are two free parameters we can enforce the conditions and . When , this gives us the system of equations
which has solution and .
If we allow a higher degree polynomial, we can enforce that higher derivatives are also zero. For instance, if we choose , corresponding to , then we can enforce that , , and . When this gives the polynomial:
We can construct a polynomial in this way for any by enforcing that and for . These polynomials are commonly called the Newton-Shulz polynomials of degree .
The following figure shows how increasing the degree of the approximation affects the error of the approximation on the interval . As the degree increases, the error of the approximation decreases. However, the convergence is not very good, particularly when is small.
Composition¶
To efficiently compute a high-degree polynomial, we can use an iterative procedure
where is a low degree polynomial (such as the ones we derived above). This iteration corresponds to an approximation
Owing to its compositional nature, is of degree , where is the degree of . However, using (10), we can compute using only matrix multiplications. In other words, we can efficiently construct certain polynomials of exponentially high degree!
Critically, since the degree Newton-Shulz polynomial satisfies from (9) , we expect fixed point iteration to converge cubically!
Let’s see how the error of the approximation changes as we increase the number of iterations .
This time, we see convergence is very fast (the convergence is faster than exponential, since we get something decreasing faster than a line on a log-plot)!
The NS-5 method used in some variants of Muon, uses the iterative procedure (11) to compute this approximation.
Polar Express: Optimal approximation¶
It’s natural to ask whether we can we do better than Newton-Schulz.
Towards this end, several modifications have been proposed for use within Muon (Jordan et al. (2024),Cesista et al. (2025)).
These changes eek out a bit of performance, but ultimately are based on heuristics and trial-and-error.
PolarExpress offers a principled approach.
With what we’ve seen so far, it’s actually pretty simple to describe PolarExpress.
Note that the bound from (7) suggests that we can try to find the best possible polynomial approximation to 1 on . In particular, we can try to solve
where
Approximation theorists love best approximation, and have developed lots of tools for solving problems like the one above. This particular problem can be solved using a variant of the Remez algorithm.
For simplicity, let’s consider the case and see what the optimal approximation on looks like.
Notice that when , the optimal polynomial and the degree 5 Newton-Shulz polynomial are identical.
However, as decreases, the optimal polynomial starts to deviate from the Newton-Shulz polynomial.
In particular, the polynomial goes above 1 on some parts of the interval and below 1 on other parts of the interval.
Intuitively this makes sense: if we want the maximum deviation of from 1 to be as small as possible, we should allow to deviate from 1 in both directions, rather than forcing it to always be below 1 like the NS-5 polynomial does.
In fact, the optimal polynomial satisfies an equioscillation property: the maximum deviation of from 1 is the same at , , and the two inflection points of . This type of behavior is characteristic of best approximations (and is key to how the Remez algorithm works).
Composition¶
Just like with the Newton-Shulz polynomial, we would like to use a composition of the optimal polynomial to get a better approximation. Notice that the optimal polynomial on maps the interval to a new interval centered around 1, which is often much smaller than . This means that in the next step of the iteration, we can use the optimal polynomial for the new interval to get an even better approximation.
In particular, let’s use an iteration of the form
We’ll take to be the optimal polynomial on the original interval as described above. Now, define
This gives us a new interval that contains the singular values of , so we can take to be the optimal polynomial on . This gives a new interval which we can use to construct , and so on.
More formally, the Polar Express polynomial is defined as follows.
The PolarExpress algorithm simply uses (14) to apply the Polar Express polynomial.
Now, let’s look at what the Polar Express polynomial looks like as we change the number of iterations !
Observe that the Polar Express polynomial is a much better to 1 than the Newton-Shulz polynomial, particularly when is small.
We also note that, as before, the Polar Express polynomial satisfies an equioscillation property.
Greedy is optimal¶
Remarkably, this iterative greedy procedure for constructing the polynomials is actually optimal among all procedures that use a degree polynomial at each step 🤯.
This means that no other choice of compositions of polynomials of degree outperforms Polar Express.
We now visualize the error
for the Polar Express and Newton-Shulz polynomials. When is small, Polar Express converges in roughly half the iterations of Newton-Shulz.
Conclusion¶
The fact that the Polar Express polynomial approximation is optimal among all approximations of its form is a nice theoretical guarantee.
However, since it’s not super clear why Muon works in the first place (to me at least!), so it’s also not clear that a better approximation to the polar decomposition actually leads to better performance of Muon.
Nevertheless, the Polar Express algorithm is a beautiful approach to approximating the polar decomposition that seems to be a good fit for the needs of Muon and serves as a good starting point for engineering-based improvements to Muon and related algorithms.
How to Cite¶
@{chen2026polarexpressviualized,
author = {Tyler Chen},
title = {The Polar Express for Muon, Visualized},
year = {2026},
url = {https://research.chen.pw/PolarExpress}
}Technically speaking, what we are calling the polar decomposition is actually the unitary factor of the polar decomposition. We call it the polar decomposition for simplicity.
- Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/
- Amsel, N., Persson, D., Musco, C., & Gower, R. M. (2026). The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm. In The Fourteenth International Conference on Learning Representations. https://arxiv.org/abs/2505.16932
- Cesista, F. L., Jiacheng, Y., & Jordan, K. (2025). Squeezing 1-2% Efficiency Gains Out of Muon by Optimizing the Newton-Schulz Coefficients. https://leloykun.github.io/ponder/muon-opt-coeffs/