Skip to article content

The Polar Express for Muon, Visualized

An interactive introduction

Recently, a new optimizer called Muon has been proposed for training deep learning models with matrix-valued weight matrices (Jordan et al. (2024)). This optimizer seems to be particularly effective for training transformer-based LLMs, consistently outperforming the previous state of the art adamW optimizer on a variety of benchmarks.

A key step in Muon (and really the big change from past optimizers) is to replace the gradient of the loss with respect to each weight matrix with its polar decomposition (see e.g. the PyTorch implementation).[1] Mathematically, the polar decomposition of a matrix G\vec{G} is

polar(G):=UVT,where G=UΣVT is the SVD of G.\polar(\vec{G}) := \vec{U} \vec{V}^\T ,\qquad \text{where } \vec{G} = \vec{U} \vec{\Sigma} \vec{V}^\T \text{ is the SVD of } \vec{G}.

Of course, this definition suggests that we can simply use something like torch.svd, and this does produce good training behavior in terms of iterations. However, the SVD is horrifyingly slow on GPU 🐌, so an SVD-based implementation of Muon would be intractable. Instead, implementations of Muon use approximations to the polar decomposition that are computed using only matrix-matrix multiplications and additions.

Recently, Amsel et al. (2026) introduced the PolarExpress method, a principled approach to constructing an approximation to the polar decomposition within Muon. This approach has gained attention in part since the performance of Muon with PolarExpress seems to be uniformly better than with previous methods such as NS-5. More generally, PolarExpress is interesting because it shows how tools from fields like numerical analysis and approximation theory can be used to improve algorithms used in machine learning.

Polynomial approximation

Matrix-matrix multiplication and addition are very fast ⚡ on GPUs (these operations are exactly what GPUs are designed for)! As such, we’d like to compute the polar decomposition using only matrix-matrix multiplication and addition.

Note that the matrices

G(GTG),G(GTG)2,\vec{G}(\vec{G}^\T \vec{G}), \vec{G}(\vec{G}^\T \vec{G})^2, \ldots

all have the same shape as G\vec{G} and can be computed using matrix-matrix multiplications with G\vec{G}. This suggests that we could try build an approximation to the polar decomposition of the form

polar(G)a0G+a1G(GTG)+a2G(GTG)2++aqG(GTG)q=:p(G).\polar(\vec{G}) \approx a_0 \vec{G} + a_1 \vec{G}(\vec{G}^\T\vec{G}) + a_2 \vec{G}(\vec{G}^\T\vec{G})^2 + \cdots + a_q \vec{G}(\vec{G}^\T\vec{G})^q =: p(\vec{G}).

Connection to scalar approximation

The definition of p(G)p(\vec{G}) above (and the title of the section) suggests a connection to polynomials. Indeed, with a bit of linear algebra, one verifies that

p(G)=Up(Σ)VT,p(\vec{G}) = \vec{U}p(\vec{\Sigma}) \vec{V}^\T,

where

p(x)=a0x+a1x3++aqx2q+1p(x) = a_0 x + a_1 x^3 + \cdots + a_q x^{2q+1}

and p(Σ)p(\vec{\Sigma}) is the diagonal matrix whose diagonal entries are p(σi)p(\sigma_i) (notice that we are only allowed to use odd powers, since even powers wouldn’t be the right shape).

By the unitary invariance of the spectral norm, we see that the error of a polynomial approximation to the polar decomposition is exactly the error of the scalar polynomial approximation:

polar(G)p(G)op=UIVTUp(Σ)VTop=Ip(Σ)op=maxi1p(σi).\begin{align} \| \polar(\vec{G}) - p(\vec{G}) \|_{\mathrm{op}} &= \| \vec{U} \vec{I} \vec{V}^\T - \vec{U} p(\vec{\Sigma}) \vec{V}^\T \|_{\mathrm{op}} \\&= \| \vec{I} - p(\vec{\Sigma}) \|_{\mathrm{op}} \\&= \max_i | 1 - p(\sigma_i) |. \end{align}

Thus, we have reduced the problem of approximating the polar decomposition on GPUs to a scalar-valued problem in approximation theory: approximating the function f(x)=1f(x) = 1 on the singular values of G\vec{G} using an odd polynomial.

Of course, since we don’t know the singular values of G\vec{G}, we might instead try to approximate the function f(x)=1f(x) = 1 on and interval [,u][\ell,u] where \ell and uu are chosen so that σi[,u]\sigma_i\in[\ell,u] for all singular values σi\sigma_i of G\vec{G}. Since i{σi}[,u]\cup_i \{\sigma_i\} \subset [\ell,u],

polar(G)p(G)2maxx[,u]1p(x).\| \polar(\vec{G}) - p(\vec{G}) \|_2 \leq \max_{x\in[\ell,u]} | 1 - p(x) |.

The above inequality is sharp. For any \ell, uu, and p(x)p(x), there exists a matrix G\vec{G} with singular values in [,u][\ell,u] such that the error of the approximation is exactly maxx[,u]1p(x)\max_{x\in[\ell,u]} | 1 - p(x) |.

First approach: Newton-Schulz

How should we choose the coefficients a0,a1,,aqa_0,a_1,\ldots,a_q of our polynomial approximation p(x)p(x) to the sign function?

A simple approach is to choose the coefficients so that the approximation matches properties of the sign function. For instance, we might specify that we will choose an approximation with q=1q=1, so that p(x)=a0x+a1x3p(x) = a_0 x + a_1 x^3. Since there are two free parameters we can enforce the conditions p(u)=1p(u) = 1 and p(u)=0p'(u) = 0. When u=1u=1, this gives us the system of equations

{a0+a1=1a0+3a1=0\left\{ \begin{aligned} a_0 + a_1 &= 1 \\ a_0 + 3a_1 &= 0 \end{aligned}\right.

which has solution a0=3/2a_0 = 3/2 and a1=1/2a_1 = -1/2.

If we allow a higher degree polynomial, we can enforce that higher derivatives are also zero. For instance, if we choose p(x)=a0x+a1x3+a2x5p(x) = a_0 x + a_1 x^3 + a_2 x^5, corresponding to q=2q=2, then we can enforce that p(u)=1p(u) = 1, p(u)=0p'(u) = 0, and p(u)=0p''(u) = 0. When u=1u=1 this gives the polynomial:

x158x108x3+38x5.x\mapsto \frac{15}{8} x - \frac{10}{8} x^3 + \frac{3}{8} x^5.

We can construct a polynomial in this way for any qq by enforcing that p(u)=1p(u) = 1 and p(i)(u)=0p^{(i)}(u) = 0 for i=1,,qi=1,\ldots,q. These polynomials are commonly called the Newton-Shulz polynomials of degree dd.

The following figure shows how increasing the degree d=2q+1d = 2q+1 of the approximation affects the error of the approximation on the interval [,1][\ell,1]. As the degree increases, the error of the approximation decreases. However, the convergence is not very good, particularly when \ell is small.

Loading...

Composition

To efficiently compute a high-degree polynomial, we can use an iterative procedure

Xt=p(Xt1),X0=G,\vec{X}_{t} = p(\vec{X}_{t-1}), \qquad \vec{X}_0 = \vec{G},

where p(x)p(x) is a low degree polynomial (such as the ones we derived above). This iteration corresponds to an approximation

XT=pT(G),wherepT=pppT times.\vec{X}_T = p_T(\vec{G}),\quad\text{where}\quad p_T = \underbrace{p \circ p \circ \cdots \circ p}_{T \text{ times}}.

Owing to its compositional nature, pT(x)p_T(x) is of degree dTd^T, where dd is the degree of p(x)p(x). However, using (10), we can compute pT(G)p_T(\vec{G}) using only O(dT)O(dT) matrix multiplications. In other words, we can efficiently construct certain polynomials of exponentially high degree!

Critically, since the degree d=5d=5 Newton-Shulz polynomial p(x)p(x) satisfies from (9) p(u)=p(u)=0p'(u) = p''(u) = 0, we expect fixed point iteration to converge cubically!

Let’s see how the error of the approximation changes as we increase the number of iterations TT.

Loading...

This time, we see convergence is very fast (the convergence is faster than exponential, since we get something decreasing faster than a line on a log-plot)! The NS-5 method used in some variants of Muon, uses the iterative procedure (11) to compute this approximation.

Polar Express: Optimal approximation

It’s natural to ask whether we can we do better than Newton-Schulz. Towards this end, several modifications have been proposed for use within Muon (Jordan et al. (2024),Cesista et al. (2025)). These changes eek out a bit of performance, but ultimately are based on heuristics and trial-and-error.

PolarExpress offers a principled approach.

With what we’ve seen so far, it’s actually pretty simple to describe PolarExpress.

Note that the bound from (7) suggests that we can try to find the best possible polynomial approximation to 1 on [,u][\ell,u]. In particular, we can try to solve

minpPdoddmaxx[,u]1p(x),\min_{p\in\mathbb{P}_d^{\mathrm{odd}}} \max_{x \in [\ell,u]} |1 - p(x)|,

where

Pdodd:={odd polynomials of degree at most d}.\mathbb{P}_d^{\mathrm{odd}} := \{ \text{odd polynomials of degree at most } d \}.

Approximation theorists love best approximation, and have developed lots of tools for solving problems like the one above. This particular problem can be solved using a variant of the Remez algorithm.

For simplicity, let’s consider the case d=5d=5 and see what the optimal approximation on [,u][\ell,u] looks like.

Loading...

Notice that when =u\ell=u, the optimal polynomial and the degree 5 Newton-Shulz polynomial are identical. However, as \ell decreases, the optimal polynomial starts to deviate from the Newton-Shulz polynomial. In particular, the polynomial goes above 1 on some parts of the interval and below 1 on other parts of the interval. Intuitively this makes sense: if we want the maximum deviation of p(x)p(x) from 1 to be as small as possible, we should allow p(x)p(x) to deviate from 1 in both directions, rather than forcing it to always be below 1 like the NS-5 polynomial does.

In fact, the optimal polynomial satisfies an equioscillation property: the maximum deviation of p(x)p(x) from 1 is the same at \ell, uu, and the two inflection points of pp. This type of behavior is characteristic of best approximations (and is key to how the Remez algorithm works).

Composition

Just like with the Newton-Shulz polynomial, we would like to use a composition of the optimal polynomial to get a better approximation. Notice that the optimal polynomial on [,u][\ell,u] maps the interval [,u][\ell,u] to a new interval centered around 1, which is often much smaller than [,u][\ell,u]. This means that in the next step of the iteration, we can use the optimal polynomial for the new interval to get an even better approximation.

In particular, let’s use an iteration of the form

Xt=pt(Xt1),X0=G.\vec{X}_{t} = p_t(\vec{X}_{t-1}), \qquad \vec{X}_0 = \vec{G}.

We’ll take p1(x)p_1(x) to be the optimal polynomial on the original interval [,u][\ell,u] as described above. Now, define

2=minx[,u]p1(x),u2=maxx[,u]p1(x).\ell_2 = \min_{x\in[\ell,u]} p_1(x), \qquad u_2 = \max_{x\in[\ell,u]} p_1(x).

This gives us a new interval [2,u2][\ell_2,u_2] that contains the singular values of p1(G)p_1(\vec{G}), so we can take p2(x)p_2(x) to be the optimal polynomial on [2,u2][\ell_2,u_2]. This gives a new interval [3,u3][\ell_3,u_3] which we can use to construct p3p_3, and so on.

Loading...

More formally, the Polar Express polynomial is defined as follows.

The PolarExpress algorithm simply uses (14) to apply the Polar Express polynomial.

Now, let’s look at what the Polar Express polynomial looks like as we change the number of iterations TT!

Loading...

Observe that the Polar Express polynomial is a much better to 1 than the Newton-Shulz polynomial, particularly when \ell is small.

We also note that, as before, the Polar Express polynomial satisfies an equioscillation property.

Greedy is optimal

Remarkably, this iterative greedy procedure for constructing the polynomials is actually optimal among all procedures that use a degree dd polynomial at each step 🤯.

This means that no other choice of compositions of TT polynomials of degree dd outperforms Polar Express.

We now visualize the error

maxx[,u]1p(x)\max_{x\in[\ell,u]} |1-p(x)|

for the Polar Express and Newton-Shulz polynomials. When \ell is small, Polar Express converges in roughly half the iterations of Newton-Shulz.

Loading...

Conclusion

The fact that the Polar Express polynomial approximation is optimal among all approximations of its form is a nice theoretical guarantee. However, since it’s not super clear why Muon works in the first place (to me at least!), so it’s also not clear that a better approximation to the polar decomposition actually leads to better performance of Muon. Nevertheless, the Polar Express algorithm is a beautiful approach to approximating the polar decomposition that seems to be a good fit for the needs of Muon and serves as a good starting point for engineering-based improvements to Muon and related algorithms.

How to Cite

@{chen2026polarexpressviualized,
    author       = {Tyler Chen},
    title        = {The Polar Express for Muon, Visualized},
    year         = {2026},
    url          = {https://research.chen.pw/PolarExpress}
}
Footnotes
  1. Technically speaking, what we are calling the polar decomposition is actually the unitary factor of the polar decomposition. We call it the polar decomposition for simplicity.

References
  1. Jordan, K., Jin, Y., Boza, V., You, J., Cesista, F., Newhouse, L., & Bernstein, J. (2024). Muon: An optimizer for hidden layers in neural networks. https://kellerjordan.github.io/posts/muon/
  2. Amsel, N., Persson, D., Musco, C., & Gower, R. M. (2026). The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm. In The Fourteenth International Conference on Learning Representations. https://arxiv.org/abs/2505.16932
  3. Cesista, F. L., Jiacheng, Y., & Jordan, K. (2025). Squeezing 1-2% Efficiency Gains Out of Muon by Optimizing the Newton-Schulz Coefficients. https://leloykun.github.io/ponder/muon-opt-coeffs/