Reproducing and Validating Distributed Muon 🐢✨: A Practical Verification of Communication Efficiency Claims

Community Article Published December 12, 2025

A deep dive into verifying MoonShot AI’s distributed optimizer claims, complete with bug fixes, profiling analysis, and practical insights for GPU clusters.

By Jen Wei

Code: Muon Distributed Reproducibility | Previous tutorials: Muon Intro, Distributed Muon Basics

TL;DR

I spent weeks reproducing MoonShot AI’s Distributed Muon optimizer and validated their paper’s claims on a 4-GPU cluster:

✅ Communication: 0.57× of AdamW (better than paper’s 1.25× upper bound!)
✅ Optimizer overhead: 1.1% (within paper’s 1–3% claim)
✅ Memory: 50% of AdamW (1 buffer vs 2)
🐛 Found and fixed 2 critical bugs in the reference implementation
📊 Muon is 15% slower per step but saves 43% on communication

The verdict: MoonShot’s implementation is production-quality. If Muon converges in 20–30% fewer steps (as the paper claims), you’ll save wall-clock time on multi-node training.

1. Introduction

Distributed training is hard. Really hard. When MoonShot AI released their paper “Muon is Scalable for LLM Training”, they made bold claims about communication efficiency and optimizer overhead. As someone who’s been learning distributed systems the hard way, I wanted to verify these claims independently.

Why this matters: If you’re training large models across multiple nodes, communication becomes your bottleneck. AdamW requires expensive all-reduce operations to synchronize gradients. Muon promises better communication efficiency through bf16 compression and a clever combination of data parallelism (DP) and tensor parallelism (TP).

What I set out to verify:

Is communication really just 1–1.25× of AdamW?
Is optimizer overhead truly negligible (1–3%)?
Does the implementation actually work correctly?

Spoiler: The answers are yes, yes, and mostly yes (after fixing 2 bugs).

For background on Muon’s optimizer mechanics, see my introductory tutorial. For distributed Muon basics, see my distributed Muon guide. This report focuses on performance validation and practical insights.

2. What Does the Paper Actually Claim?

Let me translate the paper's claims into plain English, because some of this is written in "if you know, you know" style.

Claim 1: Communication Overhead

Paper says:

The communication workload of Distributed Muon is (1, 1.25] of that of Distributed AdamW. The upper-bound is calculated as that the communication of Distributed Muon is 4 (fp32 G reduce-scatter) + 2 (bf16 Muon gather) + 4 (fp32 P all-gather), while Distributed AdamW is 4 + 4.

What this means in practice:

AdamW's communication pattern (per training step):

Step 1: Reduce-scatter gradients (fp32, 4 bytes/value) → each GPU gets a slice
Step 2: All-gather updated parameters (fp32, 4 bytes/value) → broadcast to all GPUs
Total: 8 communication units

Muon's communication pattern:

Step 1: Reduce-scatter gradients (fp32, 4 bytes/value) → same as AdamW
Step 2: All-gather for Newton-Schulz (bf16, 2 bytes/value) → reconstruct full gradient matrix
Step 3: All-gather updated parameters (fp32, 4 bytes/value) → same as AdamW
Total: 10 communication units

The math: 10/8 = 1.25× more communication… in theory.

But wait: bf16 uses half the bandwidth of fp32, so step 2 is actually cheaper. Plus, with smart topology (DP=2, TP=2), the all-gather operations can be more efficient than repeated all-reduces.

Why it matters: Communication is often the slowest part of distributed training. Network bandwidth is expensive. If Muon's communication overhead is too high, faster convergence won't help.

Claim 2: Optimizer Overhead is Negligible

Paper says:

The end-to-end latency caused by the optimizer is negligible compared to the model's forward-backward pass time (e.g. usually 1% to 3%).

What this means: In a typical training step:

Total time = Forward pass + Backward pass + Optimizer step + Communication
            |←------------ 95-99% ----------→| |←---- 1-3% ----→|

Muon's Newton-Schulz iteration (5 steps of matrix multiplication) is more expensive than AdamW's simple element-wise operations. But if it's only 1–3% of total time, who cares?

Why it matters: If the optimizer takes 20% of your training time, you have a problem. If it's 1%, optimize something else.

Claim 3: Memory is Half of AdamW

Paper says:

Muon uses only one momentum buffer, while AdamW uses two momentum buffers. Therefore, the additional memory used by the Muon optimizer is half of Distributed AdamW.

What this means: AdamW state per parameter:

state['exp_avg']     # First moment (momentum)
state['exp_avg_sq']  # Second moment (RMSprop)
# Total: 2× parameter memory

Muon state per parameter:

state['muon_buffer']  # Just momentum
# Total: 1× parameter memory

Simple math: 1/2 = 0.5 = 50% ✅

This one is straightforward - just counting dictionary entries. I'll verify it, but it's not the interesting part.

3. The Journey: Bugs, Crashes, and Jupyter Nightmares

Before I could validate anything, I had to make the code actually work. This turned into an adventure.

Bug #1: The Missing Parameter List

In the second Muon update loop, there was a critical bug:

for group in self.param_groups:
    if not group.get('use_muon', False):
        continue
    
    lr = group["lr"]
    # ... other hyperparameters ...
    
    # 🐛 BUG: `params` was never defined!
    for p in params:  # ← NameError!
        ns_input = ns_inputs[p]
        # ...

The fix:

params = group["params"]  # ← Add this line

This was causing silent failures in certain configurations. How did this slip through? My guess: the reference implementation was tested primarily with single parameter groups, where the variable might have been in scope from elsewhere.

Bug #2: The Shape Mismatch

After distributed communication, the code was reshaping gradients incorrectly:

# After all_gather, we need to reshape back to original parameter shape
unpacked_data = ns_input_global_buffer[global_range[0]:global_range[1]]
# 🐛 BUG: This creates a 1D tensor!
ns_inputs[p] = unpacked_data  # Shape: [N]
# Later, zeropower_via_newtonschulz5 expects 2D:
update = zeropower_via_newtonschulz5(ns_input, steps=ns_steps)
# ❌ AssertionError: requires 2D tensor, got 1D

The fix:

# Reshape to original 2D shape
ns_inputs[p] = unpacked_data.view(dist_meta.shape)  # Shape: [H, W]

This bug only manifested with certain DP/TP configurations where tensor splitting created non-contiguous memory layouts.

The Colab/Jupyter Multiprocessing Nightmare

The biggest challenge wasn't the bugs - it was getting torch.multiprocessing.spawn() to work in Jupyter notebooks.

The problem: Spawned processes would crash silently with no error messages:

W1106 03:17:50.173000 torch/multiprocessing/spawn.py:169] 
  Terminating process 16574 via signal SIGTERM
ProcessExitedException: process 2 terminated with exit code 1

No stack trace. No logs. Just death.

Why? Jupyter notebooks have their own complex multiprocessing setup that conflicts with PyTorch's spawning mechanism. The processes die before they can even write to stderr.

The solution: Write code to a .py file and run it as a subprocess:

# In notebook: write code to file
with open('test_muon_dist.py', 'w') as f:
    f.write(test_code)

# Run it
!python test_muon_dist.py

This gives the code a "clean" Python environment without Jupyter's kernel state. It's a well-known workaround in the PyTorch distributed community, but took me days to figure out.

Key lesson: If you're doing distributed training development, use .py files, not notebooks. Notebooks are great for exploration, terrible for multiprocessing.

For more details on environment setup (CPU/GPU, gloo/nccl, etc.), see the Quick Start section of my distributed tutorial.

4. Experimental Setup

Hardware:

4× NVIDIA RTX A4000 (16GB each)
Connectivity: PCIe Gen 4 x4 via Oculink (~8 GB/s)
Note: This bandwidth-constrained setup acts as a rigorous stress test for communication efficiency, simulating the bottlenecks often seen in multi-node training
Single machine, no NVLink
Cluster generously provided by [Mahdi Chaker](https://github.com/mchaker)

Model Configuration:

5 parameter tensors totaling ~23M parameters
Shapes: (4096, 4096), (1024, 324), (456, 1024), (676, 876), (128, 128)
Representative of LLM layer sizes

Parallelism Strategies Tested:

DP=2, TP=2 (hybrid: 2 model replicas, each split across 2 GPUs)
DP=1, TP=4 (pure tensor parallelism)
DP=4, TP=1 (pure data parallelism)

Profiling Methodology:

PyTorch Profiler with Perfetto trace viewer
Separate benchmarks for optimizer-only and full training step
5 optimization steps per benchmark (sufficient for stable measurements)

Forward-Backward Simulation:

Since we're benchmarking optimizers, not training a real model, I simulated forward-backward passes:

def simulate_fwd_bwd():
    """Simulate model compute with matrix multiplications"""
    dummy = torch.randn(2048, 2048, device='cuda')
    for _ in range(20):  # 20 iterations chosen empirically
        dummy = torch.matmul(dummy, dummy)
    torch.cuda.synchronize()

Why 20 iterations? This produces ~50ms of GPU compute per step, roughly matching the optimizer time we're measuring. The specific number is arbitrary, but critically: both AdamW and Muon use the same simulation, so relative comparisons are fair.

Different iteration counts would change the absolute percentages (e.g., 1.1% optimizer overhead) but not the relative insights. Think of it like calibrating a scale - the zero point changes, but weight differences remain accurate.

5. Results: The Numbers Don't Lie

Here's what I found after running the benchmarks. All measurements are from rank 0 on the 4-GPU cluster.

5.1 Communication Efficiency: Better Than Advertised

First, let's look at the raw communication times:

Metric	Muon	AdamW	Ratio (Muon/AdamW)
Total comm time	139.53ms	245.84ms	0.57× ✅
Number of calls	40	110	0.36×
Avg per call	3.49ms	2.23ms	1.57×

Wait, 0.57×? That's BETTER than the paper's upper bound of 1.25×!

What's happening here?

bf16 compression works: Muon's Newton-Schulz all-gather uses 16-bit instead of 32-bit, halving bandwidth
Fewer synchronization points: Muon does 40 communication calls vs AdamW's 110
Efficient all-gather topology: With DP=2/TP=2, NCCL can optimize the all-gather pattern better than many small all-reduces
Single-machine advantage: Our 4 GPUs share NVLink/PCIe, which favors large batched operations

Here are the Perfetto trace visualizations showing the communication patterns:

Muon's communication: Fewer, larger all-gather operations (pink blocks)

AdamW's communication: Many small all-reduce operations (teal blocks)

The single-machine caveat:
Our test uses GPUs in one machine with fast interconnect. On multi-node setups with network (InfiniBand, RoCE), communication would be 3–5× slower. But here's the key insight: the ratio stays the same.

If we inflate both by 3× for network overhead:

AdamW: 245.84ms × 3 = 737.52ms
Muon: 139.53ms × 3 = 418.59ms
Ratio: still 0.57×

In fact, Muon's advantage becomes MORE valuable when network is slow, because communication dominates total time.

5.2 Optimizer Overhead: Negligible as Promised

Full training step breakdown:

Phase	Muon	AdamW	Ratio
Forward-backward	271ms	281ms	0.96×
Communication	139.53ms	245.84ms	0.57×
Optimizer Compute	4.29ms	1.46ms	2.94×
Total per step	388.86ms	338.51ms	1.15×

Optimizer as % of total:

Muon: 4.29ms / 388.86ms = 1.1% ✅
AdamW: 1.46ms / 338.51ms = 0.4% ✅

Both are well within the paper's claimed 1–3% range. Muon's Newton-Schulz iterations are 2.9× slower than AdamW's element-wise operations, but it simply doesn't matter - it's only 1.1% of training time.

Visual comparison of full training steps:

Muon full step: Optimizer (red) is tiny compared to FWD_BWD (green) and communication (pink)

AdamW full step: Optimizer (purple) is even tinier, but communication (teal) dominates

5.3 Memory Usage: Exactly Half

As expected, this one is straightforward:

# AdamW state
>>> list(adamw_optimizer.state[params[0]].keys())
['exp_avg', 'exp_avg_sq', 'step']

# Muon state
>>> list(muon_optimizer.state[params[0]].keys())
['muon_buffer']

Memory for optimizer state:

AdamW: 2× parameter memory (two buffers)
Muon: 1× parameter memory (one buffer)
Reduction: 50% ✅

For a model with 7B parameters in fp32 (28GB), AdamW's optimizer state adds 56GB. Muon adds only 28GB. That's an extra 28GB you can use for larger batch sizes or longer sequences.

5.4 The Bottom Line: Per-Step Performance

Muon is 15% slower per step (388.86ms vs 338.51ms). But context matters:

If Muon converges in fewer steps (paper claims 20–30% reduction):

AdamW: 1000 steps × 338.51ms = 338.5 seconds
Muon: 750 steps × 388.86ms = 291.6 seconds
Muon wins by 14%

The trade-off:

✅ 43% less communication (scales better to multi-node)
✅ 50% less memory (enables larger models/batches)
✅ Better convergence (per paper's experiments)
❌ 15% slower per step (acceptable if fewer steps needed)

6. What About Async Overlapping?

The paper mentions:

Several engineering techniques, such as overlapping gather and computation, and overlapping optimizer reduce-scatter with parameter gather, can further reduce latency.

I tried a naive experiment: adding async_op=True to all communication calls.

Spoiler: It made everything slower. 😅

Optimizer	Original Comm	Async Comm	Verdict
Muon	139.53ms	294.09ms	2.1× slower! 🔴
AdamW	245.84ms	377.44ms	1.5× slower! 🔴

Why?

Naive async doesn't help because:

# What I did (doesn't work):
handle = dist.all_gather(..., async_op=True)
# ... no independent work here ...
handle.wait()  # ← Blocks immediately!

Real overlapping requires:

Multiple parameter groups to pipeline
Independent work that doesn't depend on communication results
Careful stream management
Reordering operations to maximize overlap

The paper's overlapping claims are about future optimization potential, not what the current implementation does automatically. This is deep systems engineering work.

Lesson learned: Don't add async_op=True unless you have actual compute to overlap. NCCL already optimizes synchronous operations efficiently.

7. Single-Machine vs Multi-Node: Scaling Analysis

Our 4 GPUs in one machine is optimistic. Let's project to realistic multi-node setups.

Assumptions:

Single-machine bandwidth: ~500 GB/s (NVLink/PCIe)
Multi-node bandwidth: ~100 GB/s (InfiniBand)
Network overhead factor: ~3–5× slower

Conservative projection assuming 3× network slowdown:

Phase	Muon (Multi-Node)	AdamW (Multi-Node)
Forward-backward	271ms	281ms
Communication	418.59ms	737.52ms
Optimizer Compute	4.29ms	1.46ms
Total Time	694ms	1020ms

Muon would be 694/1020 = 0.68× (32% faster) per step on multi-node!

Why? Because when communication dominates (multi-node), Muon's 57% communication savings becomes the deciding factor.

Scaling to 32 GPUs across 4 nodes:

The communication advantage compounds:

More GPUs → more all-reduce hops for AdamW
Muon's all-gather with bf16 scales better
50% memory savings enables larger batch sizes per GPU

This is where Muon's design really shines. The MoonShot team optimized for large-scale training, not single-machine benchmarks.

8. Practical Takeaways

After weeks of debugging, profiling, and analysis, here's what I learned:

When to Use Muon (The "Hybrid" Reality)

Since Muon math requires matrices, you rarely use only Muon. You use a Hybrid (Muon + AdamW) approach.

✅ The Sweet Spot (Use Distributed Muon):

Target: Large-scale Pre-training from scratch.
Architecture: Transformers or ConvNets (models dominated by 2D/4D Matrix weights).
Setup: Multi-node clusters where Inter-node communication (gradient reduction) is the bottleneck. (Our traces proved Muon cuts communication volume by ~40%!).
Batch Size: Large global batch sizes (Muon thrives on stable statistics).

❌ Stick with (or Fallback to) AdamW:

1D Parameters: Embeddings, LayerNorm gains, and Biases. (These must use AdamW as they cannot be orthogonalized).
Fine-Tuning: If you are doing PEFT (LoRA), the trainable parameters are already small, so Muon's overhead isn't worth it.
Small Scale: If your model fits on one GPU, the complexity of distributed orchestration outweighs the benefits.

The Real-World Implications: Memory & Speed

The Memory Trade-off: Static vs. Peak

For a 7B model on a 32-GPU cluster (ZeRO-1), let’s look at one single GPU:

Static Memory (Storage): Muon wins.

AdamW: Needs 2 states (exp_avg, exp_avg_sq).
Muon: Needs 1 state (momentum).

Result: You save 50% of optimizer state memory (approx. 7GB per GPU for a 7B model). This frees up VRAM for larger local batch sizes or longer context windows.

Peak Memory (During Step): AdamW wins.

AdamW: Updates are element-wise. No need to gather the full parameter.
Muon: Must all_gather the full matrix to run Newton-Schulz.

Result: Muon has a higher transient memory spike during the step().

The Throughput Win (The “Mic Drop” 🎤)

The real killer feature isn’t just memory; it’s Communication Efficiency.

As our profiling proved, Distributed Muon moves half the data volume of AdamW (thanks to bfloat16 gathers vs float32 reductions).

AdamW: Bottlenecked by heavy all_reduce across the network.
Muon: spends more time on compute (Newton-Schulz) and less on waiting for data.

Verdict: You pay a little latency per step (1–3%) and a higher peak memory cost to gain massive static memory savings and significantly lower network congestion. For LLM pre-training, that is a winning trade.

What MoonShot Got Right

bf16 compression: Simple but effective — halves communication bandwidth
Hybrid DP/TP topology: Enables efficient all-gather patterns
Minimalist state: One buffer vs two is a huge memory win
Production-quality code: After fixing 2 bugs, it’s rock-solid

The implementation is ready for real use. The bugs I found were edge cases that probably didn’t show up in their testing configurations.

9. Limitations and Future Work

Limitations of this study:

Single-machine 4-GPU setup (optimistic communication)
Simulated forward-backward (not real model training)
No convergence analysis (only per-step performance)
Limited to DP/TP parallelism (no pipeline parallelism)

🔮 Where We Go From Here (The “Wishlist”)

I’ve validated the math and the communication efficiency on 4 GPUs. But the “Distributed Nightmare” gets even more interesting at scale.

Here is the roadmap for Phase 3, should the compute gods smile upon me:

Scale to 32+ GPUs: Validating if that beautiful DP=2, TP=2 sweet spot holds up across multiple nodes.
The “Real” Training Run: Moving beyond synthetic benchmarks to full convergence comparisons on a real LLM pre-training run.
Gather and Compute Overlap(Nightmare 2.0 😈): Implementing the gather and compute overlap logic to hide that last 1% of latency.
Pipeline Parallelism: Integrating Muon into a 3D (DP + TP + PP) setup.

Call to Action: I have the optimized code and the experimental plan. I just need the runway. If you have a cluster that needs a stress test (and potentially a 2x faster optimizer), let’s talk.

10. Conclusion

MoonShot AI’s claims hold up. The Distributed Muon implementation is production-ready (after bug fixes), and their communication efficiency claims are validated — even exceeded in our tests.

The numbers:

✅ Communication: 0.57× of AdamW (better than 1.25× upper bound)
✅ Optimizer overhead: 1.1% (within 1–3% claim)
✅ Memory: 50% of AdamW
✅ Per-step cost: 1.15× slower (acceptable trade-off)

My verdict: If you’re training large models on multi-node clusters, Muon is worth trying. The 15% per-step overhead is manageable, and the communication savings will shine at scale.

For small-scale training (<8 GPUs), stick with AdamW. The complexity isn’t worth it yet.

Acknowledgments

🙏 Mahdi Chaker for generously providing GPU cluster access
MoonShot AI team for open-sourcing their PoC implementation

Code and Reproducibility

GitHub/HuggingFace repo: muon-distributed-reproducibility

Previous tutorials:

Muon Optimizer Introduction — Core mechanics and single-GPU usage
Distributed Muon Basics — DP/TP setup and CPU testing

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Reproducing and Validating Distributed Muon 🐢✨: A Practical Verification of Communication Efficiency Claims

TL;DR

1. Introduction

2. What Does the Paper Actually Claim?

Claim 1: Communication Overhead

Claim 2: Optimizer Overhead is Negligible

Claim 3: Memory is Half of AdamW

3. The Journey: Bugs, Crashes, and Jupyter Nightmares

Bug #1: The Missing Parameter List

Bug #2: The Shape Mismatch

The Colab/Jupyter Multiprocessing Nightmare

4. Experimental Setup

5. Results: The Numbers Don't Lie

5.1 Communication Efficiency: Better Than Advertised

5.2 Optimizer Overhead: Negligible as Promised

5.3 Memory Usage: Exactly Half

5.4 The Bottom Line: Per-Step Performance

6. What About Async Overlapping?

7. Single-Machine vs Multi-Node: Scaling Analysis

8. Practical Takeaways

When to Use Muon (The "Hybrid" Reality)

✅ The Sweet Spot (Use Distributed Muon):

❌ Stick with (or Fallback to) AdamW:

The Real-World Implications: Memory & Speed

The Memory Trade-off: Static vs. Peak

The Throughput Win (The “Mic Drop” 🎤)

What MoonShot Got Right

9. Limitations and Future Work

Limitations of this study:

🔮 Where We Go From Here (The “Wishlist”)

10. Conclusion

Acknowledgments

Code and Reproducibility

Previous tutorials:

Related Readings:

Community