This code is written for modern PyTorch (version 2.7 or newer) using DTensor-based parallelism. This includes FSDP2 with fully_shardand tensor paralle

Search code, repositories, users, issues, pull requests...

submited by

Style Pass

2025-08-04 22:00:05

This code is written for modern PyTorch (version 2.7 or newer) using DTensor-based parallelism. This includes FSDP2 with fully_shardand tensor parallelism (TP) with parallelize_module. Support for other distributed training APIs is not guaranteed.

To enable more advanced distributed strategies such as Fully Sharded Data Parallel (FSDP) and Tensor Parallelism (TP), you can specify the configuration in the dion_160m.yaml file:

Alternatively, you can override these values directly from the command line. All three values must be explicitly given, but a size may be set to 1 to omit a parallelism dimension.

Optimization algorithms are essential to training neural networks, converting gradients into model weight updates to minimize loss. For many years, the state-of-the-art method has been Adam/AdamW. However, recent work has shown that orthonormalized matrix updates can significantly accelerate model convergence. See Bernstein and Newhouse, 2025 for a theoretical justification.

The practical effectiveness of orthonormal updates was first demonstrated by Muon in the NanoGPT speedrun, and has since been validated at scale by models such as Kimi K2 and GLM-4.5. Muon implements orthonormalization via Newton-Schulz iterations, which relies on repeated matrix-matrix multiplications. However, large-scale training relies on model sharding, where weight matrices and optimizer states are distributed across multiple processes. As discussed by Essential AI, orthonormalizing a sharded matrix with Newton-Schulz iterations involves the communication-intensive procedure of reconstructing the full matrices from their individual shards.

Search code, repositories, users, issues, pull requests...

Leave a Comment

Related Posts

Recent Posts

The Future of Accrescent

Google Makes Fun of Apple Intelligence Siri Delay in Ad Promoting Pixel 10

Fun with Ghostty Shaders

Experimental Fat Loss

He Was Laughed Out of Academia for This Take About Technology. Turns Out He Was Right.

Mark Zuckerberg Says ‘Superintelligence’ Is Imminent. What Is It?

Watch a pod of orcas pretending to drown one of their own in macabre training session

When Software is in Everything - Software Freedom Law Center

NASA’s Lunar Trailblazer Moon Mission Ends

iPhone rakes in 3 times the revenue of any rival

The Changing Politics of Masks

The Crypto Crises Are Coming

Semantic Refinement/Dependent Typing for Knuckledragger/SMTLIB Pt 1

Large-scale illegal wildlife shops in Laos found scamming Chinese tourists

How to make software harder

Computer Science > Computation and Language

Vehicle-Maintenance-Log 4+

Reported impact snapshot | Gaza Strip (30 July 2025)

AWS Lambda loves charging for idle time: Vercel claims it found a way to dodge the bill

Master Your Money, Master Your Future