The Mixture-of-Experts (MoEs) architecture has become a popular model-parameter-scale-up technique. Typically, one MoE layer consists of a router (oft

Global-batch load balance almost free launch to improve your MoE LLM training

submited by
Style Pass
2025-01-22 15:30:05

The Mixture-of-Experts (MoEs) architecture has become a popular model-parameter-scale-up technique. Typically, one MoE layer consists of a router (often parameterized as one single Linear layer) and a group of experts (for transformer-based models, each expert is one feedforward layer). Given an input, only a subset of experts will be activated, and then their outputs will be aggregated based on the scores the router assigned. Specifically,

Load balancing loss is an essential regularization technique in training MoE-based networks, and high-level intuition encourages the balanced activation of all experts. It can be calculated as:

where $f_i$ is the activation frequency of the expert $E_i$, and the $p_i$ is the the average gating score that the expert $E_i$ is assigned.

However, most existing MoE training frameworks (e.g., Megatron-core), implement micro-batch level balance, which means the $L_{\text{balance}}$ is calculated within every micro-batch and is then averaged on the global batch level.

Leave a Comment