Scaling laws of optimization

submited by
Style Pass
2024-10-06 05:30:02

Scaling laws have been one of the key achievements of theoretical analysis in various fields of applied mathematics and computer science, answering the following key question: How fast does my method or my algorithm converge as a function of (potentially partially) observable problem parameters.

For supervised machine learning and statistics, probably the simplest and oldest of all such scaling laws is the classical rate \(\sigma^2 d / n\) for ordinary least-squares (where \(d\) is the number of features, \(n\) the number of observations, and \(\sigma^2\) the noise variance). This has been extended over the years in the statistics community for various non-linear function estimation, with excess risks for regression essentially proportional to \(n^{-s/d}\), where \(s\) is now the order of derivatives that exist for the optimal prediction function (see, e.g., these two nice books [1, 2]). Such scaling laws also have a long history in the part of statistical physics that looks into high-dimensional statistical estimation problems (see, e.g., [3]), as well in random matrix theory (see, e.g., [4, 28]). Note that a particular feature of a theoretical scaling law is that it should be an asymptotic equivalent rather than an upper bound, which is the traditional set up in learning theory (and for good reasons, see, e.g., my soon to be published book [5]).

While there is a long history of scaling laws in theoretical circles, empirical variants have recently emerged for large-scale neural networks [26, 27], where new quantities (such as the amount of computing power available) have appeared. Here, the algorithm is run with several values of the problem parameters, and a parameterized scaling law is fit to the empirical performance, with the hope of predicting performance outcomes in future experiments. See a comprehensive summary here.

Leave a Comment