Adaptive Optimizers#

Before the arrival of faster methods, combining Stochastic Gradient Descent with a learning rate schedule was considered close to state-of-the-art. Adaptive optimizers changed this by embedding the learning rate adjustment directly into the optimization process. Instead of relying on a separate scheduler, they use feedback from the model to adjust updates based on past gradients. The key advantage: they require little to no manual tuning.

Definition 75

An Adaptative Learning Rate is a technique that varies the learning rate using feedback from the model itself.

As the variation of the learning rate is done by the optimizer, the terms adaptive learning rate and adaptive optimizer are often used interchangeably.

Below are brief descriptions of the most popular adaptative optimizers.

AdaGrad#

This first adaptative algorithm was published in 2011. Compared to the classical Gradient Descent, AdaGrad points more directly toward the global optimum by decaying the learning rate more on the steepest dimension.

../_images/optim_4_adagrad.png

Fig. 53 . The Gradient Descent (blue) vs AdaGrad (orange).
Image: Aurélien Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, Second Edition
#

In the figure above, the AdaGrad takes a more direct thus shorter path than Gradient Descent, as AdaGrad has its learning rate reduced in the direction of steepest descent. This is done by squaring each gradient component into a vector \(\boldsymbol{s}\). The steepest a gradient component, the larger the square of this component \(s_j\). The weights are updated in almost the same way as with Gradient Descent, yet each component is divided by \(\boldsymbol{s} + \epsilon\). The added term \(\epsilon\) is to prevent a division by zero (typically 10\(^{-10}\)). As a result, the algorithm detects how to change the learning rate dynamically and specifically on the large gradient components to adapt to their steep slope.

Mathematically:

\[\begin{split}\begin{align*} \boldsymbol{s} \; &\leftarrow \; \boldsymbol{s} \;+\; \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}} \;\otimes \; \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}} \\[2ex] \boldsymbol{W} \; &\leftarrow \; \boldsymbol{W} - \alpha \; \; \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}} \oslash \sqrt{\boldsymbol{s} + \epsilon} \end{align*}\end{split}\]

The con
AdaGrad performs well on simple problem, like linear regression. With neural networks, it tends to stop too soon, before reaching the global minimum. Luckily, other adaptative algorithms fix this.

RMSProp#

RMSProp stands for Root Mean Square Propagation. It works the same as AdaGrad except that it keeps track of an exponentially decaying average of past squared gradients:

(138)#\[\begin{split}\begin{align*} \boldsymbol{s} \; &\leftarrow \; \beta\boldsymbol{s} \;+\; (1 - \beta) \; \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}} \;\otimes \; \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}} \\[2ex] \boldsymbol{W} \; &\leftarrow \; \boldsymbol{W} - \alpha \; \; \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}} \oslash \sqrt{\boldsymbol{s} + \epsilon} \end{align*}\end{split}\]

with the decay rate \(\beta\), between 0 and 1, yet typically set to 0.9. One might ask: what is the point to introduce an extra hyparameter? It turns out the default value works well in most cases and does not require tuning. What the update in the expressions (138) shows is a recursive way of computing a so-called Exponential Moving Average (EMA). In other words, \(\beta\) acts as a constant smoothing factor, representing the degree of weighting increase. A lower \(\beta\) discounts older observations faster. A higher \(\beta\) gives more weight to the previous gradient, a bit less weight to the previous previous gradient, etc.

For more elaborated tasks than the simple linear case, RMSProp is robust. It reigned until dethrowned by a newcomer called Adam. Before introducing Adam, let’s first cover the notion of momentum optimization.

Momentum Optimization#

In physics, the momentum \(\boldsymbol{p}\) is a vector obtained by taking the product of the mass and velocity of an object. It quantifies motion. In computing science, momentum refers to the direction and speed at which the parameters move - via iterative updates - through the parameter space. With momentum optimization, inertia is added to the system by updating the weights using the momenta from past iterations. This keeps the update in the same direction. The common analogy is a ball rolling down on a curved surface. It will start slowly but soon “gain momentum”, thus can go through flat gradient surface much faster than with the classic Gradient Descent. Adding momentum considerably speeds the Gradient Descent process. It also helps roll beside local minimum.

The momentum optimization algorithm is as follow:

\[\begin{split}\begin{align*} \boldsymbol{m} \; &\leftarrow \; \beta \boldsymbol{m} \; - \; \alpha \; \; \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}} \\[2ex] \boldsymbol{W} \; &\leftarrow \; \boldsymbol{W} + \boldsymbol{m} \end{align*}\end{split}\]

Here the \(\beta\) parameter controls the momentum from becoming too large. It could be analogous to introducing friction; \(\beta\) ranges from 0 to 1, with 0 meaning high friction and 1 no friction at all. In the literature you will see a common default value of \(\beta\) = 0.9.

Adam#

Adam stands for Adaptative moment estimation. It merges RMSProp with momentum optimization. Recall that RMSProp uses an exponentially decaying average of past squared gradients, while momentum does the same except with gradients (not squared).

Mathematically:

(139)#\[\begin{split}\begin{align*} \boldsymbol{m} \; &\leftarrow \; \beta_1 \boldsymbol{m} \; - \; (1 - \beta_1) \; \; \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}}\\[1ex] \boldsymbol{s} \; &\leftarrow \; \; \beta_2\boldsymbol{s} \;+\; (1 - \beta_2) \; \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}} \otimes \frac{\partial J\left(\boldsymbol{W}\right)}{\partial \boldsymbol{W}} \\[1ex] \boldsymbol{\widehat{m}} &\leftarrow \frac{\boldsymbol{m}}{1 - \beta_1^t} \\[1ex] \boldsymbol{\widehat{s}} &\leftarrow \frac{\boldsymbol{s}}{1 - \beta_2^t} \\[1ex] \boldsymbol{W} \; &\leftarrow \; \boldsymbol{W} + \alpha \; \boldsymbol{\widehat{m}} \oslash \sqrt{\boldsymbol{\widehat{s}} + \epsilon} \end{align*} \end{split}\]

with \(t\) the iteration number.

The first step is not exactly the momentum; instead of an exponentially decaying sum, Adam computes an exponentially decaying average.

The algorithm seems complicated at first glance, especially steps including the averages of \(\boldsymbol{\widehat{m}}\) and \(\boldsymbol{\widehat{s}}\). The division with the betas is to boost at the start of the training the values or \(\boldsymbol{m}\) and \(\boldsymbol{s}\), which are initialized at zero, in order to not stay at zero.

“Many extra parameters on top of the learning rate, so what is the point?” will you say. Actually the momentum decay \(\beta_1\) is set to 0.9, the scaling one \(\beta_2\) to 0.999 and \(\epsilon\) to 10\(^{-7}\). And if the learning rate is still a hyperparameter (usually set at 0.001 at start), it will be adapted to the optimiziation problem at hand by the algorithm. In that sense, the Adam optimizer is almost parameter free.

Learn More