Adam_ A method for stochastic optimization.pdf

上善若水

923

15页

4次

2022-03-24

免费下载

Published as a conference paper at ICLR 2015

ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

Diederik P. Kingma

University of Amsterdam, OpenAI

dpkingma@openai.com

Jimmy Lei Ba

∗

University of Toronto

jimmy@psi.utoronto.ca

ABSTRACT

We introduce Adam, an algorithm for ﬁrst-order gradient-based optimization of

stochastic objective functions, based on adaptive estimates of lower-order mo-

ments. The method is straightforward to implement, is computationally efﬁcient,

has little memory requirements, is invariant to diagonal rescaling of the gradients,

and is well suited for problems that are large in terms of data and/or parameters.

The method is also appropriate for non-stationary objectives and problems with

very noisy and/or sparse gradients. The hyper-parameters have intuitive interpre-

tations and typically require little tuning. Some connections to related algorithms,

on which Adam was inspired, are discussed. We also analyze the theoretical con-

vergence properties of the algorithm and provide a regret bound on the conver-

gence rate that is comparable to the best known results under the online convex

optimization framework. Empirical results demonstrate that Adam works well in

practice and compares favorably to other stochastic optimization methods. Finally,

we discuss AdaMax, a variant of Adam based on the inﬁnity norm.

1 INTRODUCTION

Stochastic gradient-based optimization is of core practical importance in many ﬁelds of science and

engineering. Many problems in these ﬁelds can be cast as the optimization of some scalar parameter-

ized objective function requiring maximization or minimization with respect to its parameters. If the

function is differentiable w.r.t. its parameters, gradient descent is a relatively efﬁcient optimization

method, since the computation of ﬁrst-order partial derivatives w.r.t. all the parameters is of the same

computational complexity as just evaluating the function. Often, objective functions are stochastic.

For example, many objective functions are composed of a sum of subfunctions evaluated at different

subsamples of data; in this case optimization can be made more efﬁcient by taking gradient steps

w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself

as an efﬁcient and effective optimization method that was central in many machine learning success

stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton

& Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013). Objectives may also have other

sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization. For

all such noisy objectives, efﬁcient stochastic optimization techniques are required. The focus of this

paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In

these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be

restricted to ﬁrst-order methods.

We propose Adam, a method for efﬁcient stochastic optimization that only requires ﬁrst-order gra-

dients with little memory requirement. The method computes individual adaptive learning rates for

different parameters from estimates of ﬁrst and second moments of the gradients; the name Adam

is derived from adaptive moment estimation. Our method is designed to combine the advantages

of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gra-

dients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary

settings; important connections to these and other stochastic optimization methods are clariﬁed in

section 5. Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to

rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter,

it does not require a stationary objective, it works with sparse gradients, and it naturally performs a

form of step size annealing.

∗

Equal contribution. Author ordering determined by coin ﬂip over a Google Hangout.

arXiv:1412.6980v9 [cs.LG] 30 Jan 2017

Published as a conference paper at ICLR 2015

Algorithm 1: Adam, our proposed algorithm for stochastic optimization. See section 2 for details,

and for a slightly more efﬁcient (but less clear) order of computation. g

indicates the elementwise

square g

 g

. Good default settings for the tested machine learning problems are α = 0.001,

= 0.9, β

= 0.999 and  = 10

−8

. All operations on vectors are element-wise. With β

and β

we denote β

and β

to the power t.

Require: α: Stepsize

Require: β

, β

∈ [0, 1): Exponential decay rates for the moment estimates

Require: f (θ): Stochastic objective function with parameters θ

Require: θ

: Initial parameter vector

← 0 (Initialize 1

moment vector)

← 0 (Initialize 2

moment vector)

t ← 0 (Initialize timestep)

while θ

not converged do

t ← t + 1

← ∇

(θ

t−1

) (Get gradients w.r.t. stochastic objective at timestep t)

← β

· m

t−1

+ (1 − β

) · g

(Update biased ﬁrst moment estimate)

← β

· v

t−1

+ (1 − β

) · g

(Update biased second raw moment estimate)

← m

/(1 − β

) (Compute bias-corrected ﬁrst moment estimate)

← v

/(1 − β

) (Compute bias-corrected second raw moment estimate)

← θ

t−1

− α · bm

√

+ ) (Update parameters)

end while

return θ

(Resulting parameters)

In section 2 we describe the algorithm and the properties of its update rule. Section 3 explains

our initialization bias correction technique, and section 4 provides a theoretical analysis of Adam’s

convergence in online convex programming. Empirically, our method consistently outperforms other

methods for a variety of models and datasets, as shown in section 6. Overall, we show that Adam is

a versatile algorithm that scales to large-scale high-dimensional machine learning problems.

2 ALGORITHM

See algorithm 1 for pseudo-code of our proposed algorithm Adam. Let f(θ) be a noisy objec-

tive function: a stochastic scalar function that is differentiable w.r.t. parameters θ. We are in-

terested in minimizing the expected value of this function, E[f(θ)] w.r.t. its parameters θ. With

(θ), ..., , f

(θ) we denote the realisations of the stochastic function at subsequent timesteps

1, ..., T . The stochasticity might come from the evaluation at random subsamples (minibatches)

of datapoints, or arise from inherent function noise. With g

= ∇

(θ) we denote the gradient, i.e.

the vector of partial derivatives of f

, w.r.t θ evaluated at timestep t.

The algorithm updates exponential moving averages of the gradient (m

) and the squared gradient

) where the hyper-parameters β

, β

∈ [0, 1) control the exponential decay rates of these moving

averages. The moving averages themselves are estimates of the 1

moment (the mean) and the

raw moment (the uncentered variance) of the gradient. However, these moving averages are

initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially

during the initial timesteps, and especially when the decay rates are small (i.e. the βs are close to 1).

The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected

estimates bm

and bv

. See section 3 for more details.

Note that the efﬁciency of algorithm 1 can, at the expense of clarity, be improved upon by changing

the order of computation, e.g. by replacing the last three lines in the loop with the following lines:

= α ·

1 − β

/(1 − β

) and θ

← θ

t−1

− α

· m

√

+ ˆ).

2.1 ADAM’S UPDATE RULE

An important property of Adam’s update rule is its careful choice of stepsizes. Assuming  = 0, the

effective step taken in parameter space at timestep t is ∆

= α · bm

√

. The effective stepsize has

two upper bounds: |∆

| ≤ α · (1 − β

√

1 − β

in the case (1 − β

) >

√

1 − β

, and |∆

| ≤ α

of 15

免费下载

深度学习 paper

关注

评论