☆ 4.7 Article

Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2020)

期刊

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

卷 31, 期 10, 页码 4394-4400

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC

DOI: 10.1109/TNNLS.2019.2952219

关键词

Convergence; Training; Stochastic processes; Optimization; Loss measurement; Nickel; Learning systems; Learning theory; nonconvex optimization; Polyak-Lojasiewicz condition; stochastic gradient descent (SGD)

类别

Computer Science, Artificial Intelligence Computer Science, Hardware & Architecture Computer Science, Theory & Methods Engineering, Electrical & Electronic

资金

National Key Research and Development Program of China [2017YFB1003102]
National Natural Science Foundation of China [11571078, 11671307, 61672478, 61806091]
Program for University Key Laboratory of Guangdong Province [2017KSYS008]
Program for Guangdong Introducing Innovative and Entrepreneurial Teams [2017ZT07X386]
Alexander von Humboldt Foundation

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Stochastic gradient descent (SGD) is a popular and efficient method with wide applications in training deep neural nets and other nonconvex models. While the behavior of SGD is well understood in the convex learning setting, the existing theoretical results for SGD applied to nonconvex objective functions are far from mature. For example, existing results require to impose a nontrivial assumption on the uniform boundedness of gradients for all iterates encountered in the learning process, which is hard to verify in practical implementations. In this article, we establish a rigorous theoretical foundation for SGD in nonconvex learning by showing that this boundedness assumption can be removed without affecting convergence rates, and relaxing the standard smoothness assumption to Holder continuity of gradients. In particular, we establish sufficient conditions for almost sure convergence as well as optimal convergence rates for SGD applied to both general nonconvex and gradient-dominated objective functions. A linear convergence is further derived in the case with zero variances.

作者

我是这篇论文的作者

点击您的名字以认领此论文并将其添加到您的个人资料中。

主要评分

4.7

评分不足

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

A Unified Analysis of AdaGrad With Weighted Aggregation and Momentum Acceleration

Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu

Summary: Integrating adaptive learning rate and momentum techniques into stochastic gradient descent has led to various efficient adaptive stochastic algorithms such as AdaGrad, RMSProp, Adam, and AccAdaGrad. This paper proposes a weighted AdaGrad algorithm called AdaUSM that incorporates a unified momentum scheme and a novel weighted adaptive learning rate. It shows that AdaUSM achieves $\mathcal{O}(\log(T)/\sqrt{T})$ convergence rate in the nonconvex stochastic setting with polynomially growing weights. Furthermore, it provides a new perspective for understanding Adam and RMSProp by showing that their adaptive learning rates correspond to exponentially growing weights in AdaUSM. Comparative experiments on deep learning models and datasets are also conducted.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS (2023)