MATHEMATICAL ANALYSIS OF CONVERGENCE FOR OPTIMIZATION ALGORITHMS IN NEURAL NETWORK TRAINING.
Abstract
This paper presents a rigorous mathematical analysis of the convergence properties of key optimization algorithms used in neural network training. The study investigates the dynamics of Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Adam within non-convex loss landscapes. The analysis reveals that stochastic methods possess a distinct advantage in escaping saddle points via gradient noise, while adaptive methods significantly accelerate the convergence rate through coordinate-wise normalization. The results provide a theoretical foundation for the trade-off between optimization speed and the generalization capability of deep learning models.
Keywords
Neural networks, optimization, convergence analysis, gradient descent, stochastic optimization, saddle points, L-smoothness, Adam algorithm.
References
- Bottou, L., Curtis, F. E., & Nocedal, J. (2018). Optimization methods for large-scale machine learning. SIAM Review, 60(2), 223–311. https://doi.org/10.1137/16M1080173
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org
- Hardt, M., Recht, B., & Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent. Proceedings of the 33rd International Conference on Machine Learning (ICML), PMLR 48:1225-1234.
- Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6980
- Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=Bkgj_S0zS7
- Nesterov, Y. (2018). Lectures on Convex Optimization (2nd ed.). Springer Optimization and Its Applications. https://doi.org/10.1007/978-3-319-91578-4
- Polyak, B. T. (1963). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 3(4), 1295–1313.
- Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and beyond. International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=ryQu7f-RZ
- Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407. [подозрительная ссылка удалена]
- Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747.