一种求解凸-PL极小极大优化问题的随机交替梯度投影算法

doi:10.12677/ORF.2023.131031

期刊菜单

一种求解凸-PL极小极大优化问题的随机交替梯度投影算法
A Stochastic Alternating Gradient Projection Algorithm for Convex-PL Minimax Optimization Problems

DOI: 10.12677/ORF.2023.131031, PDF, HTML, XML, 科研立项经费支持
作者: 覃媛媛, 王福胜, 王静：太原师范学院数学与统计学院，山西晋中
关键词: 机器学习；凸-PL；GDA算法；随机梯度；Machine Learning； Convex-PL； GDA Algorithm； Stochastic Gradient

摘要: 针对机器学习中常见的一类非凸极小极大优化问题，本文提出了一种新的随机交替梯度投影算法，该算法基于交替梯度投影算法，在随机环境中用随机梯度进行更新，和随机交替梯度下降–上升算法相比，运算效率大大提高。在高斯分布数据集上的数值实验结果表明新算法是可行有效的。

Abstract: To solve a class of nonconvex minimax optimization problems in machine learning, we propose a new stochastic alternating gradient projection algorithm (SAGP). This algorithm is based on the alternating gradient projection algorithm (AGP), which is updated with stochastic gradients in a stochastic environment. Compared with the stochastic alternating gradient descent ascent algorithm (SAGDA), the operation efficiency is greatly improved. Numerical experiments on Gaussian distributed data sets show that the new algorithm is feasible and effective.

文章引用：覃媛媛, 王福胜, 王静. 一种求解凸-PL极小极大优化问题的随机交替梯度投影算法[J]. 运筹与模糊学, 2023, 13(1): 292-297. https://doi.org/10.12677/ORF.2023.131031

1. 引言

本文考虑以下极小极大优化问题：

$\min_{x \in R^{n}} \max_{y \in R^{m}} f (x, y) ≜ Ε [F (x, y; ξ)]$ ，

其中 $ξ$ 是具有支持度 $Ξ$ 的随机向量， $f (x, y)$ 是光滑函数， $f (x, y)$ 对于 $x$ 是凸的，对于 $y$ 是非凹的。

在机器学习中，许多问题都可表述为 $\min_{x} \max_{y} f (x, y)$ 的形式，如生成性对抗网络(GAN)、分布式非凸优化、多域鲁棒学习、信号处理中的功率控制和收发器设计问题等。求解此类问题最简单的方法之一是梯度下降(GD)的自然泛化，称为梯度下降–上升(GDA)算法，在每次迭代时通过同步或交替的渐变更新来对 $x$ 执行梯度下降步骤，对 $y$ 执行梯度上升步骤。

凸–凹极小极大优化问题对应的变分不等式是单调算子，目前已有较好的成果 [1] [2] [3] [4] [5] 。非凸极小极大优化问题，包括非凸–凹、凸–非凹、非凸–非凹极小极大优化问题，由于其非凸性、非凹性和非光滑性的特殊结构，一般是NP-难的，目前已有一些理论成果。在非凸–凹环境下，求解此类问题的算法有两类，分别是嵌套循环算法 [6] [7] [8] [9] [10] 和单循环算法。单循环算法多基于GDA算法改进，如two-time-scale GDA and stochastic GDA算法 [11] ，GDmax算法 [12] ，混合块逐次逼近算法(HiBSA) [13] ，交替梯度投影算法(AGP) [14] ，proximal alternating GDA算法 [15] ，Smoothed-GDA算法 [16] ，Stoc-AGDA和Stoc-Smoothed-AGDA [17] 和其他改进的GDA算法 [18] [19] 。

在凸–非凹环境下，对于任意给定的 $x$ ，求解 $\max_{y} f (x, y)$ 是NP-难的。几乎现有的嵌套循环算法和部分现有的单循环算法 [12] [13] 都会失去理论保证，因为这些算法都需求解 $\max_{y} f (x, y)$ 。目前只有AGP算法 [14] 可求解目标函数 $f (x, y)$ 是光滑函数的凸–非凹极小极大优化问题，并证明了其收敛性。

凸–非凹极小极大优化问题的一类子问题是凸-PL极小极大优化问题，即 $f (x, y)$ 对于 $x$ 是凸的，对于 $y$ 是满足Polyak-Lojasiewicz(PL)条件的。PL条件最初是由Polyak [20] 引入，并证明了梯度下降以线性速率全局收敛。目标函数的变量 $y$ 满足PL条件的假设比 $y$ 具有强凹性的假设更温和，甚至不要求目标函数在 $y$ 上是凹的，这种假设在线性二次型调节器 [21] 和超参数化神经网络 [22] 中被证明成立。对于极小极大优化问题，有的研究 [17] 是使变量 $y$ 满足PL条件，有的研究 [23] [24] 则是通过使 $x$ 满足PL条件来找到全局解，同时PL条件被证明适用于机器学习中某些非凸应用 [21] [25] ，包括与深度神经网络相关的问题 [22] [26] ，因此PL条件引起了广泛的关注。

非凸–凹环境下，许多算法 [11] [27] 结合随机梯度都取得了较好的收敛结果。受启发于Stoc-Smoothed-AGDA算法 [17] ，本文考虑在凸-PL环境下，将随机梯度与AGP算法 [14] 相结合，从而提出了随机交替梯度投影算法(SAGP)。

2. 交替梯度投影算法

本节首先回顾交替梯度投影算法(AGP) [14] 。考虑以下极小极大优化问题：

$\min_{x \in X} \max_{y \in Y} f (x, y)$ ，

其中 $f : X \times Y \to R$ 是光滑函数， $X \subseteq R^{n}, Y \subseteq R^{m}$ 为非空闭凸集。

交替梯度投影算法是单循环算法，每次迭代通过两个梯度投影步骤更新 $x$ 和 $y$ 。在第 $t$ 次迭代中，AGP算法使用如下辅助函数的梯度进行更新，即：

$\hat{f} (x, y) = f (x, y) + \frac{b_{t}}{2} {‖ x ‖}^{2} - \frac{c_{t}}{2} {‖ y ‖}^{2}$ ，

其中 $b_{t} \geq 0$ 和 $c_{t} \geq 0$ 是正则化参数。在凸–非凹环境下， $c_{t} = 0$ ， ${b_{t}}$ 是非负单调递减序列，以下给出交替梯度投影算法的框架：

算法1 交替梯度投影算法(AGP)

步骤1 输入固定步长 $τ_{1} > 0$ ， $τ_{2} > 0$ ，初始点 $(x_{0}, y_{0})$ 及参数 $b_{0} > 0$ ，

步骤2 计算 $b_{t}$ 并更新 $x_{t}$ ：

$x_{t + 1} = P_{X} (x_{t} - τ_{1} (\nabla_{x} f (x_{t}, y_{t}) + b_{t} x_{t}))$ ，

步骤3 更新 $y_{t}$ ：

$y_{t + 1} = P_{Y} (y_{t} + τ_{2} \nabla_{y} f (x_{t + 1}, y_{t}))$ ，

步骤4 当算法满足终止准则时，停止；否则，令 $t = t + 1$ ，返回步骤2。

3. PL条件

目标函数 $f (x, y)$ 在 $y$ 上的PL条件：对于任意固定的 $x$ ， $\max_{y} f (x, y)$ 有非空解集和有限最优值，存在 $μ > 0$ 使得 ${‖ \nabla_{y} f (x, y) ‖}^{2} \geq 2 μ [\max_{y} f (x, y) - f (x, y)], \forall x, y$ 。

4. 随机交替梯度投影算法

根据文献 [17] 所述，在随机环境中求解NC-PL极小极大优化问题，基于交替梯度下降–上升算法(AGDA)和Smoothed-AGDA算法 [16] ，采用随机梯度 $G_{x} (x, y, ξ)$ 和 $G_{y} (x, y, ξ)$ 进行更新，其中 $G_{x} (x, y, ξ)$ 和 $G_{y} (x, y, ξ)$ 分别是 $\nabla_{x} f (x, y)$ 和 $\nabla_{y} f (x, y)$ 的无偏随机估计且方差有界 $σ^{2} > 0$ ，构造了新算法Stoc-AGDA和Stoc-Smoothed-AGDA。由于上述算法具有良好的数值表现，对于凸-PL极小极大优化问题，本文考虑将随机梯度结合到交替梯度投影算法(AGP)中，以下给出随机交替梯度投影算法的框架：

算法2 随机交替梯度投影算法(SAGP)

步骤1 输入固定步长 $τ_{1} > 0$ ， $τ_{2} > 0$ ，初始点 $(x_{0}, y_{0})$ 及参数 $b_{0} > 0$ ，

步骤2 for all $t = 0, 1, 2, \dots, T - 1$ do，

步骤3 抽取两个独立同分布样本 $ξ_{1}^{t}$ ， $ξ_{2}^{t}$ ，

步骤4 $x_{t + 1} = x_{t} - τ_{1} [G_{x} (x_{t}, y_{t}, ξ_{1}^{t}) + b_{t} x_{t}]$ ，

步骤5 $y_{t + 1} = y_{t} + τ_{2} G_{y} (x_{t + 1}, y_{t}, ξ_{2}^{t})$ ，

步骤6 end for，

步骤7 随机从 ${(x_{t}, y_{t})}_{t = 0}^{T - 1}$ 中均匀抽取 $(\tilde{x}, \tilde{y})$ 。

5. 数值实验

本节将SAGP算法应用于求解具有线性生成器的Toy WGAN模型，本文设置与 [17] 相同，并给出实验结果验证了新算法的有效性，模型表示如下：

$\min_{μ, σ} \max_{ϕ_{1}, ϕ_{2}} F (μ, σ, ϕ_{1}, ϕ_{2}) ≜ Ε_{(x^{r e a l}, z) \sim D} D_{ϕ} (x^{r e a l}) - D_{ϕ} (G_{μ, σ} (z)) - λ {‖ ϕ ‖}^{2}$ ，

其中生成器为 $G_{μ, σ} (z) = μ + σ z$ ，判别器为 $D_{ϕ} (x) = ϕ_{1} x + ϕ_{2} x^{2}$ ， $x$ 是来自生成器的真实数据或假数据， $D$ 为真实数据和潜在变量的分布。真实数据 $x^{r e a l}$ 和潜在变量 $z$ 的高斯分布数据集来自平均值为0、方差为1的正态分布， $x^{r e a l}$ 产生于 $\hat{μ} = 0, \hat{σ} = 0.1$ 。批量大小固定为100，实验重复3次。实验结果如图1所示。

图1对比了SAGP算法和SAGDA算法的收敛速度，其中 $x$ 轴代表迭代次数， $y$ 轴代表随机梯度范数取值范围的变化过程，其中RMSprop算法步长为5e−4，参数为0.9；Adam算法步长为5e−4，参数为0.5和0.9；SAGDA算法步长 $τ_{1}$ = 1e−1， $τ_{2}$ = 5e−1；SAGP算法步长 $τ_{1}$ = 2e−1， $τ_{2}$ = 8e−1。图1和图2分别对应于四种算法在高斯分布数据集上的实验结果对比。从图中可以看出，本文提出的新算法SAGP的收敛速度整体上比SAGDA算法快，有效地验证了新算法的可行性。

Figure 1. The change process of the range of the random gradient norm of the discriminator on the Gaussian distribution data set

图1. 在高斯分布数据集上判别器的随机梯度范数取值范围的变化过程

Figure 2. The change process of the value range of the random gradient norm of the generator on the Gaussian distribution data set

图2. 在高斯分布数据集上生成器的随机梯度范数取值范围的变化过程

6. 小结

本文在凸-PL环境下将交替梯度投影算法(AGP)与随机梯度相结合，提出了一种新的随机交替梯度投影算法(SAGP)。从实验结果分析来看，新算法在高斯分布数据集上优于SAGDA算法，从而验证了新算法的有效性和可行性。

基金项目

山西省基础研究计划(自由探索类)面上项目(202103021224303)；太原师范学院研究生教育创新项目(SYYJSYC-2287)。

参考文献

[1]	Nemirovski, A. (2004) Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems. SIAM Journal on Optimization, 15, 229-251. [Google Scholar] [CrossRef]
[2]	Nesterov, Y. (2007) Dual Extrapolation and Its Applications to Solving Variational Inequalities and Related Problems. Mathematical Programming, 109, 319-344. [Google Scholar] [CrossRef]
[3]	Monteiro, R.D.C. and Svaiter, B.F. (2010) On the Complexity of the Hybrid Proximal Extragradient Method for the Iterates and the Ergodic Mean. SIAM Journal on Optimization, 20, 2755-2787. [Google Scholar] [CrossRef]
[4]	Monteiro, R.D.C. and Svaiter, B.F (2011) Complexity of Variants of Tseng’s Modified F-B Splitting and Korpelevich’s Methods for Hemivariational Inequalities with Applications to Saddle-Point and Convex Optimization Problems. SIAM Journal on Optimization, 21, 1688-1720. [Google Scholar] [CrossRef]
[5]	Abernethy, J., Lai, K.A. and Wibisono, A. (2019) Last-Iterate Convergence Rates for Min-Max Optimization. ArXiv Preprint arxiv: 1906.02027.
[6]	Rafique, H., Liu, M., Lin, Q. and Yang, T. (2018) Non-Convex Min-Max Optimization: Provable Algorithms and Applications in Machine Learning. ArXiv Preprint arXiv: 1810.02060.
[7]	Nouiehed, M., Sanjabi, M., Huang, T., Lee, J.D. and Razaviyayn, M. (2019) Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods. 33rd Con-ference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, 8-14 December 2019, 14934-14942.
[8]	Thekumparampil, K.K., Jain, P., Netrapalli, P. and Oh, S. (2019) Efficient Algorithms for Smooth Minimax Optimization. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, 8-14 December 2019, 12659-12670.
[9]	Kong, W. and Monteiro, R.D.C. (2021) An Accelerated Inexact Proximal Point Method for Solving Non-convex Concave Min-Max Problems. SIAM Journal on Optimization, 31, 2558-2585. [Google Scholar] [CrossRef]
[10]	Lin, T., Jin, C. and Jordan, M.I. (2020) Near-Optimal Algorithms for Minimax Optimization. In: Abernethy, J. and Agarwal, S., Eds., Proceedings of Thirty Third Conference on Learning Theory, PMLR 125, ML Research Press, Maastricht, 2738-2779.
[11]	Lin, T., Jin, C. and Jordan, M.I. (2020) On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems. In: Daumé III, H. and Singh, S., Eds., Proceedings of the 37th International Con-ference on Machine Learning, PMLR 119, ML Research Press, Maastricht, 6083-6093.
[12]	Jin, C., Netrapalli, P. and Jordan, M.I. (2019) Minmax Optimization: Stable Limit Points of Gradient Descent Ascent Are Locally Optimal. ArXiv Preprint arXiv: 1902.00618.
[13]	Lu, S., Tsaknakis, I., Hong, M. and Chen, Y. (2020) Hybrid Block Successive Approximation for One-Sided Non-Convex Min-Max Problems: Algorithms and Applications. IEEE Transactions on Signal Processing, 68, 3676-3691. [Google Scholar] [CrossRef]
[14]	Xu, Z., Zhang, H., Xu, Y. and Lan, G. (2020) A Unified Single-loop Alternating Gradient Projection Algorithm for Nonconvex-Concave and Convex-Nonconcave Minimax Problems. ArXiv Preprint arXiv: 2006.02032.
[15]	Boţ, R.I. and Böhm, A. (2020) Alternating Proximal-Gradient Steps for (Stochastic) Noncon-vex-Concave Minimax Problems. ArXiv Preprint arXiv: 2007.13605.
[16]	Zhang, J., Xiao, P., Sun, R. and Luo, Z.-Q. (2020) A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems. 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, 6-12 December 2020.
[17]	Yang, J., Orvieto, A., Lucchi, A. and He, N. (2022) Faster Single-Loop Algorithms for Minimax Optimization without Strong Concavity. In: Camps-Valls, G., Ruiz, F.J.R. and Valera, I., Eds., Proceedings of the 25th International Conference on Artificial Intelligence and Statistics, PMLR 151, ML Research Press, Maastricht, 5485-5517.
[18]	Chambolle, A. and Pock, T. (2016) On the Ergodic Convergence Rates of a First-Order Primaldual Algorithm. Mathematical Programming, 159, 253-287. [Google Scholar] [CrossRef]
[19]	Daskalakis, C. and Panageas, I. (2018) The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, 2-8 December 2018, 9236-9246.
[20]	Polyak, B.T. (1963) Gradient Methods for Minimizing Functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3, 643-653.
[21]	Fazel, M., Ge, R., Kakade, S. and Mesbahi, M. (2018) Global Convergence of Policy Gradient Methods for the Linear Quadratic Regulator. In: Dy, J. and Krause, A., Eds., Proceedings of the 35th International Conference on Machine Learning, PMLR 80, ML Research Press, Maastricht, 1467-1476.
[22]	Liu, C., Zhu, L. and Belkin, M. (2020) Loss Landscapes and Optimization in Over-Parameterized Non-Linearsystems and Neural Net-works. ArXiv Preprint arXiv: 2003.00307.
[23]	Yang, J., Kiyavash, N. and He, N. (2020) Global Convergence and Variance Reduction for a Class of Nonconvex-Nonconcave Minimax Problems. 34th Conference on Neural Information Processing Sys-tems (NeurIPS 2020), Vancouver, 6-12 December 2020, 1153-1165.
[24]	Guo, Z., Liu, M., Yuan, Z., Shen, L., Liu, W. and Yang, T. (2020) Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks. In: Daumé III, H. and Singh, A., Eds., Proceedings of the 37th International Conference on Machine Learning, PMLR 119, ML Research Press, Maas-tricht, 3864-3874.
[25]	Cai, Q., Hong, M., Chen, Y. and Wang, Z. (2019) On the Global Convergence of Imitation Learning: A Case for Linear Quadratic Regulator. ArXiv Preprint arXiv: 1901.03674.
[26]	Du, S., Lee, J., Li, H., Wang, L. and Zhai, X. (2019) Gradient Descent Finds Global Minima of Deep Neural Networks. In: Chaudhuri, K. and Salakhutdinov, R., Eds., Proceedings of the 36th International Conference on Machine Learning, PMLR 97, ML Research Press, Maastricht, 1675-1685.
[27]	Rafique, H., Liu, M., Lin, Q. and Yang, T. (2022) Weakly-Convex-Concave Min-Max Optimization Provable Algo-rithms and Applications in Machine Learning. Optimization Methods and Software, 37, 1087-1121. [Google Scholar] [CrossRef]

为你推荐

友情链接