Theorem: Suppose all loss functions are \(\lambda-SC\), then OGD with \(\eta_t = \frac{1}{\lambda t}\) achieves \(Reg_T^A \leq \frac{G^2}{\lambda}(1+ \log T)\)

Proof:

Step 1: \(\forall t \in [T]\), find upper bound for \(f_t(w_t) - f_t(w^\ast)\)

\[\begin{align*} f_t(w_t) - f_t(w^\ast) & \leq \nabla f_t(w_t)^T (w_t - w^\ast) - \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 && (\lambda - SC)\\ & \leq \frac{\Vert w_t - w^\ast \Vert^2_2 - \Vert w_{t+1} - w^\ast \Vert^2_2}{2 \eta_t} + \frac{\eta_t}{2}G^2 - \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 \\ \end{align*}\]

Step 2: find \(Reg^{\text{OGD}}_T\)

\[\begin{align*} Reg_T^{\text{OGD}} & = \sum_{t=1}^T (f_t(w_t) - f_t(w^\ast)) \\ & = \sum_{t=1}^T \left( \frac{\Vert w_t - w^\ast \Vert^2_2 - \Vert w_{t+1} - w^\ast \Vert^2_2}{2 \eta_t} + \frac{\eta_t}{2}G^2 - \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 \right) \\ & = \sum_{t=2}^T \Vert w_t - w^\ast \Vert_2^2 \left(\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}\right) + \Vert w_1 - w^\ast \Vert_2^2 \frac{1}{2\eta_1} - \Vert w_{T+1} - w^\ast \Vert_2^2 \frac{1}{2\eta_T} + G^2\sum_{t=1}^T \frac{\eta_t}{2} - \sum_{t=1}^T \frac{\lambda}{2} \Vert w_t - w^\ast \Vert^2_2 \\ & = \sum_{t=2}^T \Vert w_t - w^\ast \Vert_2^2 \left(\frac{1}{2\eta_t} - \frac{1}{2\eta_{t-1}}-\frac{\lambda}{2}\right) + \Vert w_1 - w^\ast \Vert_2^2 \frac{1}{2\eta_1} - \Vert w_{T+1} - w^\ast \Vert_2^2 \frac{1}{2\eta_T} + G^2\sum_{t=1}^T \frac{\eta_t}{2} -\frac{\lambda}{2}\Vert w_1 - w^\ast \Vert^2_2 \end{align*}\]

Note that \(\eta_t = 1/(\lambda t)\). Thus, \(\frac{1}{2\eta_t}-\frac{1}{2\eta_{t-1}}-\frac{\lambda}{2}=(\lambda t - \lambda (t-1) -\lambda)/2=0.\) Moreover, \(\frac{1}{2\eta_1}-\frac{\lambda}{2}=0.\) Thus, we get

\[\begin{align*} Reg_T^{\text{OGD}} & = \sum_{t=1}^T (f_t(w_t) - f_t(w^\ast)) \\ & \leq G^2\sum_{t=1}^T\frac{\eta_t}{2}=\frac{G^2}{2\lambda}\sum_{t=1}^T\frac{1}{t}\leq \frac{G^2}{\lambda}(\log T+1). \end{align*}\]

The algorithm has a hyperparameter \(\eta\) which in turn depends on \(\lambda\). We propose a parameter-free version which is also data-dependent, hence achieving faster convergence. Our algorithm is inspired from AdaGrad and MALER. We show that this algorithm has \(Reg_T^A = O(\log T)\).

Full paper on arXiv.