Theorem: Suppose all loss functions are λSC, then OGD with ηt=1λt achieves RegTAG2λ(1+logT)

Proof:

Step 1: t[T], find upper bound for ft(wt)ft(w)

ft(wt)ft(w)ft(wt)T(wtw)λ2wtw22(λSC)wtw22wt+1w222ηt+ηt2G2λ2wtw22

Step 2: find RegTOGD

RegTOGD=t=1T(ft(wt)ft(w))=t=1T(wtw22wt+1w222ηt+ηt2G2λ2wtw22)=t=2Twtw22(12ηt12ηt1)+w1w2212η1wT+1w2212ηT+G2t=1Tηt2t=1Tλ2wtw22=t=2Twtw22(12ηt12ηt1λ2)+w1w2212η1wT+1w2212ηT+G2t=1Tηt2λ2w1w22

Note that ηt=1/(λt). Thus, 12ηt12ηt1λ2=(λtλ(t1)λ)/2=0. Moreover, 12η1λ2=0. Thus, we get

RegTOGD=t=1T(ft(wt)ft(w))G2t=1Tηt2=G22λt=1T1tG2λ(logT+1).

The algorithm has a hyperparameter η which in turn depends on λ. We propose a parameter-free version which is also data-dependent, hence achieving faster convergence. Our algorithm is inspired from AdaGrad and MALER. We show that this algorithm has RegTA=O(logT).

Full paper on arXiv.