Author :
Agarwal, Abhishek ; Negahban, Sahand N. ; Wainwright, Martin J.
Author_Institution :
Microsoft Res., New York, NY, USA
Abstract :
Summary form only given. Stochastic optimization algorithms have many desirable features for large-scale machine learning, and accordingly have been the focus of renewed and intensive study in the last several years (e.g., see the papers [2], [5], [14] and references therein). The empirical efficiency of these methods is backed with strong theoretical guarantees, providing sharp bounds on their convergence rates. These convergence rates are known to depend on the structure of the underlying objective function, with faster rates being possible for objective functions that are smooth and/or (strongly) convex, or optima that have desirable features such as sparsity. More precisely, for an objective function that is strongly convex, stochastic gradient descent enjoys a convergence rate ranging from O(1/T ), when features vectors are extremely sparse, to O(d/T ) when feature vectors are dense [10], [6]. A complementary type of condition is that of sparsity, either exact or approximate, in the optimal solution. Sparse models have proven useful in many application areas (see the overview papers [4], [9], [3] and references therein for further background), and many optimization-based statistical procedures seek to exploit such sparsity via ℓ1-regularization. A significant feature of optimization algorithms for sparse problems is their mild logarithmic scaling with the problem dimension [11], [12], [5], [14]. More precisely, itis known [11], [12] that when the optimal solution θ has at most s non-zero entries, appropriate versions of the stochastic mirror descent algorithm ´,/ converge at a rate O(s (log d)/T ). Srebro et al. [13] exploit the smoothness of many common loss functions; in application to sparse linear regression, their analysis yields improved rates ´,/ of the form O(η (s log d)/T), where η is the noise variance. While the Vlog d scaling of the dimension makes these methods attractive in high dimensions, the scaling with respe- t to the number V of iterations is relatively slow-namely, O(1/ T ) versus O(1/T) for strongly convex problems.The algorithm proposed in this paper aims to use both strong convexity and sparsity. We show that the algorithm has convergence rate O((s log d)/T) for a strongly convex problem with an s-sparse optimum in d dimensions. Moreover, this rate is unimprovable up to constant factors, meaning that no algorithm can converge at a substantially faster rate. The method builds off recent work on multi-step methods for strongly convex problems [7], [8], but involves some new ingredients that are essential to obtain optimal rates for statistical problems with sparse optima. Numerical simulations confirm our theoretical predictions regarding the convergence rate of the algorithm, and demonstrate its performance in comparison to other methods: regularized dual averaging [14] and stochastic gradient descent algorithms. We refer the reader to the full report [1] for more details.
Keywords :
gradient methods; learning (artificial intelligence); statistical analysis; stochastic programming; convergence rates; feature vectors; large-scale machine learning; objective function; optimal algorithm; optimization-based statistical procedures; regularized dual averaging algorithm; sparse statistical recovery; sparsity feature; stochastic gradient descent; stochastic optimization; strongly convex problem; Convergence; Educational institutions; Electronic mail; Machine learning algorithms; Optimization; Prediction algorithms; Stochastic processes;