Feature-aware regularization for sparse online learning

Embed Size (px)

Text of Feature-aware regularization for sparse online learning


    SCIENCE CHINAInformation Sciences

    May 2014, Vol. 57 052104:1052104:21

    doi: 10.1007/s11432-014-5082-z

    c Science China Press and Springer-Verlag Berlin Heidelberg 2014 info.scichina.com link.springer.com

    Feature-aware regularizationfor sparse online learning

    OIWA Hidekazu1, MATSUSHIMA Shin1 & NAKAGAWA Hiroshi1,2

    1Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan;2Information Technology Center, The University of Tokyo, Tokyo 113-8654, Japan

    Received August 31, 2013; accepted November 24, 2013

    Abstract Learning a compact predictive model in an online setting has recently gained a great deal of at-

    tention. The combination of online learning with sparsity-inducing regularization enables faster learning with a

    smaller memory space than the previous learning frameworks. Many optimization methods and learning algo-

    rithms have been developed on the basis of online learning with L1-regularization. L1-regularization tends to

    truncate some types of parameters, such as those that rarely occur or have a small range of values, unless they

    are emphasized in advance. However, the inclusion of a pre-processing step would make it very difficult to pre-

    serve the advantages of online learning. We propose a new regularization framework for sparse online learning.

    We focus on regularization terms, and we enhance the state-of-the-art regularization approach by integrating

    information on all previous subgradients of the loss function into a regularization term. The resulting algorithms

    enable online learning to adjust the intensity of each features truncations without pre-processing and eventually

    eliminate the bias of L1-regularization. We show theoretical properties of our framework, the computational

    complexity and upper bound of regret. Experiments demonstrated that our algorithms outperformed previous

    methods in many classification tasks.

    Keywords online learning, supervised learning, sparsity-inducing regularization, feature selection, sentiment


    Citation Oiwa H, Matsushima S, Nakagawa H. Feature-aware regularization for sparse online learning. Sci

    China Inf Sci, 2014, 57: 052104(21), doi: 10.1007/s11432-014-5082-z

    1 Introduction

    Online learning is a learning framework where a prediction and parameter update take place in a sequential

    setting each time a learner receives one datum. Online learning is beneficial for learning from large-scale

    data in terms of used memory space and computational complexity. If training data is very large, many

    batch algorithms cannot derive a global optimal solution within a reasonable amount of time because

    the computational cost is very high. If all instances cannot be simultaneously loaded into the main

    memory, optimization in batch learning requires some sort of reformulation in order to arrive at an exact

    solution [1]. Many online learning algorithms update predictors on the basis of only one new instance.

    This means that online learning runs faster and uses a smaller memory space than batch learning does.

    Corresponding author (email: hidekazu.oiwa@gmail.com)

  • Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:2

    Online learning algorithms are especially efficient on datasets where the dimension of instances or number

    of instances is very large. Many algorithms have been transformed into online ones for their ease of

    handling large-scale data.

    Regularization is a generalization technique to prevent over-fitting of previously received data. L1-

    regularization is a well-known method of deriving a compact predictive model. The functionality of

    L1-regularization is to eliminate parameters that are insignificant for prediction on the fly. A compact

    model is able to reduce the computational time and required memory space for making predictions because

    it deals with a smaller number of parameters. In addition, L1-regularization has a generalization effect

    for preventing over-fitting to the training data.

    Online learning with L1-regularization is currently being studied as a way of solving large-scale opti-

    mization problems. Some novel frameworks have been developed in this field. Most of these works are

    subgradient-based; that means the update is performed by calculating the subgradients of loss functions.

    Subgradient-based online learning has high computational efficiency and high predictive performance.

    There are three major subgradient-based algorithms for sparse online learning:

    1) COMID (Composite Objective MIrror Descent) [2,3];

    2) RDA (Regularized Dual Averaging) [4];

    3) FTPRL (Follow-The-Proximally-Regularized-Leader) [5,6];

    However, these subgradient-based online algorithms do not consider information on features. The plain

    L1-regularization makes parameters closer to 0 without considering their range of values or frequency of

    occurrences. As a result, most unemphatic features are easily truncated, even though such features are

    crucial for making a prediction. The occurrence frequency and value range are not usually uniform in

    tasks such as natural language processing and pattern recognition. Some techniques to emphasize them

    have been developed to retain such features in the predictive model1). However, these methods must

    load all data and sum up the occurrence counts of each feature before starting to learn and hence are

    pre-processings. As the name implies, pre-processing poisons the essence of online learning wherein the

    data are sequentially processed and there is no hypothesis that we can sum up feature occurrence counts

    by using all data. The previous work however has not dealt with these challanges in any detail.

    We propose a new sparsity-inducing regularization framework for subgradient-based online learning.

    Our framework enables us to eliminate the bias for retaining informative features in an online setting

    without any pre-processing. The key idea behind our framework is to integrate the absolute values of

    the subgradient of loss functions into the regularization term. We call this framework feature-aware reg-

    ularization for sparse online learning. Our framework can dynamically weaken truncation effects on rare

    features and thereby obtain a set of important features regardless of their occurrence frequency or value

    range. Our extension can be applied to many subgradient-based online learning, such as COMID, RDA,

    and FTPRL. We show three applications of our framework in this paper, FR-COMID (Feature-aware Reg-

    ularized COMID), which is an extension of COMID, FRDA (Feature-aware Regularized Dual Averaging),

    which is an extension of RDA, and FTPFRL (Follow-The-Proximally-Feature-aware-Regularized-Leader),

    which is an extension of FTPRL. We also analyzed the theoretical aspects of our framework. As a result,

    we derived the same computational cost and the regret upper bound as those for the original algorithms.

    Finally, we evaluated our framework in several experiments. The results revealed that our framework

    improved state-of-the-art algorithms on most datasets in terms of prediction accuracy and sparsity.

    This paper is constituted as follows. Section 2 introduces the basic framework of subgradient-based

    online learning and describes the three major sparse online learning algorithms, COMID, RDA, and

    FTPRL. Section 3 introduces the previous work done on sparse online learning. In Section 4, we propose

    our new sparsity-inducing regularization framework, called feature-aware regularization. We show some

    common examples that the previous work causes problems in sparse online learning. As a solution to

    these problems, we propose a new framework for the three previous algorithms. For each algorithm, we

    derive closed form update formula. Section 5 discusses the upper bound of regret for our algorithms,

    and Section 6 describes the results of several experiments. Section 7 concludes the discussion of our


    1) For example, TF-IDF [7] in tasks on natural language processing.

  • Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:3

    Table 1 Notation


    || Absolute valuea Vector

    a(i) ith entry of vector a

    A Matrix

    I Identity matrix

    A(i,j) (i,j)th entry of matrix A

    ap Lp norma, b Inner productRn n-dimensional Euclidean space

    domf Domain of function f

    sgn() Sign of a real numberargmin f Unique point for minimizing function f

    Argmin f Set of minimizing points of function f

    f(a) Differential of function f at a

    f(a) Gradient of function f (differentiable)B(, ) Bregman divergence

    2 Subgradient-based online learning

    Before explaining the three major sparse online learning algorithms, we will describe the notation used

    in this paper and the problem setting for subgradient-based online learning.

    2.1 Notation

    Our notation to formally describe the problem setting is summarized in Table 1. Scalars are denoted

    in lowercase italic letters, e.g., , and the absolute value of a scalar is ||. Vectors are in lowercasebold letters, such as a. Matrices are in uppercase bold letters, e.g., A. I is the identity matrix. aprepresents the Lp norm of a vector a, and a, b denotes the inner product of two vectors a, b. Rn is ann-dimensional Euclidean space and R+ is a one-dimensional non-negative Euclidean space. Let domf be

    the domain of function f . A function sgn is defined as sgn() = /|| where is a real number (If = 0,sgn() = 0). Let argmin f be a un