Feature-aware regularization for sparse online learning

  • Published on
    24-Jan-2017

  • View
    212

  • Download
    0

Transcript

  • . RESEARCH PAPER .

    SCIENCE CHINAInformation Sciences

    May 2014, Vol. 57 052104:1052104:21

    doi: 10.1007/s11432-014-5082-z

    c Science China Press and Springer-Verlag Berlin Heidelberg 2014 info.scichina.com link.springer.com

    Feature-aware regularizationfor sparse online learning

    OIWA Hidekazu1, MATSUSHIMA Shin1 & NAKAGAWA Hiroshi1,2

    1Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan;2Information Technology Center, The University of Tokyo, Tokyo 113-8654, Japan

    Received August 31, 2013; accepted November 24, 2013

    Abstract Learning a compact predictive model in an online setting has recently gained a great deal of at-

    tention. The combination of online learning with sparsity-inducing regularization enables faster learning with a

    smaller memory space than the previous learning frameworks. Many optimization methods and learning algo-

    rithms have been developed on the basis of online learning with L1-regularization. L1-regularization tends to

    truncate some types of parameters, such as those that rarely occur or have a small range of values, unless they

    are emphasized in advance. However, the inclusion of a pre-processing step would make it very difficult to pre-

    serve the advantages of online learning. We propose a new regularization framework for sparse online learning.

    We focus on regularization terms, and we enhance the state-of-the-art regularization approach by integrating

    information on all previous subgradients of the loss function into a regularization term. The resulting algorithms

    enable online learning to adjust the intensity of each features truncations without pre-processing and eventually

    eliminate the bias of L1-regularization. We show theoretical properties of our framework, the computational

    complexity and upper bound of regret. Experiments demonstrated that our algorithms outperformed previous

    methods in many classification tasks.

    Keywords online learning, supervised learning, sparsity-inducing regularization, feature selection, sentiment

    analysis

    Citation Oiwa H, Matsushima S, Nakagawa H. Feature-aware regularization for sparse online learning. Sci

    China Inf Sci, 2014, 57: 052104(21), doi: 10.1007/s11432-014-5082-z

    1 Introduction

    Online learning is a learning framework where a prediction and parameter update take place in a sequential

    setting each time a learner receives one datum. Online learning is beneficial for learning from large-scale

    data in terms of used memory space and computational complexity. If training data is very large, many

    batch algorithms cannot derive a global optimal solution within a reasonable amount of time because

    the computational cost is very high. If all instances cannot be simultaneously loaded into the main

    memory, optimization in batch learning requires some sort of reformulation in order to arrive at an exact

    solution [1]. Many online learning algorithms update predictors on the basis of only one new instance.

    This means that online learning runs faster and uses a smaller memory space than batch learning does.

    Corresponding author (email: hidekazu.oiwa@gmail.com)

  • Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:2

    Online learning algorithms are especially efficient on datasets where the dimension of instances or number

    of instances is very large. Many algorithms have been transformed into online ones for their ease of

    handling large-scale data.

    Regularization is a generalization technique to prevent over-fitting of previously received data. L1-

    regularization is a well-known method of deriving a compact predictive model. The functionality of

    L1-regularization is to eliminate parameters that are insignificant for prediction on the fly. A compact

    model is able to reduce the computational time and required memory space for making predictions because

    it deals with a smaller number of parameters. In addition, L1-regularization has a generalization effect

    for preventing over-fitting to the training data.

    Online learning with L1-regularization is currently being studied as a way of solving large-scale opti-

    mization problems. Some novel frameworks have been developed in this field. Most of these works are

    subgradient-based; that means the update is performed by calculating the subgradients of loss functions.

    Subgradient-based online learning has high computational efficiency and high predictive performance.

    There are three major subgradient-based algorithms for sparse online learning:

    1) COMID (Composite Objective MIrror Descent) [2,3];

    2) RDA (Regularized Dual Averaging) [4];

    3) FTPRL (Follow-The-Proximally-Regularized-Leader) [5,6];

    However, these subgradient-based online algorithms do not consider information on features. The plain

    L1-regularization makes parameters closer to 0 without considering their range of values or frequency of

    occurrences. As a result, most unemphatic features are easily truncated, even though such features are

    crucial for making a prediction. The occurrence frequency and value range are not usually uniform in

    tasks such as natural language processing and pattern recognition. Some techniques to emphasize them

    have been developed to retain such features in the predictive model1). However, these methods must

    load all data and sum up the occurrence counts of each feature before starting to learn and hence are

    pre-processings. As the name implies, pre-processing poisons the essence of online learning wherein the

    data are sequentially processed and there is no hypothesis that we can sum up feature occurrence counts

    by using all data. The previous work however has not dealt with these challanges in any detail.

    We propose a new sparsity-inducing regularization framework for subgradient-based online learning.

    Our framework enables us to eliminate the bias for retaining informative features in an online setting

    without any pre-processing. The key idea behind our framework is to integrate the absolute values of

    the subgradient of loss functions into the regularization term. We call this framework feature-aware reg-

    ularization for sparse online learning. Our framework can dynamically weaken truncation effects on rare

    features and thereby obtain a set of important features regardless of their occurrence frequency or value

    range. Our extension can be applied to many subgradient-based online learning, such as COMID, RDA,

    and FTPRL. We show three applications of our framework in this paper, FR-COMID (Feature-aware Reg-

    ularized COMID), which is an extension of COMID, FRDA (Feature-aware Regularized Dual Averaging),

    which is an extension of RDA, and FTPFRL (Follow-The-Proximally-Feature-aware-Regularized-Leader),

    which is an extension of FTPRL. We also analyzed the theoretical aspects of our framework. As a result,

    we derived the same computational cost and the regret upper bound as those for the original algorithms.

    Finally, we evaluated our framework in several experiments. The results revealed that our framework

    improved state-of-the-art algorithms on most datasets in terms of prediction accuracy and sparsity.

    This paper is constituted as follows. Section 2 introduces the basic framework of subgradient-based

    online learning and describes the three major sparse online learning algorithms, COMID, RDA, and

    FTPRL. Section 3 introduces the previous work done on sparse online learning. In Section 4, we propose

    our new sparsity-inducing regularization framework, called feature-aware regularization. We show some

    common examples that the previous work causes problems in sparse online learning. As a solution to

    these problems, we propose a new framework for the three previous algorithms. For each algorithm, we

    derive closed form update formula. Section 5 discusses the upper bound of regret for our algorithms,

    and Section 6 describes the results of several experiments. Section 7 concludes the discussion of our

    framework.

    1) For example, TF-IDF [7] in tasks on natural language processing.

  • Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:3

    Table 1 Notation

    Scalar

    || Absolute valuea Vector

    a(i) ith entry of vector a

    A Matrix

    I Identity matrix

    A(i,j) (i,j)th entry of matrix A

    ap Lp norma, b Inner productRn n-dimensional Euclidean space

    domf Domain of function f

    sgn() Sign of a real numberargmin f Unique point for minimizing function f

    Argmin f Set of minimizing points of function f

    f(a) Differential of function f at a

    f(a) Gradient of function f (differentiable)B(, ) Bregman divergence

    2 Subgradient-based online learning

    Before explaining the three major sparse online learning algorithms, we will describe the notation used

    in this paper and the problem setting for subgradient-based online learning.

    2.1 Notation

    Our notation to formally describe the problem setting is summarized in Table 1. Scalars are denoted

    in lowercase italic letters, e.g., , and the absolute value of a scalar is ||. Vectors are in lowercasebold letters, such as a. Matrices are in uppercase bold letters, e.g., A. I is the identity matrix. aprepresents the Lp norm of a vector a, and a, b denotes the inner product of two vectors a, b. Rn is ann-dimensional Euclidean space and R+ is a one-dimensional non-negative Euclidean space. Let domf be

    the domain of function f . A function sgn is defined as sgn() = /|| where is a real number (If = 0,sgn() = 0). Let argmin f be a unique point at which function f is a minimum. Let Argmin f be the set of

    minimizing points of a function f . f(a) is the subdifferential of function f , where the subdifferential of

    f at a means the set of all vectors g Rn satisfying b, f(b) f(a)+g, ba. We call a subdifferentialof a function f at a a subgradient of a function f at a. Any g is a subgradient such that g f(a).Even if f is non-differentiable, at least one subgradient exists when f is convex. A function f is convex

    when f satisfies the following inequality: t [0, 1], a, b, f (ta+ (1 t)b) tf(a) + (1 t)f(b). Whena function f is differentiable, we denote the gradient of f at a by f(a).

    The Bregman divergence can be described as

    B(a, b) = (a) (b) (b),a b , (1)

    where a Bregman function () is a continuously differentiable function. The Bregman divergence is ageneralized distance function between two vectors. It satisfies

    B(a, b)

    2a b22 (2)

    for some scalar . For example, the squared Euclidean distance B(a, b) =12ab22 is a famous example

    satisfying the properties of the Bregman divergence.

  • Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:4

    2.2 Subgradient-based online learning

    Now let us review the generic problem setting for subgradinet-based online learning. The algorithms

    perform sequential prediction and updating according to the following scheme:

    1. Initialization: start at round t = 1. Set a weight vectorw1 = 0. The weight vector space is restricted

    in a closed convex set W Rd.2. Input: at round t, receive one input datum. An input datum is described as a d-dimensional feature

    vector xt taken from a closed convex set X Rd.3. Prediction: make a prediction through an inner product of a feature vector xt and a weight vector

    wt. The predicted value is denoted by yt = wt,xt.4. Output: observe a true output yt and incur a loss through a loss function t().5. Update: update wt to a new weight vector wt+1 using a subgradient with respect to the current

    loss function at the current weight vector.

    6. Increase the round number t and repeat steps 2 through 5 until no input data remains.

    The ideal algorithm is the one that minimizes the sum of loss functions cumulated in each round for

    any sequence of loss functions. It is equivalent to getting the optimal weight vector for minimizing the

    next loss function.

    A loss function is a function that represents the penalties for predictive errors. Thus, the value of the

    loss function generally depends on the extent of dissociation between the predicted value and true output.

    t is a convex loss function of the form t() : W R+, where t is the number of rounds. We assume thateach loss function is a function t, where t(wt) = t(wt,xt) = t(yt) and t is non-decreasing as thedistance between yt and yt increases. This assumption makes the inference based on the inner product

    reasonable.

    We would like to minimize the sum of loss functions. However, it is known that there is no algorithm

    to upper bound a certain value [8]. Therefore, an algorithms performance has been measured by using

    not an absolute evaluation criterion but a relative evaluation criterion compared with the optimal weight

    vector. This is the fundamental notion behind the regret bound used to evaluate the algorithms.

    R(T ) =

    T

    t=1

    t(wt) infw

    T

    t=1

    t(w). (3)

    The first term is the total cost of the loss functions over all rounds while the algorithm runs. The second

    term is the minimal cumulative cost when we pick the optimal weight vector w in hindsight. From thelens of regret minimization, the learners goal is to achieve as low regret as possible. If the algorithms

    regret value is bounded by o(T ); that is, it is sublinear and the regret converges to 0 on average. That

    means the loss per datum will converge to the same value as in the case of the best fixed strategy no

    matter what datum or loss function is received in each round.

    A subgradient method (SG) [9] is one of the simplest online learning algorithms. It is widely used

    because of its simplicity and extensive theoretical background. In this method, a weight vector is updated

    according to the formula,

    wt+1 = wt tgt, s.t. gt t(wt), (4)where gt is the subgradient of t with respect to wt and t is the learning rate. The method sequentially

    updates parameters in order to minimize t. It has been proven that the regret bound of the SG is O(T )

    when t = /t for any constant and the regret value per datum vanishes as T [10].

    2.3 Sparsity-inducing regularization

    Regularization is a well-known technique to obtain desirable structures in accordance with task objectives.

    It is widely used to achieve objectives such as generating a predictive model with lower generalization

    error or a compact model. To apply it, we integrate a regularization term into an optimization problem

    for deriving a desirable predictive model.

  • Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:5

    By adding a convex regularized term to the optimization problem, our focus changes from a simple

    loss minimization problem to the minimization of a sum of loss functions and a regularization term. The

    summation in this case is defined as follows:

    f(w) =

    t

    t(w) + (w), (5)

    where () is a regularized term of the form () : W R+ where is convex in W .L2 and L1-regularization are the major regularization functions for integrating some structures into a

    predictive model. L2-regularization consists of taking the L2 norm of a weight vector (w) = w22,and L1-regularization is defined by the L1 norm such that (w) = w1, where is a regularizationparameter. L1-regularization is a simple and well-known technique for inducing sparsity in weight vectors.

    Regularization has several different properties depending o...

Recommended

View more >