Published on

24-Jan-2017View

212Download

0

Transcript

. RESEARCH PAPER .

SCIENCE CHINAInformation Sciences

May 2014, Vol. 57 052104:1052104:21

doi: 10.1007/s11432-014-5082-z

c Science China Press and Springer-Verlag Berlin Heidelberg 2014 info.scichina.com link.springer.com

Feature-aware regularizationfor sparse online learning

OIWA Hidekazu1, MATSUSHIMA Shin1 & NAKAGAWA Hiroshi1,2

1Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan;2Information Technology Center, The University of Tokyo, Tokyo 113-8654, Japan

Received August 31, 2013; accepted November 24, 2013

Abstract Learning a compact predictive model in an online setting has recently gained a great deal of at-

tention. The combination of online learning with sparsity-inducing regularization enables faster learning with a

smaller memory space than the previous learning frameworks. Many optimization methods and learning algo-

rithms have been developed on the basis of online learning with L1-regularization. L1-regularization tends to

truncate some types of parameters, such as those that rarely occur or have a small range of values, unless they

are emphasized in advance. However, the inclusion of a pre-processing step would make it very difficult to pre-

serve the advantages of online learning. We propose a new regularization framework for sparse online learning.

We focus on regularization terms, and we enhance the state-of-the-art regularization approach by integrating

information on all previous subgradients of the loss function into a regularization term. The resulting algorithms

enable online learning to adjust the intensity of each features truncations without pre-processing and eventually

eliminate the bias of L1-regularization. We show theoretical properties of our framework, the computational

complexity and upper bound of regret. Experiments demonstrated that our algorithms outperformed previous

methods in many classification tasks.

Keywords online learning, supervised learning, sparsity-inducing regularization, feature selection, sentiment

analysis

Citation Oiwa H, Matsushima S, Nakagawa H. Feature-aware regularization for sparse online learning. Sci

China Inf Sci, 2014, 57: 052104(21), doi: 10.1007/s11432-014-5082-z

1 Introduction

Online learning is a learning framework where a prediction and parameter update take place in a sequential

setting each time a learner receives one datum. Online learning is beneficial for learning from large-scale

data in terms of used memory space and computational complexity. If training data is very large, many

batch algorithms cannot derive a global optimal solution within a reasonable amount of time because

the computational cost is very high. If all instances cannot be simultaneously loaded into the main

memory, optimization in batch learning requires some sort of reformulation in order to arrive at an exact

solution [1]. Many online learning algorithms update predictors on the basis of only one new instance.

This means that online learning runs faster and uses a smaller memory space than batch learning does.

Corresponding author (email: hidekazu.oiwa@gmail.com)

Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:2

Online learning algorithms are especially efficient on datasets where the dimension of instances or number

of instances is very large. Many algorithms have been transformed into online ones for their ease of

handling large-scale data.

Regularization is a generalization technique to prevent over-fitting of previously received data. L1-

regularization is a well-known method of deriving a compact predictive model. The functionality of

L1-regularization is to eliminate parameters that are insignificant for prediction on the fly. A compact

model is able to reduce the computational time and required memory space for making predictions because

it deals with a smaller number of parameters. In addition, L1-regularization has a generalization effect

for preventing over-fitting to the training data.

Online learning with L1-regularization is currently being studied as a way of solving large-scale opti-

mization problems. Some novel frameworks have been developed in this field. Most of these works are

subgradient-based; that means the update is performed by calculating the subgradients of loss functions.

Subgradient-based online learning has high computational efficiency and high predictive performance.

There are three major subgradient-based algorithms for sparse online learning:

1) COMID (Composite Objective MIrror Descent) [2,3];

2) RDA (Regularized Dual Averaging) [4];

3) FTPRL (Follow-The-Proximally-Regularized-Leader) [5,6];

However, these subgradient-based online algorithms do not consider information on features. The plain

L1-regularization makes parameters closer to 0 without considering their range of values or frequency of

occurrences. As a result, most unemphatic features are easily truncated, even though such features are

crucial for making a prediction. The occurrence frequency and value range are not usually uniform in

tasks such as natural language processing and pattern recognition. Some techniques to emphasize them

have been developed to retain such features in the predictive model1). However, these methods must

load all data and sum up the occurrence counts of each feature before starting to learn and hence are

pre-processings. As the name implies, pre-processing poisons the essence of online learning wherein the

data are sequentially processed and there is no hypothesis that we can sum up feature occurrence counts

by using all data. The previous work however has not dealt with these challanges in any detail.

We propose a new sparsity-inducing regularization framework for subgradient-based online learning.

Our framework enables us to eliminate the bias for retaining informative features in an online setting

without any pre-processing. The key idea behind our framework is to integrate the absolute values of

the subgradient of loss functions into the regularization term. We call this framework feature-aware reg-

ularization for sparse online learning. Our framework can dynamically weaken truncation effects on rare

features and thereby obtain a set of important features regardless of their occurrence frequency or value

range. Our extension can be applied to many subgradient-based online learning, such as COMID, RDA,

and FTPRL. We show three applications of our framework in this paper, FR-COMID (Feature-aware Reg-

ularized COMID), which is an extension of COMID, FRDA (Feature-aware Regularized Dual Averaging),

which is an extension of RDA, and FTPFRL (Follow-The-Proximally-Feature-aware-Regularized-Leader),

which is an extension of FTPRL. We also analyzed the theoretical aspects of our framework. As a result,

we derived the same computational cost and the regret upper bound as those for the original algorithms.

Finally, we evaluated our framework in several experiments. The results revealed that our framework

improved state-of-the-art algorithms on most datasets in terms of prediction accuracy and sparsity.

This paper is constituted as follows. Section 2 introduces the basic framework of subgradient-based

online learning and describes the three major sparse online learning algorithms, COMID, RDA, and

FTPRL. Section 3 introduces the previous work done on sparse online learning. In Section 4, we propose

our new sparsity-inducing regularization framework, called feature-aware regularization. We show some

common examples that the previous work causes problems in sparse online learning. As a solution to

these problems, we propose a new framework for the three previous algorithms. For each algorithm, we

derive closed form update formula. Section 5 discusses the upper bound of regret for our algorithms,

and Section 6 describes the results of several experiments. Section 7 concludes the discussion of our

framework.

1) For example, TF-IDF [7] in tasks on natural language processing.

Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:3

Table 1 Notation

Scalar

|| Absolute valuea Vector

a(i) ith entry of vector a

A Matrix

I Identity matrix

A(i,j) (i,j)th entry of matrix A

ap Lp norma, b Inner productRn n-dimensional Euclidean space

domf Domain of function f

sgn() Sign of a real numberargmin f Unique point for minimizing function f

Argmin f Set of minimizing points of function f

f(a) Differential of function f at a

f(a) Gradient of function f (differentiable)B(, ) Bregman divergence

2 Subgradient-based online learning

Before explaining the three major sparse online learning algorithms, we will describe the notation used

in this paper and the problem setting for subgradient-based online learning.

2.1 Notation

Our notation to formally describe the problem setting is summarized in Table 1. Scalars are denoted

in lowercase italic letters, e.g., , and the absolute value of a scalar is ||. Vectors are in lowercasebold letters, such as a. Matrices are in uppercase bold letters, e.g., A. I is the identity matrix. aprepresents the Lp norm of a vector a, and a, b denotes the inner product of two vectors a, b. Rn is ann-dimensional Euclidean space and R+ is a one-dimensional non-negative Euclidean space. Let domf be

the domain of function f . A function sgn is defined as sgn() = /|| where is a real number (If = 0,sgn() = 0). Let argmin f be a unique point at which function f is a minimum. Let Argmin f be the set of

minimizing points of a function f . f(a) is the subdifferential of function f , where the subdifferential of

f at a means the set of all vectors g Rn satisfying b, f(b) f(a)+g, ba. We call a subdifferentialof a function f at a a subgradient of a function f at a. Any g is a subgradient such that g f(a).Even if f is non-differentiable, at least one subgradient exists when f is convex. A function f is convex

when f satisfies the following inequality: t [0, 1], a, b, f (ta+ (1 t)b) tf(a) + (1 t)f(b). Whena function f is differentiable, we denote the gradient of f at a by f(a).

The Bregman divergence can be described as

B(a, b) = (a) (b) (b),a b , (1)

where a Bregman function () is a continuously differentiable function. The Bregman divergence is ageneralized distance function between two vectors. It satisfies

B(a, b)

2a b22 (2)

for some scalar . For example, the squared Euclidean distance B(a, b) =12ab22 is a famous example

satisfying the properties of the Bregman divergence.

Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:4

2.2 Subgradient-based online learning

Now let us review the generic problem setting for subgradinet-based online learning. The algorithms

perform sequential prediction and updating according to the following scheme:

1. Initialization: start at round t = 1. Set a weight vectorw1 = 0. The weight vector space is restricted

in a closed convex set W Rd.2. Input: at round t, receive one input datum. An input datum is described as a d-dimensional feature

vector xt taken from a closed convex set X Rd.3. Prediction: make a prediction through an inner product of a feature vector xt and a weight vector

wt. The predicted value is denoted by yt = wt,xt.4. Output: observe a true output yt and incur a loss through a loss function t().5. Update: update wt to a new weight vector wt+1 using a subgradient with respect to the current

loss function at the current weight vector.

6. Increase the round number t and repeat steps 2 through 5 until no input data remains.

The ideal algorithm is the one that minimizes the sum of loss functions cumulated in each round for

any sequence of loss functions. It is equivalent to getting the optimal weight vector for minimizing the

next loss function.

A loss function is a function that represents the penalties for predictive errors. Thus, the value of the

loss function generally depends on the extent of dissociation between the predicted value and true output.

t is a convex loss function of the form t() : W R+, where t is the number of rounds. We assume thateach loss function is a function t, where t(wt) = t(wt,xt) = t(yt) and t is non-decreasing as thedistance between yt and yt increases. This assumption makes the inference based on the inner product

reasonable.

We would like to minimize the sum of loss functions. However, it is known that there is no algorithm

to upper bound a certain value [8]. Therefore, an algorithms performance has been measured by using

not an absolute evaluation criterion but a relative evaluation criterion compared with the optimal weight

vector. This is the fundamental notion behind the regret bound used to evaluate the algorithms.

R(T ) =

T

t=1

t(wt) infw

T

t=1

t(w). (3)

The first term is the total cost of the loss functions over all rounds while the algorithm runs. The second

term is the minimal cumulative cost when we pick the optimal weight vector w in hindsight. From thelens of regret minimization, the learners goal is to achieve as low regret as possible. If the algorithms

regret value is bounded by o(T ); that is, it is sublinear and the regret converges to 0 on average. That

means the loss per datum will converge to the same value as in the case of the best fixed strategy no

matter what datum or loss function is received in each round.

A subgradient method (SG) [9] is one of the simplest online learning algorithms. It is widely used

because of its simplicity and extensive theoretical background. In this method, a weight vector is updated

according to the formula,

wt+1 = wt tgt, s.t. gt t(wt), (4)where gt is the subgradient of t with respect to wt and t is the learning rate. The method sequentially

updates parameters in order to minimize t. It has been proven that the regret bound of the SG is O(T )

when t = /t for any constant and the regret value per datum vanishes as T [10].

2.3 Sparsity-inducing regularization

Regularization is a well-known technique to obtain desirable structures in accordance with task objectives.

It is widely used to achieve objectives such as generating a predictive model with lower generalization

error or a compact model. To apply it, we integrate a regularization term into an optimization problem

for deriving a desirable predictive model.

Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:5

By adding a convex regularized term to the optimization problem, our focus changes from a simple

loss minimization problem to the minimization of a sum of loss functions and a regularization term. The

summation in this case is defined as follows:

f(w) =

t

t(w) + (w), (5)

where () is a regularized term of the form () : W R+ where is convex in W .L2 and L1-regularization are the major regularization functions for integrating some structures into a

predictive model. L2-regularization consists of taking the L2 norm of a weight vector (w) = w22,and L1-regularization is defined by the L1 norm such that (w) = w1, where is a regularizationparameter. L1-regularization is a simple and well-known technique for inducing sparsity in weight vectors.

Regularization has several different properties depending o...