. RESEARCH PAPER .
SCIENCE CHINAInformation Sciences
May 2014, Vol. 57 052104:1052104:21
c Science China Press and Springer-Verlag Berlin Heidelberg 2014 info.scichina.com link.springer.com
Feature-aware regularizationfor sparse online learning
OIWA Hidekazu1, MATSUSHIMA Shin1 & NAKAGAWA Hiroshi1,2
1Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan;2Information Technology Center, The University of Tokyo, Tokyo 113-8654, Japan
Received August 31, 2013; accepted November 24, 2013
Abstract Learning a compact predictive model in an online setting has recently gained a great deal of at-
tention. The combination of online learning with sparsity-inducing regularization enables faster learning with a
smaller memory space than the previous learning frameworks. Many optimization methods and learning algo-
rithms have been developed on the basis of online learning with L1-regularization. L1-regularization tends to
truncate some types of parameters, such as those that rarely occur or have a small range of values, unless they
are emphasized in advance. However, the inclusion of a pre-processing step would make it very difficult to pre-
serve the advantages of online learning. We propose a new regularization framework for sparse online learning.
We focus on regularization terms, and we enhance the state-of-the-art regularization approach by integrating
information on all previous subgradients of the loss function into a regularization term. The resulting algorithms
enable online learning to adjust the intensity of each features truncations without pre-processing and eventually
eliminate the bias of L1-regularization. We show theoretical properties of our framework, the computational
complexity and upper bound of regret. Experiments demonstrated that our algorithms outperformed previous
methods in many classification tasks.
Keywords online learning, supervised learning, sparsity-inducing regularization, feature selection, sentiment
Citation Oiwa H, Matsushima S, Nakagawa H. Feature-aware regularization for sparse online learning. Sci
China Inf Sci, 2014, 57: 052104(21), doi: 10.1007/s11432-014-5082-z
Online learning is a learning framework where a prediction and parameter update take place in a sequential
setting each time a learner receives one datum. Online learning is beneficial for learning from large-scale
batch algorithms cannot derive a global optimal solution within a reasonable amount of time because
the computational cost is very high. If all instances cannot be simultaneously loaded into the main
memory, optimization in batch learning requires some sort of reformulation in order to arrive at an exact
solution . Many online learning algorithms update predictors on the basis of only one new instance.
This means that online learning runs faster and uses a smaller memory space than batch learning does.
Corresponding author (email: firstname.lastname@example.org)
Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:2
Online learning algorithms are especially efficient on datasets where the dimension of instances or number
of instances is very large. Many algorithms have been transformed into online ones for their ease of
handling large-scale data.
Regularization is a generalization technique to prevent over-fitting of previously received data. L1-
regularization is a well-known method of deriving a compact predictive model. The functionality of
L1-regularization is to eliminate parameters that are insignificant for prediction on the fly. A compact
model is able to reduce the computational time and required memory space for making predictions because
it deals with a smaller number of parameters. In addition, L1-regularization has a generalization effect
for preventing over-fitting to the training data.
Online learning with L1-regularization is currently being studied as a way of solving large-scale opti-
mization problems. Some novel frameworks have been developed in this field. Most of these works are
subgradient-based; that means the update is performed by calculating the subgradients of loss functions.
Subgradient-based online learning has high computational efficiency and high predictive performance.
There are three major subgradient-based algorithms for sparse online learning:
1) COMID (Composite Objective MIrror Descent) [2,3];
2) RDA (Regularized Dual Averaging) ;
3) FTPRL (Follow-The-Proximally-Regularized-Leader) [5,6];
However, these subgradient-based online algorithms do not consider information on features. The plain
L1-regularization makes parameters closer to 0 without considering their range of values or frequency of
occurrences. As a result, most unemphatic features are easily truncated, even though such features are
crucial for making a prediction. The occurrence frequency and value range are not usually uniform in
tasks such as natural language processing and pattern recognition. Some techniques to emphasize them
have been developed to retain such features in the predictive model1). However, these methods must
load all data and sum up the occurrence counts of each feature before starting to learn and hence are
pre-processings. As the name implies, pre-processing poisons the essence of online learning wherein the
data are sequentially processed and there is no hypothesis that we can sum up feature occurrence counts
by using all data. The previous work however has not dealt with these challanges in any detail.
We propose a new sparsity-inducing regularization framework for subgradient-based online learning.
Our framework enables us to eliminate the bias for retaining informative features in an online setting
without any pre-processing. The key idea behind our framework is to integrate the absolute values of
the subgradient of loss functions into the regularization term. We call this framework feature-aware reg-
ularization for sparse online learning. Our framework can dynamically weaken truncation effects on rare
features and thereby obtain a set of important features regardless of their occurrence frequency or value
range. Our extension can be applied to many subgradient-based online learning, such as COMID, RDA,
and FTPRL. We show three applications of our framework in this paper, FR-COMID (Feature-aware Reg-
ularized COMID), which is an extension of COMID, FRDA (Feature-aware Regularized Dual Averaging),
which is an extension of RDA, and FTPFRL (Follow-The-Proximally-Feature-aware-Regularized-Leader),
which is an extension of FTPRL. We also analyzed the theoretical aspects of our framework. As a result,
we derived the same computational cost and the regret upper bound as those for the original algorithms.
Finally, we evaluated our framework in several experiments. The results revealed that our framework
improved state-of-the-art algorithms on most datasets in terms of prediction accuracy and sparsity.
This paper is constituted as follows. Section 2 introduces the basic framework of subgradient-based
online learning and describes the three major sparse online learning algorithms, COMID, RDA, and
FTPRL. Section 3 introduces the previous work done on sparse online learning. In Section 4, we propose
our new sparsity-inducing regularization framework, called feature-aware regularization. We show some
common examples that the previous work causes problems in sparse online learning. As a solution to
these problems, we propose a new framework for the three previous algorithms. For each algorithm, we
derive closed form update formula. Section 5 discusses the upper bound of regret for our algorithms,
and Section 6 describes the results of several experiments. Section 7 concludes the discussion of our
1) For example, TF-IDF  in tasks on natural language processing.
Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:3
Table 1 Notation
|| Absolute valuea Vector
a(i) ith entry of vector a
I Identity matrix
A(i,j) (i,j)th entry of matrix A
ap Lp norma, b Inner productRn n-dimensional Euclidean space
domf Domain of function f
sgn() Sign of a real numberargmin f Unique point for minimizing function f
Argmin f Set of minimizing points of function f
f(a) Differential of function f at a
f(a) Gradient of function f (differentiable)B(, ) Bregman divergence
2 Subgradient-based online learning
Before explaining the three major sparse online learning algorithms, we will describe the notation used
in this paper and the problem setting for subgradient-based online learning.
Our notation to formally describe the problem setting is summarized in Table 1. Scalars are denoted
in lowercase italic letters, e.g., , and the absolute value of a scalar is ||. Vectors are in lowercasebold letters, such as a. Matrices are in uppercase bold letters, e.g., A. I is the identity matrix. aprepresents the Lp norm of a vector a, and a, b denotes the inner product of two vectors a, b. Rn is ann-dimensional Euclidean space and R+ is a one-dimensional non-negative Euclidean space. Let domf be
the domain of function f . A function sgn is defined as sgn() = /|| where is a real number (If = 0,sgn() = 0). Let argmin f be a unique point at which function f is a minimum. Let Argmin f be the set of
minimizing points of a function f . f(a) is the subdifferential of function f , where the subdifferential of
f at a means the set of all vectors g Rn satisfying b, f(b) f(a)+g, ba. We call a subdifferentialof a function f at a a subgradient of a function f at a. Any g is a subgradient such that g f(a).Even if f is non-differentiable, at least one subgradient exists when f is convex. A function f is convex
when f satisfies the following inequality: t [0, 1], a, b, f (ta+ (1 t)b) tf(a) + (1 t)f(b). Whena function f is differentiable, we denote the gradient of f at a by f(a).
The Bregman divergence can be described as
B(a, b) = (a) (b) (b),a b , (1)
where a Bregman function () is a continuously differentiable function. The Bregman divergence is ageneralized distance function between two vectors. It satisfies
2a b22 (2)
for some scalar . For example, the squared Euclidean distance B(a, b) =12ab22 is a famous example
satisfying the properties of the Bregman divergence.
Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:4
2.2 Subgradient-based online learning
Now let us review the generic problem setting for subgradinet-based online learning. The algorithms
perform sequential prediction and updating according to the following scheme:
1. Initialization: start at round t = 1. Set a weight vectorw1 = 0. The weight vector space is restricted
in a closed convex set W Rd.2. Input: at round t, receive one input datum. An input datum is described as a d-dimensional feature
vector xt taken from a closed convex set X Rd.3. Prediction: make a prediction through an inner product of a feature vector xt and a weight vector
wt. The predicted value is denoted by yt = wt,xt.4. Output: observe a true output yt and incur a loss through a loss function t().5. Update: update wt to a new weight vector wt+1 using a subgradient with respect to the current
loss function at the current weight vector.
6. Increase the round number t and repeat steps 2 through 5 until no input data remains.
The ideal algorithm is the one that minimizes the sum of loss functions cumulated in each round for
any sequence of loss functions. It is equivalent to getting the optimal weight vector for minimizing the
next loss function.
A loss function is a function that represents the penalties for predictive errors. Thus, the value of the
loss function generally depends on the extent of dissociation between the predicted value and true output.
t is a convex loss function of the form t() : W R+, where t is the number of rounds. We assume thateach loss function is a function t, where t(wt) = t(wt,xt) = t(yt) and t is non-decreasing as thedistance between yt and yt increases. This assumption makes the inference based on the inner product
We would like to minimize the sum of loss functions. However, it is known that there is no algorithm
to upper bound a certain value . Therefore, an algorithms performance has been measured by using
not an absolute evaluation criterion but a relative evaluation criterion compared with the optimal weight
vector. This is the fundamental notion behind the regret bound used to evaluate the algorithms.
R(T ) =
The first term is the total cost of the loss functions over all rounds while the algorithm runs. The second
term is the minimal cumulative cost when we pick the optimal weight vector w in hindsight. From thelens of regret minimization, the learners goal is to achieve as low regret as possible. If the algorithms
regret value is bounded by o(T ); that is, it is sublinear and the regret converges to 0 on average. That
means the loss per datum will converge to the same value as in the case of the best fixed strategy no
matter what datum or loss function is received in each round.
A subgradient method (SG)  is one of the simplest online learning algorithms. It is widely used
because of its simplicity and extensive theoretical background. In this method, a weight vector is updated
according to the formula,
wt+1 = wt tgt, s.t. gt t(wt), (4)where gt is the subgradient of t with respect to wt and t is the learning rate. The method sequentially
updates parameters in order to minimize t. It has been proven that the regret bound of the SG is O(T )
when t = /t for any constant and the regret value per datum vanishes as T .
2.3 Sparsity-inducing regularization
Regularization is a well-known technique to obtain desirable structures in accordance with task objectives.
It is widely used to achieve objectives such as generating a predictive model with lower generalization
error or a compact model. To apply it, we integrate a regularization term into an optimization problem
for deriving a desirable predictive model.
Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:5
By adding a convex regularized term to the optimization problem, our focus changes from a simple
loss minimization problem to the minimization of a sum of loss functions and a regularization term. The
summation in this case is defined as follows:
t(w) + (w), (5)
where () is a regularized term of the form () : W R+ where is convex in W .L2 and L1-regularization are the major regularization functions for integrating some structures into a
predictive model. L2-regularization consists of taking the L2 norm of a weight vector (w) = w22,and L1-regularization is defined by the L1 norm such that (w) = w1, where is a regularizationparameter. L1-regularization is a simple and well-known technique for inducing sparsity in weight vectors.
Regularization has several different properties depending o...