. RESEARCH PAPER .
SCIENCE CHINAInformation Sciences
May 2014, Vol. 57 052104:1052104:21
c Science China Press and Springer-Verlag Berlin Heidelberg 2014 info.scichina.com link.springer.com
Feature-aware regularizationfor sparse online learning
OIWA Hidekazu1, MATSUSHIMA Shin1 & NAKAGAWA Hiroshi1,2
1Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8654, Japan;2Information Technology Center, The University of Tokyo, Tokyo 113-8654, Japan
Received August 31, 2013; accepted November 24, 2013
Abstract Learning a compact predictive model in an online setting has recently gained a great deal of at-
tention. The combination of online learning with sparsity-inducing regularization enables faster learning with a
smaller memory space than the previous learning frameworks. Many optimization methods and learning algo-
rithms have been developed on the basis of online learning with L1-regularization. L1-regularization tends to
truncate some types of parameters, such as those that rarely occur or have a small range of values, unless they
are emphasized in advance. However, the inclusion of a pre-processing step would make it very difficult to pre-
serve the advantages of online learning. We propose a new regularization framework for sparse online learning.
We focus on regularization terms, and we enhance the state-of-the-art regularization approach by integrating
information on all previous subgradients of the loss function into a regularization term. The resulting algorithms
enable online learning to adjust the intensity of each features truncations without pre-processing and eventually
eliminate the bias of L1-regularization. We show theoretical properties of our framework, the computational
complexity and upper bound of regret. Experiments demonstrated that our algorithms outperformed previous
methods in many classification tasks.
Keywords online learning, supervised learning, sparsity-inducing regularization, feature selection, sentiment
Citation Oiwa H, Matsushima S, Nakagawa H. Feature-aware regularization for sparse online learning. Sci
China Inf Sci, 2014, 57: 052104(21), doi: 10.1007/s11432-014-5082-z
Online learning is a learning framework where a prediction and parameter update take place in a sequential
setting each time a learner receives one datum. Online learning is beneficial for learning from large-scale
batch algorithms cannot derive a global optimal solution within a reasonable amount of time because
the computational cost is very high. If all instances cannot be simultaneously loaded into the main
memory, optimization in batch learning requires some sort of reformulation in order to arrive at an exact
solution . Many online learning algorithms update predictors on the basis of only one new instance.
This means that online learning runs faster and uses a smaller memory space than batch learning does.
Corresponding author (email: firstname.lastname@example.org)
Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:2
Online learning algorithms are especially efficient on datasets where the dimension of instances or number
of instances is very large. Many algorithms have been transformed into online ones for their ease of
handling large-scale data.
Regularization is a generalization technique to prevent over-fitting of previously received data. L1-
regularization is a well-known method of deriving a compact predictive model. The functionality of
L1-regularization is to eliminate parameters that are insignificant for prediction on the fly. A compact
model is able to reduce the computational time and required memory space for making predictions because
it deals with a smaller number of parameters. In addition, L1-regularization has a generalization effect
for preventing over-fitting to the training data.
Online learning with L1-regularization is currently being studied as a way of solving large-scale opti-
mization problems. Some novel frameworks have been developed in this field. Most of these works are
subgradient-based; that means the update is performed by calculating the subgradients of loss functions.
Subgradient-based online learning has high computational efficiency and high predictive performance.
There are three major subgradient-based algorithms for sparse online learning:
1) COMID (Composite Objective MIrror Descent) [2,3];
2) RDA (Regularized Dual Averaging) ;
3) FTPRL (Follow-The-Proximally-Regularized-Leader) [5,6];
However, these subgradient-based online algorithms do not consider information on features. The plain
L1-regularization makes parameters closer to 0 without considering their range of values or frequency of
occurrences. As a result, most unemphatic features are easily truncated, even though such features are
crucial for making a prediction. The occurrence frequency and value range are not usually uniform in
tasks such as natural language processing and pattern recognition. Some techniques to emphasize them
have been developed to retain such features in the predictive model1). However, these methods must
load all data and sum up the occurrence counts of each feature before starting to learn and hence are
pre-processings. As the name implies, pre-processing poisons the essence of online learning wherein the
data are sequentially processed and there is no hypothesis that we can sum up feature occurrence counts
by using all data. The previous work however has not dealt with these challanges in any detail.
We propose a new sparsity-inducing regularization framework for subgradient-based online learning.
Our framework enables us to eliminate the bias for retaining informative features in an online setting
without any pre-processing. The key idea behind our framework is to integrate the absolute values of
the subgradient of loss functions into the regularization term. We call this framework feature-aware reg-
ularization for sparse online learning. Our framework can dynamically weaken truncation effects on rare
features and thereby obtain a set of important features regardless of their occurrence frequency or value
range. Our extension can be applied to many subgradient-based online learning, such as COMID, RDA,
and FTPRL. We show three applications of our framework in this paper, FR-COMID (Feature-aware Reg-
ularized COMID), which is an extension of COMID, FRDA (Feature-aware Regularized Dual Averaging),
which is an extension of RDA, and FTPFRL (Follow-The-Proximally-Feature-aware-Regularized-Leader),
which is an extension of FTPRL. We also analyzed the theoretical aspects of our framework. As a result,
we derived the same computational cost and the regret upper bound as those for the original algorithms.
Finally, we evaluated our framework in several experiments. The results revealed that our framework
improved state-of-the-art algorithms on most datasets in terms of prediction accuracy and sparsity.
This paper is constituted as follows. Section 2 introduces the basic framework of subgradient-based
online learning and describes the three major sparse online learning algorithms, COMID, RDA, and
FTPRL. Section 3 introduces the previous work done on sparse online learning. In Section 4, we propose
our new sparsity-inducing regularization framework, called feature-aware regularization. We show some
common examples that the previous work causes problems in sparse online learning. As a solution to
these problems, we propose a new framework for the three previous algorithms. For each algorithm, we
derive closed form update formula. Section 5 discusses the upper bound of regret for our algorithms,
and Section 6 describes the results of several experiments. Section 7 concludes the discussion of our
1) For example, TF-IDF  in tasks on natural language processing.
Oiwa H, et al. Sci China Inf Sci May 2014 Vol. 57 052104:3
Table 1 Notation
|| Absolute valuea Vector
a(i) ith entry of vector a
I Identity matrix
A(i,j) (i,j)th entry of matrix A
ap Lp norma, b Inner productRn n-dimensional Euclidean space
domf Domain of function f
sgn() Sign of a real numberargmin f Unique point for minimizing function f
Argmin f Set of minimizing points of function f
f(a) Differential of function f at a
f(a) Gradient of function f (differentiable)B(, ) Bregman divergence
2 Subgradient-based online learning
Before explaining the three major sparse online learning algorithms, we will describe the notation used
in this paper and the problem setting for subgradient-based online learning.
Our notation to formally describe the problem setting is summarized in Table 1. Scalars are denoted
in lowercase italic letters, e.g., , and the absolute value of a scalar is ||. Vectors are in lowercasebold letters, such as a. Matrices are in uppercase bold letters, e.g., A. I is the identity matrix. aprepresents the Lp norm of a vector a, and a, b denotes the inner product of two vectors a, b. Rn is ann-dimensional Euclidean space and R+ is a one-dimensional non-negative Euclidean space. Let domf be
the domain of function f . A function sgn is defined as sgn() = /|| where is a real number (If = 0,sgn() = 0). Let argmin f be a un