Learning Shapelet Patterns from Network-based Time Series Dataychen/public/TII.pdf · 2019-03-10 · this way, the shapelets are detached from candidate segments and the learned shapelets

1551-3203 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2018.2885700, IEEETransactions on Industrial Informatics

1

Learning Shapelet Patterns from Network-basedTime Series

Haishuai Wang, Jia Wu, Peng Zhang, and Yixin Chen

Abstract—This paper formulates the problem of learningdiscriminative features (i.e., segments) from networked timeseries data, considering the linked information among time series.For example, social network users are considered to be socialsensors that continuously generate social signals represented asa time series. The discriminative segments are often referred toas shapelets in a time series. Extracting shapelets for time seriesanalysis has been widely studied. However, existing works onshapelet selection assume that the time series are independentand identically distributed (i.i.d.). This assumption restricts theirapplications to social networked time series analysis since a user’sactions can be correlated to his/her social affiliations. In thispaper, we propose a novel network regularized least squares(NetRLS) feature selection model that combines typical timeseries data and user network data for analysis. Experimentson real-world Twitter, Weibo and DBLP networked time seriesdata demonstrate the performance of the proposed method.NetRLS performs better than the representative baselines onfour evaluation criteria, namely classification accuracy, AUC,F1-score, and statistical significance analysis. NetRLS also hascompetitive running time as the baselines.

Index Terms—Feature learning, time series, data mining

I. INTRODUCTION

A time series is a sequence of data points made in atemporal order over a continuous time interval [1]. Timeseries data analysis has attracted increasing interest from datamining experts, given the wide variety of applications for timeseries analysis, such as economics and finance where we arecontinually exposed to daily stock market quotations [2], theresearch of natural phenomena based on natural gas networks[3], power flow analysis for a centralized PV plant [4], andintelligent fault diagnosis for electric machines [5]. The mainchallenge of time series analysis is to find the discriminativefeatures that can best predict class labels. Recently, a lineof enquiry was proposed to solve this challenge: extractingdiscriminative features, often referred to as shapelets, from thetime series. Shapelets are maximally discriminative features

This work was supported by the MQNS under Grant no. 9201701203,MQEPS under Grant no. 9201701455, MQRSG under Grant no. 95109718,and the 2018 Collaborative Research Project between Macquarie Universityand Data61. (Corresponding author: Jia Wu.)

H. Wang is with the Department of Computer Science and Engineering,Fairfield University, Fairfield, CT, 06824, and the Department of Biomedicaland informatics, Harvard Medical School, Boston, MA 02115, USA (e-mail:haishuai [email protected]).

J. Wu is with Department of Computing, Faculty of Science and En-gineering, Macquarie University, Sydney, NSW 2109, Australia (e-mail:[email protected]).

P. Zhang is with the Ant Financial Services Group, Hangzhou 310012,China (e-mail: [email protected]).

Y. Chen is with the Department of Computer Science, Washington Univer-sity in St. Louis, MO, 63110 (e-mail: [email protected]).

Social Network

Social robots Social users

Fig. 1. An example of networked time series. We aim to identify socialrobots (left) from real social users (right). Each social node is considered tobe a social sensor [15] that generates continuous social signals. The i.i.d.assumption no longer stands because the nodes are mutually connected insocial networks.

of time series that have high prediction accuracy and theyare easy to explain [6]. Discovering shapelets has, therefore,become an important branch of time series analysis.

To date, there are two types of approaches to time seriesanalysis: distance-based methods and feature-based methods.The former measures the similarity between two time series(e.g., dynamic time warping (DTW) [7]), while the latter con-siders time series as a feature vector so that traditional feature-based learning machines (e.g., SVM or logistic regression)can be applied. In the feature-based methods, the featurescan be either simple statistics (e.g., mean and variance) ora subsequence of the time series (i.e., shapelets). Shapeletsare the discriminative segments of time series that best predictclass labels [8]. Shapelet implies the idea of a small sub-shape,and is defined as a time series subsequence with the highestperformance in terms of discriminating one class from another[9]. We have also observed methods that extract accurateand interpretable shapelets for time series analysis. Exam-ples include: decision tree-based shapelet extraction [8][10],regression-based shapelet learning [11], [12], and time seriestransformation methods [13].

On the other hand, recent work [14] proposed a newapproach to time series shapelet learning. Instead of searchingfor shapelets in a candidate pool, they use regression learningwith the aim of learning shapelets from the time series. Inthis way, the shapelets are detached from candidate segmentsand the learned shapelets may differ from all the candidatesegments. More importantly, shapelet learning has very fastruntimes, is scalable to large datasets, and is robust to noise.

However, representative methods for shapelet discoveryassume the time series to be independent and identically dis-tributed (i.i.d.), hence traditional machine learning models areapplicable. In emerging applications such as social networks,



2

time series data generated by social users no longer followsthe i.i.d. assumption. For example, each node in Fig. I is asocial sensor that generates social signals [15]. Because tweetsgenerated by social nodes in the same community are highlycorrelated, the i.i.d. assumption is violated and new modelsneed to incorporate network information to be able to analyzethe correlated time series.

To overcome the shortcomings of existing shapelet learningalgorithms being used for time series analysis, in this paper, wepropose a network regularized least squares feature selectionmethod - NetRLS- to incorporate network information forshapelet selection. Compared with previous work, the pro-posed method, NetRLS, has the following superiorities:• NetRLS incorporates the network structure information

from time series data. The network structure providesadditional information for shapelet selection, resulting inselection more discriminative shapelet features and higherclassification accuracy for time series learning.

• NetRLS uses the least squares regression to fit the labelswith respect to the time series information, and adoptgraph regularization to utilize the network information.Therefore, NetRLS is suited for time series data withnetwork structure information, that is networked timeseries.

• NetRLS is more applicable for social networked timeseries analysis since NetRLS can utilize both the timeseries information and link information in the network.Thus, NetRLS could achieve better classification resultof the networked time series data than either using thetime series information or link information alone, whichpresented in the previous work.

Therefore, the main contributions of the paper can besummarized as follows:• Existing time series data analysis is based on the assump-

tion that the i.i.d. condition holds. We show that for timeseries data in a networked environment (such as socialnetworks), the assumption does not stand and the problemof networked time series data analysis is proposed.

• We formulate a novel model to extract discriminativeshapelet features from networked time series data. Thenovel model considers both time series data and net-work structure for shapelet feature selection. The learnedshapelet features from the model can be used to identifytime series data. Since the proposed novel model incorpo-rates both traditional feature selection and network struc-ture, our model extends the traditional feature selectionapproaches to networked time series learning.

• We test the proposed algorithm on real-world Twitter,DBLP and Weibo data sets to demonstrate its perfor-mance. The proposed approach outperforms the represen-tative time series feature selection methods. These resultsshow that the linked time series network is useful foranalysis through five evaluation criteria, namely classifi-cation accuracy, AUC, F1-score, running time as well asstatistical significance.

The remainder of this paper is organized as follows. Wereview traditional time series analysis methods and feature

learning in network approaches in Section II. Section IIIprovides the preliminaries and the problem definition. Weintroduce the proposed NetRLS learning model in SectionIV. In Section V, we conduct the experiments on real-worlddata sets and compare the proposed method with benchmarkapproaches. Lastly, we draw conclusions and discuss futurework in Section VI.

II. RELATED WORK

In this section, we give a brief review of the representativemethods for learning features in a time series, followed by areview of the feature selection approaches in networks.

Networked data has been studied extensively. Most existingresearch efforts for networked data analysis are based on eithertime series or graph structure. The work in [16] modelednetworked data analysis as dynamic data analysis to detectsocial behavioral changes over time. To detect variances orchanges in human context over time, they used time seriesanalysis in cyber space. [17][18] presented graph miningmodels for network analysis. They modeled networked dataas graph structures and proposed subgraph feature selectionapproaches for network graph analysis. However, none of theexisting works modeled data as both time series and networkgraphs for learning.

A. Time Series Feature Learning

Discriminative features for temporal data analysis have beenstudied extensively [19]. For example: bursts [20], periods[21], anomalies [22], motifs [23], shapelets [6], [14], [24] anddiscords [25]. Recently, time series shapelets have attractedincreasing interest in data mining [26], [27]. Shapelets areusually much shorter than the original time series, and thismeans only one shapelet is needed to classify an entire dataset. Shapelets were first proposed by [6], as segments of thetime series that maximally predict the target variable. However,the runtime of brute-force shapelet discovery is not feasibledue to the large number of candidates. Therefore, a series oftechniques to expedite the process has been proposed, suchas abandoning distance computations prematurely or entropypruning of the information gain metric [6]. Mueen et al [28]rely on the reuse of computations and prunes of the searchspace to accelerate the shapelet discovery. Grabocka et al [14]proposed a novel method that learns near-to-optimal shapeletsdirectly, without the need to search exhaustively among a poolof candidates extracted from time-series segments. However,all existing time series approaches address either univariate ormultivariate problems. They ignore the structural informationbehind time series.

B. Feature Selection on Networked Data

Many supervised feature selection algorithms have beenproposed to select informative features from labeled data.A commonly used criterion in feature selection is to scorethe features. Representative methods to score the features arefilter-based, wrapper-based, and the embedded approach [29].The embedded methods combine feature selection with the



3

TABLE ITABLE OF NOMENCLATURE

NomenclatureX Time SeriesY Label matrix of time seriesΩ Segments from time seriesW Mapping matrix of decision boundaryA Adjacent matrix derived from networks

R G(W) Network regularization termL Undirected graph LaplacianΠ Diagonal matrix

G,V,E Network, Node, and EdgeS ShapeletsU Euclidean distance matrixP Diagonal matrix based on U

S Lower segment space of SW Low dimension space of W

classifier, and are often considered as more effective than thefirst two methods [29]. However, traditional feature selectionapproaches assume that the data are i.d.d., which is not suitablefor networked data. Based on graph regularization and theembedded method, Belkin et al. [30] proposed Laplacian regu-larized least squares (LapRLS) for networked data, and Gu andHan [31] combined linear regression with graph regularizationto select features in networked data, demonstrating that theirnetworked approach outperforms traditional feature selectionmethods.

However, the existing works assume time series are indepen-dent and identically distributed. In fact, the correlations amongtime series can be constructed as a network and the networkinformation can be used for time series analysis. For example,when classifying Twitter time series data, we can utilize usernetwork information to improve the learning results. However,the aforementioned works only use the information from timeseries for analysis, thus, none of them can directly handlenetworked time series.

III. PRELIMINARIES AND PROBLEM DEFINITION

In our problem setting, there are two types of data: anetwork G = (V,E) where nodes |V | = n and edges |E| = l,and a set of time series data denoted by a matrix X ∈ Rq×nwhere the j-th column vector xj = [x

(1)j , x

(2)j , · · · , x(q)j ] ∈ Rq

represents the time series generated by node vj ∈ V . Thereare a total of c class labels denoted by a label matrixY ∈ 0, 1c×n where each row yj ∈ Rc is a unit vectordenoting the label of node vj and xj ∈X . We use table I todemonstrate the nomenclature of this paper.

Time series segments. Consider a sliding window of length t.When the window slides along a time series, a set of segmentscan be obtained. For time series xj ∈ X , we can generate atotal of q− t+1 segments by sliding the window from x

(1)j to

x(q−t+1)j . Thus, for the entire time series X , there are a total

of (q − t+ 1)× n segments, i.e., Ω = [ϕ1, · · · , ϕ(q−t+1)×n]where each ϕj ∈ Ω denotes a segment. Each element s(j,k) isthe distance between time series xj and segment ϕk. Thiscan be defined as the differential minimum distance that

approximately denotes the minimum distance between the timeseries and the segment. Note that the segment length t q,the number of segments (q − t+ 1)× n is very large.

Shapelets. Shapelets are defined as the most discriminativetime series segments [8]. Therefore, time series segmentsare shapelet candidates, and we can use Ω as the featurespace for shapelets selection. To represent each time seriesxj ∈ X in the space Ω, we use a column vector si =[si,1, · · · , s(i,(q−t+1)×n)] to record x′js feature values, whereeach element si,j depends on a distance function between xjand segment ϕj ∈ Ω, i.e., si,j = d(xi,ϕj) (This distancefunction is discussed in Section IV. Challenges). This way, thetime series data set X can be represented by a data matrixS = [s1, s2, · · · , sn] ∈ R(q−t+1)×n,n, where each columnvector sj represents a time series xj in space Ω. Note thateach sj is a ultra-high dimensional vector.

Goal. The objective is to select the most discriminativesegments as shapelets. Consider a multi-class problem withc class labels denotes a mapping matrix W ∈ R(q−t+1)×n,c

where the j−th column stores the classifier wj that identifiesthe j−th class from the remaining c − 1 classes. We expectto obtain a sparse matrix W with only a few non-zero rowvectors by minimizing the L2,0−norm ‖W ‖2,0. The L2,0-norm of W is defined as ‖W‖2,0 = card(‖w1‖2, · · · , ‖wc‖2).wj shrinks to zero if the j-th feature is not discriminative.Therefore, the features corresponding to zero column of Wwill be discarded when performing feature selection. Thisway, a few segments (row vectors in W ) are selected as theshapelets.

Evaluation. To evaluate the selected shapelet features, theclassification performance (such as accuracy) is used forvalidation. Classification of networked time series refers toclassify time series by using structure information. The pre-dicted label for the test time series will be calculated fromW TS.

IV. NETWORK REGULARIZED LEAST SQUARES SHAPELETLEARNING

Network regularization. Network information can help iden-tify the classifiers W . The crux is to use network regular-ization under the rule that: if two nodes are linked together,then they are likely to share the same class label. Technically,consider an undirected network with the adjacent matrix A ∈Rn×n derived from the edge set E, the network regularizationterm RG(W ) can be formulated as:

RG(W ) =1

2

c∑k=1

∑i,j

(wTk si −wT

k sj)2Aij

=c∑

k=1

∑i,j

wTk siAijs

Ti wk −

c∑k=1

∑i,j

wTk siAijs

Tj wk

=c∑

k=1

∑i

wTk siDiis

Ti wk −

c∑k=1

∑i,j

wTk siAijs

Tj wk

=c∑

k=1

wTkS(D −W )STwk

= tr(W TSLSTW ),

(1)



4

where L = D − A is an undirected graph Laplacian[32] and D is a diagonal matrix called degree matrix withDii =

∑jAij . Based on graph theory, the degree matrix

is a diagonal matrix which contains information about thedegree of each vertex?that is, the number of edges attachedto each vertex [33]. It is used together with the adjacencymatrix to construct the Laplacian matrix of a graph. Eq. (1)can be easily extended to a directed network by replacing theundirected graph Laplacian with a directed graph LaplacianL, as follows:

L = Π− 1

2(ΠP + P TΠ),

where Π is a diagonal matrix and P is the transition matrixof random walk on the directed network [34].

Shapelets selection. We use the embedded feature selection[31] - specifically NetRLS. NetRLS aims to learn c linearclassifiers and select the top-ρ most discriminative shapeletsas shown in Eq. (2):

minW ‖Y −W TS‖2F + α‖W ‖2F + βtr(W TSLSTW )

s.t. : ‖W ‖2,1 ≤ ρ, α, β > 0. (2)

The first two terms in the objective function are the regularizedleast squares and the third term is the network regularization.The constraint ‖W ‖2,1 ≤ ρ is a relaxation of the L2,0

norm ‖W ‖2,0 ≤ ρ and ‖W ‖2,1 is defined as the sumof the l2 norm of all the column vectors wj ∈ W , i.e,‖W ‖2,1 =

∑j ‖wj‖2. ‖W ‖2,1 is a relaxation of ‖W ‖2,0

and the constraint ‖W ‖2,1 ≤ ρ can approximately obtain theresult that at most ρ rows in W are selected.

Challenges. The matrix S is intimidatingly large becausethe segment space Ω is ultra-high. Therefore, Eq. (2) cannotbe solved directly for S. In the sequel, we propose to trimmatrix S by using the correlation of segments.

Keogh et al. [19] conducted an experimental comparison oftime series representations and distance measures. They com-pared eight representation methods, nine similarity measuresand their variants, and tested their performance on 38 time se-ries data sets. They claim that the Euclidean distance is surpris-ing competitive to other more complex approaches, althoughit is very sensitive to misalignments. Because the focus of thispaper is not to introduce new representation/distance methods,we simply use Euclidean distance to measure the similaritybetween segments. However, it is worth noting that our methodcan be extended to the other representation/distance methodsdiscussed in their work [19]. In this paper, we first construct alower dimensional search space (e.g., with only η segments),then select top-ρ shapelets from the segments based on Eq.(2).

Define a diagonal matrix M = diag(0, 1, · · · , 1, 0), whererank(M) = η (q−t+1)×n which means only η elementone in M , and the problem turns to calculating M .

Based on Euclidean distance, we define the distance matrixU in Eq. (3). Note that the matrix is symmetric and non-

negative.

U =

d(ϕ1,ϕ1) · · · d(ϕ1,ϕ(q−t+1)×n)

d(ϕ2,ϕ1) · · · d(ϕ2,ϕ(q−t+1)×n)

.... . .

...d(ϕ(q−t+1)×n,ϕ1) · · · d(ϕ(q−t+1)×n,ϕ(q−t+1)×n)

(3)

Based on the matrix, we define a diagonal matrix P =D − U , where Dii =

∑j Uij . Then, selecting the optimal

S becomes equivalent to selecting the maximum triangleelements, i.e.,

M = argmaxM

tr(PM)

s.t. : rank(M) = η.(4)

We can then obtain the lower segment space S based on S,and W based on W , as shown in Eq. (5).

S = MS ∈ Rη×n, W = MW . (5)

Eq. (4) aims to find the top-ρ segments S from all thesegment candidates S. The constraints denote that only ρsegments from P are selected. The objective function denotesthat we want to obtain the segments that have the maximaldistances to all the other segments. Eq. (4) is easily to solve,because it is equivalent to select the maximal values on thediagonal matrix P .

Convexity. In Eq. (5), we reduced the high dimension Wand S into low dimension space W and S. Once S is replacedwith S in Eq. (2), we want to show whether the problem isconvex. So, gradient-based algorithms are used as the solution.

Theorem 1. The problem in Eq. (2) is convex w.r.t. W andgradient-based algorithms can achieve a global optimum.

Proof. Due to ‖W ‖2F = tr(W TW ), Eq. (2) can be convertedto the following optimization problem with parameters α, β >0,

min‖W ‖2,1≤λ

tr(W T [XXT+αI+βSLST ]W−2tr(STY W T )

(6)Let Λ = SST + αI + βSLST . It is clear that Λ is alwaysnon-negative, and hence it is a positive semi-definite matrix.The constraint ‖W ‖2,1 < ρ is also convex. Therefore, theoptimization problem is convex.

Algorithm. We use a recently proposed gradient-based al-gorithm, the Accelerated Proximal Gradient Decent (APG)algorithm [35], [36], as the solution. The convergence rateof APG is very fast of O( 1√

(ε)). Because Eq. (2) is convex,

APG can achieve a global optimum.Recall that the purpose of APG is to find a sequence of

variables · · · , Wk+1, · · · such that the objective functionconverges to a global minimum. Let f(W ) equals to thefollowing equation:

f(W ) = minW‖Y −WT S‖2F +α‖W‖2F +βtr(WT SLST W ),



5

5 10 15 20 25 30

−20

0

20

40

60

80

100

120

140

5 10 15 20 25 30

−20

0

20

40

60

80

100

120

140

0 300

150

00

150

3055

A social robot

A VIP user

1

50

51

2

3 4

5

67

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

3940

41

42

43

44

45

46

47

48

49

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

8990

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125126

127

128

129

130

131

132

133

134

135

136

137

138139

140

141

142

143

144

145

146

147

148

149

150

151

152 153

154

155

156

157158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195196

197 198

199200

An ordinary user

Social robots

Active users

Ordinary users

(a) The social network (b) The generated time series data (c) Shapelets from time series of the three user groups

Fig. 2. An example of the generated time series and social network from Twitter. (a) The constructed social network based on the relationships among thesocial users. The network (left) has 200 nodes and 210 edges. Each node represents a social user, and there is an edge if two users have relationships. Socialrobots are densely connected while the remaining two groups of real users are sparsely connected. Each node corresponds to a time series shown in themiddle figure. (b) The generated time series of the three classes of nodes. In the social network (figure (a)), each node (user) can generate different numberof tweets per day, which constructs the time series data. Social robots and active VIP users (middle) have discriminative patterns while ordinary users tendto have flat curves. (c) Shapelets from different types of time series. The shapelets of a typical social robot are concave (right:blue), while the shapelets ofan active VIP user are convex (right:orange). The time series of ordinary users have heavy noise and have difficulty in capturing a shapelet.

Algorithm 1: NetRLS for Shapelets Selection on Net-worked Time Series.Input : Time series X ∈ Rq×n, Network

G = (V,E), window length t, # of classes c,# of segments η, # of shapelets ρ

Output: Shapelets S∗

Initialize α, β, θ0, W1, γ1 = 1 ;Generate a segment space Ω = [ϕ1, · · · , ϕ(q−t+1)×n] ;Generate an Euclidean distance matrix U based on Ω ;Generate a diagonal matrix P based on U ;Generate a selection matrixM = diag(0, .., 1),rank(M) = η;

SolveM = argmaxM tr(PM) s.t. : rank(M) = η. ;

Generate the candidate shapelet matrix S = SM ;Prune matrices in Eq. (2), L = LM , W = WM ;repeat

while F (Wk) > Gθt−1(Wk+1 ,Wk) do

Set θt−1 = τθt−1endSet θk = θk−1 ;Wk+1 = argmin

WGθk(W , Jk) ;

γk+1 =1+√

1+4γ2k

2 ;Jk+1 = Wk + γk−1

γk+1(Wk+1 − Wk) ;

until Convergence;

Score(i) =√∑

j W2i,j ;

Output: Segments Sk with the largest Scores ;

then Eq. (2) can be relaxed to Eq. (7), as follows:

F (W ) = f(W ) + λ‖W ‖2,1 (7)

where f(W ) is the objective function in Eq. (2) and λ‖W ‖2,1is a relaxation of the constraint. According to the Taylor seriesexpansion, F (W ) approximately equals to Gθk(W , Wk) as

follows,

Gθt(W , Wk) = f(Wk)+ < ∇f(Wk), W − Wk >

+θk2‖W − Wk‖2 + λ‖W ‖2,1

(8)

where ∇f(Wk) is the first order derivative of f(W ) at Wk.Now, the iterative step Wk+1 can be obtained by minimiz-

ing Gθk(W , Wk), i.e.,

Wk+1 = argminW

Gθk(W , Wk)

= argminW

1

2‖W − Vk‖2F +

λ

θk‖W ‖2,1

(9)

where Vk = Wk − 1θk∇f(Wk). It is not difficult to prove

that the solution of Eq. (8) can reduce the objective functionF (W ) and the algorithm is convergent.

Eq. (8) can be further broken down into c separate subprob-lems, each of which has a closed form solution given in Eq.(10), where wi

k+1, wi and vik are the i-th rows of W , W and

Vk respectively.

wik+1 =

(1− λ

θk‖vik‖)vik, if ‖vik‖ > λ

θk

0, otherwise.(10)

Moreover, we construct a linear combination of Wk andWk+1 to update Jk+1 as follows,

Jk+1 = Wk + (γk − 1)(Wk+1 − Wk)(γk+1), (11)

where the sequence of γk is conventionally set to be γk+1 =1+√

1+4γ2k

2 . The algorithm is summarized in Algorithm 1.

V. EXPERIMENTS

These experiments are designed to validate whether the pro-posed NetRLS model, which combines both time series dataand network data, can achieve better performance than usingonly time series data. All the experiments were conductedon a Linux Ubuntu server with 16*2.9GHZ CPU, and were



6

0.1 1 20 50

0.73

0.74

0.75

10 Parameter values

Acc

ura

cy

Parameter αParameter βParameter λ

(a) Parameter tests on α, β and λ.

10 20 30 40 500.72

0.75

0.77

Parameter ρ

Accura

cy

NetRLSRLS

Network StrengthAccuracy gap: 0.03

(b) NetRLS VS. RLS.

Fig. 3. Parameter test (a) and model comparison (b).

implemented in Matlab. The source codes and data sets areavailable online1,2.

A. Data

We collected a Twitter data set, a DBLP data set and aWeibo data set for validation.

Twitter: The task is to detect social robots that auto-distribute advertisements for viral marketing on social net-works. We located and collected 200 social time series over thelast 30 days from three types of nodes: 1) social robots whichare zombies controlled by a master node, that occasionallydistribute spam over the network. Because we had alreadylocated the master nodes, we could infer the network linkamong these zombie nodes; 2) active social users who arefamous/VIP users. These users are very active and update theirpages frequently. The links between them are sparser than thatof the social robots; and 3) ordinary social users who rarelypost messages and therefore, whose links are sparsest.

The network information shown in Fig. 2 gives a smallportion of the time series data. We calculated the total numberof tweets for each node per day and obtained the time seriesof length 30. There are 20 social robots, 30 active socialusers, and 150 ordinary social users. Intuitively, we observethat social robots have a very short yet sharp time periodfor distributing information. But for the remainder of thetime, these nodes are asleep. Active social users have moreregular and frequent information distribution, while ordinaryusers show low frequency. These basic features can guaranteesatisfactory shapelets for analysis.

DBLP: The task is to determine whether an author isinfluential in a given research field. We retrieved around 700authors from DBLP 3 from the data mining area, i.e., authors inthe ICDM, PKDD/ECML and KDD conferences. There were300 influential authors (e.g., Jiawei Han, Philip S. Yu, etc.) inthe data set, with the remaining authors considered ’normal’in the data mining area. Note that the authors consideredto be normal do not necessarily represent authors who onlypublish a low number of papers each year or a low numberof papers in particular data mining conferences. We labelledthe influential and normal authors based on prior knowledge.Some influential authors are already well-known researchersand their frequent co-authors also publish in the top five datamining conferences. Some authors share surnames, denoted as

1https://github.com/BlindReview/TII/tree/master/Source%20Codes2https://github.com/BlindReview/TII/tree/master/data3http://dblp.uni-trier.de/db/

noise authors. We manually removed the noise and mislabeledauthors. We crawled the number of papers published eachyear to form time series data for each author between 1996and 2015. Then, based on the co-authorship, we constructednetwork information. For example, an edge exists if twoauthors have a co-authorship relationship and the weight isset to 1, otherwise it is set to 0.

Weibo: We test the proposed algorithm on a larger dataset,Sina Weibo, which is the most popular social media platformin China. In social media, users can post messages, and uploadpictures and videos to share in real-time. We crawled 4,000users over two months of Weibo data. Users can be followersand friends, which constructs a linked network. We obtain thenumber of posted messages for every user on a daily basis,which forms time series data with a length of 60. There arealso commercial users who post product information. The taskis to identify the normal users from commercial users.

B. Baseline Methods

To show the power of network information in building aclassifier, we compare the proposed NetRLS model with thefollowing baselines:• RLS: A regularized least squares (RLS) model [37] which

only uses time series data for analysis. Therefore, theonly difference between RLS and NetRLS is the networkregularization term. To disable the network term in RLS,we set parameter β as 0. Other parameters are the same asNetRLS (i.e., α = 10, β = 0, λ = 10). Note that we usedthe same source codes as NetRLS to implement RLS andonly set β = 0.

• LTS: A recent shapelet learning method (denoted as LTS)[11]. LTS uses a new mathematical formalization of thetask via an objective function, and a tailored stochasticgradient learning algorithm is used to solve the problem.LTS enables learning near-to-optimal shapelets directlywithout trying many candidates. The LTS performancehas been demonstrated to be better than the other methods[11], e.g., Fast Shapelets (FSH), which is a fast randomprojection technique on the SAX representation [19]. Thesource codes are publicly available in Python4, MAT-LAB5 and Java6. Concretely, the parameters we used inthe experiments are: K = log(total num segments) ∗(num classes− 1), α = −30, η = 0.1, iterations =1000. Other parameters were tuned via cross-validationfrom L = 0.1 or 0.2, R = 2, 3, λ = 0.01 or 0.1.

• BFDSS: A brute-force discriminative shapelet search(BFDSS) method. The most straightforward way of find-ing the discriminative features is the brute-force method[38]. This baseline is based on Euclidian distance be-tween candidate segments and time series. Given timeseries D, we first generate all the segments with lengthl. Then, the algorithm checks how well each segmentcan separate D into different classes. For each shapeletcandidate, the algorithm calculates the information gain

4https://github.com/mohaseeb/shaplets-python5https://github.com/muvic08/shapelets6http://fs.ismll.de/publicspace/LearningShapelets/



7

10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

Accu

racy

NetRLSRLSLTSBFDSSRRIFW

(a) Accuracy with window length t = 3.

10 20 30 40 500.65

0.7

0.75

0.8

Parameter ρ

Accu

racy


(b) Accuracy with window length t = 4.

10 20 30 40 500.65

0.7

0.75

0.8

Parameter ρ

Accu

racy


(c) Accuracy with window length t = 5.

10 20 30 40 500.65

0.7

0.75

0.8

Parameter ρ

Accu

racy


(d) Accuracy with window length t = 6.

Fig. 4. Accuracy comparison on Twitter data set w.r.t. various window lengths and parameter ρ.

10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

AUC


(a) AUC with window length t = 3.

10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

AUC


(b) AUC with window length t = 4.

10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

AUC


(c) AUC with window length t = 5.

10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

AUC


(d) AUC with window length t = 6.

Fig. 5. AUC comparison on Twitter data set w.r.t. various window lengths and parameter ρ.

10 20 30 40 500.65

0.7

0.75

0.8

Parameter ρ

Accu

racy



10 20 30 40 500.65

0.7

0.75

0.8

Parameter ρ

Accu

racy



10 20 30 40 500.65

0.7

0.75

0.8

Parameter ρ

Acc

urac

y



10 20 30 40 500.65

0.7

0.75

0.8

Parameter ρ

Acc

urac

y



Fig. 6. Accuracy comparison on DBLP data set w.r.t. various window lengths and parameter ρ.

10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

AUC



10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

AU

C



10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

AUC



10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

AU

C



Fig. 7. AUC comparison on DBLP data set w.r.t. various window lengths and parameter ρ.

10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

Acc

urac

y



10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

Acc

urac

y



10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

Acc

urac

y



10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ

Acc

urac

y



Fig. 8. Accuracy comparison on the Weibo data set w.r.t. various window length and parameter ρ.

achieved if using that candidate to separate the data.The algorithm returns the candidates with the highestinformation gain as shapelets. Source codes are alsopublicly available7. We used the same length l withNetRLS as shown in Figs. 4-9.

• RRIFW: This is based on a hybrid filter/wrapper fea-

7http://alumni.cs.ucr.edu/ lexiangy/shapelet.html

ture selection technique [39]. RRIFW represents a filter-wrapper approach that considers relevancy, redundancy,and interaction of the candidate inputs. Even though theproblem definitions of [39] are not exactly the same asours, we design a baseline based on their ideas. First, wemeasure the relevancy, redundancy and interaction amongcandidate features in S based on the information-theoretic



8

10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ


AU

C


10 20 30 40 500.6

0.65

0.7

0.75

0.8

Parameter ρ


AU

C


10 20 30 40 500.65

0.7

0.75

0.8

Parameter ρ


AU

C


10 20 30 40 500.65

0.7

0.75

0.8

Parameter ρ


AU

C


Fig. 9. AUC comparison on the Weibo data set w.r.t. various window length and parameter ρ.

criteria of mutual information (MI) and interaction gain(IG). Then, we generate a feature subset by maximizingrelevancy and interaction, and minimizing redundancyduring the filtering stage. In the last step, the fine-tuning of the adjustable parameter occurs during thewrapper stage. Since the baseline is related to relevancy,redundancy, interaction, filter and wrapper, we call itRRIFW for short.

C. Measures and Parameter testing

We measure performance through classification accuracy,AUC, t-test, F1-score as well as running time. We used 70%of the data set for training and the remaining 30% for testing.Fig. 3(a) shows the parameter tests with respect to α, β, λ onthe Twitter data set. The parameters are α = 1, β = 1, λ = 1by default. We observed that when the parameter β was set toa small value of 0.1, the model produced the worst result. Thisis because almost all network information is omitted from theanalysis. The sparse terms ‖W‖2 and ‖W‖2,1 achieved thebest results when the parameters λ and β were equal to 10.Thus, we set the parameters as α = 10, β = 10, λ = 10.

D. Accuracy and AUC Comparisons

Fig. 3(b) shows the improvement of NetRLS over RLS onthe Twitter data set. We observe that NetRLS has a consistentlyhigher accuracy than RLS, with the accuracy gap being 3% onaverage. The accuracy gap reflects the power of the networkdata. This is because social time series data contain more noisethan traditional time series data and the network data is usefulin improving performance.

Figs. 4 and 5 show the results of the comparisons ofNetRLS, RLS, LTS, BFDSS and RRIFW on the Twitterdata set. Figs. 6 and 7 show the results on the DBLP data set.The experimental results on the Weibo data set are reportedin Figs. 8 and 9. We compare these methods with differentpairs of parameters: candidate segment (shapelet) space ρ andwindow length t. We use a longer window length on the Weibodata set because the length of the time series is longer thanthe Twitter and DBLP data sets.

From the results, we can conclude that: 1) given a shapeletspace ρ, increasing window length t will not guarantee betterprediction results. When the length is 4 (for the Twitter dataset) and 5 (for the DBLP data set), both NetRLS and RLSobtain relatively better results; 2) given a window length t,increasing ρ will generally improve the performance, but theimprovement is insignificant when ρ > 30. For example, onthe Twitter data set, NetRLS is 0.760 when ρ = 30 and t = 4,

0 5 10 15 200

1

(a) The learned shapelets from the Twitter data set. The left most one is fromsocial robots, the middle one is from VIP users, and the right most one is fromnormal users.

0 5 10 15 200

1

(b) The learned shapelets from the DBLP data set. The left most one is fromnormal authors, while the right most one is from influential authors.

Fig. 10. An illustration of the shapelets learned by NetRSL on the Twitterand DBLP datasets.

which is the same as ρ = 50. Considering that increasing ρwill generate more candidate segments and increase memoryand computation costs, ρ can be set to between 30 and 50; 3)given the same ρ and t, NetRLS obtains higher accuracy thanRLS because NetRLS can model both time series and networkdata and obtain more accurate and robust results; 4) NetRLSoutperforms BFDSS and RRIFW because BFDSS and RRIFWonly consider shapelet distance or mutual information. BFDSSis a straightforward method that generates some redundantand useless features. The reason why we use an exhaustionmethod for finding shapelets as a baseline is that the timeseries is short so therefore the search space is low. Thus, thesimple yet efficient brute-force shapelet search algorithm isapplicable in this case. If the datasets are large or the lengthof the time series is long, the brute-force baseline will beinapplicable. RRIFW considers feature relevancy, redundancyand interaction but fails to involve network information; 5)NetRLS performs better than LTS. LTS shows high accuracyon the UCR time series data sets but produces a differentresult in the social network data sets (Twitter and DBLP).This is because there are undistinguishable users and noisewhen based solely on the time series data. For example, in theTwitter data set, a famous user may have only posted a fewtweets during the date ranges retrieved in our data set. Thiscould easily be categorized as a normal user if the numberof tweets was the only consideration. In the DBLP data set,an author may only focus on the top conferences in datamining, and may contribute significantly to them, but theirtotal number of papers published per year may be limited.Without additional data, this might indicate a classificationas a non-influential author; and 6) from the results on theWeibo data set, we make observations similar to those made onTwitter data set. The classification accuracy improves with an



9

TABLE IIT-TEST (p-VALUE) COMPARISON ON THE TWITTER DATA

TwitterAccuracy AUC

t=3 t=4 t=5 t=6 t=3 t=4 t=5 t=6

NetRLS v.s. RLS 0.00100 0.0139 0.0016 0.00097 0.0413 0.0212 0.0295 0.0011

NetRLS v.s. LTS 0.00082 0.0352 0.0500 0.3246 0.0485 0.0422 0.0424 0.0392

NetRLS v.s. BFDSS 0.00018 0.0044 0.0101 0.00009 0.0318 0.0338 0.0200 0.0072

NetRLS v.s. RRIFW 0.00850 0.0491 0.0718 0.0390 0.0496 0.0341 0.2262 0.0441

TABLE IIIT-TEST (p-VALUE) COMPARISON ON THE DBLP DATA

DBLPAccuracy AUC

t=2 t=3 t=4 t=5 t=2 t=3 t=4 t=5

NetRLS v.s. RLS 0.0070 0.0019 0.0441 0.0048 0.2466 0.0285 0.0913 0.0454

NetRLS v.s. LTS 0.3943 0.0381 0.4681 0.3672 0.0494 0.0374 0.0411 0.0482

NetRLS v.s. BFDSS 0.0112 0.0011 0.0130 0.0174 0.0049 0.0206 0.0044 0.0157

NetRLS v.s. RRIFW 0.2635 0.0451 0.0453 0.0446 0.7127 0.7990 0.0494 0.1475

TABLE IVT-TEST (p-VALUE) COMPARISON ON THE WEIBO DATA

WeiboAccuracy AUC

t=7 t=8 t=9 t=10 t=7 t=8 t=9 t=10

NetRLS v.s. RLS 0.00086 0.0081 0.0271 0.0016 0.0023 0.0012 0.0070 0.0051

NetRLS v.s. LTS 0.0042 0.0021 0.0305 0.0474 0.0031 0.0038 0.0086 0.0103

NetRLS v.s. BFDSS 0.0135 0.0237 0.0229 0.0006 0.0035 0.0003 0.0002 0.0038

NetRLS v.s. RRIFW 0.0299 0.2844 0.2080 0.0255 0.0046 0.0340 0.0249 0.0093

TABLE VF1 SCORE UNDER THE PARAMETER ρ = 40 AND VARIOUS WINDOW LENGTH t ON THE DBLP AND TWITTER DATASETS.

DBLP Twitter

t=2 t=3 t=4 t=5 t=3 t=4 t=5 t=6

NetRLS 0.78±0.14 0.79±0.07 0.80±0.06 0.82±0.07 0.77±0.09 0.77±0.02 0.76±0.04 0.78±0.08

RLS 0.75±0.07 0.76±0.09 0.76±0.12 0.75±0.10 0.72±0.07 0.73±0.09 0.74±0.09 0.72±0.07

LTS 0.72±0.13 0.75±0.11 0.75±0.05 0.76±0.11 0.69±0.11 0.72±0.12 0.78±0.10 0.73±0.12

BFDSS 0.73±0.21 0.73±0.08 0.74±0.11 0.75±0.04 0.68±0.18 0.72±0.07 0.75±0.06 0.71±0.14

RRIFW 0.75±0.17 0.75±0.10 0.75±0.14 0.77±0.11 0.74±0.13 0.76±0.12 0.76±0.10 0.73±0.11

TABLE VIF1 SCORE UNDER THE PARAMETER ρ = 40 AND VARIOUS WINDOW

LENGTH t ON THE WEIBO DATASETS.

Weibo

t=7 t=8 t=9 t=10

NetRLS 0.80±0.09 0.81±0.05 0.84±0.12 0.83±0.08

RLS 0.76±0.09 0.79±0.13 0.81±0.06 0.80±0.05

LTS 0.77±0.21 0.81±0.17 0.82±0.13 0.81±0.08

BFDSS 0.75±0.15 0.77±0.09 0.79±0.09 0.75±0.13

RRIFW 0.76±0.07 0.79±0.07 0.82±0.05 0.80±0.03

increase of parameter ρ but remains stable after ρ = 40. This isbecause the algorithms select more discriminative features as ρincreases while reaching the maximum discriminative featuresets when ρ = 40.

We added a t-test comparison to show the significantimprovement of our approach over the other baselines. Theseresults are detailed in Tables II, III and IV. Since a t-test is ananalysis of two populations means through the use of statisticalexamination, it is able to test the difference (or improvement)between the proposed approach and the baselines. T-testreturns a test decision for the null hypothesis and the p-valueof the test. The tables show the p-value from the t-test (witha confidence level of α = 0.05) between NetRLS and each



10

TABLE VIICOMPARISONS BETWEEN GRAPH-BASED BASELINE AND OTHER

METHODS.

NetRLS RLS LTS BFDSS RRIFW MMGCDBLP 0.79 0.76 0.75 0.74 0.75 0.76Twitter 0.77 0.74 0.78 0.75 0.76 0.73Weibo 0.84 0.79 0.81 0.76 0.81 0.76

of the baselines. A p-value less than α = 0.05 (shown inbold font in the tables) indicates that the difference betweenthe proposed method and baselines is statistically significant.From the tables, we can see that in most cases, the p-valueof NetRLS against the other baselines has a small p-value(defined as p ≤ 0.05). Therefore, our results are superior tothe baselines.

Fig. 10 lists the shapelets learned by NetRLS on the Twitterand DBLP data set. Fig. 10(a) shows that the social robotsdemonstrate sharp fluctuant shapelets, as they usually post amass of tweets in a short period of time, while normal userstend to post tweets represented by regular fluctuant shapelets.VIP users post tweets with gently fluctuant shapelet becausethey regularly post many tweets. Fig. 10(b) shows that normalauthors have subtlety changing shapelets because the numberof papers published by normal authors does not significantlychange each year, while influential authors have shapeletswhich change greatly due to coauthors or new topics.

E. F1 Score Comparison

We also compared the F1 score between the proposedapproach and other baselines on the Twitter, DBLP and Weibodatasets, as shown in Tables V and VI. Since the F1 score is theharmonic average of precision and recall, we use the F1 scoreto measure performance in consideration of both precisionand recall. As observed from Tables V and VI, NetRLS hasthe best F1 score in most cases, indicating that the proposedalgorithm has both high accuracy and a high recall rate. Whenthe window length is 5 on the Twitter data, LTS achieved thebest F1 score. This is because LTS outperformed NetRLS interms of the accuracy at t = 5 as shown in Fig. 4(c). Otherbaselines also achieved best F1 score when the window lengthis 5 on the Twitter data. This case indicates that the learnedshapelets with length 5 could well represent and classify thetime series data based on the characteristics (e.g., the timeseries length) of the Twitter data. Although LTS had better F1score than NetRLS at t = 5 on the Twitter dataset, NetRLSachieved similar and competitive F1 score in this special case.In addition, we can observe that the proposed NetRLS hasstable F1 score performance w.r.t. different window length onthe Twitter dataset whereas LTS that only based on time seriesdata is sensitive to the window length.

F. Running Time

A comparison of the CPU running time of the proposedmethod with its competitors is reported in Fig. 11. From thefigure, we can see that the running time generally increaseslinearly with respect to the size of the data sets, which meansit scales well to large data sets. From Fig. 11, we can also

Twitter DBLP Weibo0

500

1000

1500

2000

CP

U t

ime

(Sec

on

ds)


Fig. 11. Running time comparisons on various data sets.

observe that NetRLS has a lower CPU running time thanBFDSS. This is because NetRLS reduces the search spaceand converges quickly. NetRLS incurs a similar running timeas RLS, LTS and RRIFW. Considering the accuracy improve-ment, the efficiency of the proposed approach is acceptable.

G. Discussions

1) Comparison with graph-based baseline: since our datasets include both time series data and graph network data, wealso compare the proposed algorithm with the graph-basedapproach. For the graph-based approach, we only use networkdata for the learning tasks. A recent graph-based approachfor social network data is reported in [17]. They handledsubgraph features in a streaming fashion because of the high-dimensional feature problem, and proposed a max-margingraph classifier to calculate the score of each subgraph forsubgraph feature selection. In this paper, since the dimensionalof subgraphs from our data sets is not high, we implementedtheir max-margin graph classifier without streaming subgraphfeatures (defined as MMGC). Therefore, we first mine frequentsubgraphs based on gSpan, and then input all of the minedsubgraphs to the max-margin graph classifier for learning.The results are demonstrated in Table VII. For the otherapproaches, we used ρ = 40 and t = 4 for DBLP, t = 5for Twitter, t = 9 for Weibo. From the table, we can see thatthe graph-based approach performs worse than the proposedmethod. This is because time series data is significant inclassifying different objects in the data sets. Therefore, theproposed networked time series model enables the use ofrich time series information to improve performance, henceboth time series data and network data should be used fornetworked time series learning.

2) Selection of shapelets: we discuss two problems regard-ing the selection of shapelets from networked time series.

First, one may ask: ”Can network information alter theshapelets?” In Eq. (2), the optimization problem only has onevariable - the classifier weight W - so network informationwill only change the classifier boundaries not the shapeletsthemselves. In fact, our solution can be seen as a two-stephierarchical approach that first solves Eq. (4) for a trimmedconcise segment space, then solves Eq. (2) to obtain the



11

shapelets. In our experiments, the network links are relativelysparse 8 because social time series data often contain heavynoise and social nodes labels are usually incorrect. Networklink information can somewhat alleviate the noise and mislabelproblems, thus improving classification accuracy.

The second question is: in which cases can the link in-formation alter the shapelets themselves? We can relax Eq.(2) into a more flexible problem that optimize the objectivefunction between the variables W and S. Hence, the third itemof network regularization L will impact the optimal value ofS. However, similar to the work of learning shapelets in [11],if we optimize w.r.t. S, we will achieve higher accuracy, butat cost of obtaining shapelets that slightly all the segmentsderived from time series data.

3) Superiority: from the extensive descriptions in SectionsV-D, V-E and V-F, we can see that NetRLS outperforms theother baselines, followed by RLS. BFDSS has the worst per-formance in most cases. The superior performance of NetRLSand RLS indicates two aspects: 1) our feature selection methodefficiently selects discriminative shapelet features because RLSonly considers feature selection terms in the objective function;2) the structure information helps to better select shapeletfeatures and improves the performance. Since the structureinformation provides additional information for classifyingdifferent categories of time series, the additional informationcould optimize the selected shapelets from the feature selectionpart.

VI. CONCLUSION

In this paper, a network regularized least squares featureselection method (NetRLS) was proposed to incorporate thenetwork structure information for shapelet selection. Our workdoes not assume that time series are independent and identi-cally distributed (i.i.d.), hence enabling the use of rich net-work structure information to improve the performance. Theexperiments and comparisons on real-world Twitter, DBLPand Weibo data demonstrate that NetRLS outperforms therepresentative time series shapelet learning algorithms andis suitable for a wide range of learning tasks. This workinspires some interesting directions for future research: 1) theproblem could be further extended by combing a deep learningframework, such as graph embedding and time series learningwith CNN or RNN for networked time series learning; and2) the idea could also be applied to multivariate time series.For example, a multivariate time series can be augmented bya graph that describes how variables (or nodes) are connected.

APPENDIX

A. Explanation to Eq. (2).

In the objective function Eq. (2), we use L2,0 norm on Wto achieve feature selection, i,e., ‖W ‖2,0 ≤ ρ. To select top-ρ

8Fig. 2 shows the 200 nodes and 210 edges. Because we have located allthe social robots, all the links between social robots are captured. On the otherhand, the edges between real users, including VIP users and ordinary users,are relatively sparse. Even though the network is sparse, we can still observea performance improvement, as shown in Fig. 3(b).

most discriminative shapelets, the objective function can beformulated as follows:

minW‖Y −W TS‖2F + α‖W ‖2F + βtr(W TSLSTW )

s.t. : ‖W ‖2,0 ≤ ρ, α, β > 0.(A.1)

‖W ‖2,0 guarantees that at most ρ rows in W are selected.To solve the objective function, we rewrite the above objectivefunction to the following regularized problem,

minW‖Y −W TS‖2F+α‖W ‖2F+βtr(W TSLSTW )+λ‖W ‖2,0

(A.2)However, ‖W ‖2,0 makes the objective function in Eq. (2)

non-smooth and non-convex. Therefore, we relax ‖W ‖2,0 toits convex hull [40], and obtain the following convex problem:

minW‖Y −W TS‖2F + α‖W ‖2F + βtr(W TSLSTW )

s.t. : ‖W ‖2,1 ≤ ρ, α, β > 0.(A.3)

REFERENCES

[1] Z. Xing, J. Pei, and E. Keogh, “A brief survey on sequence classifica-tion,” ACM SIGKDD Explorations Newsletter, vol. 12, no. 1, pp. 40–48,2010.

[2] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes, “Correlat-ing financial time series with micro-blogging activity,” in Proceedingsof the Fifth ACM International Conference on Web Search and DataMining (WSDM), 2012, pp. 513–522.

[3] S. Askari, N. Montazerin, and M. H. F. Zarandi, “High-frequencymodeling of natural gas networks from low-frequency nodal meter read-ings using time-series disaggregation,” IEEE Transactions on IndustrialInformatics, vol. 12, no. 1, pp. 136–147, Feb 2016.

[4] M. Emmanuel, R. Rayudu, and I. Welch, “Impacts of power factorcontrol schemes in time series power flow analysis for centralized pvplant using wavelet variability model,” IEEE Transactions on IndustrialInformatics, vol. PP, no. 99, pp. 1–1, 2017.

[5] R. Liu, G. Meng, B. Yang, C. Sun, and X. Chen, “Dislocated time seriesconvolutional neural architecture: An intelligent fault diagnosis approachfor electric machine,” IEEE Transactions on Industrial Informatics,vol. 13, no. 3, pp. 1310–1320, June 2017.

[6] L. Ye and E. Keogh, “Time series shapelets: A new primitive fordata mining,” in Proceedings of the 15th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDD), 2009,pp. 947–956.

[7] E. Keogh and C. A. Ratanamahatana, “Exact indexing of dynamic timewarping,” Knowledge and information systems, vol. 7, no. 3, pp. 358–386, 2005.

[8] L. Ye and E. Keogh, “Time series shapelets: a novel technique thatallows accurate, interpretable and fast classification,” Data mining andknowledge discovery, vol. 22, no. 1-2, pp. 149–182, 2011.

[9] L. Zhu, C. Lu, and Y. Sun, “Time series shapelet classification basedonline short-term voltage stability assessment,” IEEE Transactions onPower Systems, vol. 31, no. 2, pp. 1430–1439, 2016.

[10] T. Rakthanmanon and E. Keogh, “Fast shapelets: A scalable algorithmfor discovering time series shapelets,” in Proceedings of the thirteenthSIAM conference on data mining (SDM). SIAM, 2013, pp. 668–676.

[11] J. Grabocka, N. Schilling, M. Wistuba, and L. Schmidt-Thieme, “Learn-ing time-series shapelets,” in Proceedings of the 20th ACM SIGKDDinternational conference on Knowledge discovery and data mining.ACM, 2014, pp. 392–401.

[12] G. A. Susto, A. Schirru, S. Pampuri, and S. McLoone, “Supervisedaggregative feature extraction for big data time series regression,” IEEETransactions on Industrial Informatics, vol. 12, no. 3, pp. 1243–1252,2016.



12

[13] J. Lines, L. M. Davis, J. Hills, and A. Bagnall, “A shapelet transformfor time series classification,” in Proceedings of the 18th ACM SIGKDDinternational conference on Knowledge discovery and data mining.ACM, 2012, pp. 289–297.

[14] J. Grabocka, N. Schilling, M. Wistuba, and L. Schmidt-Thieme, “Learn-ing time-series shapelets,” in Proceedings of the 20th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining(KDD), 2014, pp. 392–401.

[15] T. Sakaki, M. Okazaki, and Y. Matsuo, “Tweet analysis for real-time event detection and earthquake reporting system development,”Knowledge and Data Engineering, IEEE Transactions on, vol. 25, no. 4,pp. 919–931, 2013.

[16] K. Nusratullah, S. A. Khan, A. Shah, and W. H. Butt, “Detecting changesin context using time series analysis of social network,” in SAI IntelligentSystems Conference (IntelliSys), 2015. IEEE, 2015, pp. 996–1001.

[17] H. Wang, P. Zhang, X. Zhu, I. W.-H. Tsang, L. Chen, C. Zhang, andX. Wu, “Incremental subgraph feature selection for graph classification,”IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 1,pp. 128–142, 2017.

[18] X. Kong and S. Y. Philip, “Graph classification in heterogeneousnetworks,” in Encyclopedia of Social Network Analysis and Mining.Springer, 2014, pp. 641–648.

[19] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh,“Querying and mining of time series data: experimental comparisonof representations and distance measures,” Proceedings of the VLDBEndowment, vol. 1, no. 2, pp. 1542–1552, 2008.

[20] J. Kleinberg, “Bursty and hierarchical structure in streams,” Data Miningand Knowledge Discovery, vol. 7, no. 4, pp. 373–397, 2003.

[21] M. G. Elfeky, W. G. Aref, and A. K. Elmagarmid, “Periodicity detectionin time series databases,” Knowledge and Data Engineering, IEEETransactions on, vol. 17, no. 7, pp. 875–887, 2005.

[22] L. Wei, N. Kumar, V. N. Lolla, E. J. Keogh, S. Lonardi, and C. A.Ratanamahatana, “Assumption-free anomaly detection in time series.”in SSDBM, vol. 5, 2005, pp. 237–242.

[23] B. Liu, J. Li, C. Chen, W. Tan, Q. Chen, and M. Zhou, “Efficient motifdiscovery for large-scale time series in healthcare,” IEEE Transactionson Industrial Informatics, vol. 11, no. 3, pp. 583–590, 2015.

[24] L. Zhu, C. Lu, Z. Y. Dong, and C. Hong, “Imbalance learning machinebased power system short-term voltage stability assessment,” IEEETransactions on Industrial Informatics, 2017.

[25] D. Yankov, E. Keogh, J. Medina, B. Chiu, and V. Zordan, “Detectingtime series motifs under uniform scaling,” in Proceedings of the 13thACM SIGKDD international conference on Knowledge discovery anddata mining. ACM, 2007, pp. 844–853.

[26] A. McGovern, D. H. Rosendahl, R. A. Brown, and K. K. Droege-meier, “Identifying predictive multi-dimensional time series motifs: anapplication to severe weather prediction,” Data Mining and KnowledgeDiscovery, vol. 22, no. 1-2, pp. 232–258, 2011.

[27] J. Grabocka, M. Wistuba, and L. Schmidt-Thieme, “Fast classificationof univariate and multivariate time series through shapelet discovery,”Knowledge and Information Systems, vol. 49, no. 2, pp. 429–454, 2016.

[28] A. Mueen, E. Keogh, and N. Young, “Logical-shapelets: an expressiveprimitive for time series classification,” in Proceedings of the 17th ACMSIGKDD international conference on Knowledge discovery and datamining. ACM, 2011, pp. 1154–1162.

[29] I. Guyon and A. Elisseeff, “An introduction to variable and featureselection,” The Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003.

[30] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: Ageometric framework for learning from labeled and unlabeled examples,”The Journal of Machine Learning Research, vol. 7, pp. 2399–2434,2006.

[31] Q. Gu and J. Han, “Towards feature selection in network,” in Pro-ceedings of the 20th ACM international conference on Information andknowledge management. ACM, 2011, pp. 1175–1184.

[32] S. Boyd, “Convex optimization of graph laplacian eigenvalues,” inProceedings of the International Congress of Mathematicians, vol. 3,no. 1-3, 2006, pp. 1311–1319.

[33] F. Chung, L. Lu, and V. Vu, “Spectra of random graphs with givenexpected degrees,” Proceedings of the National Academy of Sciences,vol. 100, no. 11, pp. 6313–6318, 2003.

[34] K. Costello, “Random walks on directed graphs,” 2005.[35] Y. Nesterov et al., “Gradient methods for minimizing composite objec-

tive function,” 2007.[36] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algo-

rithm for linear inverse problems,” SIAM journal on imaging sciences,vol. 2, no. 1, pp. 183–202, 2009.

[37] T. Pahikkala and A. Airola, “Rlscore: regularized least-squares learners,”J Mach Learn Res, vol. 17, pp. 1–5, 2016.

[38] L. Ye and E. Keogh, “Time series shapelets: a new primitive fordata mining,” in Proceedings of the 15th ACM SIGKDD internationalconference on Knowledge discovery and data mining. ACM, 2009, pp.947–956.

[39] O. Abedinia, N. Amjady, and H. Zareipour, “A new feature selectiontechnique for load and price forecast of electrical power systems,” IEEETransactions on Power Systems, vol. 32, no. 1, pp. 62–74, 2017.

[40] S. Boyd and L. Vandenberghe, Convex optimization. Cambridgeuniversity press, 2004.

Haishuai Wang is an Assistant Professor in theDepartment of Computer Science and Engineering,Fairfield University, Fairfield, CT, USA. He is alsoa Visiting Assistant Professor in the Department ofBiomedical Informatics at Harvard Medical School.He received his Ph.D. degree in computer sciencefrom the Center of Artificial Intelligence at Univer-sity of Technology Sydney, Australia. His researchfocuses on data mining, machine learning, and ap-plications on bioinformatics and social networks.

Jia Wu received the PhD degree in computer sciencefrom University of Technology Sydney, Australia.He is currently a Lecturer in the Department ofComputing, Faculty of Science and Engineering,Macquarie University, Sydney, Australia. His re-search focuses on data mining and machine learning.Since 2009, Dr. Wu has published more than 100top-tier journals (such as TPAMI, TKDE, TNNLS,TCYB, TKDD) and conference papers (such asIJCAI, AAAI, ICDM, SDM, CIKM) in these areas.

Peng Zhang is a Senior Algorithm Staff at AntFinancial Services Group. Before that, he was aLecturer at University of Technology Sydney, Aus-tralia. He received Ph.D. from the Chinese Academyof Sciences in 2009. He has published more than100 research papers in major artificial intelligencejournals and conferences, including TPAMI, TNNLSand TKDE. He continuously serves as a PC memberin leading artificial intelligence conferences, such asICLR, ICML, NIPS, KDD, AAAI and IJCAI.

Yixin Chen received the PhD degree in computingscience from the University of Illinois at Urbana-Champaign, in 2005. He is currently an AssociateProfessor of Computer Science at the WashingtonUniversity in St. Louis, St. Louis, MO, USA. His re-search interests include data mining, machine learn-ing, artificial intelligence, optimization, and cyber-physical systems. He is an Associate Editor for theACM TIST and the IEEE TKDE, and serves on theEditorial Board of the JAIR.

Documents

Learning Shapelet Patterns from Network-based Time Series Dataychen/public/TII.pdf · 2019-03-10 · this way, the shapelets are detached from candidate segments and the learned shapelets