Cleanits: A Data Cleaning System for Industrial Time Seriesindustrial internet, knowledge discovery in industrial data contributes to the improvement in industrial process as well

Cleanits: A Data Cleaning System for IndustrialTime Series

Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Zijue Li, Jianzhong Li, Hong GaoHarbin Institute of Technology

92 West Dazhi StreetHarbin, China

[email protected], {wangzh, itx351, lizijue, lijzh, honggao}@hit.edu.cn

ABSTRACTThe great amount of time series generated by machines hasenormous value in intelligent industry. Knowledge can bediscovered from high-quality time series, and used for pro-duction optimization and anomaly detection in industry.However, the original sensors data always contain many er-rors. This requires a sophisticated cleaning strategy and awell-designed system for industrial data cleaning. Motivat-ed by this, we introduce Cleanits, a system for industrialtime series cleaning. It implements an integrated cleaningstrategy for detecting and repairing three kinds of errors inindustrial time series. We develop reliable data cleaning al-gorithms, considering features of both industrial time seriesand domain knowledge. We demonstrate Cleanits with tworeal datasets from power plants. The system detects and re-pairs multiple dirty data precisely, and improves the qualityof industrial time series effectively. Cleanits has a friendlyinterface for users, and result visualization along with logsare available during each cleaning process.

PVLDB Reference Format:Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Zijue Li, Jianzhong Li,and Hong Gao. Cleanits: A Data Cleaning System for IndustrialTime Series. PVLDB, 12(12): 1786-1789, 2019.DOI: https://doi.org/10.14778/3352063.3352066

1. INTRODUCTIONIndustrial big data evaluation plays a key role in intelli-

gent manufactory. With the rapid growth of data from theindustrial internet, knowledge discovery in industrial datacontributes to the improvement in industrial process as wellas anomaly detection. To achieve this, high-quality data isidentified as the basic premise on accomplishing informationextraction and valuable knowledge discovery. The demandfor high quality industrial data has grown stricter [1].The time series generated by sensors is a normal form of

industrial data. To describe the status of a workflow withmultiple machines, multi-dimensional time series are gener-ated with each dimension (i.e., attribute) belonging to one

This work is licensed under the Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/byncnd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected]. Copyright is held by the owner/author(s). Publication rightslicensed to the VLDB Endowment.Proceedings of the VLDB Endowment, Vol. 12, No. 12ISSN 21508097.DOI: https://doi.org/10.14778/3352063.3352066

sensor. In reality, it is challenged to obtain high-quality timeseries due to the following major quality problems.

• Missing values. Some dimensions of the time series of-ten contain missing values at discrete time points or duringa time period. Machine sensors may fail to collect the valueof a time point. Short time transmission fault is anotherreason for the absence of these values.

• Inconsistent attribute values. Signal interferencemay happen when equipments are undergoing working con-dition transition. As a result, an attribute may record theinformation of another attribute during a time period. Itleads to subsequence inconsistency problems during a cer-tain time duration.

• Abnormal values and subsequences. Imprecise ordirty values are prevalent in industrial time series [2]. Un-planned machine failures give rise to the sudden value changesat some time points. Unexpected troubles in sensor record-ing also lead to short-length abnormal sequences among var-ious attributes.

Many data cleaning techniques have been developed tosolve data quality problems [3, 4], and cleaning platformsare applied in various data repairing tasks. Unfortunately,most of them do not apply to time series cleaning, especiallyfor industrial time series. On the one hand, data cleaningtasks are always industry-specific problems. Cleaning meth-ods hardly perform well without the guidance from domainknowledge and adequate understanding about the industri-al process. On the other hand, data reflects multi-modalworking conditions (e.g., smooth, step, sparse pattern [1])of machines. Errors in different patterns are difficult to beeither identified or repaired without a well-designed cleaningapproach.

Thus, comprehensive time series cleaning approaches havebecome a urgent demand in intelligent manufacturing. Mo-tivated by this, we develop Cleanits, which makes effectivedata cleaning in multi-dimensional industrial t ime series.Our system focuses on the following objectives.

(1) Effectiveness in industrial data. After a thoroughpractice research, we develop three data cleaning functionsin Cleanits, namely missing values imputation, matching in-consistent attribute values, and anomaly detection and re-pairing. These approaches with the process order amongthem are verified to be effective in the demo scenarios withreal industrial datasets.

(2) Domain knowledge support. Domain knowledgehas a meaningful influence in industrial data cleaning. Clean-its is designed to understand domain knowledge includingsequence constraints and machine working patterns. Thus,

1786

Metrics Analysis

Training classifier on similarity matrix

Matching Inconsistent Attribute Values

· Training course

Classifer prediction with confidence

Matching withGreedy strategy

· Matching on bipartite graphs

Maximum weight matching algorithm

Optimization computation

Constraints detection

Anomaly Detection and Repairing

Pattern intervals in sequence

Bollinger bands detectionin a pattern interval

· Abnormal points

Working pattern detection

Sub-sequence pattern mining Correlation analysisStatistical computation Time costs computation

Result visualization

High-quality Data

User Interaction

Original Time series

· Identify working patterns

· Abnormal sub-sequences

Auto Regression & ARMAX

Sequential dependencies

· Repairing

Repair with pattern tranistion model

LSTM-based trainingReasoning from

correlative attributes

Sequence constraints

Missing Values Imputation

· Data-based

Variance constraints

· Model-based

Sequence pattern learning

Labelling clean sample data Labelling clean sample data Labelling clean sample data Labelling clean sample data Domain knowledge extraction Parameter settingsParameter settings Cleaning mode selection

Cleaning Logs

Figure 1: Framework overview of Cleanits

Cleanits provides reliable data repairing results on preciselocations in each sequence of the time series.(3) Customized and user-friendly cleaning modes.

Considering the parameter settings and the balance betweentime costs and the performance of different cleaning strate-gies, we design multiple data cleaning modes. Users canchoose a proper mode according to their data quality re-quirements.To summarize, we make the following contributions in

this paper. 1) We develop a data cleaning system Clean-its for industrial time series. 2) Cleanits implements threerepairing functions to effectively improve the quality of themulti-dimensional time series. 3) Cleanits provides a well-considered interface designment for users to operate a cus-tomized data cleaning. 4) We run Cleanits on real machinesensors data from two power plants for system functionsdemonstration.

2. SYSTEM OVERVIEWFigure 1 shows the framework of Cleanits, which has three

cleaning functions: 1) Missing value imputation, 2) Match-ing inconsistent attribute values, and 3) Anomaly detectionand repairing. Besides, it has a metric analysis module forparameter measurement and a user interaction module.Missing value imputation. In the first step, Clean-

its detects missing values in each attribute of the inputtime series, and fills the missing parts to satisfy the sta-tistical properties of each attribute. The system providestwo imputation strategies: data-based approach and model-based approach, respectively. In the data-based approach,important sequence features are described in semantics, ac-cording to the combination of domain knowledge and theclean sample data. Our system Cleanits cleans missing val-ues with sequence constraints [5] and short-window varianceconstraints [3]. It also provides learning-based imputationresults for sequences based on their corresponding patternsaccording to sub-sequence pattern mining results.Matching inconsistent attribute value. We then de-

tect and repair inconsistent attribute values. We develop anexact maximum weight matching based on Kuhn-Munkres(KM) algorithm in [6] and a fast greedy matching algorith-m. We also propose pruning and optimizing matching algo-rithms for users to alternatively select a cleaning mode withthe balance of the time costs and the cleaning accuracy.

Anomaly detection and repairing. After inconsistentsubsequences have been matched to right attributes, we con-duct anomaly detection and repairing as the third cleaningstep. Cleanits not only cleans abnormal values of discretedata points, but also captures abnormal subsequences effec-tively. During this course, users are able to know exactlywhere the abnormal cases happen. Cleanits provides the vi-sualization of abnormal detection, enabling users to modifyand label anomaly cases via the interaction module. Thesystem allows users to upload manual repairing result, andthese labelled data will be applied in the later training pro-cess.

Considering the complex working conditions reflected byindustrial data, we design three kind of anomaly repair-ing methods: statistics-based, model-based and data-basedapproaches. The system chooses proper algorithms amongthem, according to both statistical analysis and the cleaningmode selected by users.

Metrics analysis. This module provides measurementfor data cleaning functions. Cleanits executes four comput-ing sections in this module: statistical computation, sub-sequence pattern mining, correlation analysis and time costscomputation.

Statistical computation. Important statistical indicatorvalues are computed to capture sequence features reliably.These indicators guide the repair of errors, and further, theyverify the reliability of the cleaning results.

Subsequence pattern mining . Various working conditionsof machines appear as multi-modal sequence patterns in thewhole dataset. We propose a subsequence pattern miningprocess in order to capture different patterns (e.g., station-ary, floating, and smooth pattern) in each attribute. Thesystem makes decisions on appropriate cleaning solutionsaccording to features of different sequence patterns. Thus,it provides targeted cleaning results instead of generic ones.

Correlation analysis. Since the data records working con-ditions of a machine group, some attributes may be relevant.On the one hand, some attributes record data belonging tothe same machine. Thus, they describe similar sequencepatterns. On the other hand, some attributes record col-laboration working patterns among a local machine group.These attribute values probably have a correlation with eachother. We compute the correlation among attributes accord-ing to both the domain knowledge and statistical metrics.

1787

Thus, dirty data in an attribute can be repaired effective-ly according to its strong correlated attributes. Cleanitsachieves effectiveness in both detection and repairing withthe correlation analysis on multi-dimensional time series.User interaction. This module is designed for an easy-

to-use interaction between users and the system. Before thecleaning process starts, users are able to upload a sampleof clean data. These labelled training data will be used inthe model-based cleaning methods. In addition, the domainknowledge (e.g., sequence constraints, working state transi-tion, and attributes correlation) contributes to an accuratedetection of errors.During the cleaning process, users are able to set parame-

ters depending on their own requirements. Since some clean-ing functions contain quite a few parameters, Cleanits alsoallows default settings. We design four cleaning modes forusers: 1) Overall mode implements a thorough cleaning withall parameters set by users. It achieves a good cleaning per-formance, but has the highest time cost; 2) High-efficiencymode implements a fast coarse-grained cleaning with theleast time cost. Only a small part of parameters are deter-mined manually; 3) Automation mode runs all cleaning func-tions with default parameter settings. All repairing methodsare determined by the system, and users are not involved inthe cleaning process; and 4) Time restriction mode repairsdata according to a running time limitation set by users.The cleaning result reflects the balance between the timecost and the cleaning granularity.

3. IMPLEMENTATIONSIn this section, we first introduce basic definitions in time

series quality, and then discuss cleaning functions designedin Cleanits. Note that missing values can be detected eas-ily by checking null values between space characters in se-quences, and the imputation strategy is similar with anoma-ly value repairing. We focus on two cleaning functions andintroduce matching inconsistent attribute values in Section3.2 and anomaly repairing in Section 3.3.

3.1 PreliminariesA time series S ∈ RN×M is denoted by S = {S1, ..., SM},

where M is the total number of attributes: M = |attr(S)|.Si = ⟨s1, ..., sN ⟩, (i ∈ [1, N ]) is a sequence on attributeAi of S, where N = |Si| is the total number of elementsin Si, i.e., the length of Si. sj = (xj , tj), (j ∈ [1,m]),where xj is a real-valued number with a timestamp tj , and(j < k) ⇔ (tj < tk). A subsequence Si.[l,n] = ⟨sl, ..., sn⟩ is acontinuous subset of Si beginning from the element sl andending in sn. T[l:n] is the time interval of Si.[l,n].The inconsistency problem. A time interval T[l:n] is in-

consistent iff it satisfies: (i) ∃Si.[l,n] * Si, but Si.[l,n] ⊆ Sj ,(i, j ∈ [1,m] and i = j). Each Si.[l,n] * Si is one unmatchin T[l:n], and the total number of such unmatch is no lessthan 2, and (ii) the length of T[l:n] is no less than a giventhreshold, i.e., |n− l + 1| ≥ δ.The anomaly problem. For a sequence Si = ⟨s1, ..., sn⟩

in S, F1(s) = [fl, fu] is a value prediction function of theelement s in Si, where fl (resp. fu) represents the min (re-sp. max) expected value of s. s = (x, t) is identified asan abnormal point iff x /∈ [fl, fu]. F2(S[l,n]) is a patternprediction function of subsequence S[l,n]. S[l,n] is identifiedas a k-length abnormal sequence iff Dist(S[l,n],F2(S[l,n])) islarger than a required distance threshold.

Instance distributions

Bipartite graphs

Classifier prediction

Feature vectors extraction

· The mean vector

· The slope vector

· The average critical difference

Matching on bipartite graphs

Optimization computing on

· KM Algorithm

· Greedy Algorithm

Training Data

Parameter settingsParameter settings

Statistical computingInconsistent

Time SeriesConfidence computing

Similarity matrix computingTraining classifier

Random Forest

Consistent�Time Series

Figure 2: Matching inconsistent subsequences

3.2 Matching Inconsistent AttributesThe inconsistency repair solution in Cleanits is shown in

Figure 2. We first make classifier prediction and then matchinconsistent subsequences to their corresponding attributes.Each sequence is treated as a classification with several fea-ture vectors extracted from the computed similarity matrix.We construct the classifier based on random forest consid-ering its efficiency on large-scale data and high performanceon multi-dimensional time series.

When the classifier prediction function begins, we com-pute instance distributions for each sequence, i.e., a confi-dence value set of subsequences matched to each attribute.Accordingly, we construct a bipartite graph G = (NS′ , NS ,E,W ), and mark the confidence as the weight of an edge.That is, w(S′

i, Sj) is the confidence of matching S′i to Sj ,

(S′i ∈ NS′ , Sj ∈ NS).We design two matching approaches to achieve precise

consistent repairing, according to the matching confidencedistribution. If most of the matching weights are high, weuse KM algorithm to obtain a matching pattern with themax sum of weights over all edges on G. If there exist-s several low matching confidences, we consider a greedymatching strategy. We first select the edge (S′

i, Sj) fromE(G) with maxS′

i∈S′,Sj∈S w(S′i, Sj), and add it to the result

set. We then delete all the edges connecting with either S′i

or Sj . We iteratively process the above steps until E(G) isempty, and obtain a consistent time series.

3.3 Anomaly Detection and RepairingThe unexpected change in either the value or the pattern

is recognized as an anomaly in industrial time series [1]. Wefirst recognize machine working conditions and divide eachsequence into intervals according to different working pat-terns. After that, we detect and repair abnormal data pointsand subsequences.

For abnormal data points repairing, we identify the un-expected values according to both sequential dependencies(sd) [5] and window-variance constraints proposed in ourprevious work [7]. sds express the required value differencebetween two consecutive time points, i.e., sdi : T→[G1,G2]Si,where 0 < G1 ≤ G2, Si ∈ S, and T is the timestamp set.w -Variance constraints describe the required variance valueδ for a w-length sequence. The value si on ti is consideredas an abnormal point if the variance of k intervals is overδ in w intervals involving si, i.e., s[i−w+1,i], s[i−w+2,i+1], ...,s[i,i+w−1]. After the detection, we use statistic-based meth-ods as well as the sd solution to repair abnormal points withthe maximum likelihood defined in our models.

1788

(a) Correlation and pattern mining (b) Repairing abnormal data points

T72:8.4551

T80:8.4534

(c) Repairing abnormal subsequences

Figure 3: System Interfaces

In abnormal subsequence repairing, we check pattern tran-sitions in each divided sequence interval. Abnormal patternsare repaired by the known pattern transitions model. Dur-ing each interval, we compute attributes correlation and de-tect candidate abnormal subsequences with Bollinger bands[8] computing. We then use LSTM training approach todetermine and repair the real abnormal data.

4. DEMONSTRATIONSWe intend to demonstrate all the three data cleaning func-

tions with the visualization of each cleaning step, and showhow the system works with real-life industrial data.Data sources. We demonstrate the whole data cleaning

course with two real-life industrial time series.(1) Temperature control system data. The data comes

from a large-scale fossil-fuel power plant, with 80 attributesdescribing the water temperature control machine group.We process and analysis sequences on 1050K time points.(2) Fans group system data. It is from a wind power plant,

which has 150 attributes describing the working conditionof fan-machine groups. It collects data each 5 seconds, and1620K time points data have been used in the system.Demo scenarios. Cleanits first allows users to input the

clean sample data and the semantic domain knowledge. Af-ter that, users upload the industrial time series, and chooseone cleaning mode. As shown in Fig.3(a), Cleanits first pro-cesses sequence pattern mining and correlation analysis onthe time series, with logs begin to generate. Attributes withstrong correlation are listed in the page, and different work-ing patterns are recognized and shown in the right bar inFig.3(a). Accordingly, different subsequence patterns aremarked in different colors in each sequence.The three cleaning functions are executed in order. In

the missing imputation step, users are allowed to set thevalue range for sequence constraints and the window sizein variance constraints. Cleanits repairs the missing datawith proper methods discussed in Section 2. After that, thesystem begins to match inconsistent attribute values. Userscan check the graphs of inconsistent subsequences and thematching result with a click on the corresponding box in theleft-side column. Cleanits allows users to determine whetherthe matching result is correct. When it achieves consistencyin attributes, the metrics computing (e.g., statistical indica-tors measurement and correlation analysis) will be updated.After both the missing values and the inconsistencies have

been repaired, the anomaly detection starts. As shown inFig.3(b), users can set parameters in the right-side bar. Ab-normal data will be highlighted in the sequence shown in themiddle of the page (Fig.3(b) and 3(c)). Thus, users know the

anomaly part clearly via the sequence visualization. Figure3(c) shows that Cleanits provides recommended repairing re-sults and also allows users to determine the abnormal partsand repair the values manually. Accordingly, abnormal val-ues modified by users will be considered as labelled sampledata, which will be used to optimize the training models incleaning functions for later data cleaning tasks.

5. ACKNOWLEDGMENTSThis work is supported by National Key Research and De-

velopment Program of China grant 2016YFB1000703; NS-FC grant U1509216, U1866602, 61602129 and Microsoft Re-search Asia. Hongzhi Wang and Xiaoou Ding contributed tothe work equally and should be regarded as co-first authors.Hongzhi Wang is the corresponding author.

6. REFERENCES[1] Meir Toledano, Ira Cohen, Yonatan Ben-Simhon, and

Inbal Tadeski. Real-time anomaly detection system fortime series at scale. In Proceedings of the KDDWorkshop on Anomaly Detection, pages 56–65, 2017.

[2] Aoqian Zhang, Shaoxu Song, Jianmin Wang, andPhilip S. Yu. Time series data cleaning: From anomalydetection to anomaly repairing. PVLDB,10(10):1046–1057, 2017.

[3] Aoqian Zhang, Shaoxu Song, and Jianmin Wang.Sequential data cleaning: A statistical approach. InProceedings of the International Conference onManagement of Data, SIGMOD Conference, pages909–924, 2016.

[4] Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, andChristopher Re. Holoclean: Holistic data repairs withprobabilistic inference. PVLDB, 10(11):1190–1201,2017.

[5] Lukasz Golab, Howard J. Karloff, Flip Korn, AvishekSaha, and Divesh Srivastava. Sequential dependencies.PVLDB, 2(1):574–585, 2009.

[6] James Munkres. Algorithms for the assignment andtransportation problems. Journal of the society forindustrial and applied mathematics, 5(1):32–38, 1957.

[7] Wei Yin, Tianbai Yue, Hongzhi Wang, Yanhao Huang,and Yaping Li. Time series cleaning under varianceconstraints. In Database Systems for AdvancedApplications - DASFAA International Workshops,pages 108–113, 2018.

[8] Bar Koer. Bollinger bands approach on boosting abcalgorithm and its variants. Applied Soft Computing,49:292–312, 2016.

1789

Documents

Cleanits: A Data Cleaning System for Industrial Time Seriesindustrial internet, knowledge discovery in industrial data contributes to the improvement in industrial process as well