16
HAL Id: hal-01609256 https://hal.archives-ouvertes.fr/hal-01609256 Submitted on 3 Oct 2017 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Dynamic time warping-based imputation for univariate time series data Thi-Thu-Hong Phan, Emilie Poisson Caillault, Alain Lefebvre, André Bigand To cite this version: Thi-Thu-Hong Phan, Emilie Poisson Caillault, Alain Lefebvre, André Bigand. Dynamic time warping- based imputation for univariate time series data. Pattern Recognition Letters, Elsevier, 2017, 10.1016/j.patrec.2017.08.019. hal-01609256

Dynamic time warping-based imputation for univariate time

Embed Size (px)

Citation preview

HAL Id: hal-01609256https://hal.archives-ouvertes.fr/hal-01609256

Submitted on 3 Oct 2017

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Dynamic time warping-based imputation for univariatetime series data

Thi-Thu-Hong Phan, Emilie Poisson Caillault, Alain Lefebvre, André Bigand

To cite this version:Thi-Thu-Hong Phan, Emilie Poisson Caillault, Alain Lefebvre, André Bigand. Dynamic time warping-based imputation for univariate time series data. Pattern Recognition Letters, Elsevier, 2017,�10.1016/j.patrec.2017.08.019�. �hal-01609256�

newcolorrgb.8,.349,.1

1

Dynamic Time Warping-based imputation for univariate time series data

Thi-Thu-Hong PHANa,b,∗, Émilie POISSON CAILLAULTa,c,∗, Alain LEFEBVREc, André BIGANDa

a Univ. Littoral Côte d’Opale, EA 4491-LISIC, F-62228 Calais, FrancebVietnam National University of Agriculture, Department of Computer Science, Hanoi, Vietnam

c IFREMER, LER BL, F-62321 Boulogne-sur-mer, France

Abstract

Time series with missing values occur in almost any domain of applied sciences. Ignoring missing values can leadto a loss of efficiency and unreliable results, especially for large missing sub-sequence(s). This paper proposes anapproach to fill in large gap(s) within time series data under the assumption of effective information. To obtain theimputation of missing values, we find the most similar sub-sequence to the sub-sequence before (resp. after) the miss-ing values, then complete the gap by the next (resp. previous) sub-sequence of the most similar one. Dynamic TimeWarping algorithm is applied to compare sub-sequences, and combined with the shape-feature extraction algorithm forreducing insignificant solutions. Eight well-known and real-world data sets are used for evaluating the performanceof the proposed approach in comparison with five other methods on different indicators. The obtained results provedthat the performance of our approach is the most robust one in case of time series data having high auto-correlationand cross-correlation, strong seasonality, large gap(s), and complex distribution.

Keywords: Imputation, Missing data, Univariate time series, DTW, Similarity

1. Introduction

Recent advances in monitoring systems, communica-tion and information technology, storage capacity and re-mote sensing systems make it possible to consider hugetime series databases. These databases have been col-lected over many years with intraday samplings. How-ever, they are usually incomplete due to sensor failures,communication/transmission problems or bad weatherconditions for manual measures or maintenance. Thisis particularly the case for marine samples (Rousseeuwet al. (2013), Ceong et al. (2012)). Incomplete missingdata are problematic (Gómez-Carracedo et al. (2014)) be-cause most data analysis algorithms and most statisticalsoftwares are not designed to handle this kind of data.

∗Corresponding authors:Email addresses: [email protected] (Thi-Thu-Hong

PHAN), [email protected] (Émilie POISSONCAILLAULT)

Let consider some terminologies and a real marine dataset to illustrate the problem. A time series x = {xt |t =

1, 2, · · · ,N} is a set of N observations successive indexedin time, occurring in uniform intervals. A single hole atindex t is an isolated missing value where observations attime t − 1 and t + 1 are available, we note xt = NA (NAstands for not available). A hole of size T , also called gap,is an interval [t : t + T − 1] of consecutive missing valuesand is denoted x[t : t + T − 1] = NA. We define a largegap when T is larger than the known-process change, soit depends on each application. At the MAREL Carnotstation, a marine water monitoring platform in the east-ern English Channel, France (Lefebvre (2015)), 19 largetime series are collected every 20 minutes as fluorescence,turbidity, oxygen saturation and so on. These data containsingle and large holes. For example, oxygen saturation se-ries has 131,472 observations and only 81.9% available.This series comprises 4,004 isolated missing values andmany consecutive missing data. The size of these gaps are

Preprint submitted to Pattern Recognition Letters October 3, 2017

various from one hour to few months; the largest gap is a3,044 points corresponding to 42 days. Single holes andgaps having T < tide duration-holes (807 missing points)could be easily replaced by local averages. For the othergaps, the phytoplankton bloom dynamics or compositionchanges too fast to use linear or spline imputation method.

Other classical solution consists in ignoring missingdata or listwise deletion. But it is easy to imagine that thisdrastic solution may lead to serious problems, especiallyfor time series data (the considered values would dependon the past values). The first potential consequence of thismethod is information loss which could lose efficiency(Noor et al. (2014)). The second consequence is aboutsystematic differences between observed and unobserveddata that leads to biased and unreliable results (Hawthorneand Elliott (2005)).

Therefore, it is crucial to propose a new technique to es-timate missing values. One prospective approach to solvemissing data problems is the adoption of imputation tech-niques (Junninen et al. (2004)). These techniques shouldensure that the obtained results are efficient (having min-imal standard errors) and reliable (effective, curve-shaperespect).

According to our knowledge, there is no applicationfor filling time series data with large missing gap(s) sizefor univariate time series. We therefore investigate andpropose an algorithm to complete large gap(s) of univari-ate time series based on Dynamic Time Wrapping (Sakoeand Chiba (1978)). We do not deal with all the missingdata over the entire series, but we focus on each large gapwhere series-shape change could occur over the durationof this large gap. Further, the distribution of missing val-ues or entire signal could be very difficult to estimate, soit is necessary to make some assumptions. Our approachmakes the assumption that the information about missingvalues exists within the univariate time series and takesinto account the time series characteristics.

This paper is organized as follows. First,we discuss therelated work in section 2. The analysis of time series datais discussed in Section 3. The proposed approach is intro-duced in Section 4. Experimental results and discussionon 8 data sets are illustrated in Section 5. Conclusion isset out in Section 6.

2. Related work

In the literature, missing data mechanisms can be di-vided into three categories. Each category is based on onepossible cause: "Missing data are completely random"(Missing Completely At Random, MCAR, in the litera-ture), "Missing data are random" (Missing At Random,MAR) and "Missing data are not random" (Not Miss-ing At Random, NMAR) (Little and Rubin (2014)). Itis important to understand the causes that produce miss-ing data to develop an imputation task. This can help toselect an appropriate imputation algorithm (Moritz et al.(2015)). But in practice, understanding the causes re-mains a challenging task when missing data cannot beknown at all, or when these data have a complex distribu-tion (Gómez-Carracedo et al. (2014)). Similarly, assign-ing sub-sequences of missing values to a category can beblurry (Moritz et al. (2015)). Commonly, most currentresearch works focus on the three types of missing datapreviously defined to find out corresponding imputationmethods. Regarding imputation methods, a large numberof successful approaches have been proposed for complet-ing missing data.

Concerning the imputation task for multivariate timeseries, many studies have been investigated using ma-chine learning techniques as Shah et al. (2014), Liao et al.(2014), Rahman et al. (2015) and model techniques suchas Raghunathan and Siscovick (1996), Schafer (1997),Van Buuren et al. (1999), Raghunathan et al. (2001),Royston (2007), Joseph et al. (2009), Stuart et al. (2009),Lee and Carlin (2010), Spratt et al. (2010), Gelman et al.(2015), Deng et al. (2016). The efficiency of these algo-rithms is based on correlations between signals or theirfeatures, and missing values are estimated from the ob-served values. However, handling missing values withinunivariate time series data differs from multivariate timeseries techniques. We must only rely on the available val-ues of this unique variable to estimate the incomplete val-ues of the time series. Moritz et al. (2015) showed thatimputing univariate time series data is a particularly chal-lenging task.

Fewer studies are devoted to the imputation task forunivariate time series. Allison (2001) and Bishop (2006)proposed to simply substitute the mean or the median ofavailable values to each missing value. These simple al-gorithms provide the same result for all missing values

3

leading to bias result and to undervalue standard error(Crawford et al. (1995), Sterne et al. (2009)). Other im-putation techniques for univariate time series are linear in-terpolation, spline interpolation and the nearest neighborinterpolation. These techniques were studied for missingdata imputation in air quality data sets (Junninen et al.(2004)). The results showed that univariate methods aredependent upon the size of the gap in time: the largergap, the less effective technique. Walter et al. (Walter.Oet al. (2013)) carried out a performance comparison ofthree methods for univariate time series, namely, ARIMA(Autoregressive Integrated Moving Average), SARIMA(Seasonal ARIMA), and linear regression. The linear re-gression method was more efficient and effective than theother two methods, only when rearranging the data in pe-riods. This study treated non-stationary seasonal timeseries data but it did not take into account series with-out seasonality. Chiewchanwattana et al. proposed theVaried-Window Similarity Measure (VWSM) algorithm(Chiewchanwattana et al. (2007)). This method is bet-ter than the spline interpolation, the multiple imputation,and the optimal completion strategy fuzzy c-means algo-rithms. However, this research only focused on filling oneisolated missing value, but did not consider sub-sequencemissing. Moritz et al. (2015) performed an overviewabout univariate time series imputation comparing six im-putation methods. Nevertheless, this study only consid-ered the MCAR type.

3. Time series characterization

Filling large gaps within time series requires firstly tocharacterize the data. This step permits to extract usefulinformation from the data set and makes the data set easilyexploitable. The four specific components of time seriesare trend, seasonal, cyclical and random change:

1. Trend component: That is the change of variable(s)in terms of monitoring for a long time. If there ex-ists a trend within the time series data (i.e. on theaverage data), the measurements tend to increase (ordecrease) over time.

2. Seasonal component: This component takes into ac-count intra-interval fluctuations. That means there isa regular and repeated pattern of peaks and valleyswithin the time series related to a calendar period

such as seasons, quarters, months, weekdays, and soon.

3. Cyclical component: This component equals the sea-sonal one, the difference is that its cycle duration ismore than one year.

4. Random change component: This component con-siders random fluctuations around the trend; thiscould affect the cyclical and seasonal variations ofthe observed sequence, but it cannot be predicted byprevious data (in the past of time series).

There are different techniques to decompose time seriesinto components. “Decompose a time series into seasonal,trend and irregular components using moving averages”(R-starts package, R Core Team (2016)) is the most com-mon technique. In this study, we use this technique toanalyze time series data.

Auto-correlation function (ACF) provides an additionalimportant indication of the properties of time series (i.e.how past and future data points are related). Therefore, itcan be used to identify the possible structure of time se-ries data, and to create reliable forecasts and imputations(Moritz et al. (2015)). High auto-correlation values meanthat the future is strongly correlated to the past. Fig. 1 in-dicates the auto-correlation of Mackey-Glass chaotic, wa-ter level and Google data sets in our experiment.

4. The proposed method - DTWBI

In this part, we present a new method for imputingmissing values of univariate time series data.

A time series x is referred as incomplete time se-ries when it contains missing values (or values are NotAvailable-NA). Recall that the portion of a time series be-tween two points xt and xt+T−1 with xi = NA (i = t :t + T − 1) is called a gap of T -size at position t. In thispaper, we consider a large gap when T ≥ 6%N for smalltime series (N < 10, 000) or when T is larger than theknown-process change.

The proposed approach finds the most similar sub-sequence (Qs) to a query (Q), with Q (cf. Fig. 2) isthe sub-sequence before a gap of T size at position t(Q = x[t − T : t − 1]), and completes this gap by thefollowing sub-sequence of the Qs.

To find the Qs similar sub-sequence, we use the princi-ples of Dynamic Time Warping - DTW (Sakoe and Chiba

4

0 5 10 15 20 25 30

−0.5

0.0

0.5

1.0

ACF of Mackey-Glass chaotic time series

Lag

ACF

0.0000 0.0005 0.0010 0.0015 0.0020

−1.0

−0.5

0.0

0.5

1.0

ACF of water time series

Lag

ACF

0 5 10 15 20 25

−0.05

0.00

0.05

ACF of google time series

Lag

ACF

Figure 1: ACF of Mackey-Glass chaotic, water level and Google time series

(1978)), especially transformed from original data toDerivative Dynamic Time Warping - DDTW data (Keoghand Pazzani (2001)). The DDTW data are used becausewe can obtain information about the shape of sequence(Keogh and Pazzani (2001)). The dynamics and the shapeof data before a gap are a key-point of our method. Theelastic matching is used to find a similar window to theQ query of T size in the search database. Once the mostsimilar window is identified, the following window willbe copied to the location of missing values. Fig. 2 de-scribes the different steps of our approach.

The detail of DTWBI (namely DTW-Based Imputa-tion) algorithm is introduced in Algorithm 1. In theproposed method, the shape-feature extraction algorithm(Phan et al. (2016)) is applied before using DTW algo-rithm in order to reduce the computation time. As weknow DTW time complexity is O(N2), so this is a veryuseful step to decrease computation time of DTW method.A reference window is selected to calculate DTW costonly if the correlation between the shape-features (alsocalled the global features) of this window and the ones ofthe query is very high. In addition, we apply the shape-feature extraction algorithm because it better presents theshape and dynamics of series through 9 elements, suchas moments (the 1st moment, the 2nd moment, the 3rd

moment), number of peaks, entropy, etc (see Phan et al.(2016) for more detail). This is an important objective ofthe proposed method. In Algorithm 1, we just mention the

Figure 2: Diagram of DTWBI method for univariate time series imputa-tion

finding of similar windows before the gap. In case of find-ing similar windows after the gap, the method just needsto shift the corresponding index.

5

5. Experimental results and discussion

5.1. Data presentation

In this study, we analyzed 8 data sets in order to eval-uate the performance of the proposed technique. 4 datasets come from TSA package (Hyndman and Khandakar(2008)). These data sets are chosen because they are usu-ally used in the literature, including Airpassenger, Beer-sales, Google, and SP. Besides, we also choose other datasets from various domains in different places:

1. Airpassenger - Monthly total international airlinepassengers from 01/1960 to 12/1971.

2. Beersales - Monthly beer sales in millions of barrels,from 01/1975 to 12/1990.

3. Google - Daily returns of the google stock from08/20/04 to 09/13/06.

4. SP - Quarterly S&P Composite Index, 1936Q1 -1977Q4.

5. CO2 concentrations - This data set contains monthlymean CO2 concentrations at the Mauna Loa Obser-vatory from 1974 to 1987 (Thoning et al. (1989)).

6. Mackey-Glass chaotic - The data is generated fromthe Mackey-Glass equation which is the nonlineartime delay differential (Mackey and Glass (1977)).

7. Phu Lien temperature - This data set is composedof monthly mean air temperature at the Phu Lienmeteorological station in Vietnam from 1/1961 to12/2014.

8. Water level - The MAREL Carnot data in France ac-quired from 2005 up today. For our study, we focuson the water level, sampling frequency of 20 minutesfrom 01/1/2015 to 31/12/2009 (Lefebvre (2015)).

Table 1 summarizes characteristics of the data sets.

Table 1: Data characteristics

N0 Data set name N0 ofinstants

Trend(Y/N)

Seasonal(Y/N) Frequency

1 Air passenger 144 Y Y Monthly2 Beersales 192 Y Y Monthly3 Google 521 N N Daily4 SP 168 Y Y Quarterly5 CO2 concentrations 160 Y Y Monthly6 Mackey-Glass chaotic 1201 N N7 Phu Lien temperature 648 N Y Monthly8 Water level 131472 N Y 20 minutes

5.2. Univariate time series imputation algorithmsThe performance of the proposed method compared

with 5 other existing methods for univariate time se-ries (namely, na.interp, na.locf, na.approx, na.aggregate,na.spline) is evaluated in this paper. All these methods areimplemented using R language (na stands for Not Avail-able):

1. na.interp (forecast R-package): linear interpolationfor non-seasonal series and Seasonal Trend decom-position using Loess (STL decomposition) for sea-sonal series to replace missing values (Hyndman andKhandakar (2008)). A seasonal model is fit to thedata, and then interpolation is made on the season-ally adjusted series, before re-seasonalizing. So, thismethod is especially devoted to strong and clear sea-sonality data.

2. na.locf (last observation carried forward) (zoo R-package): any missing value is replaced by themost recent non-NA value prior to it (Zeileis andGrothendieck (2005)). Conceptually, this method as-sumes that the outcome would not change after thelast observed value. Therefore, there has been notime effect since the last observed data.

3. na.approx (zoo R-package): generic function for re-placing each NA with interpolated values (Zeileisand Grothendieck (2005)).

4. na.aggregate (zoo R-package): generic function forreplacing each NA with aggregated values. This al-lows imputing using the overall mean, by monthlymeans, etc (Zeileis and Grothendieck (2005)). In ourexperiment, we use the overall mean.

5. na.spline (zoo R-package): polynomial (cubic) in-terpolation to fill in missing data (Zeileis andGrothendieck (2005)).

5.3. Imputation performance indicatorsAfter the completion of missing values, we assess the

performance of our method, and then compare it with ex-isting imputation methods based on four different metricsdescribed as follows:

1. Similarity: S im(y, x) indicates the similarity betweenactual data (X) and imputation data (Y). It is calcu-lated by:

S im(y, x) =1T

T∑i=1

1

1 +|yi−xi |

max(x)−min(x)

(1)

6

Where T is the number of missing values. A highersimilarity (similarity value ∈ [0, 1]) highlights a bet-ter ability method for the task of completing missingvalues.

2. NMAE: The Normalized Mean Absolute Error be-tween the imputed value y and the respective truevalue time series x is computed as:

NMAE(y, x) =1T

T∑i=1

|yi − xi|

Vmax − Vmin(2)

Where Vmax, Vmin are the maximum and the min-imum values of input time series (time series hasmissing data) by ignoring the missing values. Alower NMAE means better performance method forthe imputation task.

3. RMSE: The Root Mean Square Error is defined asthe average squared difference between the imputedvalue y and the respective true value time series x.This indicator is very useful for measuring overallprecision or accuracy. In general, the most effectivemethod would have the lowest RMSE.

RMS E(y, x) =

√√√1T

T∑i=1

(yi − xi)2 (3)

4. FSD: Fraction of Standard Deviation of the imputedvalue y and the respective true value time series x isdefined as follows:

FS D(y, x) = 2 ∗|S D(y) − S D(x)|S D(y) + S D(x)

(4)

This fraction indicates whether a method is accept-able or not (here SD stands for Standard Deviation).For the imputation task, FSD should be closer to 0,the imputation values are closer to the real values.

5.4. Experiment protocolIndeed, we could not compare the ability of imputation

algorithms on real missing data because the true valuesare not available. Therefore, we have to create simulatedmissing gaps on full data to compare the performance ofimputation algorithms. For assessing the results, we usea technique based on three steps. In the first step, wecreate artificial missing data by deleting data values from

known time series. The second step consists in applyingthe imputation algorithms to complete missing data. Fi-nally, the third step compares the performance of the pro-posed method with published methods using the differentimputation performance indicators as previously defined.

In the present study, 5 missing data levels are consid-ered on 8 data sets. If the size of a data set (number ofinstants of the data set) is less than or equal to 10,000 sam-ples, we create gaps with different sizes: 6%, 7.5%, 10%,12.5%, 15% of overall data set size. In contrast, when thesize of a data set is greater than 10,000 sampling points,gaps are built at rates 0.6%, 0.75%, 1%, 1.25%, and 1.5%of the data set size (here the largest gap of the water leveltime series is 1,972 missing values, corresponding to themissing rate 1.5%). For each missing rate, the algorithmsare conducted 10 times by randomly selecting the missingpositions on the data. We then run 50 iterations for eachdata set.

5.5. Results and discussion

5.5.1. Comparison of quantitative performanceTable 2 shows imputation average results of DTWBI,

na.interp, na.locf, na.approx, na.aggregate, na.splinemethods applied on 8 data sets using 4 indicators: sim-ilarity, NAME, RMSE, FSD.

• Airpassenger, Beersales, Google, SP data sets

The Airpassenger data set has both trend and season-ality components. The result from Table 2 indicates thatwhen the gap size is greater than or equal to 10%, the pro-posed method has the highest similarity and the lowestNMAE and RMSE.

On the Beersales data set, considering similarity andRMSE indicators: na.interp method provides the best re-sult and the second one is our approach. By contrastto these two indicators, our method has better results onNMEA and FSD indicators at any missing rate. Whencomparing na.interp method to the na.approx one on theAirpassenger and Beersales data sets, we can see na.interpshows better performance than na.approx method on anyindicators and at every level of missing data. It corre-sponds to the fact that these two data sets have a clear sea-sonality component. Na.interp method takes into accountthe seasonality factor, so it can better handle seasonality

7

than na.approx does, although both algorithms use the in-terpolation for completing missing data.

On Airpassenger and Beersales data sets, na.aggregateapproach gives less efficient results than na.interp. But onGoogle series, na.aggregate method yields the best per-formance: the highest similarity and the smallest NMEA,RMSE indicators. Without any trend on this data set,this method leads to the best result. For SP data set,na.aggegate method still highlights a good performanceon NMEA and RMSE, but this approach has lower sim-ilarity than it has on Google series. The na.aggegatemethod replaces missing values by overall mean. How-ever, SP series has a clear trend; therefore, na.aggregatemethod seems not to be effective with series having astrong trend.

In all data sets, FSD value of na.aggregate and na.locfmethods always equals 2, because they use the same valuefor all missing data (last value for na.locf method; overallmean for na.aggregate).

• CO2 concentrations, Mackey-Glass chaotic, PhuLien temperature, water level data sets

These data sets have a seasonality component (exceptMackey-Glass chaotic series but this data set is regularlyrepeated), without any trend (excluding CO2 concentra-tions data set) and high auto-correlation. Our methoddemonstrates the best ability for completing missing dataon these series: the highest similarity, the lowest NMAE,RMSE and FSD at any missing level. Furthermore, onAirpassenger, Beersales, Google and SP data sets, thesimilarity of our approach is lower, but the differencevalue in this indicator between the proposed method andthe best method is small. On the contrary, for these fourdata sets, our method outperforms the existing techniqueson any indicator and at any missing rate. The differentvalues of these indicators between the proposed methodand the other ones are quite large. The results confirmthat the imputation values generated from the proposedmethod are close to the real values on data sets havinghigh auto-correlation (see Fig. 1, the ACF maximum val-ues of water and chaotic series are approximate 1), whichmeans that there is a strong relationship between the avail-able and the unknown values. Following the proposedmethod, the second one is na.aggregate one applied on theMackey-Glass chaotic series, Phu Lien temperature and

water level series. As mentioned above (Table 1), thesedata sets have no trend, that is why na.aggregate coulddemonstrate its ability. However, on the C02 series withclear trend, fully opposed to these 3 data sets, the perfor-mance of this method is the worst one.

Although na.interp method is well indicated for han-dling data sets with seasonality component: here withthese 4 data sets this approach does not illustrate its capa-bility. It gives the same results as na.approx method andlower results than our approach and the na.aggregate one(on the Mackey-Glass chaotic, Phu Lien temperature andwater series). For any data set, na.spline method indicatesthe lowest performance. However on the water series, thismethod has the least performance for completing missingvalues. This means that the spline method is not suitablefor this task.

5.5.2. Comparison of the visual performanceTable 2 indicates the quantitative comparison of 6 dif-

ferent methods for the task of completing missing values.In this part, Fig. 3, 4, 5, 7, and 8 show the comparison ofvisual imputation performance of different methods.

Fig. 3 presents the shape of imputation valuesof 5 existing methods (na.interp, na.locf, na.approx,na.aggregate and na.spline) with the true values at posi-tion 106, the gap size of 9 on the Airpassenger series. Aswe can notice on Table 2, considering low rates of missingdata, the proposed approach is less effective than na.interpand na.aggregate methods for Airpassenger time series.However, when looking at Fig. 4, we find that the shapeof the imputation values generated from DTWBI methodis very similar to the shape of true values. Despite highsimilarity, low RMSE and NMAE, the shape of imputa-tion values yielded from na.aggregate method (Fig. 3) isnot as effective as the proposed method (Fig. 4). As an-alyzed above, the na.interp method better deals with sea-sonal factor, so their imputed values are asymptotic to thereal values (Fig. 3).

Fig. 5 illustrates the visual comparison of DTWBI im-putation values and real values on water level series atposition 23,282, and at 0.6% rate of missing values (cor-responding to 789 missing points). The proposed methodproves again its capability for the task of completing miss-ing values. We see that the shape of the imputation valuesgenerated from our method and the one of the true values

8

Tabl

e2:

Ave

rage

impu

tatio

npe

rfor

man

cein

dexe

sof

six

met

hods

onei

ghtd

ata

sets

Gap

size

Met

hod

Air

pass

enge

rB

eers

ales

Goo

gle

SP

Sim

NM

AE

RM

SEFS

DSi

mN

MA

ER

MSE

FSD

Sim

NM

AE

RM

SEFS

DSi

mN

MA

ER

MSE

FSD

6%D

TW

BI

0.77

70.

034

21.1

0.24

0.88

0.03

50.

70.

140.

830.

140.

034

0.43

0.74

0.02

635

.50.

7na

.inte

rp0.

850.

019

11.1

0.24

0.89

0.06

30.

60.

150.

830.

110.

032

1.11

0.74

0.02

836

.30.

54na

.locf

0.76

0.04

426

.32

0.81

0.12

91.

22

0.81

0.12

60.

036

20.

750.

022

29.2

2na

.app

rox

0.77

0.03

721

.81.

010.

80.

136

1.3

1.5

0.83

0.11

0.03

21.

110.

730.

028

371.

03na

.agg

rega

te0.

80.

033

20.1

20.

830.

111.

12

0.86

0.08

20.

024

20.

780.

021

26.5

2na

.spl

ine

0.71

0.05

735

.10.

520.

680.

262.

30.

550.

51.

813

0.47

31.

020.

630.

045

56.8

0.41

7.5%

DT

WB

I0.

782

0.03

520

.60.

30.

870.

038

0.7

0.16

290.

840.

131

0.03

20.

330.

760.

0338

.90.

52na

.inte

rp0.

860.

023

13.6

0.3

0.88

50.

067

0.6

0.16

30.

830.

119

0.03

41.

180.

780.

024

33.1

0.67

na.lo

cf0.

770.

046

27.4

20.

810.

123

1.2

20.

820.

126

0.03

52

0.77

0.02

634

.82

na.a

ppro

x0.

740.

053

31.3

1.49

0.8

0.13

21.

31.

510.

830.

119

0.03

41.

180.

780.

025

341.

1na

.agg

rega

te0.

810.

033

20.2

20.

820.

112

1.1

20.

870.

081

0.02

42

0.8

0.02

229

.12

na.s

plin

e0.

60.

112

65.4

0.45

0.6

0.40

43.

50.

430.

443.

652

0.96

31.

380.

690.

042

54.5

0.55

10%

DT

WB

I0.

887

0.02

12.7

0.36

0.84

0.05

41

0.13

0.84

0.13

20.

032

0.23

0.81

0.02

940

.10.

57na

.inte

rp0.

860.

021

13.1

0.34

0.89

0.06

80.

70.

180.

850.

105

0.03

1.22

0.82

0.02

536

.30.

56na

.locf

0.79

0.04

226

.12

0.82

0.13

1.3

20.

830.

131

0.03

52

0.81

0.02

636

.92

na.a

ppro

x0.

790.

041

24.6

1.03

0.82

0.12

41.

21.

240.

850.

105

0.03

1.22

0.83

0.02

433

.51.

14na

.agg

rega

te0.

810.

035

22.1

20.

840.

111

1.1

20.

870.

084

0.02

42

0.82

0.02

331

.72

na.s

plin

e0.

620.

134

78.3

0.52

0.55

0.55

84.

90.

670.

424.

684

1.11

81.

130.

760.

049

63.2

0.45

12.5

%D

TW

BI

0.89

30.

0212

.60.

360.

870.

039

0.7

0.12

0.85

0.13

80.

032

0.23

0.8

0.03

41.9

0.61

na.in

terp

0.86

0.02

314

.80.

390.

890.

068

0.6

0.15

0.85

0.11

50.

032

1.27

0.81

0.02

838

.80.

52na

.locf

0.8

0.04

426

.92

0.82

0.12

71.

22

0.84

0.12

90.

035

20.

810.

027

36.1

2na

.app

rox

0.79

0.04

326

.70.

950.

80.

147

1.4

1.28

0.85

0.11

50.

032

1.27

0.82

50.

027

35.6

1.06

na.a

ggre

gate

0.82

0.03

521

.82

0.84

0.10

91.

12

0.88

0.08

30.

024

20.

824

0.02

431

2na

.spl

ine

0.64

0.12

976

.80.

670.

610.

458

40.

770.

392.

143

0.53

21.

40.

610.

113

132.

40.

69

15%

DT

WB

I0.

895

0.02

12.8

0.36

0.84

0.05

41

0.1

0.85

0.13

30.

031

0.29

0.81

0.02

940

.70.

59na

.inte

rp0.

860.

025

15.6

0.35

0.89

0.06

90.

70.

170.

860.

110.

031

0.99

0.79

0.03

343

.60.

49na

.locf

0.79

0.04

728

.22

0.82

0.12

61.

22

0.84

0.12

70.

034

20.

810.

028

36.3

2na

.app

rox

0.8

0.04

326

.51.

170.

830.

117

1.1

1.42

0.86

0.11

0.03

10.

990.

810.

032

411

na.a

ggre

gate

0.83

0.03

522

.12

0.84

0.11

1.1

20.

890.

079

0.02

32

0.82

0.02

532

2na

.spl

ine

0.55

0.17

510

6.1

0.95

0.49

0.73

16.

30.

880.

3412

.339

2.92

81.

60.

610.

136

162.

50.

68

CO

2co

ncen

trat

ions

Mac

key-

Gla

ssC

haot

icPh

uL

ien

tem

pera

ture

Wat

erle

vel

6%D

TW

BI

0.93

0.00

10.

30.

040.

950.

005

0.01

0.03

0.88

0.06

1.7

0.08

0.95

0.00

90.

10.

05na

.inte

rp0.

750.

055

1.6

1.5

0.79

0.03

10.

040.

810.

80.

142

3.1

0.63

0.81

0.04

20.

51.

05na

.locf

0.73

0.05

91.

72

0.77

0.03

60.

052

0.77

0.17

33.

82

0.8

0.04

30.

42

na.a

ppro

x0.

750.

055

1.6

1.5

0.79

0.03

10.

040.

810.

80.

142

3.1

0.63

0.81

0.04

20.

51.

05na

.agg

rega

te0.

450.

185

4.7

20.

820.

025

0.03

20.

830.

114

2.4

20.

830.

035

0.4

2na

.spl

ine

0.75

0.05

71.

60.

750.

650.

072

0.09

0.38

0.61

0.41

38.

50.

520.

30.

654

6.6

1.61

7.5%

DT

WB

I0.

930.

001

0.4

0.05

0.93

0.00

80.

010.

020.

8788

0.06

11.

70.

060.

960.

007

0.1

0.02

na.in

terp

0.74

0.05

71.

61.

380.

80.

031

0.04

1.04

0.79

0.14

73.

20.

980.

820.

038

0.4

0.97

na.lo

cf0.

760.

053

1.6

20.

770.

038

0.05

20.

770.

171

3.7

20.

810.

043

0.5

2na

.app

rox

0.74

0.05

71.

61.

380.

80.

031

0.04

1.04

0.79

0.14

73.

20.

980.

820.

038

0.4

0.97

na.a

ggre

gate

0.45

0.18

64.

72

0.83

0.02

50.

032

0.83

0.11

32.

42

0.83

0.03

60.

42

na.s

plin

e0.

740.

058

1.6

0.79

0.69

0.06

20.

080.

390.

580.

701

14.5

0.8

0.2

1.22

812

1.71

10%

DT

WB

I0.

930.

001

0.4

0.04

0.93

0.00

80.

010.

010.

8791

0.06

31.

80.

050.

970.

005

0.1

0.03

na.in

terp

0.76

0.05

11.

40.

880.

810.

030.

040.

980.

810.

137

30.

580.

810.

041

0.4

0.91

na.lo

cf0.

760.

054

1.6

20.

790.

036

0.05

20.

770.

176

3.8

20.

810.

043

0.5

2na

.app

rox

0.76

0.05

11.

40.

880.

810.

030.

040.

980.

810.

137

30.

580.

810.

041

0.4

0.91

na.a

ggre

gate

0.44

0.19

74.

92

0.83

0.02

50.

032

0.83

0.11

42.

42

0.83

0.03

60.

42

na.s

plin

e0.

660.

098

2.9

0.26

0.71

0.05

80.

080.

330.

490.

8817

.81.

040.

181.

5715

.51.

79

12.5

%D

TW

BI

0.94

0.00

10.

30.

040.

920.

009

0.02

0.01

0.88

10.

065

1.8

0.04

0.96

0.00

60.

10.

03na

.inte

rp0.

780.

049

1.5

1.39

0.8

0.03

30.

041.

130.

790.

163

3.5

1.44

0.81

0.04

40.

51.

21na

.locf

0.75

0.05

71.

72

0.79

0.03

60.

052

0.78

0.18

3.8

20.

810.

043

0.5

2na

.app

rox

0.78

0.04

91.

51.

390.

80.

033

0.04

1.13

0.79

0.16

33.

51.

440.

810.

044

0.5

1.21

na.a

ggre

gate

0.44

0.2

52

0.84

0.02

50.

032

0.84

0.11

62.

42

0.83

0.03

60.

42

na.s

plin

e0.

710.

073

2.2

0.38

0.61

0.09

30.

120.

630.

550.

653

13.7

0.99

0.25

0.96

9.8

1.74

15%

DT

WB

I0.

940.

001

0.3

0.04

0.92

0.01

0.02

0.01

0.88

20.

066

1.8

0.05

0.96

0.00

70.

10.

04na

.inte

rp0.

760.

053

1.6

1.46

0.81

0.03

0.04

0.99

0.81

0.14

53.

21

0.81

0.04

40.

51.

6na

.locf

0.77

0.05

21.

62

0.79

0.03

70.

052

0.79

0.17

53.

82

0.81

0.04

30.

52

na.a

ppro

x0.

760.

053

1.6

1.46

0.81

0.03

0.04

0.99

0.81

0.14

53.

21

0.81

0.04

40.

51.

6na

.agg

rega

te0.

430.

202

5.1

20.

840.

025

0.03

20.

840.

117

2.5

20.

830.

036

0.4

2na

.spl

ine

0.69

0.08

52.

50.

580.

570.

129

0.16

0.73

0.44

1.26

826

.31.

270.

211.

185

11.8

1.83

9

2 4 6 8−100

−50

0

50

Time (Month)

Num

ber

ofA

irPa

ssen

gers

True values na.interp na.approxna.aggregate na.locf na.spline

Figure 3: Visual comparison of imputed values of different imputationmethods with true values on Airpassenger series at position 106 with thegap size of 9.

2 4 6 8

−50

0

50

Time (Month)

Num

ber

ofA

irPa

ssen

gers

True valuesDTWBI

Figure 4: Visual comparison of imputed values of proposed method withtrue values on Airpassenger series at position 106 with the gap size of 9.

are almost completely identical. Fig. 6 shows the match-ing pairs between the query and the most similar referencewindow for the considered case. The values of matchingpairs are very close, which indicates the reason why theDTWBI imputation values are very similar to the real val-ues. In contrast to our approach, handling seasonal factorof na.interp method is ineffective on water level data set.This method does not provide good result such as on Air-passenger series (Fig. 3); its performance is the same asna.approx method (Fig. 7). Fig. 8 especially points out theobvious inefficiency of na.spline method for the task ofcompleting missing values, considering series with highauto-correlation and large gap size (789 missing values inthis case).

In this paper, we also calculate Cross-Correlation (CC)coefficients between the query with each reference win-dow, and then we find the maximum coefficient. CCdemonstrates that a pattern (here that is the query) existsor not in the database. High CC value means that there ex-

ists the recurrence of the pattern in the database. There-fore, we could easily find the pattern. Table 3 indicatesthe maximum of cross-correlation between the query andreference windows.

Table 3: The maximum of cross-correlation between the query andreference windows.

Gap size Data set

#1 #2 #3 #4 #5 #6 #7 #8

6% 0.88 0.92 0.58 0.78 0.99 1 0.91 17.50% 0.91 0.91 0.55 0.74 0.99 0.99 0.91 110% 0.94 0.87 0.5 0.67 0.98 0.99 0.91 112.50% 0.95 0.89 0.44 0.65 0.98 0.99 0.9 115% 0.95 0.85 0.4 0.65 0.98 0.99 0.9 1

#1-Airpassenger, #2-Beersales, #3-Google, #4-SP, #5-Co2 concentrations#6-Mackey-Glass chaotic, #7-Phu Lien temperature, #8-water level

This result is fully interpreted: for 4 data sets includingCO2 concentrations, Mackey-Glass chaotic series, PhuLien temperature and water level, their cross-correlation

10

0 200 400 600 800

−0.5

0

0.5

1

Time (20-minute)

Wat

erle

vel(

m)

True values DTWBI

Figure 5: Visual comparison of imputed values of the proposed methodwith true values on water level series at position 23,282 with the gap sizeof 789.

0 200 400 600 800

−0.5

0

0.5

Time (20-minute)

Wat

erle

vel(

m)

Q query Qs similar query .

Figure 6: Visual comparison of the query with the similar window onwater level series at position 23,282 with the gap size of 789.

11

0 200 400 600 800−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Time (20-minute)

Wat

erle

vel(

m)

True values na.interp na.approxna.aggregate na.locf

Figure 7: Visual comparison of imputed values of different methods withtrue values on water level series at position 23,282 with the gap size of789.

0 200 400 600 800

−8

−6

−4

−2

0

2

4

Time (20-minute)

Wat

erle

vel(

m)

True valuesna.spline

Figure 8: Visual comparison of imputed values of spline method with truevalues on water level series at position 23,282 with the gap size of 789.

between the query and reference windows are very highfor each missing level (Table 3). This corresponds to theresults in Table 2: the proposed method yields the high-est similarity and the lowest NMAE, RMSE, FSD. It alsomeans that the imputation values generated from DTWBImethod are very close to the true ones. For Google (#3)and SP (#4) data sets, we see that CC are not high, thatis why our approach does not well prove its ability. WithAirpassenger data set (#1), when CC are greater than orequal to 0.94, the proposed method highlights better re-sults than other methods. On Beersales data set (#2), incase of higher CC, DTWBI gives the best results in caseof lower CC.

From these results, we can notice that the proposedmethod gives the best performance in case of high CC co-efficient (> 0.9). Indeed, CC is an indicator that gives in-formation about the pattern recurrence in the data. Basedon this indicator, we can predict if one pattern may oc-

cur in the past or in the following data from the posi-tion we are considering. From the above analyses, wecan see that our algorithm outperforms other imputationmethods when data sets have high auto-correlation andcross-correlation, no trend, strong seasonality, and com-plex distribution, especially in case of large gap(s). Highcross-correlation means that these data sets are recurrent,or in other words, these time series will repeat themselvesover some periods. The drawback of this method is thecomputation time. The proposed algorithm may take along time to find the imputation values when the size ofthe given data is large. The reason is the search for all pos-sible sliding windows to find a reference window havingthe maximum similarity to the query.

6. Conclusion

In this paper, we have proposed a new imputationmethod for univariate time series data, namely DTWBI

12

method. This methodology has been tested using 8 datasets: Airpassenger, Beersales, Google, SP, Co2 concen-trations, Mackey-Glass chaotic, Phu Lien temperature,and water level. The accuracy of imputation values byDTWBI is compared with 5 existing methods (na.interp,na.locf, na.approx, na.aggegate and na.spline) using 4quantitative indicators (similarity, NMAE, RMSE andFSD). We also compare the visual performance of thesemethods. The experiments show that our approach givesbetter results than the other existing methods, and is thebest robust method in case of time series having highcross-correlation and auto-correlation, large gap(s), com-plex distribution, and strong seasonality. However, theproposed framework is restricted to applications wherethe necessary assumption of recurring data in the time se-ries is set up (high cross-correlation indicator), and it re-quires computation time for very large missing intervals.The present work will allow to extend the proposed ap-proach to complete missing values of multivariate timeseries data in the future.

Acknowledgments

This work was kindly supported by the Ministry ofEducation and Training Vietnam International EducationDevelopment, the French government, the region Hauts-de-France in the framework of the project CPER 2014-2020 MARCO and the European Commission’s H2020program with the Joint European Research Infrastructurefor Coastal Observations JERICO-Next.

References

Allison, P.D., 2001. Missing Data. volume 136 of Quan-titative Applications in the Social Sciences. Sage Pub-lication.

Bishop, C.M., 2006. Pattern Recognition and Ma-chine Learning (Information Science and Statistics).Springer-Verlag New York, Inc., Secaucus, NJ, USA.

Ceong, H.T., Kim, H.J., Park, J.S., 2012. Discovery ofand recovery from failure in a costal marine usn ser-vice. Journal of Information and Communication Con-vergence Engineering 1.

Chiewchanwattana, S., Lursinsap, C., Henry Chu, C.H.,2007. Imputing incomplete time-series data based onvaried-window similarity measure of data sequences.Pattern Recognition Letters 28, 1091–1103.

Crawford, S.L., Tennstedt, S.L., McKinlay, J.B., 1995. Acomparison of anlaytic methods for non-random miss-ingness of outcome data. Journal of Clinical Epidemi-ology 48, 209–219.

Deng, Y., Chang, C., Ido, M.S., Long, Q., 2016. Multi-ple Imputation for General Missing Data Patterns in thePresence of High-dimensional Data. Scientific Reports6, 21689.

Gelman, A., Hill, J., Su, Y.S., Yajima, M., Pittau, M.,Goodrich, B., Si, Y., Kropko, J., 2015. Mi: MissingData Imputation and Model Checking.

Gómez-Carracedo, M., Andrade, J., López-Mahía, P.,Muniategui, S., Prada, D., 2014. A practical com-parison of single and multiple imputation methods tohandle complex missing data in air quality datasets.Chemometrics and Intelligent Laboratory Systems 134,23–33.

Hawthorne, G., Elliott, P., 2005. Imputing cross-sectionalmissing data: Comparison of common techniques. TheAustralian and New Zealand Journal of Psychiatry 39,583–590.

Hyndman, R., Khandakar, Y., 2008. Automatic timeseries forecasting: the forecast package for r, usedpackage in 2016. Journal of Statistical Software ,1–22URL: http://www.jstatsoft.org/article/view/v027i03.

Joseph, J.G., El-Mohandes, A.A.E., Kiely, M., El-Khorazaty, M.N., Gantz, M.G., Johnson, A.A., Katz,K.S., Blake, S.M., Rossi, M.W., Subramanian, S.,2009. Reducing Psychosocial and Behavioral Preg-nancy Risk Factors: Results of a Randomized Clini-cal Trial Among High-Risk Pregnant African Ameri-can Women. American Journal of Public Health 99,1053–1061.

Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J.,Kolehmainen, M., 2004. Methods for imputation of

13

missing values in air quality data sets. AtmosphericEnvironment 38, 2895–2907.

Keogh, E.J., Pazzani, M.J., 2001. Derivative DynamicTime Warping., in: Sdm, SIAM. pp. 5–7.

Lee, K.J., Carlin, J.B., 2010. Multiple Imputation forMissing Data: Fully Conditional Specification VersusMultivariate Normal Imputation. American Journal ofEpidemiology 171, 624–632.

Lefebvre, A., 2015. MAREL Carnot data andmetadata from Coriolis Data Centre. SEANOE.http://doi.org/10.17882/39754.

Liao, S.G., Lin, Y., Kang, D.D., Chandra, D., Bon, J.,Kaminski, N., Sciurba, F.C., Tseng, G.C., 2014. Miss-ing value imputation in high-dimensional phenomicdata: Imputable or not, and how? BMC Bioinformatics15, 346.

Little, R.J.A., Rubin, D.B., 2014. Statistical Analysis withMissing Data. John Wiley & Sons. Google-Books-ID:AyVeBAAAQBAJ.

Mackey, M.C., Glass, L., 1977. Oscillation and chaosin physiological control systems. Science (New York,N.Y.) 197, 287–289.

Moritz, S., Sardá, A., Bartz-Beielstein, T., Zaefferer, M.,Stork, J., 2015. Comparison of different Methods forUnivariate Time Series Imputation in R. arXiv preprintarXiv:1510.03924 .

Noor, N.M., Al Bakri Abdullah, M.M., Yahaya, A.S.,Ramli, N.A., 2014. Comparison of Linear Interpola-tion Method and Mean Method to Replace the MissingValues in Environmental Data Set. Materials ScienceForum 803, 278–281.

Phan, T.T.H., Caillault, E.P., Bigand, A., 2016. Compara-tive study on supervised learning methods for identify-ing phytoplankton species, in: 2016 IEEE Sixth Inter-national Conference on Communications and Electron-ics (ICCE), IEEE. pp. 283–288.

R Core Team, 2016. R: A Language and Environmentfor Statistical Computing. R Foundation for Statisti-cal Computing. Vienna, Austria. URL: http://www.R-project.org/.

Raghunathan, T.E., Lepkowski, J.M., Van Hoewyk, J.,Solenberger, P., 2001. A multivariate technique formultiply imputing missing values using a sequence ofregression models. Survey methodology 27, 85–96.

Raghunathan, T.E., Siscovick, D.S., 1996. A Multiple-Imputation Analysis of a Case-Control Study of theRisk of Primary Cardiac Arrest Among Pharmacologi-cally Treated Hypertensives on JSTOR. Royal Statisti-cal Society. Series C (Applied Statistics) 45, 335–352.

Rahman, S.A., Huang, Y., Claassen, J., Heintzman, N.,Kleinberg, S., 2015. Combining Fourier and lagged k-nearest neighbor imputation for biomedical time seriesdata. Journal of Biomedical Informatics 58, 198–207.

Rousseeuw, K., Caillault, E.P., Lefebvre, A., Hamad, D.,2013. Monitoring system of phytoplankton blooms byusing unsupervised classifier and time modeling, in:2013 IEEE International Geoscience and Remote Sens-ing Symposium-IGARSS, IEEE. pp. 3962–3965.

Royston, P., 2007. Multiple imputation of missing values:Further update of ice, with an emphasis on interval cen-soring. Stata Journal 7, 445–464.

Sakoe, H., Chiba, S., 1978. Dynamic Programming Al-gorithm Optimization for Spoken Word Recognition.IEEE transactions on acoustics, speech, and signal pro-cessing 16, 43–49.

Schafer, J., 1997. Analysis of Incomplete MultivariateData. Chapman and Hall, London.

Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O.,Hemingway, H., 2014. Comparison of random forestand parametric imputation models for imputing miss-ing data using MICE: A CALIBER study. AmericanJournal of Epidemiology 179, 764–774.

Spratt, M., Carpenter, J., Sterne, J.A.C., Carlin, J.B.,Heron, J., Henderson, J., Tilling, K., 2010. Strategiesfor Multiple Imputation in Longitudinal Studies. Amer-ican Journal of Epidemiology 172, 478–487.

Sterne, J.A.C., White, I.R., Carlin, J.B., Spratt, M., Roys-ton, P., Kenward, M.G., Wood, A.M., Carpenter, J.R.,2009. Multiple imputation for missing data in epidemi-ological and clinical research: Potential and pitfalls.BMJ (Clinical research ed.) 338, b2393.

14

Stuart, E.A., Azur, M., Frangakis, C., Leaf, P., 2009. Mul-tiple Imputation With Large Data Sets: A Case Studyof the Children’s Mental Health Initiative. AmericanJournal of Epidemiology 169, 1133–1139.

Thoning, K.W., Tans, P.P., Komhyr, W.D., 1989. Atmo-spheric carbon dioxide at Mauna Loa Observatory. II- Analysis of the NOAA GMCC data, 1974-1985 94,8549–8565.

Van Buuren, S., Boshuizen, H.C., Knook, D.L., others,1999. Multiple imputation of missing blood pressurecovariates in survival analysis. Statistics in medicine18, 681–694.

Walter.O, Y., Kihoro, J.M., Athiany, K.H.O., W, K.H.,2013. Imputation of incomplete non- stationary sea-sonal time series data. Mathematical Theory and Mod-eling 3, 142–154.

Zeileis, A., Grothendieck, G., 2005. zoo: S3 infrastruc-ture for regular and irregular time series, used pack-age in 2016. URL: https://www.jstatsoft.org/v014/i06, doi:10.18637/jss.v014.i06.

Algorithm 1 DTWBI algorithmInput: x = {x1, x2, . . . , xN}: incomplete time series

t: index of a gap (position of the first missing ofthe gap)

T : size of the gapθ_cos: cosine threshold (≤ 1)step_threshold: increment for finding a thresholdstep_sim_win: increment for finding a similar

windowOutput: y - completed (imputed) time series

1: Step 1: Transform x to DDTW data Dx = DDTW(x)2: Step 2: Construct a Q query - temporal window be-

fore the missing data Q = Dx[t − T : t − 1]3: Step 3: Build a search database before the gap:

S DB = Dx[1 : t − 2T ] and deleting all lines contain-ing missing parameter S DB = S DB\{dx j, dx j = NA}

4: Step 4: Find the threshold5: i← 1; DTW_costs← NULL6: while i <= length(S DB) do7: k ← i + T − 18: Create a reference window: R(i) = S DB[i : k]9: Calculate global feature of Q and R(i): g f Q, g f R

10: Compute cosine coefficient: cos =

cosine(g f Q, g f R)11: if cos ≥ θ_cos then12: Calculate DTW cost: cost =

DTW_cost(Q,R(i))13: Save the cost to DTW_costs14: end if15: i← i + step_threshold16: end while17: threshold = min{DTW_costs}18: Step 5: Find similar windows on the SDB19: i← 1; Lop← NULL20: while i < length(S DB) do21: k ← i + T − 122: Create a reference window: R(i) = S DB[i : k]23: Calculate global feature of Q and R(i): g f Q, g f R24: Compute cosine coefficient: cos =

cosine(g f Q, g f R)25: if cos ≥ θ_cos then26: Calculate DTW cost: cost =

DTW_cost(Q,R(i))27: if cost < threshold then28: Save position of R(i) to Lop29: end if30: end if31: i← i + step_sim_win32: end while33: Step 6: Replace the missing values at the position t

by vector after the Qs window having the minimumDTW cost in the Lop list.

34: return y - with imputed series

15