60
AAL Data Cluster Analysis BACHELORARBEIT zur Erlangung des akademischen Grades Bachelor of Science im Rahmen des Studiums Medieninformatik und Visual Computing eingereicht von Dzenan Hamzic Matrikelnummer 0327029 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Wolfgang Zagler Ao.Univ.Prof.Dipl.Ing. Dr.techn. Mitwirkung: Peter Mayer Projektass. Dipl.Ing. Wien, 28. Juni 2016 Dzenan Hamzic Dipl.Ing. Peter Mayer Dr.techn. Wolfgang Zagler

AAL Data Cluster Analysis - WordPress.com · 2017. 12. 13. · I would like to thank Peter Mayer and Wolfgang Zagler for contributions and navigating this project with their advices,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  •  

    AAL Data Cluster Analysis  

    BACHELORARBEIT 

    zur Erlangung des akademischen Grades 

    Bachelor of Science 

    im Rahmen des Studiums 

    Medieninformatik und Visual Computing 

    eingereicht von 

    Dzenan Hamzic Matrikelnummer 0327029 

     

    an der Fakultät für Informatik 

    der Technischen Universität Wien 

     

    Betreuung:  Wolfgang Zagler Ao.Univ.Prof.Dipl.Ing. Dr.techn. Mitwirkung:  Peter Mayer Projektass. Dipl.Ing.  

    Wien, 28. Juni 2016 

     

     Dzenan Hamzic                          Dipl.Ing. Peter Mayer               Dr.techn. Wolfgang Zagler 

     

  •  

       

    Dzenan Hamzic                   TU Wien                   1 

  •  

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    Dzenan Hamzic                   TU Wien                   2 

  •  

    Erklärung zur Verfassung der Arbeit 

     Dzenan Hamzic Stavangergasse 4/7/3 1220 Wien [email protected] 

    Hiermit erkläre ich, dass ich diese Arbeit selbstständig verfasst habe, dass ich die verwendeten                           Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit –                           einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im                         Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als                             Entlehnung kenntlich gemacht habe. 

     

     

    Wien, 28. Juni 2016  Dzenan Hamzic 

       

    Dzenan Hamzic                   TU Wien                   3 

  •  

    Acknowledgments 

    First and foremost I would like to thank Amra Omeragic for her patience and loving support                               throughout the long writing process. 

    I would like to thank Peter Mayer and Wolfgang Zagler for contributions and navigating this                             project with their advices, ideas and revisions. 

    I must also thank my mother and father, Nevresa and Enver Hamzic, for all of their patience,                                 support and love, and for beeing my biggest fans. 

    Finally, I would like to thank all those who stood by me in good and bad times. 

     

    Dzenan Hamzic   

     

       

    Dzenan Hamzic                   TU Wien                   4 

  •  

      

    Abstract   

    The eHome project from the Vienna University of Technology [1] is an R&D project with                             goals of providing assistive technologies for private households of older people with idea to                           give them possibilities for longer and independent living in their homes. The eHome system                           consists of an adaptive intelligent network of wireless sensors for activity monitoring with a                           central contextaware embedded system [2]. 

    The primary goal of this thesis is to investigate unsupervised prediction and clustering                         possibilities of user behaviour based on collected timeseries data from infrared temperature                       sensors in the eHome enviroment. Three different prediction approaches are described. Hourly Based Event Binning approach is                       compared to two clustering algorithms, Hierarchical Clustering and Dirichlet Process GMM.                     Prediction rates are measured on data from three different test persons.  This thesis first examines two different approaches for event detection from infrared signal data. In a second stage three different methods for unsupervised prediction analytics are discussed                         and tested on selected datasets. Clustering algorithms parameter settings for timeseries data                       have also been discussed and tested in detail. Finally the prediction performance results are                           compared and each method's advantages and disadvantages have been discussed.  The practical part of this thesis is implemented in IPython notebook. Python version was 2.7 on                               64 bit Ubuntu linux 12.04 LTS. Data analysis has been implemented with Python’s Pandas                           library. Visualisations are made with Matplotlib and Seaborn libraries.  The results reveal that prediction accuracy depends on data quantity and spread of data points.                             The simplest method in prediction comparison, the Hourly Based Binning has however given                         the best prediction rates overall. By contrast to the Hourly Based Binning the Dirichlet Process Gaussian Mixture Models                         clustering show best prediction performance on smaller training data sets and well spread data.                           By further parameter tuning on Dirichlet Process GMM clustering the prediction rates could be                           further improved coming very close or even over performing the Hourly Based Binning. Due to the unknown distribution and well spread data, choosing the right threshold parameter                           for the Hierarchical Clustering was trickier than initially assumed. Despite the initial                       assumptions for Hierarchical Clustering, this method was at least applicable for unsupervised                       prediction analytics on used data sets.   

    Dzenan Hamzic                   TU Wien                   5 

  •  

    Contents Erklärung zur Verfassung der Arbeit ………………………………… 3  

    Acknowledgments ……………………………………………………… 4 

    Abstract .……………………………………………………………….... 5 

    1. Introduction ……………………………………………………… 8 1.1. Motivation …………………………………………………………….. 8 1.2. Outline ………………………………………………………………… 8 1.3. The Goal ………………………………………………………………. 8 1.4. Methodology …………………………………………………………... 9 

     2. Theory Section …………………………………………………... 10 

    2.1. Gaussian Mixture Model ……………………………………………... 10 2.2. Dirichlet Process GMM ……………………………………………..... 12 2.3. Hierarchical Clustering ……………………………………………….. 13 

     3. Implementation …………………………………………………. 15 

    3.1. Software Specification ……………………………………………….. 15 3.2. Data and Sensors ……………………………………………………... 15 3.3. Data Visualisation and Inspection ……………………………………. 16 3.4. Sensor values Discretisation and Extraction …………………………. 18 3.5. Unsupervised Event Extraction ………………………………………. 24 3.6. Data Structure for Event Analysis ……………………………………. 28 3.7. Data Quality and Quantity ……………………………………………. 29 3.8. Predictive Analysis ………………………………………………….... 33 

    3.8.1. Hourly Binning Analysis ……………………………………... 34  3.8.2. Clustering Analysis …………………………………………... 37  

    3.8.2.1. Hierarchical Clustering Analysis ……………………... 37 3.8.2.2. Dirichlet Process Gaussian Mixture Model Clustering 

    Analysis ………………………………………………. 43  

    4. Summary ………………………………………………………… 52 4.1. Results ………………………………………………………………... 52 4.2. Discussion ……………………………………………………………. 53 

    Literature ………………………………………………………………. 55  

    List of Figures ………………………………………………………….. 56  

    Dzenan Hamzic                   TU Wien                   6 

  •  

    List of Tables …………………………………………………………… 58 

    List of Abbreviations …………………………………………………... 59  

     

               

                   

      Dzenan Hamzic                   TU Wien                   7 

  •  

    1. Introduction    

    We are living in times of “smarthomes”. Sensors are registering our every movement. We can                             use these data on various ways like for preventing leakage, disasters, robbery and automating                           doors, lights etc. On the other hand, we can use these data to to support older people by finding                                     out if they are behaving other than usual in order to trigger some help alarm like sending an                                   emergency call or cutting the powersupply in order to prevent disaster. The eHome project                           from the Vienna University of Technology [1] is an R&D project with goals of providing                             assistive technologies for private households of older people with the idea to give them                           possibilities for longer and independent living in their homes.   

     

    1.1 Motivation Most people when they are old want to be supported by their loved ones. At the same time they                                     do not want to be a burden to them. There are a lot of cases where such people live alone.                                       Many of them have difficulties in walking and are prone to falling without possibility to get up                                 again or calling help. As another example, if a cooking plate stays on without being noticed the                                 house could be placed on fire. In order to prevent such a disaster it is a goal to detect events like                                         that. If detected there is a possibility to trigger an alarm and to remind to turn off the plate and                                       send someone to help the person to get up. For that reason and many others there is a need for a                                         system like eHome that detects unusual and dangerous behaviour and supports older people by                           providing them assisting technologies.  

    1.2 Outline Multiple wireless sensors of different types like Temperature Sensor, IIR (Infrared Temperature                       Sensor), REED (Magnetic Contact Sensor), PIR (Passive Infrared Sensor) etc. are placed in the                           house. The data from all sensors are recorded and analysed on eHome’s central unit. Events                             like cooking and leavingflat must firstly be correctly recognised in sensor data. Having                         recognised and extracted events, it can be further investigated if any behavioural patterns or                           event accumulations can be recognised in data which would be then useful in predicting the                             future behaviour of the person. 

    1.3 The Goal The goal is to have as reliable as possible solutions for detecting the typical times of a person’s                                   behaviour like cooking without any supervised interventions. If the person usually cooks in                         intervals between 10 AM and 1 PM and 3 PM and 5 PM it should be possible to find such time                                         intervals. It should also be possible to have a probability of an event occurring in any given hour                                   or time span. 

    Dzenan Hamzic                   TU Wien                   8 

  •  

    1.4 Methodology Three methods of possible event predicting are going to be discussed. The Hourly Binning of                             detected events is the simplest solution which returns event probabilities for every given hour.                           The two clustering approaches Dirichlet Process Gaussian Mixture Model and Hierarchical                     Clustering are used for finding event accumulations in timespans. Prediction possibilities are                       measured by dividing the whole dataset of a test person by 50:50 to training and testing datasets                                 and then comparing them for event hits and misses.    

                     Dzenan Hamzic                   TU Wien                   9 

  •  

    2. Theory  Every person is different and behaves differently. Some people cook normally at 12 PM others                             at 6 AM. Some people cook only once a day while others cook three or even four times a day.                                       Some people don’t cook at all. This leads to the conclusion that no fixed number of cooking                                 events can be assigned to each person. Having this in mind, the eHome system must have a                                 possibility of unsupervised finding the number of events taking place in homes of older people.  The clustering algorithms chosen in this section are later used for finding event accumulations                           and event clustering on timeseries data from the eHome project’s Infrared sensors. The                         Dirichlet Process Gaussian Mixture Model and Hierarchical Clustering need no fixed number of                         clusters to be found what makes them appropriate for unsupervised finding of eventual                         behavioural patterns.      2.1 Gaussian Mixture Model  Common and important clustering techniques are based on probability density estimation using                       Gaussian Mixture Model and Expectation Maximisation.  The Kmeans Algorithm uses just single points in feature space as the cluster centre. Each data                               point is then assigned to the nearest cluster using euclidian distance from the cluster centre.  Using only euclidian distance the Kmeans algorithm is not well suited for overlapping data                           points in feature space and for clusters that form no circular shape.  Convex sets: In Euclidean space, an object is convex if for every pair of points within the object,                                   every point on the straight line segment that joins them is also within the object [3].    Kmeans is often actually viewed as a special case of a Gaussian Mixture Model. GMMmodels                               can be seen as an extension to Kmeans models in which clusters are modeled with gaussian                               distributions using not only their means but also covariance that describes their ellipsoid shapes.  Covariance parameter in GMM can be constrained to spherical, diagonal, tied or full.  

    Dzenan Hamzic                   TU Wien                   10 

  •  

     

     Figure 2.1 Gaussian Mixture Model density estimation [4]. 

     Figure 2.1 shows 3 gaussian components. Each component is described as Gaussian distribution                         so each has a mean ᵻ�, variance or covariance ᵻ� and the “size” ᵻ�. The mean is responsible for                                     the distribution shift. Variance determines how wide/narrow the component is. The ᵻ� is the                           component's height.  The goal of performing an GMM clustering on a dataset is to find the model parameters (Mean                                 and variance of each component) so the model fits the data as much as possible. This bestfit                                 estimation usually translates into likelihoodmaximisation of the GMM model data.  Likelihood maximisation is performed by the Expectation Maximisation algorithm [B1]. The                     EM algorithm proceeds iteratively in 2 steps.   The Expectationstep treats gaussian component’s mean, covariance and the size as fixed. For                         each datapoint i and each cluster c probability value Ric of that datapoint belonging to cluster                               c is computed. If the particular probability value Ric is not high that data point i does not belong                                     to c cluster.  The best possible explanation for single component belonging to one cluster is Ric = 1.  The Maximisationstep starts with assignment of Ric probabilities and updates gaussians                     components parameters mean, covariance, and size. For each cluster c parameters are update                         using and estimated weights of Ric probabilities.  Each iterative step increases the loglikelihood of the model.  Prior knowledge of cluster number is assumed for this clustering algorithm.  Advantages: 

    Fastest algorithm for mixture models learning  No cluster shape and size limits 

    Dzenan Hamzic                   TU Wien                   11 

  •  

     Disadvantages: 

    Bad performance on small datasets  Fixed number of components 

       

    2.1 Dirichlet Process Gaussian Mixture Model  As noted earlier Gaussian Mixture Model and Kmeans algorithms assume a fixed number of                           components which should be found but in the most real world problems data are unstructured                             and no exact conclusion on datapoints distribution is can be made in advance.  The DPGMM is a Gaussian Mixture Model variant where no prior knowledge of cluster number                             is necessary. It uses a maximum number of clusters parameter as an upper bound for maximum                               components number to be found. Setting this parameter to e.g. five components should find all                             possible clusters in data up to maximum five. The algorithm should not simply split data into                               five components but deliver the real cluster number and cluster data accordingly. This is                           illustrated on diagram below.  

     Figure 2.2 Gaussian Mixture Model and Dirichlet Process GMM both initialised with 5 

    components [5].   This upper bound parameter should be loosely coupled with the real cluster number.  In comparison to Gaussian Mixture Model, Dirichlet Process GMM uses one additional                       parameter ᵬ� (alpha) to specify datapoints concentration.The alpha parameter controls the                     number of components used to fit the data. Lowering alpha parameter clusters the datapoints                           tightly as the expected number of clusters is alpha*log(N). Doing the opposite, more clusters are                             produced in any finite set of points. Given low data quantity, the DPGMM tends to fit data                                 points to only single component.   

    Dzenan Hamzic                   TU Wien                   12 

  •  

    Depending on the data type and data distribution DPGMM allows to set which cluster                           parameters are going to be updated in the training process. It can be setup to update wweights,                                 mmeans and ccovariances or any combination of the three.   The Dirichlet Process can be explained by the “chinese restaurant process” which satisfies                         properties [6][7]: 

    “Rich get richer” property (the more people sitst at the table the higher the chance of a                                 new person joining in) 

    There exists always a small probability that a new person entering joins the new table  Probability of a new group is set by concentration parameter alpha 

     Tables in the chinese restaurant paradigm are components of GMM.  Advantages: 

    Relatively stable (no big changes with small parameter tuning)  Less tuning needed  No need for component number specification (only loose upper bound) 

     Disadvantages: 

    Dirichlet Process makes inference slower (not much)  Implicit biases (sometimes better to use finite mixture models as GMM) 

        2.3 Hierarchical Clustering  Hierarchical methods need no cluster number and no cluster seed specification. In hierarchical                         clustering methods a nested series of clusters is produced. Hierarchical clustering tries to                         capture the underlying datastructure by constructing a tree of clusters.  Two hierarchical approaches are possible [B2]. Bottomup approach where at start every                       dataobject is a cluster by itself. Nearby clusters are iteratively merged into bigger clusters until                             all clusters are merged into a single cluster in the highest hierarchy level or some stopping                               criterion is met. Topdown approach starts from one big cluster containing all datapoints in the highest level of                             the hierarchy. Going towards bottom in hierarchy this method repeatedly performs splitting of                         clusters resulting in smaller and smaller clusters until every data point is a cluster for itself or                                 some stopping criterion is met. Depending on the dataobject’s distances, a threshold for flat cluster formatting is to be set.                             Both approaches can use distance as a stopping criterion.  Computing distance between all data pointsin two clusters is an expensive operation especially                         on big datasets. Therefore, the Hierarchical Clustering method offers multiple algorithms for                       computing distances between clusters.   

    Dzenan Hamzic                   TU Wien                   13 

  •  

    Singlelink algorithm: computes distance between two nearest points each in a different cluster. Completelink algorithm: computes the distance between two furthest points each in different                       cluster (opposite of singlelink algorithm). Centroid algorithm: computes the distance between two in cluster center points each in different                           cluster Averagelink algorithm: computes distance between all data points pairs of each in different                         cluster.  Advantages: 

    Can provide more insight into the data (eventual cluster hierarchy)  Simple to implement  Can provide clusters at different levels of granularity 

     Disadvantages: 

    No dataobject resignation to other clusters  Time complexity О(n³)  Distance matrix requires О(n²) memory space  

          

               

    Dzenan Hamzic                   TU Wien                   14 

  •  

     3. Implementation  3.1 Software Specification  The practical part of this thesis is implemented in IPython Notebook [8] on Ubuntu 12.04 LTS.                               Python version is 2.7.10 with 64 Bit Anaconda 2.3.0 [9]. Anaconda is a completely free                             scientific Python distribution. It includes more than 300 of the most popular Python packages                           for science, math, engineering and data analysis. NumPy is used for mathematical functions like                           transponding, rounding and others, Pandas for CSV datatables management, scikit for                     Dirichlet Process GMM Algorithm implementation, scipy for Hierarchical Clustering                 implementation. Visualisations are made using Matplotlib and Seaborn [10] libraries. 

     

     

     

    3.2 Data and Sensors  The eHome system consists of an adaptive intelligent network of wireless sensors placed in                           homes of older people for activity monitoring with a central contextaware embedded system. In                           each home the data are monitored by the central system in real time. The data subsets of                                 monitored test persons used in this thesis is a collection of all sensor events recorded in a time                                   frame of a few months. 

     Data events from wireless sensors like Accelerometers, Temperature Sensors, IIR (Infrared                     Temperature Sensor), REED (Magnetic Contact Sensor) and PIR (Passive Infrared Sensor) [2]                       are recorded to single CSV formatted data files. Each line represents a single sensor event. Each                               new day begins with a new data file. The recorded data are separated by a comma and have the                                     following format: 

     day.month.year hour:min:sec, unixtimestamp, milliseconds, sensor type, event type, event                 subtype, sensor ID, network ID, sensorvalue. 

     The CSV file looks as following: 

     

     

     

     

    Dzenan Hamzic                   TU Wien                   15 

  •  

     

     

    2010.05.15 00:03:28,1273881808,716,R,0,4,573769,173,1,22.600000,2130706433 2010.05.15 00:03:30,1273881810,74,R,0,6,573771,183,1,0.000000,2130706433 2010.05.15 00:05:55,1273881955,232,R,0,6,573805,174,1,0.000000,2130706433 2010.05.15 00:06:29,1273881989,272,R,0,6,573811,173,1,3.000000,2130706433 2010.05.15 00:06:51,1273882011,301,R,0,6,573817,178,1,3.000000,2130706433 2010.05.15 00:06:57,1273882017,581,R,0,6,573822,177,1,3.000000,2130706433 2010.05.15 00:07:04,1273882024,338,R,0,6,573824,181,1,3.000000,2130706433 2010.05.15 00:07:05,1273882025,162,R,0,6,573826,175,1,3.000000,2130706433 2010.05.15 00:07:22,1273882042,59,R,0,6,573828,166,1,3.000000,2130706433 2010.05.15 00:07:37,1273882057,313,R,0,3,573832,180,1,24.500000,2130706433 2010.05.15 00:07:42,1273882062,834,R,0,6,573834,171,1,0.000000,2130706433 2010.05.15 00:08:52,1273882132,638,R,0,3,573851,146,1,19.500000,2130706433 2010.05.15 00:09:05,1273882145,308,R,0,3,573857,175,1,21.500000,2130706433 2010.05.15 00:10:44,1273882244,775,R,0,6,573877,176,1,3.000000,2130706433 2010.05.15 00:10:52,1273882252,784,R,0,6,573880,146,1,0.000000,2130706433 2010.05.15 00:11:37,1273882297,607,R,0,6,573890,180,1,3.000000,2130706433 2010.05.15 00:11:55,1273882315,719,R,0,6,573897,182,1,0.000000,2130706433 2010.05.15 00:13:30,1273882410,808,R,0,6,573916,183,1,0.000000,2130706433 2010.05.15 00:13:31,1273882411,141,R,0,4,573918,173,1,22.600000,2130706433 2010.05.15 00:15:55,1273882555,966,R,0,6,573952,174,1,0.000000,2130706433 2010.05.15 00:16:31,1273882591,700,R,0,6,573959,173,1,3.000000,2130706433  

    Table 3.1. Sensor data sample from CSV file.  

     

    The marked lines indicate temperature of 22.6 C recorded by Infrared sensor with ID 173. 

     

     

     

     

    3.3 Data Visualisation and Inspection  

    The first step of almost every data analysis is to get to know the data. In this case the data of the                                           Infrared Temperature Sensor (placed by the cooking plate) and Magnetic Contact Sensors                       (placed on doors) are going to be inspected by visualising their values. 

    Dzenan Hamzic                   TU Wien                   16 

  •  

     Figure 3.1 Cooking Sensor data overview. 

     As can be seen from figure 3.1, the cooking plate was active multiple times on 15.05.2010. It                                 can be noticed that the temperature is not so high as usuall on cooking plates. The reason is that                                     the sensor is not placed directly on the cooking plate but to the side.  By visually inspecting the cooking sensor temperature values, 3 big peaks and multiple small                           peaks can be seen. Such small peaks are of no interest for this thesis because they do not imply                                     cooking. Cooking usually takes longer than 10 minutes. Event length detection and filtering will                           be discussed in further sections. The IIR (Infrared temperature sensor) cooking sensor is                         continuously delivering temperature values. Minimal sending interval is 1 minute. Maximal                     sending interval is 10 minutes. It is triggered by a temperature change of∓0,5 C. This setup can                                   be visually confirmed in Figure 3.2 below.  

     

     

    Figure 3.2 Infra Red Sensor sending intervals. 

     It can be seen that the sensor fired multiple times in the interval between 8:46 and 8:56 what                                   indicates a strong increase in temperature concluding that the cooking plate was turned on.  

    Sensors like PIRSensor (Movement sensor) are working with discrete values. PIR sensors are                         saving ones if a movement is detected. Otherwise nothing is saved. Such discrete values are                             

    Dzenan Hamzic                   TU Wien                   17 

  •  

    easier to work with. The clustering algorithms can directly be fed with such values. Below is the                                 visualisation of the Passive Infrared sensor values on 19. and 20. May of 2010 in the home of                                   Test Person 1.   

      

    Figure 3.3 PIR Sensor values visualisation for TP1. 

     

    From Figure 3.3 can nicely be seen if there was any movement in house. The value gaps, which                                   could indicate sleeping or being outdoors, can easily be filled with virtual sensor values in order                               to be processed further.   In order to do density estimations on passive infrared sensor timeseries values, some kind of                             conversion to discrete values is needed. When working with timeseries data density estimation                         clustering algorithms are to be filled only with values when the cooking plate is turned on. All                                 other values from the cooking sensor can be filtered out from the dataset or can be set to 0. 

    Multiple steps are needed in order to achieve discretisation of values from the cookingplate                           infrared sensor. There are multiple possible solutions to this issue and some of them are going                               to be discussed in next section of the thesis.  

     

     

    3.4 Sensor Values Discretisation and Extraction  As noted earlier IIR sensor values need discretisation. The idea is to implement some kind of                               sampling on continuous values of the IIR cooking sensor. This is needed to be able to extract                                 the cookings from the rest of the sensor events from the dataset.  

     Having a discrete signal from infrared temperature sensor, two approaches for cooking event                         extraction from a dataset are going to be discussed. The goal is to have possibilities of reliable                                 and correct cooking events detection in a single dataset.  

    Dzenan Hamzic                   TU Wien                   18 

  •  

     The first approach is the sequencing of temperature rises that belong together. The second                           approach is unsupervised event extraction using a clustering algorithm for grouping temperature                       increases that belong together. The latter is discussed in the next section.  Three big peaks from Figure 3.1 need to be discretized since only “heating in progress” on the                                 cooking plate is of interest. This gives an idea that positive temperature increases should be                             inspected and leads to conclusion that the first step of temperature signal transforming should be                             a difference operation on temperature values.  

     

    The result of temperature values difference operation is visualised on the following diagram. 

      

    Figure 3.4 Temperature differences.  

     

    Figure 3.4 clearly shows temperature peaks. Small increases in temperature can be categorised                         as signal noise (∓0.1 C). This noise was present in most of the datasets and can clearly be seen                                     e.g. between 03:00 and 06:00. Negative temperature differences can simply be filtered out from                           the new dataset. The positive “noise” in temperature signal (+0.1 C) does also have to be                               filtered out in order to get only relevant temperature increases. This can be dealt with using 2                                 approaches. 

    First: Taking the sensors sensitivity of ∓0.5C into consideration every temperature difference value                       below that threshold could be set to 0.   

    Second: Counting the absolute probability for every positive temperature difference and filtering (setting                       to 0) all values between some threshold. This threshold could be set by finding the values which                                 have probabilities of e.g. 15% or 20% in difference dataset. Using this approach demands                           detailed analysis of sensor behaviour.   

     

    Dzenan Hamzic                   TU Wien                   19 

  •  

    Below is the table of probabilities for every temperature difference.   

    0.0 C: 62x, 34 % 0.1 C: 48x, 26% 0.2 C: 4x, 2% 0.3 C: 2x, 1% 0.4 C: 3x, 1% 0.5 C: 1x, 0% 0.6 C: 22x, 12% 

    0.7 C: 12x, 6% 0.8 C: 8x , 4% 0.9 C: 5x , 2% 1.0 C: 2x , 1% 1.1 C: 1x , 0% 1.2 C: 2x, 1% 1.6 C: 1x, 0% 

    1.9 C: 1x, 0% 2.5 C: 1x, 0% 2.9 C: 1x, 0% 3.1 C: 1x, 0% 3.3 C: 1x, 0% 3.6 C: 1x, 0% 5.8 C: 1x, 0% 

     Figure 3.5 Absolute probabilities of temperaturevalue differences. 

     

    Almost every provided dataset was noisy. One of the two above mentioned methods can be used                               to get the clean signal. After successful implementation of this step the dataset should no longer                               contain the sensor noise. 

     

     

    Figure 3.6. Filtered temperature differences. 

     

     The filtered signal can now be manipulated easily. One could add a new binary column to the                                 new dataset which indicates the positive increases in temperature. This column would set ones                           to rows which indicate that the cooking plate is turned on. This isolates the needed signal events                                 from the rest of the dataset. By plotting the the temperature values marked as true in new                                 column we get the following diagram: 

     

     

    Dzenan Hamzic                   TU Wien                   20 

  •  

       

    Figure 3.7 Isolated cooking signals.  

    Three big cooking peaks from figure 3.1 are now isolated and clearly to be seen on figure 3.7.                                   The new created binary column contains sampled cooking signal which can be further                         manipulated.  

     Figure 3.8 Sampled temperature rises. 

     

    Figure 3.8 shows datapoints where the temperature was rising. Some standingalone points like                         the one between 11:19:00 and 11:49:00 could indicate that the cooking plate was shortly turned                             on. Such single standing points indicate no cooking and should also be filtered out.  

    One possibility to find such single temperature increases is to check if the datapoints are                             isolated relatively far from others. If such points are not in the vicinty (duration based) to other                                 positive temperature rising sequences they can be filtered out. Simple check on timing of                           previous and succeeding positive temperature increase would be enough. Taking into account                       that some cooking plates “turn off” for short periods of time after reaching a certain                             temperature, and then turning on again, the threshold for filtering such singlestanding events                         should be carefully chosen and be not too short. 

    Dzenan Hamzic                   TU Wien                   21 

  •  

      If there is no positive temperature increase before or after such an isolated point, it can be set to                                     “NO SEQUENCE EVENT” or in following case 1. The following diagram demonstrates the                         idea of events that stand close to other positive temperature increases. 

     

    20100515 09:32:29, 0.3, 27.7 , 28.0 , sequence_event:  1 20100515 09:42:32, 0.4, 27.3 , 27.7 , sequence_event:  1 20100515 09:47:39, 0.6, 27.9 ,  27.3 , sequence_event:  0 20100515 09:50:39, 0.7, 28.6 ,  27.9 , sequence_event:  1 20100515 09:51:40, 0.7, 29.3 ,  28.6 , sequence_event:  2 20100515 09:55:41, 0.6, 29.9 ,  29.3 , sequence_event:  3 20100515 09:57:41, 0.7, 30.6 ,  29.9 , sequence_event:  4 20100515 10:00:42, 0.8, 31.4 ,  30.6 , sequence_event:  5 20100515 10:04:43, 0.7, 32.1 ,  31.4 , sequence_event:  6 20100515 10:10:44, 0.8, 32.9 ,  32.1 , sequence_event:  7 20100515 10:17:46, 3.1, 36.0 ,  32.9 , sequence_event:  8 20100515 10:18:46, 1.9, 34.1 ,  36.0 , sequence_event:  1       a) belongs to sequence. 20100515 10:20:47, 0.9, 35.0 ,  34.1 , sequence_event:  0 20100515 10:21:47, 0.8, 35.8 ,  35.0 , sequence_event:  1 20100515 10:22:47, 2.5, 38.3 ,  35.8 , sequence_event:  2 20100515 10:23:49, 1.0, 37.3 ,  38.3 , sequence_event:  1       b) belongs to sequence. 20100515 10:25:49, 0.7, 38.0 ,  37.3 , sequence_event:  0 20100515 10:29:04, 0.9, 37.1 ,  38.0 , sequence_event:  1 20100515 10:31:04, 0.9, 36.2 ,  37.1 , sequence_event:  1  

    Table 3.2. Sequencing cooking sensor events.  (Datetime, temperature difference from previous event, current temperature, previous event 

    temperature, is_in_sequence )  Table 3.2 demonstrates the idea of sequencing the positive temperature differences. Events a and b                             should be merged with other sequences since they are very close to preceding and following sensor                               event. If they were e.g 15 or 20 minutes distanced from other positive increase they could be filtered                                   out.  Having the sensor events sequenced and single temperature increases filtered out, one additional                         field could be added to the dataset that indicates the sequence name. This is needed in order to add a                                       functionality of a programmatically selection of distinct cookings events from the dataset. The idea                           is to have some functionality of select distinct or group by in the data set. The Table 3.3                                   demonstrates the idea.  

     

     

     

     

     

     

    Dzenan Hamzic                   TU Wien                   22 

  •  

    20100515 08:46:17 , 0.6 , 23.2 , sequence_event:  0 , sequence_name: A 20100515 08:48:17 , 1.1 , 24.3 , sequence_event:  1 , sequence_name:A 20100515 08:49:18 , 0.6 , 24.9 , sequence_event:  2 , sequence_name:A 20100515 08:50:18 , 0.6 , 25.5 , sequence_event:  3 , sequence_name:A 20100515 08:52:18 , 1.0 , 26.5 , sequence_event:  4 , sequence_name:A 20100515 08:53:19 , 0.6 , 27.1 , sequence_event:  5 , sequence_name:A 20100515 08:55:19 , 0.9 , 28.0 , sequence_event:  6 , sequence_name:A 20100515 08:57:20 , 0.9 , 28.9 , sequence_event:  7 , sequence_name:A 20100515 08:59:20 , 0.6 , 29.5 , sequence_event:  8 , sequence_name:A 20100515 09:02:21 , 0.7 , 30.2 , sequence_event:  9 , sequence_name:A 20100515 09:05:22 , 0.6 , 30.8 , sequence_event:  10 , sequence_name:A 20100515 09:07:22 , 0.7 , 30.1 , sequence_event:  1 , nan 20100515 09:10:23 , 0.7 , 29.4 , sequence_event:  1 , nan 20100515 09:14:24 , 0.7 , 28.7 , sequence_event:  1 , nan 20100515 09:22:27 , 0.7 , 28.0 , sequence_event:  1 , nan 20100515 09:32:29 , 0.3 , 27.7 , sequence_event:  1 , nan 20100515 09:42:32 , 0.4 , 27.3 , sequence_event:  1 , nan 20100515 09:47:39 , 0.6 , 27.9 , sequence_event:  0 , sequence_name:B 20100515 09:50:39 , 0.7 , 28.6 , sequence_event:  1 , sequence_name:B 20100515 09:51:40 , 0.7 , 29.3 , sequence_event:  2 , sequence_name:B 20100515 09:55:41 , 0.6 , 29.9 , sequence_event:  3 , sequence_name:B 20100515 09:57:41 , 0.7 , 30.6 , sequence_event:  4 , sequence_name:B 20100515 10:00:42 , 0.8 , 31.4 , sequence_event:  5 , sequence_name:B 20100515 10:04:43 , 0.7 , 32.1 , sequence_event:  6 , sequence_name:B 20100515 10:10:44 , 0.8 , 32.9 , sequence_event:  7 , sequence_name:B 20100515 10:17:46 , 3.1 , 36.0 , sequence_event:  8 , sequence_name:B 

     Table 3.3 Naming the sequences. 

    (Datetime, temperature difference from previous event, current temperature, sequence event number, sequence name) 

     

    With sequences named, the cooking events can now be programmatically selected. The distinct                         cooking event sequences can now be visualised. 

     Figure 3.9 Programmatically selected cooking events. 

     

    Dzenan Hamzic                   TU Wien                   23 

  •  

    Found events are summarised in the following table.  

    Sequence  A  B  C  D  E 

    Minutes  21.1   31.1  3.0  4.0  7.0 

    Seconds  1265.1  1867.5  182.0  241.0  421.7 

    Start  08:46:17  09:47:39  10:20:47  12:29:56  12:36:58 

    End  09:05:00  10:17:46  10:22:47  12:32:57  12:42:59 

    Count events  11  9  3  4  2 

     Table 3.4. Programmatically selecting cooking sequence events. 

     

     

    Table 3.4 lists 5 distinctive cooking events. Sequences B and C are very close one to another.                                 As well as sequences D and E. Looking at the start and end times of those sequences leads to                                     the conclusion that such close to other sequences can be merged into larger ones since they are                                 divided only by a few minutes from another. 

    The merging of sequences can be done programmatically with some threshold variable to check                           on sequence distances or can be done by some clustering algorithm which could group                           togetherbelonging dataevents. The latter will be discussed in the next section. 

     

     

    3.5 Unsupervised Event Extraction  Another approach of extracting cooking events from datasets is to feed an unsupervised                         clustering algorithm with sampled cooking data. Advantage of this method is that the cooking                           sequences close one to another are going to be automatically merged. Precondition for this                           approach is, as already mentioned, transformation of continuous data to discrete format and                         removal of single peaks in temperature values.  

     Hierarchical clustering can be used for unsupervised grouping of temperature increasing                     sequences. As noted earlier, some cooking plates shortly turn off for a short period of time and                                 then turn on again producing multiple positive temperature increases close one to another.                         Hierarchical clustering merges such temperature increasing sequences nicely. 

    It performs well on sparse data which is the case in positive temperature increases. Hierarchical                             clustering is further discussed in theory section (see Chapter 2.3). Timeseries data need to be converted before hierarchical clustering algorithm can be fed with                           them. 

    Dzenan Hamzic                   TU Wien                   24 

  •  

     Time dimension needs to be converted into distance dimension. One possible solution would be                           to convert sensor timestamps into distance from the day begin. One new column in the dataset                               can be created to hold seconds+milliseconds delta for every sensor event from the beginning of                             the day. 

     Having cooking temperature values sampled and delta times from day begin, hierarchical                       clustering can be tested. 

    Using default parameters of the algorithm [11] with criterion= ”distance”, threshold of 1000 (16                           min) seconds and “single linkage” which reduces computing time, the following clusters are                         found:  

     Figure 3.10 Hierarchical Clustering of sequenced datapoints. 

     

    Cluster  1  2  3 

    Minutes  11  21.1  34.2 

    Seconds   662.7  1265.1  2049.5 

    Start  12:29:56  08:46:17  09:47:39 

    End  12:42:59  09:05:22  10:22:47 

    Count events  6  11  12 

     

    Table 3.4. Hierarchical clustering cluster statistics. 

     

    Dzenan Hamzic                   TU Wien                   25 

  •  

    Figure 3.10 and Table 3.4 show 3 bigger dataevent groupings. All small peaks in temperature                             are merged with those closest to them. This result can be checked if it matches three big peaks                                   from the Figure 3.1.  As can be seen, found clusters correspond exactly to the all cooking peaks from the dataset.  

     

    This method has been tested on all provided datasets and has given good results in recognising                               cooking events. The threshold of 1000 seconds with single linkage has given very good results                             in recognizing and merging of temperature increases in the event detection step. 

     Another example of having multiple temperature increasing close one to another, all of them                           indicating one and the same cooking event is on the following diagram. The green cluster                             consists of multiple temperature increases. The Hierarchical clustering algorithm does a nice job                         by grouping them together. 

     

     Figure 3.11 Hierarchical Clustering of sequenced datapoints. 

     

    Figure 3.11 shows a good example of long cooking where periodically cookingplate turning off                           is nicely to be seen.  

      

    Both methods of cooking events extraction have been tested on multiple datasets. They are                           reliable methods of cooking events detection and extraction. In step of event detection each                           method can be used alone as well as combined together. Hierarchical clustering provides great                           results in the step of merging shorter temperature increases with longer ones. Since the quantity                             of datasets is not huge, algorithmic complexity plays no big role in this case. It also reduces                                 programming complexity and need for duration threshold between temperature increases.    Additionally one could also implement temperature increase delta check. If a temperature                       

    Dzenan Hamzic                   TU Wien                   26 

  •  

    difference in a peak is not significant, which implies to no cooking, that peak could not be of                                   interest and can be filtered out.   

     

    Successful and reliable implementation of cooking events detection and extraction on a single                         dataset is a basis for automating that method for events extraction on all provided datasets at                               once. One data structure should be made for holding all relevant data of cooking events for each                                 day. This is going to be discussed in the next section of the thesis.   

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

     

    Dzenan Hamzic                   TU Wien                   27 

  •  

    3.6 Data Structure for Event Analysis  One unified data structure is needed to hold all recognised events from given test person’s data                               set to have a possibility of further analysis of that person’s behaviour. 

    This data structure may be implemented in various ways. One possibility is to have some kind                               of nested dictionary or multiplelevel keyvalue structure. Since the practical part of this thesis                           was implemented in python, nested dictionary was used to hold all extracted cooking events. 

    Nested dictionary has the following form: #first level dictionary outputDict = {} # holds the date of the datasource file date_as_key =('%Y_%m_%d') # second level dictionary outputDict[key] = {}  #number of extracted event clusters with hierarchical clustering outputDict[key][“hclustersnr”] = #number_found #number of programmaticaly extracted events with DISTINCT outputDict[key][“eventgroupsnr”] = #number_found # third level dictionary for holding list or arrays of extracted events outputDict[key][“hclusters”] = {} outputDict[key][“eventgroups”] = {}  

    The “hclusters” and “eventgroups” dictionaries can be now filled with detected cooking events.  # foreach event detected with hierarchical clustering  

    # first event (first cooking detected) list outputDict[key][“hclusters”][1]=[eventNumber,length,startTime,endTime,meanTime,            

    sensorEventsNumber]  # second event (second cooking detected) list outputDict[key][“hclusters”][2]=[eventNumber,length,startTime,endTime,meanTime,            

    sensorEventsNumber]  # nth event (nth cooking detected) list outputDict[key][“hclusters”][n]=[eventNumber,length,startTime,endTime,meanTime,            

    sensorEventsNumber]   # foreach event programmaticaly detected  

    # first event (first cooking detected) list outputDict[key][“eventgroups”][1] = [eventNumber, length, startTime, endTime,           

    meanTime,   sensorEventsNumber]  # second event (second cooking detected) list outputDict[key][“eventgroups”][2] = [eventNumber, length, startTime, endTime,           

    meanTime, sensorEventsNumber]  # nth event (nth cooking detected) list outputDict[key][“eventgroups”][n] = [eventNumber, length, startTime, endTime,           

    meanTime, sensorEventsNumber]   

    Such data structure gives many possibilities for further data manipulation and analysis. For each                           day the number of events detected is saved. For each single cooking event the length, starttime,                               endtime, meantime, sensoreventscounter and eventnumber is saved. Having such data                   structure gives a possibility of sorting by date which is needed to divide the test person’s dataset                                 into training and testing parts.  

    Dzenan Hamzic                   TU Wien                   28 

  •  

    3.7 Data Quantity and Quality  Data quality and data quantity are of big importance for measuring prediction rates and for                             finding event accumulations. No pattern can be learned if the data collection period is too short.                               If the test person is rarely cooking or using the cookplate only to shortly heat up meals, longer                                   periods of data collecting are needed to detect reliable patterns.   

     

    Having extracted all events from the test person’s datasets and holding them in one single data                               structure, the cooking event distribution can be inspected . The dataset statistics as shown in the                               following table are useful to get the insight into possible underlying event patterns and the                             person’s behaviour. All this can help in further analysis of prediction methods. 

     

    Three randomly chosen test persons are going to be compared. Below are the dataset statistics. 

     

    Test Person  TP10  TP2  TP3 

    Mean Cooking Time: Cooking Standard Dev: DaysCooked: Days: NoCooking: Total Days: Percentage Cooked: Percentage NoCooking: 

    11:00:45 03:19:34 26 45 71 36% 63% 

    10:23:09 03:43:24 75 52 127 59% 40%  

    11:08:37 03:51:26 90 26 116 77% 22%  

     Table 3.5 Test persons dataset statistics. 

     

    “Total Days” parameter indicates the amount of days of data being recorded. Test persons 2 and                               3 have significantly more recorded data than testperson 10. Those 2 datasets are good examples                             where prediction rates should be higher. .“Percentage cooked” and “Percentage NoCooking”                     parameters indicate an event density for a given person.  

     

     

    The visualisation of event occurrence density over recording time period gives another good                         insight into person's behaviour patterns and possible hints for prediction rates.  

     

     

     

    Dzenan Hamzic                   TU Wien                   29 

  •  

     Figure 3.12  TP10 Cooking density overview. 

    Dzenan Hamzic                   TU Wien                   30 

  •  

     Figure 3.13 TP3 Cooking density overview. 

    Dzenan Hamzic                   TU Wien                   31 

  •  

     Figure 3.14 TP2 Cooking density overview. 

    Dzenan Hamzic                   TU Wien                   32 

  •  

     Figure 3.12 shows that the testperson 10 is using the cooking plate mostly once a day. Most of                                   the days no cooking was detected. Making reliable predictions with such non dense dataset may                             be a hard task taking the relatively small amount of data in consideration.  

    Figures 3.13 and 3.14 show relatively good recorded dataquality of test persons 2 and 3. Test                               person 2 has only one big gap where no cooking event was registered. Finding some                             behavioural patterns from dense datasets like this two should be possible. This statement shall                           be put to the test further in the thesis.  

     

     

    3.8 Predictive Analysis  This section analyses two different approaches in predicting a testperson's behaviour. The                       hourly based event grouping vs. event accumulations by clustering algorithms. The main goal is                           to have as much reliable and precise prediction as possible. 

     

    The first approach is to bin detected events by hour and calculate hourly probability of event                               occurring. This is the easier and less computationally intensive approach of the two. Many say                             that the simplest method is mostly the best but we’ll put this saying to the test. 

    The second approach uses clustering algorithms for the possible datapoints accumulations to be                         discovered. Two clustering algorithms are put to the prediction analysis test. Hierarchical                       Clustering which was already used on cooking events detection and Dirichlet Process Gaussian                         Mixture Model.  

    The Hierarchical Clustering has already proven good performance on clustering sparse                     discredited sensor signals. Having extracted all cooking events from given datasets the new                         dataset is created with dense data. The question is how the Hierarchical Clustering performs on                             such datasets. 

     

    The testperson’s datasets containing all extracted cooking events (nested dictionary from                     previous section) are split by 50:50 to training and testing dataset. The fiftyfifty split is pretty                               unusual but the idea is to find out if the event accumulations are to be found and if the cooking                                       patterns are predictable at all.  

     

    In case of hourlybinning predictions, the hit rate is measured by checking if the meantimes of                               testing dataset points are in any hourlybins from training dataset. In case of clustering                           predictions the hit rate is measured by checking if the meantimes of datapoints from testing                             dataset fall in any of the found clusters in the training dataset. 

      

    Dzenan Hamzic                   TU Wien                   33 

  •  

    3.8.1 Hourly Binning Analysis  Cooking events from the training dataset are binned into hour intervals. If the meantime of                             cooking event is 8:45 it will be put into the 8 hour bin. Overlapping events with mean times                                   very close to the next hour (like 10:59) are also binned according to their hour time. Such events                                   can theoretically be binned into next or previous overlapping hours. This may be an issue in                               further analysis of hourly binning predictions. 

    Having data from the training dataset binned, the probability of an event occurring at a certain                               hour (bin) can be calculated. The same binning will be applied to the testing dataset and                               overlapping counts are going to be calculated. Below is the visualisation of binned events of                             training and testing datasets for TP2 

    .   

     Figure 3.15 TP2 training vs testing data bins comparison. 

      

    Some behavioural patterns are visible at first sight. Most probable bins had the biggest match                             rate except the 4 AM bin. Only the 14 hour and 21 hour bins had no matches. Visualising the                                     hourly cooking probability gives an even better insight into prediction possibilities. 

       

      

    Dzenan Hamzic                   TU Wien                   34 

  •  

     Figure 3.16 TP2 training vs testing hourly probability comparison. 

     

    Figures 3.15 and 3.16 show clear resemblance in test persons 2 behaviour. Training and testing                             dataset visualisations can be made for other test persons to get better intuition of possible end                               results.  

      Figure 3.17 TP3 training vs testing data bins and hourly probability comparison. 

     

    Below is the binning visualisation of test person’s 10 training and testing dataset which clearly                             shows the need for better quality or bigger amount of data. 

    Dzenan Hamzic                   TU Wien                   35 

  •  

      

    Figure 3.18 TP10 training vs testing data bins and hourly probability comparison.   

     Due to the rare cooking events and low size of test person’s 10 dataset as shown on figure 3.18                                     the excellent prediction results are not to be expected.  

     

    The table below that summarises hourly binning prediction rates on all three datasets. 

     

    Test Person  TP2  TP3  TP10 

    Hit Rate  92.40 %  100.00 %   62.00 % 

    Miss Rate  7.60 %  0.00 %  38.00 % 

    Training set size  60  87  21 

    Testing set size  79  103  15 

     Table 3.6. Hourly binning prediction analysis summary. 

     

    Hourly based pattern discovery yields relatively good prediction rates. Test person 3 data were                           the biggest in size and had 100% prediction rate. Test person’s 10 non satisfactory numbers can                               be explained by the datashortage. The thing to note in hourly binning prediction is that                             prediction rates are higher if the datapoints are nicely distributed across the hourly bins. If for                               example the test person cooked only once in every hour in the training dataset the prediction                               

    Dzenan Hamzic                   TU Wien                   36 

  •  

    rate would be 100%. The advantage of hourly based binning is that the cooking probability can                               be computed for every single hour.  

    Taking low computational demands into consideration the hourly binning is the way to go if the                               provided hardware has no huge processing resources.  

     

     

    3.8.2 Clustering Analysis  Clustering is unsupervised learning. It tries to learn only from the provided data. The main idea                               is to find similarity measures and group similar objects together. 

    Cooking plate temperature values have no labels and no known structures. In order to find any                               hidden structures in temperature values unsupervised learning is the way to go. Unsupervised                         learning uses machine learning algorithms to discover and describe key data features. It is                           closely related to density estimation in statistics. Depending on the data, different techniques are                           applicable. In order to produce better results, combinations of different unsupervised learning                       methods are also applicable. The main problem in most of the unsupervised learning methods is                             finding a good number of clusters. 

     Fed with the same data, unsupervised learning algorithms behave differently and produce                       different results. Choosing the optimal solution for a specific dataset is more an art than a                               science. Lots of different method testing and parameter settings tryouts are needed to be able to                               choose the appropriate method.   

     

    In this section two clustering algorithms are going to be compared on unsupervised clustering                           hit/miss performance. Since the Hierarchical Clustering has less parameters to play and test with                           it is a good decision to start with it. A cluster number eventually found from Hierarchical                               Clustering can be used as maxcomponents parameter in Dirichlet Process GMM.  

     

     

    3.8.2.1 Hierarchical Clustering analysis  As noted earlier Hierarchical Clustering takes a threshold input parameter as basic distance                         measure between dataclusters. The cooking eventsdata have two dimensions. Distance from                     the beginning of the day in seconds and the binary value indicating that cooking event took                               place. Taking the time distance between cooking events into consideration the Hierarchical                       Clustering algorithm is to be set up with criterion = “distance”.  

    Typical cooking takes place few times a day. Normally the distance between cooking events                           should be at least 30 minutes. Considering this, the Hierarchical Clustering algorithm can be set                             

    Dzenan Hamzic                   TU Wien                   37 

  •  

    up with threshold of 1800 seconds. Lowering the threshold would result in producing more                           clusters with a smaller distance between the same.   

     

    Below are hierarchical clustering execution results on test person’s 2, 3 and 10 data with                             criterion “distance” and threshold of 1800 seconds. 

     Figure 3.19 Hierarchical Clustering on TP3 training data with threshold of 1800 seconds. 

     

     Figure 3.19 shows found clusters from the training data. Test data (red points) are plotted just                               for comparison. Blue and green lines indicate hourly probability of cooking events occurring on                           the training and testing dataset’s respectively. The black horizontal line stands for hourly based                           probability standard deviation. This parameter could be an answer to the question “when should                           the cluster be grouped?”. Hourly probability standard deviation seemed to be a good threshold                           parameter for filtering rare events on time dimension. This can be visually inspected by looking                             at clusters 5, 6, 7 (CL 5, CL 6, CL 7) which are based on single points. Cluster 2 with 3 points                                           and half of a cluster 3 would be filtered out. Following figure demonstrates hourly std. filtering                               on TP3 dataset.   

     

     

     

     

     

    Dzenan Hamzic                   TU Wien                   38 

  •  

       

     Figure 3.20 Threshold filtered TP3 dataset. 

     

     There are only 3 clusters in Figure 3.20 which is pretty close to the real cluster number. hourly                                   std. threshold filtering could be used to estimate loose cluster number wich could be used as                               input parameter in Kmeans or DPGMM.  

     

    There is an alternative to hourly std. based filtering. One could use cluster duration or cluster                               datapoints number as constraints. After execution of the Hierarchical Clustering method on the                         dataset a cluster duration check can be made which would filter out too short clusters. Using                               this approach would filter out single clusters. Clusters below e.g. 10 minutes are not long                             enough and could also be filtered out. Additionally, check on datapoints number in cluster                           constraint can be made. If a cluster contains less than e.g. 5 datapoints it could be filtered out.                                   This can be seen on cluster 2 with only 3 points. Although it has only 3 datapoint is is 17                                       minutes in duration. Using this filtering methods would result in having same clusters as in                             Figure 3.20.   

     

     The table below shows some cluster statistics from Figure 3.19. 

     

     

    Dzenan Hamzic                   TU Wien                   39 

  •  

      

    Cluster  1  2  3  4  5  6  7 

    Start  16:06:46  14:16:34  12:32:26  05:39:21  15:13:27  04:34:26  21:29:20 

    End  18:22:47  14:33:40  13:40:08  11:56:54  15:13:27  04:34:26  21:29:20 

    Duration  2:16:01  0:17:06  1:07:42  6:17:33  0:00:00  0:00:00  0:00:00 

    Elements  18  3  9  57  1  1  1 

    Percentage  20%  3%  10%  63%  1%  1%  1% 

    Count hits  19  0  7  70  0  0  0 

    Hit percentage  16%  0%  6%  61%  0%  0%  0% 

     Table 3.7. Hierarchical Clustering cluster statistics of TP3 training data. 

      Using the same parameters as described at beginning of the section, visualisations and cluster                           statistics are made for TP 2 and TP10 respectively.   

     Figure 3.21 Hierarchical Clustering on TP2 training data. 

         

    Dzenan Hamzic                   TU Wien                   40 

  •  

    Cluster #  1  2  3  4  9  10  11 

    Start  08:05:33  04:08:46  06:36:29  05:25:40  16:59:45  15:08:28  15:56:04 

    End  13:01:07  04:45:19  07:24:44  05:30:06  18:11:43  15:12:53  16:00:47 

    Duration  4:55:34  0:36:33  0:48:15  0:04:26  1:11:58  0:04:25  0:04:43 

    Elements  37  9  3  2  5  2  2 

    Percentage  57  14  4  3  7  3  3 

    Hit count  48  2  1  1  4  0  0 

    Hit percentage  60%  2%  1%  1%  5%  0  0 

     Table 3.8. Hierarchical Clustering cluster statistics of TP2 training data. 

        Clusters and statistics for TP 10.   

     Figure 3.22. Hierarchical clustering on TP10 training data. 

          

    Dzenan Hamzic                   TU Wien                   41 

  •  

    Cluster  1  2  3  4  5  6 

    Start  08:55:49  12:33:33  11:50:19  05:54:14  07:36:25  14:44:41 

    End  10:45:17  12:53:05  11:50:19  06:22:49  07:36:25  14:44:41 

    Duration  1:49:28  0:19:32  0:00:00  0:28:35  0:00:00  0:00:00 

    Elements  10  3  1  5  1  1 

    Percentage  47%  14%  4%  23%  4%  4% 

    Hit count  3  0  0  0  0  0 

    Hit percentage 

    21%  0%  0%  0%  0%  0% 

     Table 3.9. Hierarchical Clustering cluster statistics of TP10 training data. 

     As can be seen from the diagrams 3.19, 2.21 and 3.22 the filtering with hourly probability std.                                 threshold would lead only to an even worse hit rate since it would cut off most of the good clusters.                                       It can however be used to find realistic cluster number which can be used as initialisation parameter                                 in other clustering techniques. Below is the statistics of all three dataset clustering hit/miss performances. The results are from the                               clustering with no thresholds and parameter initialisation as described at the beginning of the                           section. Single point clusters are of no relevance for hit/miss performance in the results since they                               have the same starting and ending time.     

    Test Person  TP3  TP2  TP10 

    Hits  96  56  3 

    Misses  17  23  11 

    Hit Rate  84%  70%  21% 

    Miss Rate  16%  30%  79% 

    Training set size  87  60  21 

    Testing set size  103  79  15 

     Table 3.10: Summary of hierarchical clustering on hit/miss performance. 

      Every single datapoint from the testing data set (second data portion) was checked if it belongs to                                 any cluster from the training data set. More exactly this means if the mean time of a second                                   dataportion event is between any cluster’s start and end time.  The results from the table 3.10 show direct correlation between trainingset size and and hitrate.                             TP10 results are unsatisfactory due to the insufficient amount of data.   

    Dzenan Hamzic                   TU Wien                   42 

  •  

    3.8.2.2 Dirichlet Process GMM Clustering Analysis  The first test for the Dirichlet Process GMM would be to see if it groups the datapoints into a                                     reasonable number of clusters. The best way to test this is to set a high maximum number of                                   components. The Dirichlet Process GMM should deliver the real density estimates as described                         in theory section.  Initialising DPGMM with 20 as maximum component number leaving alpha, covariance type                       “diag”, cluster updating parameters on w “weight”, m “means” and c “covariances” combined                         and number of iterations 10 at default results in the following diagram [12].  

     Figure 3.23 DPGMM run with default parameter setting. 

      DPGMM simply puts all data to one single cluster. Increasing the alpha parameter which should                             lead to larger number of clusters, makes no changes at all. Increasing the number of iterations                               parameter returns also no better results.   To make an useful model with the Dirichlet Process GMM the distribution of the data has to be                                   closely inspected. How are the data aligned? Is there any possible convergence between the data                             points? Are there any obvious accumulation points?     All cooking events are aligned on one horizontal line, incremented by seconds they form one                             long straight line. This can be seen as perfect convergence. Accumulation points are possible                           but not so frequent. Small well spread data chunks are also visible which labels the cooking                               event data as sparse with noise. 

    Dzenan Hamzic                   TU Wien                   43 

  •  

    All the points converge to one straight line which may be the reason for Dirichlet Process GMM                                 initialised with “covariance” to put them in one single cluster.  As discussed in the theory section each gaussian component can be described with mean,                           variance/convergence and the sizeᵻ�. The mean is responsible for the distribution shift, variance                           describes the component's width and the ᵻ� is responsible for the component’s hight (the                           component number in the direct neighborhood that are clustered together).   This leads to the conclusion that DPGMM’s cluster update parameters in the training process                           and covariance types are to be tested with in order to get a meaningful model.   Since the data are perfectly convergent the convergence parameter can be removed from the                           training process. Removing the convergence from the cluster update parameter leaving it to ‘wm’ only, meaning                           that only weight (components height ᵻ�) and mean components are updated in clustering, the                           training process results in following diagram:  

     Figure 3.24 DPGMM Clustering with weights and means combined on  TP3 dataset.  

      Looking at figure 3.24 above, four clusters are found. All four correspond to the hourly                             probabilities of cooking events occurring (blue line). Cluster probabilities are calculated based                       on the number of data objects they contain. Outliers far away from others, as the first point from                                   the right (red cluster) and the fist point from the left (blue cluster) are not grouped into single                                   clusters but grouped with their nearest objects. This widens the clusters in length which is                             optimal for better hit rates. This seems to be an acceptable solution for clustering with                             DPGMM.   

    Dzenan Hamzic                   TU Wien                   44 

  •  

    Removing weights parameter (“w”) results in the same cluster configuration as shown in figure                           3.24. This means that for this kind of timeseries data the mean “m” parameter which shifts the                                 gaussian components center is crucial. Changing the covariancetype between all four possible                       options had no influence on the result [11].  The following figure shows gaussian components for TP3 data.  

     Figure 3.25 TP3 training vs testing datasets gaussian components. 

      The table 3.11 gives some cluster insights for comparison.  

    Cluster  2  4  5  12 

    Start  04:34:26  07:23:44  09:59:54  14:16:34 

    End  07:06:44  09:34:48  13:40:08  21:29:20 

    Duration  2:32:18  2:11:04  3:40:14  7:12:46 

    Elements  22  17  28  23 

    Percentage  24%  18%  31%  25% 

    Hit count  20  16  37  25 

    Hit percentage  17%  14%  32%  22% 

     Table 3.11. DPGMM cluster statistics and hit performance on TP3. 

      

    Dzenan Hamzic                   TU Wien                   45 

  •  

    Cluster five contains 31% of the training dataset. The cluster starts at 10:00 and ends at 13:40.                                 Looking at the test data this is the most probable cooking time of the test person 3. Comparing                                   this with the testing dataset, this cluster has the highest hit percentage of 32%. The second                               biggest cluster is number 12. It’s the longest cluster with duration of over 7 hours due to the                                   outlier point at the end of the cluster. Although this increases the possible hit rate (depending on                                 data quality) it is highly improbable that the person typically cooks at 9PM. The Same can be                                 said for cluster 2 with an outlier at the beginning.    Initialising DPGMM with the same training data and setting maximum components on 2 with                           “means” parameter would result in returning 2 clusters combining clusters 2 with 4 and 5 with                               12. Setting the maximum components number to 3 would result in combining clusters 2 and 4                        �